Disclaimer: These are my personal notes on this paper. I am in no way related to this paper. All credits go towards the authors.

MesoNet: a Compact Facial Video Forgery Detection Network

Sept. 4, 2018 - Paper Link - Tags: Deepfake, Detection

Summary

Detect deepfakes via "mesoscopic" image properties, i.e. not small noise nor entire faces at a time (same thing as XceptionNet basically). Proposed networks detect both deepfakes and Face2Face. Two networks are proposed, Meso-4 and MesoInception-4 (has Inception modules). Had 0.917 classification score for single frame deepfake detection and 0.984 for video with some compression. No idea what classification score metric they used (accuracy?). Github

Notes

Section 1.1 gives a really good description on how deepfakes are trained (using auto-encoders)
Deepfake criticisms:

"some frames can end up with no facial reenactment or with a large blurred area or a doubled facial contour."
"autoencoders tend to poorly reconstruct fine details because of the compression of the input data on a limited encoding space, the result thus often appears a bit blurry."

Both of the proposed networks performs similarly on deepfakes and Face2Face
Why mesoscopic: "microscopic analyses based on image noise cannot be applied in a compressed video context where the image noise is strongly degraded. Similarly, at a higher semantic level, human eye struggles to distinguish forged images, especially when the image depicts a human face. That is why we propose to adopt an intermediate approach using a deep neural network with a small number of layers."
Figure 4 highlights the Meso-4 architecture
Figure 5 highlights the MesoInception-4 architecture. The first two convolutional layers of Meso-4 are replaced with inception modules/
Table 2 highlights the datasets used. The deepfake dataset was compressed with the H.264 codec with varying compression levels.
Faces were extracted using the Viola-Jones detector. Alignment was done via a neural network trained for facial landmark detection.
They saw a notable deterioration of classification scores when strong video compression was used.
Performed worse than XceptionNet.
Found intra-frame aggregation had negative effects on score (intra-frame being frames are not temporally compressed)

Analysis

WHAT CLASSIFCATION METRIC ARE YOU USING?!?
XceptionNet still #1

Citation: Afchar, Darius, et al. "Mesonet: a compact facial video forgery detection network." 2018 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 2018.