Back
Disclaimer: These are my personal notes on this paper. I am in no way related to this paper. All credits go towards the authors.
Evading Deepfake-Image Detectors with White- and Black-Box Attacks
April 1, 2020 -
Paper Link -
Tags: Adversarial, Black-Box, Deepfake, Detection, Perturbation
Summary
Created 5 different adversarial perturbation methods to attack deepfake detectors, primarily Frank et al's detector. The first two attacks focused on only changing the least significant bit in every pixel, the third created a universal patch, the fourth used a universal (single) low-level attribute vector of ProGAN to generate adversarial deepfakes, the fifth was a black-box method where adversarial perturbation was learned from one network, and those samples were tested on the goal network. Figure 3 shows the last 4 attack's results using ROC curves.
Notes
- Uses two detectors:
Wang et al.
and
Frank et al. with a focus on Wang's.
- Wang et al.:
- Based on ResNet-50
- Pre-trained on ImageNet with 720,000 training and 4,000 validation images (half rea, half fake using ProGAN)
- Dataset had been augmented by spatial blurring and JPEG compression
- Accuracy is tested against 10 different generators
- Robust to spatial blurring and JPEG compression
- Frank et al.
- "The authors argue that GAN synthesized-images have a command spatial frequency artifact that emerges from image upsampling..." → temporarily similar image artifacts
- Looked at five different attack strategies:
- Distortion-Minimizing Attack
- Limited their attack to only flipping the lowest bit of each pixel → the maximum perturbation to any pixel is 1/255
- Equation 2 shows their optimization function. They try to minimize the perturbation while minimizing the probability of the new image being labeled fake
- Figure 2.a shows the results of Fake images being marked as Real. Half of all faked images are classified as real by flipping the least significant bit in 1% of pixels. At 11% of pixel flips, nearly 100% of fakes are labeled real
- Figure 2.b shows the results of Real images being marked as Fake. This is the only attack that tests both scenarios. They find that it takes 7% of pixels being flipped to get 50% of real images to be labeled fake, as opposed to the above 1% findings for fake to real. Easier to change computer generated images? Are the discriminators really learning real photos while kind of guessing at fake ones? Looking at how certain the discriminators are of real images would be an interesting study.
- Loss-Maximizing Attack
- Similar but simpler than the Distortion-Minimizing Attack. Equation 3 shows the optimization function. They minimize the perturbation p-norm as opposed to the perturbation itself.
- Figure 3.a shows the results. 0.0 is 0% pixel bit flips, 1.0 is 100% pixel bit flips. Everything is more extreme in the raw PNG case as opposed to the compressed JPG. (Smaller AUC is better for the attack)
- Universal Adversarial-Patch Attack
- Create a single perturbation instead of a perturbation for every image
- Equation 4 shows their optimization function. They minimize a perturbation for a set of images instead of a single image through random sampling of the images.
- The perturbation batch is limited to be 1% of the input image size, or a 24x24 pixel patch
- Figure 4.a demonstrates the patch in the top left corner. It is a very noticeable patch.
- Figure 3.b shows the results. AUC is reduced from 0.966 down to 0.085.
- Universal Latent-Space Attack
- NOTE: Most interesting attack in my opinion
- Used the latent-space of the GAN to generate adversarial samples.
- Used the StyleGAN.
- Recent generative models take two input vectors
- z, which corresponds to high level features, such as gender, pose, skin color, and hair color/length
- w, which corresponds to low-level attributes, such as freckles
- The attack constructs a single (universal) w for all synthesized images. The goal of w is to misclassify fake as real. w is optimized via initially randomly setting w and iteratively changing it based on the resulting detector fake probability.
- Figure 4.b.c.d demonstrates generated images.
- Figure 3.c shows the results. AUC is reduced from 0.99 to 0.17 (not the greatest, but an interesting find)
- Black-Box Attack
- This attack works by learning the way to attack one network and transferring that knowledge to the goal network. Train a ResNet-18 to classify images as fake or real with a dataset of one million ProGAN-generated fake images and one million real images that was used to train ProGAN. Randomly cropping a 224x224 bounding box with 50% chance of horizontal flipping was used
- Figure 3.d shows the results. AUC is reduced from 0.96 to 0.22
- Discussion
- Perturbation isn't the only attack option
- Standard image laundering often reduces the true positive rate by over 10%. (Resizing, rescaling, cropping, or recompression)
- The two most effective defenses on large images have been adversarial training (training on adversarial samples) and randomized smoothing (adds Gaussian noise to every pixel, making small perturbations impossible to change the output).
Interesting References
- High-level forensic techniques that focus on semantically meaningful features:
- Low-level forensic techniques focus on pixel-level artifacts introduced by the synthesis process:
Ye et al.,
Marra et al.,
Rossler et al., and
Zhang et al. (Detecting and simulating artifacts in GAN fake images NOTE: interesting?)
- Low level techniques apparently struggle to "generalize to novel datasets" and "can be sensitive to laundering (e.g. transcoding or resizing)"
- DeeepFake GANs:
- BigGAN
- CycleGAN
- GauGAN
- ProGAN
- StarGAN
- StyleGAN
- StyleGAN2
Citation: Carlini, Nicholas, and Hany Farid. "Evading Deepfake-Image Detectors with White-and Black-Box Attacks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020.