Back

Disclaimer: These are my personal notes on this paper. I am in no way related to this paper. All credits go towards the authors.



Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks

Sept. 16, 2019 - Paper Link - Tags: Backdoor, Detection

Summary

Learned the perturbation used to generate a backdoor. They were then able to remove the back door from the neural network.

Their detect and identify steps were combined. To identify a backdoor, they found a minimal perturbation filter required to transform all samples from all other labels to the targeted label. They did this for every label. They then ran outlier detection on these filters. If a filter was determined to be an outlier, then it was marked as a backdoor.

To mitigate backdoors, they applied two different approaches. They empirically noticed that the top 1% of neuron activation's heavily correlated to the backdoor. They set neurons that were highly correlated to the backdoor in the second to last layer of the network to 0. Their second approach used unlearning. They used the reversed trigger (trigger inverted) and re-trained the network using those samples. They only used 10% of the training date for unlearning, and used the reversed trigger on 20% of these samples.

Notes

Interesting References

Analysis

Citation: Wang, Bolun, et al. "Neural cleanse: Identifying and mitigating backdoor attacks in neural networks." 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 2019.