DEFENSE AGAINST BACKDOOR ATTACKS VIA IDENTI-FYING AND PURIFYING BAD NEURONS

Abstract

Recent studies reveal the vulnerability of neural networks to backdoor attacks. By embedding backdoors into the hidden neurons with poisoned training data, the backdoor attacker can override normal predictions of the victim model to the attacker-chosen ones whenever the backdoor pattern is present in a testing input. In this paper, to mitigate public concerns about the attack, we propose a novel backdoor defense via identifying and purifying the backdoored neurons of the victim neural network. Specifically, we first define a new metric, called benign salience. By combining the first-order gradient to retain the connections between neurons, benign salience can identify the backdoored neurons with high accuracy. Then, a new Adaptive Regularization (AR) mechanism is proposed to assist in purifying these identified bad neurons via fine-tuning. Due to the ability to adapt to different magnitudes of parameters, AR can provide faster and more stable convergence than the common regularization mechanisms in neuron purifying. Finally, we test the defense effect of our method on ten different backdoor attacks with three benchmark datasets. Experimental results show that our method can decrease the attack success rate by more than 95% on average, which is the best among six state-of-the-art defense methods.

1. INTRODUCTION

The brilliant feat of neural network (NN) make it a focal point of many attacks, one of the most threatening among which is the backdoor attack (Liu et al., 2017; Barni et al., 2019) . By mixing poisoned data into the training set, backdoor attack can control the victim NN to output attacker-chosen predictions for triggered inputs, while hardly disturb the predictions of normal inputs. Moreover, the triggers crafted by the attacker can be only several blocks of pixels (Li et al., 2020b) or even the invisible noises (Zhong et al., 2020) , which makes the attack notoriously perilous in applications. Based on the findings in (Gu et al., 2017) , backdoor attacks can succeed because the neurons that memorize trigger patterns, often called bad neurons, only keep a strong connection with the triggers but rarely react to normal features. Naturally, to defend the attack, a direct idea is to remove or mitigate the effect of bad neurons, e.g., model-pruning or fine-tuning (Gu et al., 2017) . However, the existing defense methods are mainly flawed from two perspectives. Neuron Evaluation. To evaluate which neurons are backdoored, current approaches are mostly based on the activation magnitude (AM) of neurons (Gu et al., 2017) . Neurons with low AMs about normal inputs are deemed to be "bad", and pruned during the defense stage. Despite being intuitive and easy to apply, such a metric sometimes leads to the over-pruning of good neurons as it may ignore the connections between neurons. For instance, consider a case where a clean neuron N in the network has low AMs but has very high weights on its downstream neurons. N has a strong positive effect on the final predictions but is still judged to be bad based on the above rule. Neuron Purifying. As bad neurons are marked in the evaluation stage, the next step is to purify or prune them. Currently, most pruning based defense methods choose to roughly remove bad neuron from the backdoored network. However, as pointed out by prior works (Liu et al., 2018) , some bad neurons can also be related to the predictions of normal data. Roughly removing these neurons can easily cause the performance degradation of the target model.

