DEFENSE AGAINST BACKDOOR ATTACKS VIA IDENTI-FYING AND PURIFYING BAD NEURONS

Abstract

Recent studies reveal the vulnerability of neural networks to backdoor attacks. By embedding backdoors into the hidden neurons with poisoned training data, the backdoor attacker can override normal predictions of the victim model to the attacker-chosen ones whenever the backdoor pattern is present in a testing input. In this paper, to mitigate public concerns about the attack, we propose a novel backdoor defense via identifying and purifying the backdoored neurons of the victim neural network. Specifically, we first define a new metric, called benign salience. By combining the first-order gradient to retain the connections between neurons, benign salience can identify the backdoored neurons with high accuracy. Then, a new Adaptive Regularization (AR) mechanism is proposed to assist in purifying these identified bad neurons via fine-tuning. Due to the ability to adapt to different magnitudes of parameters, AR can provide faster and more stable convergence than the common regularization mechanisms in neuron purifying. Finally, we test the defense effect of our method on ten different backdoor attacks with three benchmark datasets. Experimental results show that our method can decrease the attack success rate by more than 95% on average, which is the best among six state-of-the-art defense methods.

1. INTRODUCTION

The brilliant feat of neural network (NN) make it a focal point of many attacks, one of the most threatening among which is the backdoor attack (Liu et al., 2017; Barni et al., 2019) . By mixing poisoned data into the training set, backdoor attack can control the victim NN to output attacker-chosen predictions for triggered inputs, while hardly disturb the predictions of normal inputs. Moreover, the triggers crafted by the attacker can be only several blocks of pixels (Li et al., 2020b) or even the invisible noises (Zhong et al., 2020) , which makes the attack notoriously perilous in applications. Based on the findings in (Gu et al., 2017) , backdoor attacks can succeed because the neurons that memorize trigger patterns, often called bad neurons, only keep a strong connection with the triggers but rarely react to normal features. Naturally, to defend the attack, a direct idea is to remove or mitigate the effect of bad neurons, e.g., model-pruning or fine-tuning (Gu et al., 2017) . However, the existing defense methods are mainly flawed from two perspectives. Neuron Evaluation. To evaluate which neurons are backdoored, current approaches are mostly based on the activation magnitude (AM) of neurons (Gu et al., 2017) . Neurons with low AMs about normal inputs are deemed to be "bad", and pruned during the defense stage. Despite being intuitive and easy to apply, such a metric sometimes leads to the over-pruning of good neurons as it may ignore the connections between neurons. For instance, consider a case where a clean neuron N in the network has low AMs but has very high weights on its downstream neurons. N has a strong positive effect on the final predictions but is still judged to be bad based on the above rule. Neuron Purifying. As bad neurons are marked in the evaluation stage, the next step is to purify or prune them. Currently, most pruning based defense methods choose to roughly remove bad neuron from the backdoored network. However, as pointed out by prior works (Liu et al., 2018) , some bad neurons can also be related to the predictions of normal data. Roughly removing these neurons can easily cause the performance degradation of the target model. In this paper, we propose a novel backdoor defense method (called WIPER) to overcome the two flaws. Specifically, instead of utilizing AM, we define a more effective metric called benign salience (BS, Section 4.1) to evaluate the importance of neurons. Compared to AM, BS retains the connections between neurons through the first-order gradient, hence bad neurons can be identified more accurately. Then, after identifying bad neurons, our method chooses to fine-tune but not directly prune them in the neuron purifying stage (Section 4.2) to avoid unexpected performance degradation. Through a newly designed adaptive piece-wise regularization mechanism, our fine-tuning method can be far more effective in mitigating the network's attention on trigger patterns than the existing fine-tuning method (Truong et al., 2020) . Our contributions are summarized in four folds: • We propose WIPER, a novel backdoor defense method that combine the advantages of both model pruning and fine-tuning to identify and purify backdoored neurons. • We design a new metric BS to mark the bad neurons whose attention is misled to the backdoor trigger patterns. Since the connections between neurons are reserved via the first-order gradient, defenders can use BS to distinguish bad neurons with higher accuracy than the commonly used metrics, e.g., AM. • We develop a new type of regularization adaptive regularization (AR). Compared to the common regularization, AR can better accelerate and stabilize the purifying process of bad neurons by adaptively adjusting the penalty degree to different magnitudes of parameters. • We conduct extensive experiments on benchmark datasets to validate the effectiveness of WIPER. The result shows that our method outperforms state-of-the-art defenses significantly on both attack success rate decreasing and model performance maintenance.

2. RELATED WORKS

Backdoor attack. A typical backdoor attack is done by injecting a small volume of poison data crafted with the attacker-chosen triggers into the training set. To maintain the stealthiness of backdoor attacks, the triggers used by attackers are various. For instance, the trigger used in (Li et al., 2020b) was a simple rect with the resolution of 3 × 3. Blend attack (Chen et al., 2017) adopted common life devices, e.g. glasses, as a trigger so as to evade the inspection of humans. Moreover, the authors in (Zhong et al., 2020) attempted to design a human-imperceptible yet effective noise. The clean-label attack (Shafahi et al., 2018) made the natural features in images difficultly learnable so that the model was forced to only rely on the trigger to correctly classify without modifying the label. More recently, similar to clean-label attack, sinusoidal signal attack (Barni et al., 2019) designed an easily-learned trigger to conduct the backdoor attack. WIPER is designed to provide an effective to defend the above-mentioned attacks and ensure model security in applications. Backdoor defense. Backdoor defenses can be roughly divided into two categories: detection-based methods (Wang et al., 2019; 2020; Xu et al., 2021) that aim to detect whether a neural network is backdoored, and purifying-based methods (Zhao et al., 2019; Truong et al., 2020; Yoshida & Fujino, 2020; Li et al., 2021a ) that try to remove the backdoor while maintaining the performance of the target model. Up to now, the detection-based method has been well developed and many remarkable works (Gao et al., 2019; Wang et al., 2020; Xu et al., 2021) achieve quite a high detection rate. Thus, this paper mainly focuses on the bad neuron purifying to remove the backdoor. Inspired by the fact that the bad neurons were dormant with the presence of clean data, fine-pruning (Liu et al., 2018) removed the backdoor by erasing the neurons with activation values below a certain threshold. However, the effectiveness of fine-pruning heavily depends on the quality of holding data, and it can be easily evaded by some state-of-the-art attacks, such as TrojanNN (Liu et al., 2017) . In (Truong et al., 2020) , the authors suggested that referring to catastrophic forgetting (Delange et al., 2021) , fine-tuning the model with some clean data was a simple yet effective method to remove the backdoor. Similar to fine-pruning, fine-tuning also need high-quality clean data to mitigate the effect of bad neurons. Recently, knowledge distillation (Yoshida & Fujino, 2020; Li et al., 2021a) was proposed to achieve backdoor defense by distilling clean knowledge from the infected model to a fresh model but failed to defend TrojanNN (Liu et al., 2017) due to the lack of consideration for its special trigger attention mechanism. Recently, a novel work, AI-Lancet (Zhao et al., 2021) , was proposed to locate bad neurons by doing comparative experiments with triggered inputs and clean inputs whose trigger-pasted regions were cropped. Such an idea achieves competitive defense

