BACKDOOR MITIGATION BY CORRECTING ACTIVA-TION DISTRIBUTION ALTERATION

Abstract

Backdoor (Trojan) attacks are an important type of adversarial exploit against deep neural networks (DNNs), wherein a test instance is (mis)classified to the attacker's target class whenever a backdoor trigger is present. In this paper, we reveal and analyze an important property of backdoor attacks: a successful attack causes an alteration in the distribution of internal layer activations for backdoor-trigger instances, compared to that for clean instances. Even more importantly, we find that instances with the backdoor trigger will be correctly classified to their original source classes if this distribution alteration is reversed. Based on our observations, we propose an efficient and effective method that achieves post-training backdoor mitigation by correcting the distribution alteration using reverse-engineered triggers. Notably, our method does not change any trainable parameters of the DNN, but achieves generally better mitigation performance than existing methods that do require intensive DNN parameter tuning. It also efficiently detects test instances with the trigger, which may help to catch adversarial entities.



In this paper, we investigate an interesting distribution alteration property of backdoor attacks. In short, the learned backdoor trigger causes a change in the distribution of internal activations for test instances with the trigger, compared to that for backdoor-free instances; and we demonstrate that instances with the trigger are classified to their original source class after the distribution alteration is reversed. Accordingly, we propose a method to mitigate backdoor attacks (post-training), such that classification accuracy on instances both with and without the trigger will be close to the accuracy of a clean (backdoor-free) classifier. In particular, we propose a practical way to correct the distribution alteration by exploiting reverse-engineered triggers (Wang et al. (2019) ; Xiang et al. ( 2020)). Compared with existing approaches that address the same mitigation problem, but which require tuning of the whole DNN, our method achieves generally better performance and without changing any original parameters of the DNN. Moreover, while most mitigation approaches are designed to correctly classify backdoor-trigger instances blindly without detection, our method is able to detect those backdoor-trigger instances efficiently. Our main contributions in this paper are twofold: 1) We discover and analyze the activation distribution alteration property of backdoor attacks and its relation to accuracy in classifying backdoor-triggered instances. 2) We propose a post-training backdoor mitigation approach based on our findings, which outperforms several state-of-the-art approaches for a variety of datasets and backdoor attack settings. 2020)). Defenses in this category may help to catch the adversarial entities in the act, but they cannot correctly classify the detected backdoor trigger instances to their original source classes. Moreover, existing methods in this category require heavy computation at test time (where rapid inferences are needed). In contrast, our mitigation framework includes both test-time trigger detection and source class inference, both with very little computation, as will be detailed in Sec. 4.2.

2. RELATED WORK

Closely related to our method, Neural Cleanse (NC) proposed by Wang et al. (2019) detects backdoor attacks and then fine-tunes the classifier using a reverse-engineered trigger. However, NC is not as effective as our method in backdoor mitigation, especially when its fine-tuning is performed with insufficient data (see the last paragraph in Sec. 5.2 for more details). Moreover, NC does not detect backdoor-trigger instances during inference, unlike our method.



(DNN) have shown impressive performance in many applications, but are vulnerable to adversarial attacks. Recently, backdoor (Trojan) attacks have been proposed against DNNs used for image classification (Gu et al. (2019); Chen et al. (2017); Nguyen & Tran (2021); Li et al. (2019); Saha et al. (2020); Li et al. (2021a)), speech recognition (Liu et al. (2018b)), text classification (Dai et al. (2019)), point cloud classification (Xiang et al. (2021)), and even deep regression (Li et al. (2021b)). The attacked DNN will classify to the attacker's target class whenever a test instance is embedded with the attacker's backdoor trigger, while maintaining high accuracy on backdoor-free instances. Typically, a backdoor attack is launched by poisoning the training set of the DNN with a few instances embedded with the trigger and (mis)labeled to the target class. Most existing works on backdoors either focus on improving the stealthiness of attacks (Zhao et al. (2022); Wang et al. (2022b)), their flexibility for launching (Bai et al. (2022); Qi et al. (2022)), their adaptation for different learning paradigms (Xie et al. (2020); Yao et al. (2019); Wang et al. (2021)), or develop defenses for different practical scenarios (Du et al. (2020); Liu et al. (2019); Dong et al. (2021); Chou et al. (2020); Gao et al. (2019)). However, there are few works studying the basic properties of backdoor attacks. Tran et al. (2018) first observed that triggered instances (labeled to the target class) are separable from clean target class instances in terms of internal layer activations of the poisoned classifier. This property led to defenses that detect and remove triggered instances from the poisoned training set (Chen et al. (2019a); Xiang et al. (2019)). As another example, Zhang et al. (2022) studied the differences between the parameters of clean and attacked classifiers, which inspired a stealthier attack with minimum degradation in accuracy on clean test instances.

Figure1: Activation distribution of a neuron in the penultimate layer of ResNet-18 trained on CIFAR-10, for instances with and without a backdoor trigger, for (a) a classifier and (b) a backdoor-poisoned classifier (with the same trigger). In (c), the distribution alteration in (b) is reversed by our proposed method -most instances with the trigger will thus be correctly classified.

Existing backdoor defenses are deployed either during the DNN's training stage or post-training. The ultimate goal of training-stage defenses is to train an accurate, backdoor-free DNN given the possibly poisoned training set. To achieve this goal, Shen & Sanghavi (2019); Huang et al. (2022); Li et al. (2021d); Chen et al. (2019a); Xiang et al. (2019); Du et al. (2020) either identify a subset of "high-credible" instances for training, or detect and then remove training instances possibly with a backdoor trigger before training. Post-training defenders, however, are assumed to have no access to the classifier's training set. Many post-training defenses aim to detect whether a given classifier has been backdoor-compromised. Wang et al. (2019); Xiang et al. (2020); Wang et al. (2020); Liu et al. (2019) perform anomaly detection using triggers reverse-engineered on an assumed independent clean dataset; while Xu et al. (2021); Kolouri et al. (2020) train a (binary) meta classifier on "shadow" classifiers trained with and without attack. However, model-detection defenses are not able to mitigate backdoor attacks at test time. Thus, there is a family of post-training backdoor mitigation approaches proposed to fine-tune the classifier on the assumed clean dataset, with a subset of neurons possibly associated with the backdoor attack pruned (Liu et al. (2018a); Wu & Wang (2021); Guan et al. (2022); Zheng et al. (2022)), by leveraging knowledge distillation to preserve only classification functions for clean instances (Li et al. (2021c); Xia et al. (2022)), or by solving a min-max problem as an analogue to adversarial training for evasion attacks (Zeng et al. (2022); Madry et al. (2018)). These methods all aim to enhance the robustness of the classifier against triggers embedded at test time, but are not implemented with a backdoor detector. The cost of such robustness is usually a significant degradation in the classifier's accuracy on clean instances, especially when the clean data for fine-tuning are insufficient. Another family of approaches are designed to detect test instances embedded with the trigger, without altering the classifier (Gao et al. (2019); Chou et al. (2020); Doan et al. (

