BACKDOOR MITIGATION BY CORRECTING ACTIVA-TION DISTRIBUTION ALTERATION

Abstract

Backdoor (Trojan) attacks are an important type of adversarial exploit against deep neural networks (DNNs), wherein a test instance is (mis)classified to the attacker's target class whenever a backdoor trigger is present. In this paper, we reveal and analyze an important property of backdoor attacks: a successful attack causes an alteration in the distribution of internal layer activations for backdoor-trigger instances, compared to that for clean instances. Even more importantly, we find that instances with the backdoor trigger will be correctly classified to their original source classes if this distribution alteration is reversed. Based on our observations, we propose an efficient and effective method that achieves post-training backdoor mitigation by correcting the distribution alteration using reverse-engineered triggers. Notably, our method does not change any trainable parameters of the DNN, but achieves generally better mitigation performance than existing methods that do require intensive DNN parameter tuning. It also efficiently detects test instances with the trigger, which may help to catch adversarial entities.



In this paper, we investigate an interesting distribution alteration property of backdoor attacks. In short, the learned backdoor trigger causes a change in the distribution of internal activations for test instances with the trigger, compared to that for backdoor-free instances; and we demonstrate that instances with the trigger are classified to their original source class after the distribution alteration is reversed. Accordingly, we propose a method to mitigate backdoor attacks (post-training), such that classification accuracy on instances both with and without the trigger will be close to the accuracy of a clean (backdoor-free) classifier. In particular, we propose a practical way to correct the distribution alteration by exploiting reverse-engineered triggers (Wang et al. (2019) ; Xiang et al. ( 2020)). Compared with existing approaches that address the same mitigation problem, but which require tuning of the whole DNN, our method achieves generally better performance and without changing



(DNN) have shown impressive performance in many applications, but are vulnerable to adversarial attacks. Recently, backdoor (Trojan) attacks have been proposed against DNNs used for image classification (Gu et al. (2019); Chen et al. (2017); Nguyen & Tran (2021); Li et al. (2019); Saha et al. (2020); Li et al. (2021a)), speech recognition (Liu et al. (2018b)), text classification (Dai et al. (2019)), point cloud classification (Xiang et al. (2021)), and even deep regression (Li et al. (2021b)). The attacked DNN will classify to the attacker's target class whenever a test instance is embedded with the attacker's backdoor trigger, while maintaining high accuracy on backdoor-free instances. Typically, a backdoor attack is launched by poisoning the training set of the DNN with a few instances embedded with the trigger and (mis)labeled to the target class. Most existing works on backdoors either focus on improving the stealthiness of attacks (Zhao et al. (2022); Wang et al. (2022b)), their flexibility for launching (Bai et al. (2022); Qi et al. (2022)), their adaptation for different learning paradigms (Xie et al. (2020); Yao et al. (2019); Wang et al. (2021)), or develop defenses for different practical scenarios (Du et al. (2020); Liu et al. (2019); Dong et al. (2021); Chou et al. (2020); Gao et al. (2019)). However, there are few works studying the basic properties of backdoor attacks. Tran et al. (2018) first observed that triggered instances (labeled to the target class) are separable from clean target class instances in terms of internal layer activations of the poisoned classifier. This property led to defenses that detect and remove triggered instances from the poisoned training set (Chen et al. (2019a); Xiang et al. (2019)). As another example, Zhang et al. (2022) studied the differences between the parameters of clean and attacked classifiers, which inspired a stealthier attack with minimum degradation in accuracy on clean test instances.

