BACKDOOR MITIGATION BY CORRECTING ACTIVA-TION DISTRIBUTION ALTERATION

Abstract

Backdoor (Trojan) attacks are an important type of adversarial exploit against deep neural networks (DNNs), wherein a test instance is (mis)classified to the attacker's target class whenever a backdoor trigger is present. In this paper, we reveal and analyze an important property of backdoor attacks: a successful attack causes an alteration in the distribution of internal layer activations for backdoor-trigger instances, compared to that for clean instances. Even more importantly, we find that instances with the backdoor trigger will be correctly classified to their original source classes if this distribution alteration is reversed. Based on our observations, we propose an efficient and effective method that achieves post-training backdoor mitigation by correcting the distribution alteration using reverse-engineered triggers. Notably, our method does not change any trainable parameters of the DNN, but achieves generally better mitigation performance than existing methods that do require intensive DNN parameter tuning. It also efficiently detects test instances with the trigger, which may help to catch adversarial entities.

1. INTRODUCTION

Deep neural networks (DNN) have shown impressive performance in many applications, but are vulnerable to adversarial attacks. Recently, backdoor (Trojan) attacks have been proposed against DNNs used for image classification (Gu et al. (2019) ; Chen et al. (2017) ; Nguyen & Tran (2021) ; Li et al. (2019) ; Saha et al. (2020) ; Li et al. (2021a) ), speech recognition (Liu et al. (2018b) ), text classification (Dai et al. (2019) ), point cloud classification (Xiang et al. (2021) ), and even deep regression (Li et al. (2021b) ). The attacked DNN will classify to the attacker's target class whenever a test instance is embedded with the attacker's backdoor trigger, while maintaining high accuracy on backdoor-free instances. Typically, a backdoor attack is launched by poisoning the training set of the DNN with a few instances embedded with the trigger and (mis)labeled to the target class. Most existing works on backdoors either focus on improving the stealthiness of attacks (Zhao et al. (2022) ; Wang et al. (2022b) ), their flexibility for launching (Bai et al. (2022) ; Qi et al. (2022) ), their adaptation for different learning paradigms (Xie et al. (2020) ; Yao et al. (2019) ; Wang et al. (2021) ), or develop defenses for different practical scenarios (Du et al. (2020) ; Liu et al. (2019) ; Dong et al. (2021) ; Chou et al. (2020) ; Gao et al. (2019) ). However, there are few works studying the basic properties of backdoor attacks. Tran et al. (2018) first observed that triggered instances (labeled to the target class) are separable from clean target class instances in terms of internal layer activations of the poisoned classifier. This property led to defenses that detect and remove triggered instances from the poisoned training set (Chen et al. (2019a) ; Xiang et al. (2019) ). As another example, Zhang et al. (2022) studied the differences between the parameters of clean and attacked classifiers, which inspired a stealthier attack with minimum degradation in accuracy on clean test instances. In this paper, we investigate an interesting distribution alteration property of backdoor attacks. In short, the learned backdoor trigger causes a change in the distribution of internal activations for test instances with the trigger, compared to that for backdoor-free instances; and we demonstrate that instances with the trigger are classified to their original source class after the distribution alteration is reversed. Accordingly, we propose a method to mitigate backdoor attacks (post-training) , such that classification accuracy on instances both with and without the trigger will be close to the accuracy of a clean (backdoor-free) classifier. In particular, we propose a practical way to correct the distribution alteration by exploiting reverse-engineered triggers (Wang et al. (2019) ; Xiang et al. (2020) ). Compared with existing approaches that address the same mitigation problem, but which require tuning of the whole DNN, our method achieves generally better performance and without changing backdoor-poisoned classifier (with the same trigger). In (c), the distribution alteration in (b) is reversed by our proposed method -most instances with the trigger will thus be correctly classified. any original parameters of the DNN. Moreover, while most mitigation approaches are designed to correctly classify backdoor-trigger instances blindly without detection, our method is able to detect those backdoor-trigger instances efficiently. Our main contributions in this paper are twofold: 1) We discover and analyze the activation distribution alteration property of backdoor attacks and its relation to accuracy in classifying backdoor-triggered instances. 2) We propose a post-training backdoor mitigation approach based on our findings, which outperforms several state-of-the-art approaches for a variety of datasets and backdoor attack settings. 2018)). These methods all aim to enhance the robustness of the classifier against triggers embedded at test time, but are not implemented with a backdoor detector. The cost of such robustness is usually a significant degradation in the classifier's accuracy on clean instances, especially when the clean data for fine-tuning are insufficient. Another family of approaches are designed to detect test instances embedded with the trigger, without altering the classifier (Gao et al. (2019) ; Chou et al. (2020) ; Doan et al. (2020) ). Defenses in this category may help to catch the adversarial entities in the act, but they cannot correctly classify the detected backdoor trigger instances to their original source classes. Moreover, existing methods in this category require heavy computation at test time (where rapid inferences are needed). In contrast, our mitigation framework includes both test-time trigger detection and source class inference, both with very little computation, as will be detailed in Sec. 4.2.

2. RELATED WORK

Closely related to our method, Neural Cleanse (NC) proposed by Wang et al. (2019) detects backdoor attacks and then fine-tunes the classifier using a reverse-engineered trigger. However, NC is not as effective as our method in backdoor mitigation, especially when its fine-tuning is performed with insufficient data (see the last paragraph in Sec. 5.2 for more details). Moreover, NC does not detect backdoor-trigger instances during inference, unlike our method.

3. DISTRIBUTION ALTERATION PROPERTY OF BACKDOOR ATTACKS

In this section, we first present the activation distribution alteration property of backdoor attacks. Then for a simplified setting, we analytically show how closing the "gap" between the clean-instance and backdoor-trigger instance distributions improves the accuracy in classifying backdoor-trigger instances; this will guide the design of our backdoor mitigation approach in Sec. 4. Property 3.1. (Activation Distribution Alteration) For a successful backdoor attack, two different backdoor-trigger instances will induce perturbations to the activations of an internal DNN layer that are in a similar direction. Thus, there is effectively a "shift" in the internal layer activation distribution for backdoor-trigger instances, compared to that for backdoor-free instances. Distribution alteration can be easily visualized empirically. Consider a set of clean instances from CIFAR-10 ( Krizhevsky & Hinton (2009) ) and the same set of instances but with the backdoor trigger used by Gu et al. (2019) embedded in each instance. For a ResNet-18 (He et al. (2016) ) classifier that was successfully attacked using this trigger, there is a divergence between the distributions of the internal layer activations induced by these two sets of instances. This is shown in Fig. 1b for a neuron in the penultimate layer as an example. In comparison, for a clean classifier (not backdoorattacked), the divergence between the two distributions is almost negligible as shown in Fig. 1a . Based on these visualizations, we ask the following question: Suppose the distribution alteration is reversed for each neuron, e.g. by applying a transformation to the internal activations of the triggered instances, so that the transformed distribution now closely agrees with the distribution for clean (without the backdoor-trigger) instances (see Fig. 1c ). Then, following this compensation, will the classifier accurately predict the true class of origin for these backdoor-trigger instances? Here, we investigate this problem in a simplified binary classification setting similar to the one considered by Ilyas et al. (2019) . Here, class '-1' is automatically the source class of X b since there are only two classes. With backdoor poisoning, a multi-layer perceptron (MLP) classifier is trained with one hidden layer of J nodes, a batch normalization (BN) layer (Ioffe & Szegedy (2015) ) followed by linear activation, and two output nodes with functions f -: R d → R and f + : R d → R corresponding to classes '-1' and '+1' respectively. An instance x will be classified to class '-1' if f -(x) > f + (x); else it will be classified to '+1'. Definition 3.1. (η-erroneous classifier) A classifier is said to be η-erroneous if the error rate for each class is upper bounded by η. Definition 3.2. (ψ-successful attack) A backdoor attack is said to be ψ-successful if its attack success rate (ASR), i.e. the probability for triggered instances being (mis)classified to the attacker's target class (Li et al. (2022b) ), is at least ψ; in our case, this means that P [f + (X b ) > f -(X b )] ≥ ψ. Given the settings above, for an arbitrary input x, the activation of the j-th node (j ∈ {1, • • • , J}) (after BN with trained parameters γ j and β j ), with weight vector w j in the hidden layer, is: ]. An easy way to eliminate the divergence between these two distributions is to create a classifier for triggered instances X bfoot_0 by replacing a j in Eq. ( 1) with a * j (x) = (w ⊤ j x -m * j )γ j / v * j + β j for each node j, where (see Apdx. A.1 for derivation): The proof of the theorem is given in Apdx. A.2. Note that the assumptions for Thm. 3.1 are very mild and reasonable. For example, η < 1/2 is a minimum requirement for the classifier and ψ > 1/2 is a minimum requirement for a successful backdoor attack. Moreover, σ b ≤ σ generally holds empirically since trigger embedding (e.g., consider a patch attack) typically reduces the variance of source class instances (while additive attacks do not change the variance). Also note that α merely gives a way of quantifying distribution divergence for purpose of analysis. According to these results, the core part of our proposed backdoor mitigation approach should be to find a modified classifier g(•|Θ) by minimizing (e.g., using sub-gradient methods) a measure of distribution divergence over a well-chosen set of parameters, Θ. This approach is next explicated. a j (x) = w ⊤ j x -m j √ v j γ j + β j , m * j = σ b σ m j + ( σ b σ -1)w ⊤ j µ + w ⊤ j ϵ and v * j = σ b σ v j ,

4. REVERSING DISTRIBUTION ALTERATION FOR BACKDOOR MITIGATION

4.1 PROBLEM DESCRIPTION Threat model. For input space X and label space C, a classifier that has been successfully backdoorattacked will predict to the attacker's target class t * ∈ C when a test instance x ∈ X is embedded with the backdoor trigger using an incorporation function ∆ : X → X . In addition to this "all-toone" setting, we also consider the "all-to-all" setting where a test instance from any class c ∈ C will be (mis)classified to class (c + 1)mod|C| when it is embedded with the trigger (Gu et al. (2019) ). Defender's goal. Given a trained classifier f : X → C that may possibly be attacked, the defender aims to mitigate possible attacks by producing a mapping f : X → C which (a) has high accuracy in classifying clean instances, and (b) when there is a backdoor attack, classifies triggered instances to their original source class, as though there is no trigger embedded, i.e., achieves a high SIA. Defender's assumptions. We consider a post-training scenario where the defender has no access to the training set of the classifier. The defender does possess an independent clean dataset, but this dataset is too small to train an accurate classifier from scratch, and even too small to effectively finetune the full set of classifier parameters (Liu et al. (2018a); Zeng et al. (2022) ; Wang et al. (2019) ). The defender has white box access to the classifier, but does not know whether it has been attacked and, if so, does not know the trigger pattern that was used, i.e., the defense is unsupervised.

4.2. METHOD

Based on Thm. 3.1, it would seem that a good mitigation approach involves modifying the classifier f , i.e. creating a new classifier g(•|Θ) : X → C from f by applying a transformation function h j,l (•|θ j,l ) : R → R to the activation of each neuron j ∈ {1, • • • , J l } in each layer l ∈ {1, • • • , L}. The parameters Θ = {θ j,l } should be jointly chosen so as to minimize the aggregation (e.g. sum) of the divergences between the distributions q j,l (θ <l ∪ θ j,l ) obtained using h j,l (ẑ j,l (∆(X)|θ <l )|θ j,l ) and the target distributions p j,l for z j,l (X) for ∀j, l, where X follows the clean data distribution, i.e.: minimize Θ={θ j,l } j,l D k p j,l ||q j,l (θ <l ∪ θ j,l ) where: z j,l : X → R and ẑj,l : X → R are activation functions for neuron j in layer l for classifiers f and g(•|Θ) respectively; θ <l = {θ j,l ′ |l ′ < l} represents all transformation parameters prior to layer l; D k (p||q) := E q [k(p/q)] for a convex function k : [0, ∞) → R satisfying k(1) = 0 and belonging to the family of f -divergences for any distributions p and q (Ali & Silvey (1966) ). However, in practice, we will face the following challenges. Challenge 1: The defender does not know a priori whether there is an attack. When there is no attack, no distribution correction should be needed. Moreover, when there is an attack, while the classifier g(•|Θ) with optimal transformation functions for neuron activation will achieve a high SIA on triggered instances, its accuracy on clean instances (especially those not from the backdoor target class) may be degradedfoot_1 . Challenge 2: If there is an attack, the attack setting, i.e. all-to-one or all-to-all, and the groundtruth backdoor trigger ∆ are both unknown to the defender. Challenge 3: The density form for z j,l (∆(X)) may get altered by the trigger ∆ and will likely be different from the density form for z j,l (X) -moreover both will likely be non-Gaussian. Thus, (3) cannot be easily minimized, e.g., simply by matching the mean and variance. For Challenges 1&2, we leverage existing post-training backdoor detection approaches to infer: whether the classifier f is backdoor attacked and the associated target classes when f is attacked (Wang et al. (2019) ; Chen et al. (2019b) ; Liu et al. (2019) ). These detectors, following the same assumptions in Sec. 4.1, reverse-engineer a trigger for each putative target class on the clean dataset possessed by the defender. Then anomaly detection is performed on statistics derived from these reverse-engineered triggers, e.g. the estimated size for patch triggers used by Wang et al. (2019) . Here, the detector in our framework is different from most existing ones in order to cover a broad range of attack settings including all-to-one and all-to-all attacks. We first reverse-engineer a trigger by solving an optimization problem defined on the clean set to get a detection statistic for each ordered putative class pair (s, t) ∈ C × C. For the Xiang et al. (2020) method, this statistic is (the reciprocal of) the estimated perturbation size inducing high (mis)classifications from s to t. For Wang et al. (2019) , it is the estimated patch size inducing high (mis)classifications from s to t. Then we apply the anomaly detection approach in Wang et al. (2019) , based on the MAD criterion (Hampel (1974) ), to all the obtained statistics to find all the outlier statistics. We denote the set of detected class pairs associated with these outlier statistics as P, and denote T = {t ∈ C | ∃s ∈ C s.t. (s, t) ∈ P} as the set of detected target classes. For each t ∈ T , we (re-)estimate a trigger ∆t (as a surrogate for the true backdoor trigger, which is unknown) using clean instances from all detected source classesfoot_2 Ŝ(t) = {s ∈ C|(s, t) ∈ P}. Then, for each detected target class t ∈ T , we construct a classifier g(•|Θ t ) by solving the distribution divergence minimization problem using its (re-)estimated ∆t . For any test input x ∈ X , if classifier f is deemed attack-free, i.e. P = ∅, the classification output under our mitigation framework will be f (x) = f (x). Otherwise, if f (x) ∈ C \ T , we trust the class decision and set f (x) = f (x) both because x is unlikely to possess a trigger and because a successful attack should not degrade the classifier's accuracy on clean instances. However, if f (x) = t ∈ T , there are two main possibilities: 1) x is a clean instance truly from class t; 2) x is classified to class t due to the presence of the trigger. To distinguish these two cases, we feed x to the optimized g( •|Θ t ). If g(x|Θ t ) ̸ = f (x), x likely contains a trigger, and thus we should set f (x) = g(x|Θ t ), which is likely the original source class of x based on our theoretical results. The outline of our mitigation framework is summarized in Fig. 2 . Note that in the test-time inference procedure above, the major (additional) computation for both backdoor trigger instance detection and source class inference is a forward propagation for feeding x to g(•|Θ t ), which is comparable to the computation required for classification using f . Moreover, such additional computation occurs only if an attack is detected and f (x) = t; thus, our test-time inference is very efficient. Now the remaining problem is to address Challenge 3, which is critical to the estimation of Θ t using the reverse-engineered trigger ∆t for each detected target class t ∈ T . For simplicity, we will drop the subscript t below without loss of generality. Our main goals are: (a) specifying the structure of the transformation function h j,l with its associated parameters θ j,l , (b) empirical estimation of the distribution divergence in Eq. ( 3) using a clean dataset (i.e. the subset of clean instances from classes in Ŝ(t) for each detected class t), and (c) choosing the convex function k for the divergence form. For (a), we consider the following transformation function with parameters θ j,l = {µ j,l , σ j,l , υ j,l , ω j,l }: h j,l (z) = max{min{ z -µ j,l σ j,l , ω j,l }, υ j,l } where µ j,l and σ j,l specify the location and scale of the activation distribution, respectively, while υ j,l , ω j,l control the shape of the tail of the distribution. For goal (b), we quantize the real line into M intervals I 1 = (-∞, b 1 ), I 2 = [b 1 , b 2 ), • • • , I M = [b M -1 , ∞) , for M sufficiently large. Then the distribution divergence in Eq. ( 3) for each node j and layer l is computed on discrete distributions pj,l and qj,l over these intervals. Specifically, the discrete distributions are estimated using a subset D t of instances from classes Ŝ(t), with the probabilities for interval I i computed by: p(i) j,l = 1 |D t | x∈Dt 1[z j,l (x) ∈ I i ] and q(i) j,l = 1 |D t | x∈Dt 1[h j,l (ẑ j,l ( ∆t (X)|θ <l )|θ j,l ) ∈ I i ]. (5) To ensure that the distribution divergence is differentiable with reference to the parameters, such that it can be minimized using (e.g.) gradient descent, we approximate the non-differentiable indicator function 1[•] in Eq. ( 5) using differentiable functions such as the sigmoid, i.e. we redefine: 1[z ∈ I i ] = sigmoid(τ (z -b i-1 )) -sigmoid(τ (z -b i )) (6) where τ is a scale factor controlling the error of approximation. For I 1 and I M , which have semiinfinite support, we use a single sigmoid in Eq. ( 6). The choice of the intervals and τ is not critical to the performance, as long as the length of the finite intervals is sufficiently small, as will be shown in Tab. 4 in Sec. 5. Finally, for goal (c), we consider several different divergence forms including the total variation (TV) divergence with k(r) = |r -1|/2, the Jensen-Shannon (JS) divergence with k(r) = r log 2r r+1 + log 2 r+1 , and the Kullback-Leibler (KL) divergence with k(r) = r log r. The choice of the divergence form is also not critical to the mitigation performance (see Apdx. E).

Datasets:

Our main experiments are conducted on the benchmark CIFAR-10 dataset, which contains 60,000 32 × 32 color images from 10 classes, with 5,000 images per class for training and 1,000 images per class for testing (Krizhevsky & Hinton (2009) ). We also show the effectiveness of our proposed mitigation framework on other benchmark datasets including GTSRB (Houben et al. (2013) ), CIFAR-100 (Krizhevsky & Hinton (2009) ), ImageNette (Howard (2020) ), and TinyImageNet. Details of these datasets can be found in Apdx. B.1. Data allocation in our experiments strictly follows the assumptions in Sec. 4.1. For each dataset, we randomly sample 10% of the test set to form the small, clean dataset D Defense assumed for the defender. The remaining test instances, denoted by D Test , are reserved for performance evaluation. Attack settings: In this paper, we consider standard backdoor attacks launched by poisoning the training set of the classifier (Gu et al. (2019) ; Chen et al. (2017) ). In particular, we consider both the all-to-one (A2O) attacks and the all-to-all (A2A) attacks in our main experiments on CIFAR-10. For A2O attacks on CIFAR-10, we arbitrarily choose class 9 as the target class; while for A2A attacks, as described in Sec. 4.1, triggered instances from any class c ∈ C are supposed to be (mis)classified to class (c + 1)mod|C|. For each attack setting, we consider the following triggers: 1) a 3 × 3 random patch (BadNet) with a randomly selected location (fixed for all triggered images for each attack) used in Gu et al. (2019) ; 2) an additive perturbation (with size 2/255) resembling a chessboard (CB) used in Xiang et al. (2020) ; 3) a single pixel (SP) perturbed by 75/255 with a randomly selected location (fixed for all triggered images for each attack) used by Tran et al. (2) the perturbation size under all-to-one CB attack. to optimize the transformation functions using learning rate 0.01 for 30 epochs. If a neuron is followed by a BN (which is very common), instead of applying an additional transformation function h j,l , we treat the mean and standard deviation of BN as the parameters µ j,l and σ j,l associated with h j,l respectively. Here, we only show results for BNA with the total variation divergence. Results for KL-divergence and JS-divergence are deferred to Apdx. E. To compute the divergence, we use the "interval trick" (Eq. ( 5)) to obtain the discrete empirical distribution. For simplicity, we let all finite intervals, I i = [b i-1 , b i ), i = 1, • • • , M , have the same length ∆b = 0.1. For each neuron, we set b min and b max as the minimum and maximum activations, respectively, when feeding in clean instances from D Defense to the poisoned classifier f . Then, the number of intervals is M = ⌈ bmax-bmin ∆b ⌉; and all intervals can be specified by b 0 = b min and b i = b i-1 + ∆b. Finally, the scale factor in Eq. ( 6) is set to τ = 150, which is obtained by line search to minimize the total variation between the "soft" distribution and the empirical one. In fact, the choices for ∆b and τ (over reasonable ranges) has little impact on our mitigation performance, as shown in Tab. 4. In Tab. 1, we show the ASR, ACC, and SIA for our BNA compared with the other five methods (which are all tuning-based) for attacks on CIFAR-10. Each metric is averaged over the five attacks created for each trigger type and attack setting, with the highest ACC and SIA, and the lowest ASR in bold. Although the five tuning-based methods we compare here can effectively deactivate the backdoor attacks (i.e., significantly reduce ASRs), there is a clear drop (3%-15%) in both ACC and SIA. This is possibly due to tuning many DNN parameters using very limited data. Moreover, we found these tuning-based methods are sensitive to the choices of hyper-parameters, such as the learning rate. For ANP with neuron pruning, the performance is acceptable only for A2O with the BadNet trigger. One possible reason is that invisible, perturbation-based triggers affect most neurons only moderately (Wang et al. (2022a) ); thus, pruning a small number of neurons cannot mitigate the attack. In contrast, our method successfully mitigates all these backdoor attacks (with generally the best ACC and ASR compared with the others) regardless of the trigger type and attack setting. Notably, since the purpose of BNA's divergence minimization is to maximize the SIA, it unsurprisingly achieves the best SIA with a clear margin over all other methods, in all cases (the corresponding distribution divergnces are shown in Tab. 9). We also tune the poisoning ratio and perturbation size used in A2O CB attacks, and the performance for our BNA slightly declines as the attack is strengthened, as shown in Tab. 2. However, it still outperforms the other methods (see Tab. 11 in Apdx. F). Why tuning-based methods like NC cannot achieve SIA as high as our BNA without parameter altering? Note that NC tunes the classifier using instances embedded with the estimated trigger but without label flipping. This is equivalent to minimizing the divergence between internal activation distributions for clean and triggered instances, but with the parameters changed. Even for an optimal (zero) divergence, the best achievable SIA of NC is still upper-bounded by the ACC of the classifier after tuning, which usually drops due to the data insufficiency. By contrast, the reference distribution for our divergence minimization is obtained by feeding clean instances to the poisoned classifier without changing its parameters; thus, it is a "better" reference with a higher upper-bound ACC. Results of our BNA on other datasets are shown in Table 3 . The ACC for DNNs trained without attack for GTSRB, CIFAR-100, ImageNette, and TinyImageNet are 0.9567, 0.6926, 0.8726, and 0.5224, respectively; while ACC, ASR, and SIA for attacked DNNs are shown in the row "Vanilla". We apply our BNA on the poisoned DNNs, with the same settings as for CIFAR-10, which significantly reduces ASR (to less than 1.3% in all cases), with uniformly high SIA and ACC.

5.3. TEST-TIME BACKDOOR-TRIGGER INSTANCES DETECTION

Different from other tuning-based backdoor mitigation approaches, our BNA can also detect backdoor-trigger instances at test-time, as described in Sec. 4.2 and shown in Fig. 2 . Here, we evaluate the accuracy of our test-time detection compared with a state-of-the-art detector named 5 : TPR and FPR for our BNA, compared with STRIP, against all attacks created on CIFAR-10. STRIP (Gao et al. (2019) ). For any input image during inference, STRIP blends it with clean images possessed by the defender. The blended image is fed into the poisoned DNN, with an entropy calculated on the output posteriors. If the entropy is lower than a prescribed detection threshold, the input is deemed to be embedded with the trigger. Here, we set the detection threshold at 15% FPR for STRIP which achieves a generally good trade-off between TPR and FPR. In contrast, our BNA does not need to set a detection threshold. In Tab. 5, we show the True Positive Rate (TPR, i.e., the fraction of backdoor-trigger images correctly detected) and the False Positive Rate (FPR, i.e., the fraction of clean test images from the backdoor target class(es) that are falsely detected) for both methods. Although STRIP performs well on A2O attacks for some trigger types, e.g., BadNet, l 0 inv, and l 2 inv, its TPR drops drastically on attacks using human-imperceptible triggers, especially the WaNet attacks. Moreover, it does not perform well on all A2A attacks, with largest TPR of only 0.5272. By contrast, our BNA is effective for all these attacks -it detects almost all the backdoor-trigger images, with FPRs comparable to STRIP.

5.4. MITIGATION PERFORMANCE AGAINST ADAPTIVE ATTACKS

A recent backdoor attack proposed by Doan et al. (2021) minimizes a metric similar to our activation distribution alteration, in order to achieve better stealthiness. This attack can be viewed as an adaptive attack against our mitigation defense since the trained classifier will be more sensitive to even a smaller distribution divergence than for ordinary backdoor attacks. Nevertheless, our method successfully mitigates this attack. In our experiment on CIFAR-10, the average distribution total variation divergence over all neurons is reduced from 8067 to 2789. Accordingly, the ACC/ASR before and after mitigation are 0.9162/0.9978 and 0.8906/0.0072 respectively, with SIA 0.8496.

6. CONCLUSION

In this paper, we revealed an activation distribution alteration property for backdoor attacks. We found that by correcting such alteration, backdoor trigger instances will be classified to their original source classes. Accordingly, we proposed a backdoor mitigation approach without changing any parameters of the classifier, which outperformed methods that use DNN fine-tuning. Moreover, our method can detect instances with the trigger during inference. Yi Zeng, Si Chen, Won Park, Zhuoqing Mao, Ming Jin, and Ruoxi Jia. Adversarial unlearning of backdoors via implicit hypergradient. In ICLR, 2022. Zhiyuan Zhang, Lingjuan Lyu, Weiqiang Wang, Lichao Sun, and Xu Sun. How to inject backdoors with better consistency: Logit anchoring on clean data. In ICLR, 2022. Zhendong Zhao, Xiaojun Chen, Yuexin Xuan, Ye Dong, Dakui Wang, and Kaitai Liang. Defeat: Deep hidden feature backdoor attacks by imperceptible perturbation and latent representation constraints. In CVPR, 2022. Runkai Zheng, Rongjun Tang, Jianze Li, and Li Liu. Data-free backdoor removal based on channel lipschitzness. 2022.

A PROOF OF THEOREMS IN THE MAIN PAPER

A.1 DERIVATION OF EQ. ( 2) Here, we provide the derivation showing that m * j and v * j in Eq. ( 2) are the solutions to: E[a * j (X b )] = E[a j (X)|Y = -1] (7) Var[a * j (X b )] = Var[a j (X)|Y = -1] Based on Eq. ( 1), the above equations can be expanded as the following: E[ w ⊤ j X b -m * j v * j γ j + β j ] = E[ w ⊤ j X -m j √ v j γ j + β j |Y = -1] (9) Var[ w ⊤ j X b -m * j v * j γ j + β j ] = Var[ w ⊤ j X -m j √ v j γ j + β j |Y = -1] We first solve Eq. ( 10) for (X|Y = -1) ∼ N (-µ, σ 2 I) and X b ∼ N (µ b , σ 2 b I), which leads to: v * j = σ b σ v j By substituting Eq. ( 11) into Eq. ( 9), and since µ b = -µ + ϵ, we get the following: m * j = v * j v j (m j -w ⊤ j µ) + w ⊤ j µ b = σ b σ m j + ( σ b σ -1)w ⊤ j µ + w ⊤ j ϵ. A.2 PROOF OF THEOREM 3.1 Proof. First, let's specify the following vector/matrix representations that will be used throughout this proof: W = [w 1 , • • • , w J ] ⊤ ∈ R J×d V =    v 1 • • • 0 . . . . . . . . . 0 • • • v J    ∈ R J×J V (α) =    v1 (α) • • • 0 . . . . . . . . . 0 • • • vJ (α)    ∈ R J×J m =    m 1 . . . m J    ∈ R J m(α) =    m1 (α) . . . mJ (α)    ∈ R J Γ =    γ 1 • • • 0 . . . . . . . . . 0 • • • γ J    ∈ R J×J β =    β 1 . . . β J    ∈ R J a(•) =    a 1 (•) . . . a J (•)    ∈ R J â(•|α) =    â1 (•|α) . . . âJ (•|α)    ∈ R J a * (•) =    a * 1 (•) . . . a * J (•)    ∈ R J Let X -= (X|Y = -1) ∼ N (-µ, σ 2 I) denote a random instances from the source class '-1' for simplicity. Let u = u --u + with u -and u + being the weight vectors associated with the node for class '-1' and the node for class '+1' respectively. Then, it is easy to see that: â(X b |α) α=1 = a(X b ) and â(X b |α) α=0 = a * (X b ), and taking one step further by setting α = 1, we have the following: P [g -(X b |α) > g + (X b |α) α = 1] = P [u ⊤ â(X b |α) > 0 α = 1] (12) = P [u ⊤ a(X b ) > 0] = P [f -(X b ) > f + (X b )] ≤ 1 -ψ B DATASETS, TRAINING SETTINGS, AND ATTACK SETTINGS

B.1 DATASETS

In experiments, we show the effectiveness of our proposed backdoor mitigation method on several benchmark datasets including CIFAR-10 (Krizhevsky & Hinton (2009) ), GTSRB (Houben et al. (2013) ), CIFAR-100 (Krizhevsky & Hinton (2009) ), ImageNette (Howard (2020) 

B.2 TRAINING SETTINGS

Training settings for the 5 datasets are shown in Table 6 . We train a ResNet-18 (He et al. (2016) ) on CIFAR-10 and CIFAR-100 for 30 epochs and 40 epochs, respectively. We train a ResNet-34 (He et al. (2016) ) on both TinyImageNet and ImageNette for 90 epochs. For GTSRB, we train a MobileNet (Howard et al. (2017) ) for 60 epochs. For all models, we use Adam optimizer (Kingma & Ba (2015) ) for parameter learning and a scheduler to decay the learning rate of each parameter group by 0.1 every "scheduler step size" epochs (shown in the table). We choose batch size 32 for both CIFAR-10 and CIFAR-100, 64 for GTSRB and ImageNette, and 128 for TinyImageNet.

B.3 ATTACK SETTINGS

On dataset CIFAR-10, we consider the following triggers: 1) a 3 × 3 random patch (BadNet) with a randomly selected location (fixed for all triggered images for each attack) used in Gu et al. (2019) , as visualized in Fig. 3b ; 2) an additive perturbation (with size 2/255) resembling a chessboard (CB) used in Xiang et al. (2020) , as visualized in Fig. 3c ; 3) a single pixel (SP) perturbed by 75/255 with a randomly selected location (fixed for all triggered images for each attack) used by Tran et al. (2018) , as visualized in Fig. 3d ; 4) invisible triggers generated with l 0 and l 2 norm constraints (l 0 inv and l 2 inv respectively) proposed by Li et al. (2021a) , as visualized in Fig. 3e and 3f; 5) a warping-based trigger (WaNet) proposed by Nguyen & Tran (2021) , as visualized in Fig. 3g . Attack settings for CIFAR-10 are summarized in Tab. 7. For all-to-one attacks, we arbitrarily choose class 9 as the target class, and embed the backdoor triggers in 100 randomly chosen training samples per class (excluding the target class). To achieve similar effective attacks as other triggers, we poison 900 images per source class in the all-to-one attack using WaNet. For all-to-all attacks, we embed the backdoor triggers into 300 images for each class. For effective attacks, we poison 800 training images and 1500 training images in the all-to-all attacks using SP and WaNet, respectively. Attack settings for other datasets are summerized in Tab. 8. Due to the insufficiency of data, we only conduct all-to-one attacks on these datasets for effective attacks. We arbitrarily choose class 0 as the target class for CIFAR-100, GTSRB, and TinyImageNet, and class 9 for ImageNette. The classes other than the target class are all source classes. For CIFAR-100, we use the same BadNet, l 0 inv, and l 2 inv triggers as CIFAR-10. We increase the perturbation size to 6/255 for CB pattern for a effective backdoor attack. For each of the attack, we poison 10 images per source class using the above triggers. Trigger SP and WaNet are not considered since we can not launch a successful backdoor attack using the trigger on CIFAR-100. For GTSRB, in addition to the same triggers as CIFAR-100, 2e also use the warping-based trigger (WaNet). We poison 2% of the training images per source class using BadNet trigger and l 2 inv trigger, and 5% of the training images per source class with CB trigger and l 0 inv trigger. 

C PATTERN ESTIMATION AND BACKDOOR DETECTION

For our BNA, following Sec. 4.2, we first perform detection by reverse-engineering a backdoor trigger for each class pair. For patch triggers like BadNet, we use the objective function from Wang et al. (2019) for trigger reverse-engineering. For other more subtle, perturbation-based trigger types, we use the objective function from Xiang et al. (2020) for reverse-engineering. The detection statistic is the reciprocal of the l 0 norm of estimated patch triggers and l 2 norm of reverse-engineered perturbation-based triggers. Then we feed the statistics obtained from the estimated trigger to an anomaly detector. Our anomaly detector is based on MAD, which is a classical approach also used by Wang et al. (2019) ; Chen et al. (2019b) ; Wang et al. (2020) . It first calculates absolute deviation between all detection statistics (the reciprocal of l 0 norm of patch triggers and l 2 norm of perturbation-based triggers) and the median, and the median of the absolute deviations is called Median Absolute Deviation (MAD). For a class pair and its corresponding estimated trigger, if the trigger's anomaly score, which is defined as the absolute deviation divided by MAD, is larger than a given threshold, it is detected as a backdoor class pair. The detection threshold can be easily found, as shown in the Fig. 4 and Fig. 5 . Fig. 4 and Fig. 5 show the histograms of the anomaly scores for all class pairs under all-to-one and all-to-all attacks, respectively. Here, we set the detection threshold at 7, which easily catches all the backdoor class pairs under all the attacks, except for the all-to-all BadNet attack and both attacks using WaNet trigger. For the all-to-all BadNet attack, the outlier detector finds two source classes -0 and 8 -for the target class 1, where 0-1 is the true source-target class pair and 8-1 is falsely detected, as shown in 2022)), and 0-1 is the true source class pair, since the trigger of 0-1 has smaller size than 8-1. By optimizing on clean images from class 0 and 8, the l 0 norm of the trigger that causes mis-classification to class 1 with high confidence is 27.18 -much larger than the triggers estimated on either class 0 images or class 8 images. Thus, we detect 0-1 as the true backdoor class pair and discard the trigger for class pair 8-1 in backdoor mitigation. For the attacks using warping-based triggers (WaNet), unlike the other attacks, trigger size for clean class pairs and backdoor class pairs are both small. However, there is still a "gap" between the anomaly scores of clean class pairs and backdoor class pairs, as shown in Fig. 4f and 5f . The outlier detector successfully detects all the backdoor class pairs by using a threshold at 3. 

D DISTRIBUTION DIVERGENCES

As stated in Thm. 3.1, for our backdoor mitigation method, the SIA monotonically increases as the divergence between clean instances and backdoor-trigger instances decreases. We show the ACC, ASR, and SIA for our method against all all-to-one attacks on CIFAR-10 dataset in Tab. 1. Here, we show the corresponding distribution divergences under all attacks in Tab. 9. Tab. 9 shows the average TV distance, JS divergence, and KL divergence between distributions of penultimate layer activations of clean images and backdoor-trigger images in clean ResNet-18, backdoor poisoned ResNet-18, and backdoor poisoned ResNet-18 with BNA. For the backdoor poisoned ResNet-18 with BNA, we use TV divergence in backdoor mitigation. For all attacks, all the three divergences are small for a clean DNN, while relatively large for a backdoor-poisoned DNN. The distribution of backdoor-trigger instances severely deviates from that of clean instances. However, with our mitigation method, the distribution alteration is significantly relieved. All the three divergences are drastically reduced, which is consistent with the results in Tab. 1. 

MITIGATION

To observe the impact of attack settings on the performance of backdoor mitigation methods, we tune the poisoning ratio (i.e., the number of poisoned instances per source class) and perturbation size used in all-to-one CB attacks, and apply all the mitigation methods on these poisoned DNNs. The results are shown in Tab. 11. Generally, the metrics for all methods decrease with increasing poisoning ratio and perturbation size. Although the performance for our BNA slightly declines as the attack is strengthened, our method still outperforms other methods in terms of the SIA. Besides, it achieves the best or comparable ACC and ASR to other methods. 



These can be constructed in practice, given an estimated backdoor trigger, by embedding the trigger in clean instances available to the defender. For example, the histogram for clean activations in Fig.1bwill be shifted away (to the left). More reliable trigger estimation can be achieved in this way for a detected target class. The 10 classes are tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, and parachute.



Figure1: Activation distribution of a neuron in the penultimate layer of ResNet-18 trained on CIFAR-10, for instances with and without a backdoor trigger, for (a) a clean classifier and (b) a backdoor-poisoned classifier (with the same trigger). In (c), the distribution alteration in (b) is reversed by our proposed method -most instances with the trigger will thus be correctly classified.

Existing backdoor defenses are deployed either during the DNN's training stage or post-training. The ultimate goal of training-stage defenses is to train an accurate, backdoor-free DNN given the possibly poisoned training set. To achieve this goal, Shen & Sanghavi (2019); Huang et al. (2022); Li et al. (2021d); Chen et al. (2019a); Xiang et al. (2019); Du et al. (2020) either identify a subset of "high-credible" instances for training, or detect and then remove training instances possibly with a backdoor trigger before training. Post-training defenders, however, are assumed to have no access to the classifier's training set. Many post-training defenses aim to detect whether a given classifier has been backdoor-compromised. Wang et al. (2019); Xiang et al. (2020); Wang et al. (2020); Liu et al. (2019) perform anomaly detection using triggers reverse-engineered on an assumed independent clean dataset; while Xu et al. (2021); Kolouri et al. (2020) train a (binary) meta classifier on "shadow" classifiers trained with and without attack. However, model-detection defenses are not able to mitigate backdoor attacks at test time. Thus, there is a family of post-training backdoor mitigation approaches proposed to fine-tune the classifier on the assumed clean dataset, with a subset of neurons possibly associated with the backdoor attack pruned (Liu et al. (2018a); Wu & Wang (2021); Guan et al. (2022); Zheng et al. (2022)), by leveraging knowledge distillation to preserve only classification functions for clean instances (Li et al. (2021c); Xia et al. (2022)), or by solving a min-max problem as an analogue to adversarial training for evasion attacks (Zeng et al. (2022); Madry et al. (

For a clean training random vector (X, Y ) with a uniform class prior, i.e. Y ∼ U{-1, +1} and with X|Y ∼ N (Y • µ, Σ), where µ ∈ R d and Σ = σ 2 I, consider a backdoor attack with target class '+1', triggered instance X b ∼ N (µ b , Σ b ) with µ b = -µ + ϵ, and Σ b = σ 2 b I.

where m j and v j respectively are the mean and variance stored by the BN layer during training on the poisoned training set. Then the activation distribution for clean source class instances (X|Y = -1) ∼ N (-µ, Σ) is a Gaussian specified by mean E[a j (X)|Y = -1] and variance Var[a j (X)|Y = -1]; while for triggered instances X b ∼ N (µ b , Σ b ), the activation follows a Gaussian specified by mean E[a j (X b )] and variance Var[a j (X b )

Figure 2: Illustration of our backdoor mitigation framework with a test-time inference rule. such that E[a * j (X b )] = E[a j (X)|Y = -1] and Var[a * j (X b )] = Var[a j (X)|Y = -1] are achieved. But here, we aim to study the quantitative relationship between the distribution divergence and the SIA metric of Def. 3.3 below. Thus, we consider an "intermediate state" with a classifier specified by output node functions g -(•|α) : R d → R and g + (•|α) : R d → R, where for each output node i ∈ {-, +}, g i (x|α) = u ⊤ i â(x|α) depends on a "transition variable" α ∈ [0, 1], with u i the weight vector for the original output function f i . â(x|α) = [â 1 (x|α), • • • , âJ (x|α)] ⊤ is the activation vector for input x where âj (x|α) = (w ⊤ j x -mj (α))γ j / vj (α) + β j , with mj (α) = αm j + (1 -α)m * j and vj (α) = (α √ v j + (1 -α) v * j ) 2 being the "intermediate" mean and variance respectively. Given these settings, our main theoretical results are presented below. Definition 3.3. (Source inference accuracy (SIA)) SIA is the probability that a triggered instance is classified to its original source class (Li et al. (2022a)), i.e. P [g -(X b |α) > g + (X b |α)] here. Theorem 3.1. (Monotonicity of SIA with Divergence) If the binary classifier with f -and f + is η-erroneous with η < 1/2, the attack is ψ-successful with ψ > 1/2, and σ b ≤ σ, then SIA of the modified classifier, i.e. P [g -(X b |α) > g + (X b |α)], monotonically decreases as α ∈ [0, 1] increases.

Figure 3: Example of CIFAR-10 images embedded with the backdoor triggers considered in our experiments.

The l 0 norm of the trigger estimated on class 0 clean images is 3.02, and that estimated on class 8 images is 7.95. If class 0 and 8 are both the source classes involved in the backdoor attack, then the trigger estimated on the clean images from class 0 and 8 should also have a small l 0 norm. Otherwise, the trigger estimated using class 8 images is an intrinsic backdoor pattern(Xiang et al. (2022);Liu et al. (2022);Tao et al. (

Figure 4: Histograms of anomaly scores for each class pair under all all-to-one attacks.

Figure 5: Histograms of anomaly scores for each class pair under all all-to-all attacks.

Average ACC, ASR, and SIA for our BNA, compared with NC, NAD, I-BAU, ANP, and ARGD, against all the created attacks applied to ResNet-18 trained on the CIFAR-10 dataset.(2018); 4) invisible triggers generated with l 0 and l 2 norm constraints (l 0 inv and l 2 inv respectively) proposed byLi et al. (2021a); 5) a warping-based trigger (WaNet) proposed byNguyen & Tran (2021). Details for generating these triggers are deferred to Apdx. B.3. For experiments on CIFAR-10, we randomly create 5 attacks for each attack setting and each trigger (e.g. with random location). For experiments on the other four datasets, we only consider A2O attacks for a subset of triggers where sufficiently high success rate can be achieved. For each dataset, we create one attack for each trigger being considered. A2A attacks are not considered for these datasets since they are not successful due to the insufficiency of data. More details about the attacks, including the number of backdoor-trigger images used for poisoning and the target class selected to create A2O attacks for the four datasets other than CIFAR-10, are shown in Apdx. B.3.

ACC, ASR, and SIA for our BNA against all-to-one attacks on CIFAR-100, GTSRB, ImageNette, and TinyImageNet datasets.

ACC, ASR, and SIA for our BNA as a function of scale factor and bin size on ResNet-18 trained on CIFAR-10 poisoned by all-to-one BadNet attack.

), and TinyIma-geNet. CIFAR-10 dataset contains 60,000 32 × 32 color images from 10 classes, with 5,000 images per class for training and 1,000 images per class for testing . GTSRB dataset has more than 50,000 traffic sign images with different sizes from 43 classes. Here, we resize all images in GTSRB to 32 × 32. contains 60,000 32 × 32 color images evenly from 100 classes, where 500 images per class are used for training, while the others are used for testing. ImageNette is a subset of 10 easily classified classes from Imagenet 4 , with image size of 256 × 256. For each class, there are around 900 images for training and 400 images for testing. The TinyImageNet dataset is a subset of the ImageNet dataset(Russakovsky et al. (2015)). It contains 100,000 64 × 64 color images evenly distributed in 200 classes (500 training images and 50 test images for each class).

To achieve similar effective attacks, we embed WaNet trigger into 24% of the training images per source class. For TinyImageNet and ImageNette, we only consider BadNet as the trigger, as the DNN can not learn the backdoor mapping using the other (relatively simple and small) triggers in datasets that are much more complicated than CIFAR-10. To successfully plant backdoors, we increase the size the the BadNet patch to 6 × 6 for TinyImageNet and to 21 × 21 for ImageNette. We embed the trigger in 10 training images per source class in TinyImageNet and in 5% of the training images per source class for ImageNette.

Training configurations of the 5 datasets used in our experiments.

Attack configurations on CIFAR-10

Attack configurations on GTSRB, CIFAR-100, ImageNette, and TinyImageNet.

Average TV distance, JS divergence, and KL divergence between distributions of clean instances and backdoor-trigger instances in clean DNN, poisoned DNN, and poisoned DNN with BNA using TV divergence in backdoor mitigation.

ACC, ASR, and SIA for our BNA, NC, I-BAU, ANP, NAD, and ARGD as a function of (1) the number of poisoned instances injected into the training set; (2) the perturbation size under all-to-one CB attack.

ETHICS STATEMENT

The main purpose of this research is to understand the behavior of deep learning systems facing malicious activities, and enhance their safety level. The backdoor attack considered in this paper is well-known, with open-sourced implementation code. Thus, publication of this paper will be beneficial to the community in defending against backdoor attacks. The code of our defense will be released if the paper is accepted. This is to say that when α = 1, the classifier is not modified at all, thus the SIA will be no larger than 1 -ψ since the attack is ψ-successful (see Definition 3.2). On the other hand, by setting α = 0, we will have the following:and this is to say that when α = 0, the distribution shift will be fully recovered, such that SIA is equally high as the accuracy of the source class. Recall that Eq. ( 14) is due to Eq. ( 7) and Eq. ( 8). The inequality ( 15) is because the classifier specified by f -and f + is assumed η-erroneous (see Definition 3.1). Here, we prove the theorem by showing that the partial derivative of Pe. triggered instances have smaller standard deviation than clean instances, which is generally true). To achieve this, we notice thatfollows a Gaussian distribution withWe also notice that for source class instances X -Then we havewhere Φ is the cumulative distribution function of standard Gaussian. Now let's consider Eq. ( 21) first. Since η < 1 2 as we have reasonably assumed (otherwise the classifier may be worse than a random guess), and also according to Eq. ( 15), we have1 2 Thus, based on Eq. ( 21) and Eq. ( 18), we getNext, let's focus on Eq. ( 20). Again, we set α = 1. Based on ( 12)-( 13) and the reasonable assumption that ψ > 1 2 (otherwise the attack is not deemed successful since the success rate will be even lower than the accuracy on clean instances), we haveThen, based on Eq. ( 20) and Eq. ( 16), we getSubtract Eq. ( 23) from Eq. ( 22) we get:Based on Eq. ( 20), we also havewhere ϕ is the probability density function (PDF) for standard normal distribution. Based on Eq. ( 16), Eq. ( 17), and Eq. ( 2), we haveand thus, based on Eq. ( 23) and Eq. ( 24)when σ b ≤ σ. Substitute it into Eq. ( 25) and given Gaussian PDF being strictly positive, we have

E CHOICE OF DIVERGENCE FORMS

In Tab. 1 and 3, we only show the results for our method of using TV distance in backdoor mitigation (Eq. 3). Here we show that our method is not sensitive to the choice of distribution divergence form.We respectively use TV distance, JS divergence, and KL divergence to mitigate the 5 all-to-one CB attack against CIFAR-10, and show the distribution similarity measured by the three measurements after mitigation. The distribution similarity is calculated on the penultimate layer activations. As shown in Tab. 10, the distribution alteration is significantly relieved after mitigation, regardless the measurement used in mitigation (Eq. 3). 

