FLIP: A PROVABLE DEFENSE FRAMEWORK FOR BACKDOOR MITIGATION IN FEDERATED LEARNING

Abstract

Federated Learning (FL) is a distributed learning paradigm that enables different parties to train a model together for high quality and strong privacy protection. In this scenario, individual participants may get compromised and perform backdoor attacks by poisoning the data (or gradients). Existing work on robust aggregation and certified FL robustness does not study how hardening benign clients can affect the global model (and the malicious clients). In this work, we theoretically analyze the connection among cross-entropy loss, attack success rate, and clean accuracy in this setting. Moreover, we propose a trigger reverse engineering based defense and show that our method can achieve robustness improvement with guarantee (i.e., reducing the attack success rate) without affecting benign accuracy. We conduct comprehensive experiments across different datasets and attack settings. Our results on nine competing SOTA defense methods show the empirical superiority of our method on both single-shot and continuous FL backdoor attacks. Code is available at https://github.com/KaiyuanZh/FLIP.



model on generated backdoor triggers that can cause misclassification, which counters the data poisoning by malicious local clients. When all local weights are aggregated in the global server, the injected backdoor features in the aggregated global model are mitigated by the hardening performed on the benign clients. Therefore, FLIP can reduce the attack success rate of backdoor samples. The overview of FLIP is shown in Figure 1 . As a part of the framework, we provide a theoretical analysis of how our training on a benign client can affect a malicious local client as well as the global model. To the best of our knowledge, this has not been studied in the literature. The theoretical analysis determines that our method ensures a deterministic loss elevation on backdoor samples with only slight loss variation on clean samples. It guarantees that the attack success rate will decrease, and the model can meanwhile maintain the main task accuracy on clean data without much degradation. Certified accuracy is commonly used in evasion attacks that do not involve training. As data poisoning happens during training, it is more reasonable to certify the behavior of models during training rather than inference. Our Contributions. We make contributions on both the theoretical and the empirical fronts. • We propose FLIP, a new provable defense framework that can provide a sufficient condition on the quality of trigger recovery such that the proposed defense is provably effective in mitigating backdoor attacks. • We propose a new perspective of formally quantifying the loss changes, with and without defense, for both clean and backdoor data. • We empirically evaluate the effectiveness of FLIP at scale across MNIST, Fashion-MNIST and CIFAR-10, using non-linear neural networks. The results show that FLIP significantly outperforms SOTAs on the continuous FL backdoor attack setting. The ASRs after applying SOTA defense techniques are still 100% in most cases, whereas FLIP can reduce ASRs to around 15%. • We design an adaptive attack that is aware of the proposed defense and show that FLIP stays effective. • We conduct ablation studies on individual components of FLIP and validate FLIP is generally effective with various downstream trigger inversion techniques. Threat Model. We consider FL backdoor attacks performed by malicious local clients, which manipulate local models by training with poisoned samples. On benign clients, we do not assume any knowledge about the ground truth trigger. Backdoor triggers are inverted on benign clients based on received model weights (from the global server) and their local data (non-i.i.d.) . Standard training on clean data and adversarial training on augmented data (clean samples stamped with inverted triggers) are then performed. The global server does not distinguish weights from trusted or untrusted clients. Nor does it assume any local data. Thus there is no information leakage or privacy violation. The attack's goal is to inject a backdoor to the global model, achieving a high attack success rate without causing any noticeable model accuracy on clean samples. In our setting, a defender has no control over any malicious client, who may perform any kind of attack, e.g. model replacement or weight scaling. They can attack any round of FL. In an extreme case, they attack in every round after the global model converges (if an attack is persistent since the first round, the model may not converge (Xie et al., 2019) ). In this paper, we consider static backdoors, i.e. patch backdoors (Gu et al., 2017) . Dynamic backdoors such as reflection backdoors (Liu et al., 2020) , composite backdoors (Lin et al., 2020a) , and feature space backdoors (Cheng et al., 2021b ) will be our future work.

2. RELATED WORK

Backdoor Attack and Defense. In general, the goal of backdoor attack is to inject a trigger pattern and associate it with a target label, e.g., by poisoning the training dataset. During testing, any inputs with such pattern will be classified as the target label. There are a number of existing backdoor attacks, like patch attacks (Gu et al., 2017; Liu et al., 2018) , feature space attacks (Cheng et al., 2021b) , etc. To identify whether a model is poisoned, existing works inverse triggers (Wang et al., 2019; Liu et al., 2019; Shen et al., 2021; Liu et al., 2022; Tao et al., 2022; Cheng et al., 2023) , identify differences between clean models and backdoored models (Huang et al., 2020; Wang et al., 2020b) . There are also methods that detect and reject inputs stamped with triggers (Ma & Liu, 2019; Li et al., 2020b) . (Xie et al., 2019; Tolpegin et al., 2020; Bagdasaryan et al., 2020; Fang et al., 2020; Shejwalkar et al., 2022) . To defend against FL backdoor attacks, a number of defense methods have been proposed (Blanchard et al., 2017; Pillutla et al., 2022; Sun et al., 2019; Nguyen et al., 2021; Ozdayi et al., 2020; Fung et al., 2020; Andreina et al., 2021; Cao et al., 2020) , they focus more on detecting and rejecting malicious weights, which could fall short in the stronger and stealthier attack settings and may lead to data leakage. Certified and provable defense techniques (Panda et al., 2022; Cao et al., 2021; Xie et al., 2021) also have been proposed to analyze the robustness of FL, while they can provide robustness certification in the presence of backdoors with a (relatively) limited magnitude. Federated

3. METHODOLOGY

In this section, we detail the design of FLIP, which consists of three main steps as illustrated in Trigger Inversion. Trigger inversion leverages optimization methods to invert the smallest input pattern that flips the classification results of a set of clean images to a target class. Neural Cleanse (Wang et al., 2019) uses optimization to derive a trigger for each class and eventually observes if there is any trigger that is exceptionally small and hence likely injected instead of naturally occurring feature. In our paper, we leverage universal trigger inversion that aims to generate a trigger that can flip samples of all the classes (other than the target class) to the target class. Class Distance. Recent work quantifies model robustness (against backdoor) by class distance (Tao et al., 2022) . Given images from a source class s, it generates a trigger, consisting of a mask m and a pattern δ, which can flip the labels of these images stamped with the trigger to a target class t. The stamping function is illustrated in Equation 1 and the optimization goal in Equation 2, where L(•) is the cross entropy loss, M denotes the subject model, and ||•|| denotes L 1 , i.e. the absolute value sum. x ′ s→t = (1 -m) • x s + m • δ (1) Loss = L(M (x ′ s→t ), y t ) + α • ||m|| (2) The class distance d s→t is measured as ||m||. The intuition here is that if it's easy to generate a small trigger from the source class to the target class, the distance between the two class is small. Otherwise, the class distance is large. Furthermore, the model is robust if all the class distances are large, or one can easily generate a small trigger between the two classes. Cached Warm-up. Adversarial training on samples with inverted triggers is a widely used technique for model hardening. Observe that different label pairs have different distance capacities and enlarging label pair distances by model hardening can improve model robustness and help mitigate backdoors (Tao et al., 2022) . Existing trigger inversion methods optimize all combinations of label pairs without selection and hence lead to quadratic computation time (O(n 2 )). In order to reduce the trigger optimization cost, we first generate universal triggers, having each label being the target and prioritize promising pairs. It has a linear time complexity (O(n)). We consider pairs with larger distance capacity as having greater potential. Specifically, we need to know the difficulty of flipping a source class to the target class, we leverage loss changes during optimization to measure. In other words, classes far from the target class have larger loss variances as they are quite different from the target class; once the predicted label is flipped to the target class, the loss value would be quite small. During optimization, we compute the variance of loss between source classes and the target class. FLIP starts with a warm-up phase and repeats the above process for each class, and utilizes the loss changes of different source labels to approximate class distances. Such information is saved in a distance matrix or cache matrix (on each client). FLIP then prioritizes the promising pairs, namely, those with large distances. When the client is selected for hardening, FLIP generates the label-specific trigger for each source class and updates the distance matrix between the source class and target class. In Algorithm 1, each local client utilizes their local samples and computes their class distances (measured by L 1 ) to a target class and caches the results in the distance matrix (lines 11). Then based on the distance matrix, FLIP prioritizes the promising pairs with large distance (line 12). If the client has been selected before, we can directly get the promising pairs from the cached distance matrix (lines 13, 14). The distance matrix is stored and updated locally by each client. Caching allows more iterations allocated for model hardening. (A)symmetric Hardening. Given a pair of label 1 and label 2, there are two directions for trigger inversion, from 1 to 2 and from 2 to 1. A straightforward idea (Tao et al., 2022) for step in [0, max steps] do 7: X ′ n = p • (1 -m[0]) • Xn + m[0] • δ[0] +(1 -p) • (1 -m[1]) • Xn + m[1] • δ[1] ▷ Indices of 0 and 1 denote optimization direction, p denotes the direction of symmetric hardening 8: distance matrix for step in [0, max steps] do 14: In Algorithm 2, we present more details about the symmetric and asymmetric inversion. If the client has sufficient data (i.e. more than 5 images) for both labels of a pair (a, b), we perform symmetric hardening by generating triggers from two directions (a → b and b → a) simultaneously. We first initialize the backdoor mask m and patter δ from two directions (line 5). The indicator vector p denotes the direction of symmetric hardening, i.e. 1 denotes from label a to b and 0 denotes from b to a. Then we stamp the triggers on the corresponding class samples (line 7), and update the distance matrix from two directions (line 8, 9). If the client only has sufficient data on the source label of a pair (a, b), we perform asymmetric hardening from the source label to the target (i.e., a → b). We initialize backdoor mask m and patter δ in one direction (line 12), then stamp the triggers on the corresponding class samples (line 14), and update the distance matrix in one direction (line 15). [a][b] ← L 1 (a, b) 9: distance matrix[b][a] ← L 1 (b, X ′ n = (1 -m) • Xn + m • δ 15: distance matrix[a][b] ← L 1 (a, Low-confidence Sample Rejection. As the hardening on benign clients counters the data poisoning on malicious clients, the aggregated model on the global server tends to have low confidence in predicting backdoor samples (while the confidence on benign samples is largely intact). During the inference of global model, we apply a threshold τ to filter out samples with low prediction confidence after the softmax layer, which significantly improves the model's robustness against backdoor attacks in federated learning. In the next section, we prove that, as long as the inverted trigger satisfies our given bound, we can guarantee attack success rate must decrease and in the meantime the model can maintain similar accuracy on clean data.

4. THEORETICAL ANALYSIS

In this section, we develop a theoretical analysis to study the effectiveness of our proposed defense in a simple but representative FL setting. It consists of the following: (i) developing upper and lower bounds quantifying the cross-entropy loss changes on backdoored and clean data in the settings of with and without the defense (Theorem 1); (ii) showing a sufficient condition on the quality of trigger recovery such that the proposed defense is provably effective in mitigating backdoor attacks (Theorem 2); (iii) following (ii), we show that inference with confidence thresholding on models trained with our proposed defense can provably reduce the backdoor attack success rate while maintaining similar accuracy on clean data. During analysis, we leverage inequality (Mitrinovic & Vasić, 1970) and use variable substitution to get the upper-bound and lower-bound. To the best of our knowledge, our analysis is new to this field due to the novel modeling of trigger recovery and model hardening schemes in our proposed defense method. These findings are also consistent with our empirical results in more complex settings. Learning Objective and Setting. Suppose the k-th device holds the n k training samples: {x k,1 , x k,2 , • • • , x k,n k }, where x ∈ R 1×dx . The model has one layer of weights W ∈ R dx×I . Label q ∈ R 1×I is a one-hot vector for I classes. In this work, we consider the following distributed optimization problem: min W F (W ) = N k=1 g k F k (W ) where N is the number of devices, g k is the weight of the k-th device such that g k ≥ 0 and N k=1 g k = 1, and F (•) the objective function. The local objective F k (•) is defined by F k (W ) = 1 n k n k j=1 L(W ; x k,j ), where L(•; •) is loss function. In the global server, we can write the global softmax cross-entropy loss function as L global = - I i=1 q i • logsof tmax(xW ) i = - I i=1 q i • log( e (xW ) i I t=1 e (xW ) t ) = - I i=1 q i (xW ) i + log( I t=1 e (xW )t ), i and t as label index for the I classes. Appendix A.16 describes all used symbols in our paper. In the theoretical analysis, we assume that we are under the FedAvg protocol (McMahan et al., 2017) . In order to simplify analysis without losing generality, we assume there is one global server and two clients: one benign and the other malicious. We conduct the analysis on multi-class classification using logistic regression. The model is composed of one linear layer with the softmax function and the cross-entropy loss. Theorem 1 develops the upper and lower bounds quantifying the loss changes on backdoored and clean data in the settings with and without the defense. Theorem 1 (Bounds on Loss Changes). Let L ′ g denote the global model loss with defense, L g without defense, let ∆W = W ′ -W denote the weight differences with and without defense. The loss difference with and without defense can be upper and lower bounded by min t (x∆W ) t - I i=1 q i (x∆W ) i ≤ L ′ g -L g ≤ max t (x∆W ) t - I i=1 q i (x∆W ) i The detailed proof is provided in Appendix A.14. The above theorem bounds the loss changes with and without the defense. From Cauchy-Schwarz inequality (Mitrinovic & Vasić, 1970) , we derive the lower bound and upper bounds of loss changes. Intuitively, given the parameter W of the linear layer, the k-th device holds n k training samples x k,n k , the loss differences of with and without defense must be larger than the lower bound ∆ min loss as min t (x k,n k ∆W ) t -I i=1 q i (x k,n k ∆W ) i , and smaller than the upper bound ∆ max loss as max t (x k,n k ∆W ) t - I i=1 q i (x k,n k ∆W ) i . To facilitate the analysis of attack success rate (ASR) and clean accuracy (ACC) changes, intuitively, we aim to analyze how much the ASR at least will be reduced and how much the ACC will at most be maintained. Thus, we studied this lower bound (∆ min loss) on backdoor data, which indicates the minimal improvements on the backdoor defense that reduce the ASR. Similarly we studied the upper bound (∆ max loss) for clean data, as they indicates the worst-case accuracy degradation. Denote backdoor samples as n b and clean samples as n c . Note that backdoor samples can be written as x s + δ. By using Theorem 1, we have ∆ min loss = n b s=1 min t [(x s + δ)∆W ] t - n b s=1 I i=1 q s,i [(x s + δ)∆W ] i . And similarly on benign data, we have ∆ max loss x s , ∆ max loss = nc s=1 max t (x s ∆W ) t - nc s=1 I i=1 q s,i (x s ∆W ) i . Next, we aim to develop a sufficient condition on the quality of trigger recovery such that the proposed defense is provably effective in mitigating backdoor attack and in the meantime maintaining similar accuracy on clean data, based on Theorem 1. Theorem 2 (General Robustness Condition). Let α = η r n b s=1 I i=1 (q * s,i -q s,i ){z s n1 j=1 [z j T (q j -p(z j ))]} i b η r n b s=1 I i=1 (q * s,i -q s,i ){ n1 j=1 [z j T (q j -p(z j ))]} i where b = [b 1 , ..., b d ], d is the sample dimension, let b v = sign η r n b s=1 I i=1 (q * s,i -q s,i ) n1 j=1 [z j T q j -p(z j )]] i,v , on all dimensions v of the vector. For all ||ϵ|| ∞ ≤ α, we have ∆ min loss ≥ 0. And we have ∆ max loss ≤ η r nc s=1 I i=1 (q * s,i -q s,i )x s n1 j=1 [z j T (q j -p(z j ))] i . The detailed proof is provided in Appendix A.15. We denote q * s as a one-hot vector for sample s with I dimensions. Its i-th dimension is defined as q * s,i , and q * s,i = 1 if i = arg min t [(x s +δ)(W ′ -W )] t . Denote δ as ground truth trigger, ϵ as difference of reversed trigger and ground truth trigger, η r as the learning rate (a.k.a. step size) in round r. Denote z as x + δ + ϵ for simplicity, which is the benign sample stamped with the recovered trigger. Note that ∆ min loss ≥ 0 indicates that the defense is provably effective than without defense. Since benign local clients' training can increase global model backdoor loss, and they have positive effects on mitigating malicious poisoning effect, the second condition ∆ max loss ≤ η r nc s=1 I i=1 (q * s,i -q s,i )x s n1 j=1 [z j T (q j -p(z j ))] i indicates that the defense is provably guarantee maintaining similar accuracy on clean data, similarly, here i = argmax t [x s (W ′ -W )] t (with a slight abuse of the notation q * s ). Corollary 1. Assume ϵ satisfies Theorem 2, let n b as backdoored samples, n c as benign samples, τ as confidence threshold. Then the number of backdoored samples that are rejected is R bd = R ′ b -R b , the number of benign samples that are rejected is R bn = R ′ c -R c R ′ b and R b denote the rejected backdoor samples with and without defense. With defense, R ′ b is n b j=1 1(L g + ∆ min loss > L τ ); and without defense, R b is n b j=1 1(L g > L τ ). Thus, the exact value of rejected backdoored samples can be calculated through R bd = R ′ b -R b . Similarly, R ′ c and R c denote the rejected clean(benign) samples with and without defense. With defense, R ′ c is nc j=1 1(L g + ∆ max loss > L τ ); and without defense, R c is nc j=1 1(L g > L τ ). Thus, the exact value of rejected benign samples can be calculated through R bn = R ′ c -R c .

5. EXPERIMENT

In this section, we empirically evaluate FLIP under two existing attack settings, i.e. single-shot attack (Bagdasaryan et al., 2020) and continuous attack (Xie et al., 2019) . We compare the performance of FLIP with 9 state-of-the-art defenses, i.e. Krum (Blanchard et al., 2017) , Bulyan Krum (El Mhamdi et al., 2018) , RFA (Pillutla et al., 2022) , FoolsGold (Fung et al., 2020) , Median (Yin et al., 2018) , Trimmed Mean (Yin et al., 2018) , Bulyan Trimmed Mean (Buly-Trim-M) (El Mhamdi et al., 2018) , and FLTrust (Cao et al., 2020) , DnC (Shejwalkar & Houmansadr, 2021) . Besides using the experiment settings in (Bagdasaryan et al., 2020) and (Xie et al., 2019) , we conduct experiments using the setting in our theoretical analysis to validate our analysis. Moreover, we evaluate FLIP on adaptive attacks and conduct several ablation studies.

5.1. EXPERIMENT SETUP

We conduct the experiments under the PyTorch framework (Paszke et al., 2019) Evaluation Metrics. We consider attack success rate (ASR) and main task accuracy (ACC) as evaluation metrics to measure defense effectiveness. ASR indicates the ratio of backdoored samples that are misclassified as the attack target label, while ACC indicates the ratio of correct classification on benign samples. While certified accuracy is commonly used in evasion attacks that do not involve training. As data poisoning happens during training, it is more reasonable to certify the behavior of models during training rather than inference.

5.2. EVALUATION ON BACKDOOR MITIGATION

We consider backdoor attacks via model replacement approach where attackers train their local models with backdoored samples. We follow single-shot setting in (Bagdasaryan et al., 2020) and continuous setting in (Xie et al., 2019) to perform the attacks. Single-shot backdoor attack means every adversary only participates in one single round, while there can be multiple attackers. Continuous backdoor attack is more aggressive where the attackers are selected in every round and continuously participate in the FL training from the beginning to the end. Both settings of attacks happen after the global model converges, since if attackers poison from the first round, even after training enough rounds, (Xie et al., 2019) found that the main accuracy was still low and models hard to converge. For fair comparison, we report ASR and ACC of FLIP and all the baselines in the same round. Note that the confidence threshold τ of FLIP introduced in Section 3 is only used in the continuous attacks to filter out low-confidence predictions, since single-shot attack is easy to defend. Based on our empirical study, we typically set τ = 0.3 for MNIST and Fashion-MNIST, and τ = 0.4 for CIFAR-10, and we evaluate thresholds with the Area Under the Curve metric (Appendix A.4). Single-shot attack results shown in Table 1 . Line 2 illustrates the attack performance with No Defense. Observe that singleshot attack can achieve more than 80% ASR throughout all the datasets while preserving a high main task accuracy over 77%. The following rows show the defense performance of existing SOTA defenses and the last row denotes FLIP results. We can find that FLIP can reduce the ASR to below 8% on all the 3 datasets and keep the benign accuracy degradation within 5%. FLIP outperforms all the baselines on both MNIST and Fashion-MNIST while is slightly worse on CIFAR-10. We show the results of continuous attack in Table 2. Continuous attack is more aggressive than the single-shot one. The former's ASR is 3%-20% higher than the latter when there is no defense. Note that all the existing defense techniques fail in the continuous attack setting. The ASR remains nearly 100% in most cases on MNIST and Fashion-MNIST and is higher than 63% on CIFAR-10. However, FLIP reduces the ASR to a low level and the accuracy degradation is within an acceptable range. Specifically, FLIP reduces the ASR on MNIST to 2% while the accuracy degradation is within 2%. For Fashion-MNIST and CIFAR-10, the ASR is reduced to below 18% and 23%, respectively, while the accuracy decreases a bit more compared to the results of MNIST and to single-shot attack. This is reasonable due to the following: First, the complexity of the dataset and continuous backdoor attacks may add to the difficulty of recovering good quality triggers. In addition, there is trade-off between adversarial training accuracy and standard accuracy of a model as discussed in (Tsipras et al., 2019) . Adversarial training on benign clients can induce negative effects on the accuracy. However, we argue that FLIP still outperforms existing defenses as the ASR is reduced to a low level.

5.3. EVALUATION ON THE SAME SETTING AS THEORETICAL ANALYSIS

In this section, we conduct an experiment that follows the same setting as our assumptions in the theoretical analysis to validate its correctness. We conduct experiments on the multi-class logistic regression (i.e., one linear layer, softmax function, and cross-entropy loss) as the setting in theoretical analysis Section 4. We take MNIST as the example for analysis and it can be easily extended to other datasets. Regarding the FL system setting, there are one global server, one benign client and one malicious client. We train the FL global model until convergence and then apply the attack. The attackers inject the pixel-pattern backdoor in images and swap the label of the image source label to the target label. We also do not have any restrictions on the attackers, as long as they follow the federated learning protocol. Table 3 shows the result of single-shot and continuous attack ACC and ASR on the logistic regression. We can see both singleshot and continuous attacks' ASRs are reduced to around 5% and the accuracy degradations are within an acceptable range. This result is consistent with our observations on more complex settings above. Besides, the exact value of rejected clean samples and backdoored samples under different settings can also be calculated, which correspond to Corollary 1, details can be found in Appendix A.2.

5.4. ADAPTIVE ATTACKS

We study an attack scenario where the adversary has the knowledge of FLIP, our results show that FLIP still mitigate the backdoor attacks in most cases. For those that ACC does degrade, the adaptive attack is not effective. Details can be found in Appendix A.3.

5.5. ABLATION STUDY

This section we conduct several ablation studies. We study both adversarial training (Appendix A.5) and thresholding (Appendix A.6) is critical in FLIP. We study another different trigger inversion technique in FLIP, which can mitigate backdoors as well, indicating that FLIP is compatible with different trigger inversion techniques (Appendix A.7). We study the effects of different sizes of triggers and show that our defense can cause a significant ASR reduction while maintaining comparable benign classification performance (Appendix A.8). We also study different threshold influences on ACC and ASR and show the trade-off between attack success rate and accuracy (Appendix A.9).

6. CONCLUSION

We propose a new provable defense framework FLIP for backdoor mitigation in Federated Learning. The key insight is to combine trigger inversion techniques with FL training. As long as the inverted trigger satisfies our given bound, we can guarantee attack success rate will decrease and in the meantime the model can maintain similar accuracy on clean data. Our technique significantly outperforms prior work on the SOTA continuous FL backdoor attack. Our framework is general and can be instantiated with different trigger inversion techniques. While applying various trigger inversion techniques, FLIP may have slight accuracy degradation, but it can significantly boost the robustness against backdoor attacks. classes cat and dog, which makes the models vulnerable to backdoor attacks (Tao et al., 2022) . The class distance is independent of the number of local client samples, in other words, the class distance is related to the model's robustness itself instead of the number of samples. Regarding the attack setting, there are 100 clients in total by default. In each round we randomly select 10 clients, including 4 adversaries and 6 benign clients. We do not have any restrictions on attackers as long as they follow the federated learning communication protocol. The attackers inject the pixel-pattern backdoor in images and swap the label of image source label to target label (by default, label "2"). Figure 2 (b) shows a backdoored example. During testing phase, any inputs with such pattern will be classified as the target label. In single-shot attack, attackers can choose any round to participate. In continuous attack, attackers participate in every round after model convergence. Benign clients perform adversarial training continuously in both settings. We report the ACC and ASR after the attack happens at least 60 rounds, that is, attackers already achieve a high and stable attack success rate.

A.2 EVALUATION ON THE SAME SETTING AS THEORETICAL ANALYSIS

In this section, we conduct an extend experiment that follows the same setting as our assumptions to validate our theoretical analysis. Table 4 shows the sample counts of clean samples and backdoored samples under different settings. (2) attackers stamp the inverted triggers to their local images and add them to the training phase for backdoor attacks;

As illustrated in

(3) attackers submit the updated model weights to global server. We conduct experiments on three datasets under continuous attack setting. Table 5 shows the result, observe that even under an adaptive attack setting, FLIP can still mitigate the backdoor attacks in both MNIST and Fashion-MNIST. In CIFAR-10, the accuracy drops and the adaptive attack is not effective. This indicates that even though the attackers are aware of our technique during poison training, under the FLIP framework, benign clients can still effectively reduce the attacker's poisoning confidence and keep the attack success rate in a low range. In this section, we take MNIST as an example to show the AUC-ROC curves (Area Under the Curve Receiver Operating Characteristics), other datasets can be analyzed similarly. Figure 3 shows the AUC-ROC curve of our confidence-based sample rejection on MNIST. The curve is plotted with TPR (True Positive Rate) against the FPR (False Positive Rate ) where TPR is on the y-axis and FPR is on the x-axis. In our evaluation, the AUC is 0.97. As stated by many existing works (Mandrekar, 2010), an AUC of 0.5 (indicated by the orange dashed line) means the model is unable to discriminate positive and negative samples while an AUC higher than 0.9 is considered outstanding. Therefore our confidence-based rejection strategy is effective in distinguishing backdoored samples and benign samples. In Table 6 , we observe that without adversarial training, malicious can successfully inject backdoor patterns even with a high confidence threshold above τ . The underlying reason is that adversarial training in benign clients hardens the model against malicious samples and reduces the confidence of malicious samples. Interestingly, we notice MNIST ASR drops compare with no defenses, the reason could be MNIST dataset feature is simpler, thus with a lot of benign clients continuous training parallelly, it is easy to forget injected backdoor patterns quickly (Wang et al., 2020a) , so that attacker's poisoning confidence is reduced and parts samples are rejected. Hence, the results show that adversarial training is significantly effective in reducing the attacker's confidence in backdoor samples during backdoor training, which is consistent with our theoretical analysis.

A.6 EFFECT OF CONFIDENCE THRESHOLD

In this section, we demonstrate that threshold is a critical component in FLIP and we evaluate our defense with and without thresholding, results can be found in Table 7 . We conduct thresholding experiments under continuous backdoor attacks with three datasets. Each benign client performs trigger inversion and adversarial training as before, while in global inference-time, we set the confidence threshold τ to 0, which is no threshold, and keep all the other settings unchanged. We observe that without thresholding applied, though the ASRs are reduced to some extent, they are still much higher, compared with FLIP results in Table 2 . The underlying reason is adversarial training does help in reducing the confidence of backdoored samples, however, without applying confidence threshold to reject the backdoored samples, ASR keeps high. We validate that the threshold is critical in FLIP and the observation is consistent with our results in Corollary 1. 

A.7 OTHER TRIGGER INVERSION TECHNIQUES EVALUATION

In general, FLIP is compatible with any trigger inversion technique. In this section, we use another widely-used technique ABS (Liu et al., 2019) as the trigger inversion component of our framework. Specifically, we replace the "Trigger inversion" part in Figure 1 with ABS, while keeping all other settings the same. We conduct experiments on both single-shot and continuous attack settings. Note that we only evaluate CIFAR-10, since the released version of ABS focuses on the complex dataset with three color channels instead of greyscale images. In each training round of local clients, we use ABS to invert 10 most likely triggers and perform the adversarial training. 8 shows the defense technique evaluation result, which is consistent with results shown previously in Table 1 and 2 . We observe that in continuous attack, FLIP equipped with ABS keeps higher clean accuracy 74% compared to 71% in Table 2 and they both reduce ASR to a low level, near 22%. However, in the single-shot attack, FLIP with ABS only reduces ASR to 8%. The underlying reason is that ABS inverts effective triggers within a small size range, while the method in our main text is more aggressive in hardening the model. The result demonstrates that FLIP is generally effective with various downstream trigger inversion techniques against backdoor attacks.

A.8 IMPACT OF TRIGGER SIZE

In this section, we study the different sizes of triggers effect and the evaluation results in Table 9 . We define the initial trigger size as X, that is, 2*X denotes the trigger size is scaled up two times compared with the initial trigger. Take MNIST as an example, we observe that the single-shot ASR is low when trigger size (TS) is 1*X, the reason is each local trigger is too small to be recognized during the global model testing phase. We conduct an experiment consisting of different trigger sizes from 1*X, 2*X, 4*X, 6*X, to 8*X. The evaluation shows that our defense can significantly degrade ASR while maintaining comparable benign classification performance, no matter how triggers' sizes change. In this section, we show the trade-off between attack success rate and accuracy when we apply the confidence threshold. We conduct an extensive evaluation to study different threshold influences on ACC and ASR. We test our framework on MNIST dataset in the continuous attack setting with three different thresholds 0.0, 0.3, and 0.7. We found that with the increase of confidence threshold, ACC is 97.2%, 96.62%, and 88.86% accordingly, in the meantime, the ASR is 22.35%, 1.93%, and 0.91% accordingly. We observe that benign local model hardening has controllable negative effects on accuracy. Meanwhile, there is a trade-off between adversarial training accuracy and standard accuracy of a model (Tsipras et al., 2019) . If we aim for a much lower attack success rate, this will sacrifice part of clean accuracy. In other words, when we set a higher threshold, ASR indeed decreases, in the meantime, some low-confidence benign samples are also rejected, which causes the benign accuracy to reduce to some extent.

A.10 DISCUSSION ON OTHER DEFENSES

In this section, we provide additional experimental results on the comparison between the Multi-KRUM (Blanchard et al., 2017) and our method. We take the CIFAR-10 dataset as an example, in the single-shot attack, Multi-KRUM can drop ASR from 80.46% to 4.18%, and our defense ASR is 7.83%. However, in the continuous attack, Multi-KRUM can only reduce ASR from 84.73% to 61.86%, and our defense ASR is 17.27%, the ACC is at a similar level. Our technique can achieve comparative performance with Multi-KRUM in the single-shot attack and outperforms Multi-KRUM in more complex attack scenarios of continuous attack. In addition, we also try to evaluate FLAME (Nguyen et al., 2021) . We contacted the authors of FLAME several times for their experiments and parameters setup but got no response until submission.

A.11 JUSTIFICATIONS FOR SOTA DEFENSES NOT WORKING

In this section, we provide concrete justifications on why SOTA defenses produce a nearly 100% attack success rate on continuous attacks setting. Continuous backdoor attacks denote that in each round the attackers will be selected and continuously participate in federated learning. We suspect there are three reasons that SOTA defenses are performing not well on continuous attacks. First, continuous backdoor attacks are more aggressive. In each round of selected participants, 40% of them are attackers and will participate in every round of model training. Second, as mentioned in (Wang et al., 2020a) , even under a very low attack frequency, the attacker still manages to gradually inject the backdoor as long as federated learning runs for long enough. Third, some of their assumptions, e.g. though FoolsGold (Fung et al., 2020) assumes that benign data are non-iid, meanwhile, it also assumes manipulated data are iid, this could cause FoolsGold to be only effective under certain simpler attack scenarios, e.g. single-shot attacks. A.12 IMPACT OF TRIGGER QUALITY In this section, we study the different quality of triggers effect and the evaluation results in Table 10 . We evaluate CIFAR-10 dataset on single-shot attacks. We inject triggers with random shapes in both white and colorful. We observe that the randomly chosen triggers do not have much influence on attackers and cannot reduce the attack success rate. However, when we use the ground truth triggers, they do achieve the best performance in both reducing ASR on backdoor tasks and maintaining ACC on benign tasks. The evaluation shows that random triggers cannot reduce ASR and FLIP inverted trigger can achieve comparable performance with ground truth trigger and can be further improved in the future work. In this section, we will present the bound on loss changes, formulate the benign local clients training and global model aggregation process, and then provide the detailed proofs for our Theorem 1 that are related to loss changes bound. Note that, we list all the notations used in the paper in Table 12 . Generalize Proof to complex model architectures. Given that existing works (Xie et al., 2021; Li et al., 2021; Nguyen et al., 2021) all focus on linear models as the complexity in real-world models makes theoretical analysis tasks infeasible. Note that besides the theoretical results, empirically, we extended to non-linear and showed in Section 5 that our defense gives outstanding performance against state-of-the-art backdoor attacks, consistent with our theoretical analysis. aggregated global weight can be written as W r+1 = N k=1 W k r+1 = W 1 r+1 + W 2 r+1 = -η r n1 j=1 [x T,k j (p(x) k j ) -Y k j )] -η r n1 j=1 [(x j + δ + ϵ) T,1 (p(x j + δ + ϵ) 1 -Y 1 j )] + 2W r -W M = -η r n1 j=1 [x T,k j (p(x) k j ) -Y k j )] -η r n1 j=1 [(x j + δ + ϵ) T,1 (p(x j + δ + ϵ) 1 -Y 1 j )] + 2W r -W M When we consider the without defense setting, δ + ϵ not exists, in round t + 1, W t+1 the global weight without local weights can be written as W r+1 = -η r n1 j=1 [x T,1 j (p(x) 1 j -Y 1 j )] + 2W r -W M (12) When we consider the with defense setting, δ + ϵ exists, in round t + 1, W t+1 the global weight without local weights can be written as W ′ r+1 = -η r n1 j=1 [x T,1 j (p(x) 1 j -Y 1 j )] -η r n1 j=1 [(x j + δ + ϵ) T,1 (p(x j + δ + ϵ) 1 -Y 1 j )] + 2W r -W M The difference between with defense and without defense training is exactly how much adversarial training in benign will influence other clients, it can be written as W ′ r+1 -W r+1 = -η r n1 j=1 [(x j + δ + ϵ) T,1 (p(x j + δ + ϵ) 1 -Y 1 j )] Given model parameter W of one linear layer, the k-th device holds n k training data {x k,j , y k,j } n k j=1 . We denote the loss as L(W ; {x k,j , y k,j } n k j=1 ). Denote xW as the output of the linear layer, P i (x) = softmax(xW + b) i as the normalized probability for class i (the output of the softmax function). We omit b (bias) in the following theoretical analysis for simplicity. Adding the bias term to our analysis is straightforward. Global softmax cross-entropy loss function can be written as: L global = - I i=1 q i log(p i ) = - I i=1 q i logsof tmax(xW ) i = - I i=1 q i log( e (xW )i I t=1 e (xW )t ) = - I i=1 q i (xW ) i + log( I t=1 e (xW )t ) = - I i=1 q i (xW ) i + log( I t=1 e (xW )t ) Since we want to compare the loss changes in two different cases (e.g. with defense and without defense setting), to observe if the dedution of the loss increase or decrease, here we let the two losses (say L ′ g is with defense, L g is without defense) deduct each other:  L ′ g -L g = - I i=1 q i (xW ′ ) i + log( I t=1 e (xW ′ )t ) + I i=1 q i (xW ) i -log( I t=1 e (xW )t ) = - I i=1 q i [(W ′ -W )x] i + log( I t=1 e (xW ′ )t ) -log( I t=1 e (xW )t ) = - I i=1 q i [(W ′ -W )x] i + log( I t=1 e (xW So the deduction of L ′ g -L g can be written as: logm - I i=1 q i [x(W ′ -W )] i ≤ L ′ g -L g ≤ logM - I i=1 q i [x(W ′ -W )] i (24) log min t e (xW ′ )t e (xW )t - I i=1 q i [x(W ′ -W )] i ≤ L ′ g -L g ≤ log max t e (xW ′ )t e (xW )t - I i=1 q i [x(W ′ -W )] i Denote the left hand side of above formula as ∆ min loss, denote the inequality's right hand side value as ∆ max loss. ∆ min loss = log min t e (xW ′ )t e (xW )t - I i=1 q i [x(W ′ -W )] i = min t log e (xW ′ )t e (xW )t - I i=1 q i [x(W ′ -W )] i = min t loge [x(W ′ -W )]t - I i=1 q i [x(W ′ -W )] i = min t [x(W ′ -W )] t - I i=1 q i [x(W ′ -W )] i Then we can get the lower bound and upper bound of L ′ g -L g min t [x(W ′ -W )] t - I i=1 q i [x(W ′ -W )] i ≤ L ′ g -L g ≤ max t [x(W ′ -W )] t - I i=1 q i [x(W ′ -W )] i Let L ′ g denote the global model loss with defense, L g as without defense, let ∆W = W ′ -W denote the weight differences with and without defense. The loss difference with and without defense can be upper and lower bounded by (as shown in Theorem 1) min t (x∆W ) t - I i=1 q i (x∆W ) i ≤ L ′ g -L g ≤ max t (x∆W ) t - I i=1 q i (x∆W ) i To facilitate the analysis, we denote the upper bound as ∆ max loss and the lower bound as ∆ min loss. To efficiently reduce the attack success rate and maintain the clean accuracy, we studied this lower bound on backdoor data, which indicates the minimal improvements on the backdoor defense. Similarly we studied the upper bound for clean data, as they indicates the worst-case accuracy degradation. Denote the number of backdoor samples as n b and the number of benign samples as n c . Note backdoor samples are written as x s + δ. By using Theorem 1, we have ∆ min loss = n b s=1 min t [(x s + δ)∆W ] t - n b s=1 I i=1 q s,i [(x s + δ)∆W ] i . And similarly on benign data, we have ∆ max loss x s , ∆ max loss = nc s=1 max t (x s ∆W ) t -nc s=1 I i=1 q s,i (x s ∆W ) i .

A.15 PROOF OF GENERAL ROBUSTNESS CONDITION

In this section, we will present general condition of robustness on trigger generation, formulate ∆min loss on backdoored data and ∆max loss on clean data, and then provide the detailed proofs for our Theorem 2 that are related to general robustness condition. Our intuition is that we want the loss to increase more on backdoored data, and increase less on clean data. This means after applying defense, the global server loss in backdoored data will increase and the loss in clean data will change within a constant range. Accordingly, when evaluate on n b backdoored data, we want the lower bound at least greater than 0, ∆min loss ≥ 0. When evaluate on n c clean data, we want the upper bound ∆max loss ≤ ζ, here ζ is a constant. In evaluation, denote global server has n b backdoored data and n c clean data for testing. When evaluating on n b backdoored data  L ′ g -L g ≥ n b s=1 min t [(x s + δ)(W ′ -W )] t - n b s=1 I i=1 q s,i [(x s + δ)(W ′ -W )] i = ∆ min loss ≥ 0 We denote q * s as a one-hot vector for sample s with I dimensions. Its i-th dimension is defined as q * s,i , and q * s,i = 1 if i = arg min t [(x s + δ)(W ′ -W )] t . ∆ min loss = Replace ϵ = αb into eq. ( 34), and in order to be consistent with previous section in main text Theorem 2, we use q j to replace Y j , we have f (ϵ) ≥ η r n b s=1 I i=1 (q * s,i -q s,i ){(z s -αb) n1 j=1 [z j T (q j -p(z j ))]} i (38) The sufficient condition of f (ϵ) ≥ 0 is thus [z j T (q j -p(z j ))]} i (40) Note that for any vector x, we have sign (x)x ≥ 0. And we can divide the right hand side by the left hand side and finish the prove. α ≤ η r n b s=1 I i=1 (q * s,i -q s,i ){z s n1 j=1 [z j T (q j -p(z j ))]} i b{η r n b s=1 I i=1 (q * s,i -q s,i ){ n1 j=1 [z j T (q j -p(z j ))]} i } Each term in above can be computed, then we can always find a small enough error range ϵ where surely improve the loss function. Similarly, in upper bound of ∆ max loss, we denote q * s as a one-hot vector, denote q * s as a onehot vector for sample s with I dimensions. Its i-th dimension is defined as q * s,i , and q * s,i = 1 if i = argmax t [x s (W ′ -W )] t , let g(ϵ) = ∆ max loss = η r nc s=1 I i=1 (q * s,i -q s,i )x s n1 j=1 [z j T (q j -p(z j ))] i (42) Note that g(ϵ) is nothing but a constant with respect to ϵ. This means that the upper bound loss is up to some constant with respect to the recovered trigger z j . Note that ∆ min loss ≥ 0 indicates that the defense is provably effective than without defense. Since benign local clients training can increase global model backdoor loss, and they have positive effects on mitigating malicious poisoning effect. The second condition ∆ max loss ≤ η r nc s=1 I i=1 (q * s,i -q s,i )x s n1 j=1 [z j T (q j -p(z j ))] i indicates that the defense is provably guarantee maintaining similar accuracy on clean data.

Notation Description

x k,j , y k,j the k-th client device j-th data sample and its label q s,i s-th sample i-th dimension η 



Figure 1: Overview of FLIP. The left upper part (red box) performs the malicious client backdoor attack and the left lower part (green box) illustrates the main steps of benign client model training, they will submit local clients' updates to the global server. The middle part illustrates that the global server will aggregate all the received local clients' model weights and update the global server's model. The right part shows global server inference based on the updated global model. On benign clients, we do not assume any knowledge about the ground truth trigger.

The procedure is summarized in Algorithm 1. (1) Trigger inversion. During local client training, benign local clients apply trigger inversion techniques to recover the triggers, stamp them on clean images (without changing their original labels) to constitute the augmented dataset. (2) Model hardening. Benign local clients combine the augmented data with the clean data to perform model hardening (adversarial training). The local clients submit updated local model weights to the global server, which aggregates all the received weights. (3) Low-confidence sample rejection. Our adversarial training can substantially reduce the prediction confidence of backdoor samples. During inference, we apply an additional sample filtering step, in which we use a threshold to preclude samples with low prediction confidence. Note that this filtering step is infeasible for most existing techniques as they focus on rejecting abnormal weights during training.

Figure 2: Trigger Examples, (a) is the ground truth trigger, (b) is malicious client poisoned data, (c) is after benign client trigger inversion, augmented data

Corollary 1, the exact value of rejected backdoored samples (R b /R ′ b ) and rejected clean samples (R c /R ′ c ) can be calculated. The No defense column represents the counts of R b and R c without defense; the last column FLIP represents the counts of R ′ b and R ′ c with defense, which correspond to the numbers defined in theoretical analysis.

Figure 3: AUC-ROC Curves

Throughout this paper, "clean training" refers to benign local clients training with clean data; "adversarial training" refers to benign local clients apply trigger inversion techniques to get reversed trigger, then stamp the trigger to their local clean image and assign with ground truth clean label to get the augmented dataset, then train with the augmented dataset.In benign clients, we train with defense technique to generate trigger, then do adversarial training and submit gradients to global server. Given model parameter W of one linear layer, k-th device holds the n k training data x k,n k , then denoted the loss as L(W ; x k,n k ). Let Y ∈ {0, 1} i denote a one-hot vector of local samples. For x, we denote xW as the output of the linear layer, p i (x) = sof tmax(xW + b) i as the normalized probability for class i (the output of the softmax function).

i [x s (W ′ -W )] i = ∆ max loss ≤ ζ (30)Since previous results we know W ′ r+1 -W r+1 can be represented asW ′ r+1 -W r+1 = -η r n1 j=1 [(x j + δ + ϵ) T (p(x j + δ + ϵ) -Y j )] j + δ + ϵ) T (Y j -p(x j + δ + ϵ))]

s + δ)(W ′ -W )] ti [(x s + δ)(W ′ -W )] i j + δ + ϵ) T (Y j -p(x j + δ + ϵ))]] i j + δ + ϵ) T (Y j -p(x j + δ + ϵ))]] i -q s,i )[(x s + δ)η r n1 j=1 [(x j + δ + ϵ) T (Y j -p(x j + δ + ϵ))]] i (32)Let z s = x s + δ + ϵ and z j = x j + δ + ϵ, then the ∆ min loss is∆ min loss = η s,i -q s,i )[(z s -ϵ) j -p(z j ))]] i (35)Let ||ϵ|| ∞ ≤ α, and that f (ϵ) is a linear function, we know the minimal value of f (ϵ) is achieved whenϵ k = α sign s,i -q s,i ) n1 j=1 [z T j (Y j -p(z j ))]] i,k as b = [b 1 , ..., b d ].The minimal condition is thus ϵ = αb.

-q s,i ){(z s -αb) s,i -q s,i ){z s n1 j=1

r the learning rate (a.k.a. step size) W k r the k-th client device in r-th round weights W local model weights without defense W ′ local model weights with defense τ confidence threshold R b the number of rejected backdoor samples without defense R ′ b the number of rejected backdoor samples with defense R bd = R ′ b -R b the number of backdoored samples that are rejected after defense applied R c the number of rejected benign samples without defense R ′ c the number of rejected benign samples with defense R bn = R ′ c -R c the number of benign samples that are rejected after defense applied δ the ground truth trigger δ + ϵ the various trigger inversion technique recovered trigger ϵ the difference between the reversed trigger and the ground truth trigger z = x + δ + ϵ z denotes the benign sample stamped with the recovered trigger L g the global model loss without defense L ′ g the global model loss with defense

and reimplement existing attacks and defenses following their original designs. Regarding the attack setting, there are 100 clients in total by default. In each round we randomly select 10 clients, including 4 adversaries i.i.d.) which is more practical in real-world applications. We leverage the same setting as (Bagdasaryan et al., 2020) which applies a Dirichlet distribution(Minka, 2000) with a parameter α as 0.5 to model non-i.i.d. distribution. The poison ratio denotes the fraction of backdoored samples added in each training batch, MNIST and Fashion-MNIST poison ratio is 20/64, CIFAR-10 is 5/64.

Single-shot attack evaluation

Continuous attack evaluation

Logistic regression evaluation

Logistic regression sample countsAttack Type Samples count Total samples No defense FLIP

Adaptive attacks evaluation

Effect of adversarial training

Effect of confidence threshold

Other Trigger Inversion Techniques Evaluation

Trigger Size

Trigger QualityIn this section, we leverage Dirichlet distribution(Minka, 2000) with a hyperparameter α to model different non-i.i.d. distributions. By increasing the hyperparameter α in the Dirichlet distribution, we can simulate from non-i.i.d to i.i.d distributions for the datasets(Xie et al., 2019). Here we evaluate MNIST on single-shot attacks without defense and with FLIP. We conduct an experiment consisting of different α from 0.2, 0.4, 0.6, 0.8, 1.0, to 2.0. The evaluation demonstrates that without defense applied, backdoor attack performance is affected by non-i.i.d degree. However, our defense can still cause a significant ASR degradation in different non-i.i.d degrees and only has slightly differences, while maintaining comparable benign classification performance.

ACKNOWLEDGEMENTS

We thank the anonymous reviewers for their constructive comments. This research was supported, in part by IARPA TrojAI W911NF-19-S-0012, NSF 1901242 and 1910300, ONR N000141712045, N000141410468 and N000141712947. Any opinions, findings, and conclusions in this paper are those of the authors only and do not necessarily reflect the views of our sponsors.

ETHICS STATEMENT

In this paper, our studies are not related to human subjects, practices to data set releases, discrimination/bias/fairness concerns, and also do not have legal compliance or research integrity issues. Backdoor attacks aim to make any inputs stamped with a specific pattern misclassified to a target label. Backdoors are hence becoming a prominent security threat to the real-world deployment of federated learning. FLIP is a provable defense framework that can provide a sufficient condition on the quality of trigger recovery such that the proposed defense is provably effective in mitigating backdoor attacks.

REPRODUCIBILITY STATEMENT

The implementation code is available at https://github.com/KaiyuanZh/FLIP. All datasets and code platform (PyTorch) we use are public. In addition, we also provide detailed experiment parameters in the Appendix.

A APPENDIX

We provide a simple 

A.1 EXPERIMENT SETUP

In this section, we illustrate more details about the experimental setups, neural network structures, parameters setups, etc. For more detailed hyperparameter settings and evaluations, please refer to our code repository, we will release our code upon the paper acceptance.We train the FL system following our FLIP framework on three datasets: MNIST (LeCun et al., 1998) , Fashion-MNIST (Xiao et al., 2017) and CIFAR-10 (Krizhevsky et al., 2009) . MNIST has a training set of 60,000 examples, and a test set of 10,000 examples and 10 classes. Fashion-MNIST consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. CIFAR-10 is an object recognition dataset with 32x32 colour images in 10 classes. It consists of 60,000 images and is divided into a training set (50000 images) and a test set (10000 images). We split the training data for FL clients in a non-i.i.d. manner, by a Dirichlet distribution (Minka, 2000) with hyperparameter α 0.5, following the same setting as (Bagdasaryan et al., 2020; Xie et al., 2019) . We train the FL global model until convergence and then apply various trigger inversion defense techniques, otherwise the main task accuracy is low and the backdoored model is hard to converge (Xie et al., 2019) . In the augment dataset, we use 64 augmented samples in each batch of each local client training, following the existing works of backdoor removal in (Tao et al., 2022) . Note that the confidence threshold τ of FLIP discussed in Methodology Section 3 is only used in continuous backdoor attack setting to filter out low-confidence predictions. Based on our empirical study, we typically set τ = 0.3 for simpler datasets, e.g. MNIST and Fashion-MNIST, while τ = 0.4 for more complex datasets, e.g. CIFAR-10. We apply two convolutional layers and two fully connected layers in MNIST and Fashion-MNIST, and Resnet-18 in CIFAR-10 to train our model.

FLTrust Setting

In FLTrust (Cao et al., 2020) , we follow the original settings and collect the root dataset for the learning task with 100 training examples, the root dataset has the same distribution as the overall training data distribution of the learning task. We exclude the sampled root dataset from the clients' local training data, indicating that the root dataset is collected independently by the global server.

Class distance

The class distance is defined by the trigger size, we use L 1 norm to measure. The intuition here is that if it's easy to generate a small trigger from the source class to the target class, the distance between the two classes is small. Otherwise, the class distance is large. Furthermore, the model is robust when all the class distances are large, otherwise one can easily generate a small trigger between the two classes. The class distances of models in the wild are very small and do not align well with humans' intuition, for example, classes turtle and bird have smaller distances than b or bias is omitted in following equations for simplicity, but it would still work if added. For one example the cross-entropy loss is calculated as:We define G as the gradient for one sample:Similarly, when defense technique get reversed trigger and stamp it on clean image, then we get the augmented dataset, denote is as x aug , then the gradient on augmented dataset G ′ can be written as:Here, we describe one around (say the r-th) of the standard F edAvg algorithm. When the benign device in k-th receive the global weights W r , and then performs E (= 1) local updates (lets W k r = W r ), in benign clients, we training on both clean dataset and augmented dataset:where η r is the learning rate (a.k.a. step size), n k is the number of samples in k-th client.In global server, define δ as the malicious clients generated trigger, δ + ϵ as the benign clients generated trigger, then we can represent backdoored sample as (x + δ) and augmented sample as (x + δ + ϵ). Benign clients updates can be written as:) In the threat model, we consider the practical oblivious but honest attack setting that a defender has no control on malicious clients and they can perform any kinds of attack, as long as attackers follow the federated learning protocol. Our proof focuses on the two-client setting, one benign and the other malicious. Thus, we represent the malicious clients updates as W M .After each local finished their training, they submit their model updates to global. Then global aggregation step performsg k is the weight of the k-th device. In order to simplify, here we take g k as 1 and assume we only have two clients (N=2), k = 1 is benign client, n 1 denotes the number of samples in this benign client. Then the aggregated global weight are the each local weights aggregate together. Then the

