FLGAME: A GAME-THEORETIC DEFENSE AGAINST BACKDOOR ATTACKS IN FEDERATED LEARNING

Abstract

Federated learning enables the distributed training paradigm, where multiple local clients jointly train a global model without needing to share their local training data. However, recent studies have shown that federated learning provides an additional surface for backdoor attacks. For instance, an attacker can compromise a subset of clients and thus corrupt the global model to incorrectly predict an attacker-chosen target class given any input embedded with the backdoor trigger. Existing defenses for federated learning against backdoor attacks usually detect and exclude the corrupted information from the compromised clients based on a static attacker model. Such defenses, however, are less effective when faced with dynamic attackers who can strategically adapt their attack strategies. In this work, we model the strategic interaction between the (global) defender and attacker as a minimax game. Based on the analysis of our model, we design an interactive defense mechanism that we call FLGAME. Theoretically, we prove that under mild assumptions, the global model trained with FLGAME under backdoor attacks is close to that trained without attacks. Empirically, we perform extensive evaluations on benchmark datasets and compare FLGAME with multiple state-ofthe-art baselines. Our experimental results show that FLGAME can effectively defend against strategic attackers and achieves significantly higher robustness than baselines.

1. INTRODUCTION

Federated learning (FL) (McMahan et al., 2017a) aims to train machine learning models (called global models) over training data that is distributed across multiple clients (e.g., mobile phones, IoT devices). FL has been widely used in many real-world applications such as finance (Long et al., 2020) and healthcare (Long et al., 2022) . FL trains a global model in an iterative manner. In each communication round, a cloud server shares its global model with selected clients; each selected client uses the global model to initialize its local model, then utilizes its local training dataset to train the local model, and finally sends the local model update to the server; the server uses an aggregation rule to aggregate local model updates from clients to update its global model. Due to the distributed nature of FL, many recent studies (Bhagoji et al., 2019; Bagdasaryan et al., 2020; Baruch et al., 2019; Wang et al., 2020; Kairouz et al., 2021) have shown that it is vulnerable to backdoor attacks. For instance, an attacker can compromise a subset of clients and manipulate their local training datasets to corrupt the global model such that it predicts an attacker-chosen target class for any inputs embedded with a backdoor trigger (Bagdasaryan et al., 2020) . To defend against backdoor attacks, many defenses (Sun et al., 2019; Cao et al., 2021a) have been proposed. For example, Sun et al. (2019) proposed to clip the norm of the local model update from each client such that its L 2 -norm was no larger than a defender-chosen threshold. Cao et al. (2021a) proposed FLTrust in which a server computes a local model update itself and computes its similarity with that of a client as the trust score, which is leveraged when updating the global model. However, all of those defenses consider a static attack model where an attacker does not adapt its attack strategies. As a result, they are less effective under adaptive attacks, e.g., Wang et al. (2020) showed that the defenses proposed in (Sun et al., 2019; Blanchard et al., 2017) can be bypassed by appropriately designed attacks. Our contribution: In this work, we propose FLGAME, a game-theoretic defense against backdoor attacks to FL. Specifically, we formulate FLGAME as a minimax game between the server (defender) and attacker, which enables them to strategically adapt their defense and attack strategies. In the rest of the paper, we use the terms benign client to denote a valid/un-compromised client and genuine score to quantify the extent to which a client is benign. Our key idea is that the server can compute a genuine score for each client whose value is large (or small) if the client is benign (or compromised) in each communication round. The genuine score serves as a weight for the local model update of the client when used to update the global model. The goal of the defender is to minimize the genuine scores for compromised clients and maximize them for benign ones. To solve the resulting minimax game for the defender, we follow a three-step process consisting of 1) building an auxiliary global model, 2) exploiting it to reverse engineer a backdoor trigger and target class, and 3) inspecting whether the local model of a client will predict an input embedded with the reverse engineered backdoor trigger as the target class to compute a genuine score for the client. Based on the deployed defense, the goal of the attacker is to optimize its attack strategy by maximizing the effectiveness of the backdoor attack. Our key observation is that the attack effectiveness is determined by two factors: genuine score and the local model of the client. We optimize the attack strategy with respect to those two factors to maximize the effectiveness of backdoor attacks against our defense. We perform both theoretical analysis and empirical evaluations for FLGAME. Theoretically, we prove that the global model trained with our defense under backdoor attacks is close to that trained without attacks (measured by L 2 -norm of global model parameters difference). Empirically, we evaluate FLGAME on benchmark datasets to demonstrate its effectiveness under state-of-the-art backdoor attacks. Moreover, we compare it with state-of-the-art baselines. Our results indicate that FLGAME outperforms them by a significant margin. Our key contributions can be summarized as follows: • We propose a game-theoretic defense FLGAME. We formulate FLGAME as a minimax game between the defender and attacker, which enables them to strategically optimize their defense and attack strategies. • We theoretically analyze the robustness of FLGAME. In particular, we show that the global model trained with FLGAME under backdoor attacks is close to that without attacks. • We perform a systematic evaluation of FLGAME on benchmark datasets and demonstrate that FLGAME significantly outperforms state-of-the-art baselines.

2. RELATED WORK

Backdoor attacks on federated learning: In backdoor attacks to FL (Bhagoji et al., 2019; Bagdasaryan et al., 2020; Baruch et al., 2019; Wang et al., 2020; Zhang et al., 2022b) , an attacker aims to make a global model predict a target class for any input embedded with a backdoor trigger via compromised clients. For instance, Bagdasaryan et al. (2020) proposed scaling attack in which an attacker uses a mix of backdoored and clean training examples to train its local model and then scales the local model update by a factor before sending it to the server. Xie et al. (2019) proposed distributed backdoor attack to FL. Roughly speaking, the idea is to decompose a backdoor trigger into different sub-triggers and embed each of them to the local training dataset of different compromised clients. In our work, we will leverage those attacks to perform strategic backdoor attacks to our defense. Defenses for Federated learning against backdoor attacks: Many defenses (Sun et al., 2019; Cao et al., 2021a; Ozdayi et al., 2021; Wu et al., 2020; Rieger et al., 2022; Nguyen et al., 2022) were proposed to mitigate backdoor attacks to FL. For instance, Sun et al. (2019) proposed norm-clipping which clips the norm of the local model update of a client such that its norm is no larger than a threshold. They also extended differential privacy (Dwork et al., 2006; Abadi et al., 2016; McMahan et al., 2017b) to mitigate backdoor attacks to federated learning. The idea is to clip the local model update and add Gaussian noise to it. Cao et al. (2021a) proposed FLTrust which leveraged the similarity of the local model update of a client with that computed by the server itself on its clean dataset. Other defenses include Byzantine-robust FL methods such as Krum (Blanchard et al., 2017) , Trimmed Mean (Yin et al., 2018) , and Median (Yin et al., 2018) . However, all of those defenses consider a static attacker model. As a result, they become less effective against dynamic attackers who strategically adapt their attack strategies. Another line of research focuses on detecting malicious clients (Li et al., 2020a; Zhang et al., 2022a) . For instance, Li et al. (2020a) proposed to train a variational autoencoder (VAE) and use its reconstruction loss on the local model update of a client to detect malicious clients. However, those defenses need to collect many local model updates from a client to make confident detection. As a result, the global model may already be backdoored before those clients are detected. Two recent studies (Cao et al., 2021b; Xie et al., 2021) proposed certified defenses against compromised clients. However, they can only tolerate a moderate fraction of malicious clients (e.g., less than 10%) as shown in their experimental results. 3 BACKGROUND ON FEDERATED LEARNING AND THREAT MODEL  Θ t i = Θ t + g t i . After receiving the local model updates from all clients, the server can aggregate them based on an aggregation rule R (e.g., FedAvg) to update its glocal model, i.e., we have: Θ t+1 = Θ t + ηR(g t 1 , g t 2 , • • • , g t |S| ), where |S| represents the number of clients and η is the learning rate of the global model.

3.2. THREAT MODEL

We consider the backdoor attack proposed in previous work (Bagdasaryan et al., 2020; Xie et al., 2019) . In particular, we assume an attacker can compromise a set of clients (denoted as S a ). To perform the backdoor attack, the attacker first selects a backdoor trigger δ and a target class y tc . For each client i ∈ S a in the tth (t = 1, 2, • • • ) communication round, the attacker can choose an arbitrary fraction (denoted as r In our game-theoretic framework, we will optimize r t i for the compromised client i in each communication round to make the backdoor attack more effective under our defense. We consider that the server itself has a small clean training dataset (denoted as D s ), which could be collected from the same or different domains of the local training datasets of clients. Moreover, we consider the case that the server does not have any information on each client except their local model updates in each communication round.

4. FLGAME: A GAME-THEORETIC DEFENSE AGAINST BACKDOORS

Overview: Our idea is to formulate FLGAME as a minimax game between the defender and attacker, solving which enables them to respectively optimize their strategies. In particular, the defender computes a genuine score for each client in each communication round. The goal of the defender is to maximize the genuine score for a benign client and minimize it for a compromised one. Given the genuine score for each client, we use a weighted average over all the local model updates to update the global model, i.e., we have Θ t+1 = Θ t + η 1 i∈S p t i i∈S p t i g t i , where p t i is the genuine score for client i in the tth communication round and η is the learning rate of the global model. The goal of the attacker is to maximize its attack effectiveness, which is determined by two components based on Equation 2: genuine scores and local models of compromised clients. In our framework, the attacker will optimize the tradeoff between those two components to maximize the effectiveness of its backdoor attacks against our defense.

4.1. FORMULATING FLGAME AS A MINIMAX GAME

Computing the genuine score for client i: To compute p t i , our key observation is that the local model of a compromised client is more likely to predict the target class for a trigger-embedded input compared with that of a benign client. However, the key challenge is that the server does not know the backdoor trigger and target class adopted by the attacker. To overcome the challenge, the server can reverse engineer a backdoor trigger δ re and target class y tc re (we will discuss more details in the next subsection). Recall that the client i sends its local model update g t i to the server, the local model of the client i can be computed as Θ t i = Θ t + g t i . Then, we can compute p t i as follows: p t i = 1 - 1 |D s | x∈Ds I(G(x ⊕ δ re ; Θ t i ) = y tc re ), where I is an indicator function, D s is the clean training dataset of the server, x ⊕ δ re is a triggerembedded input, and G(x ⊕ δ re ; Θ t i ) represents the predicted label of the local model Θ t i for x ⊕ δ re . Roughly speaking, the genuine score for a client is small if its local model predicts a large fraction of inputs embedded with the reverse engineered backdoor trigger as the target class. The optimization problem for the defender: The server aims to reverse engineer the backdoor trigger δ re and target class y tc re such that the genuine scores for compromised clients are minimized while those for benign clients are maximized. Formally, we have the following optimization problem: min δre,y tc re i∈Sa p t i - j∈S\Sa p t j . The optimization problem for the attacker: The goal of an attacker is to maximize its attack effectiveness. Based on Equation 2, the attacker needs to: 1) maximize the genuine scores for compromised clients while minimizing them for benign ones, i.e., max( i∈Sa p t i -j∈S\Sa p t j ), and 2) make the local models of compromised clients predict an input embedded with the attacker-chosen backdoor trigger δ as the target class y tc . To perform the backdoor attack in the tth communication round, the attacker embeds the backdoor to a certain (denoted as r t i ) fraction of training examples in the local training dataset of the client and uses them to augment it. A larger r t i is more likely to make the local model of the client i predict a trigger-embedded input as the target class but also make its genuine score smaller. Therefore, r t i measures a tradeoff between them. Formally, the attacker can find the desired tradeoff by solving the following optimization problem: max R t ( i∈Sa p t i - j∈S\Sa p t j + λ i∈Sa r t i ), where R t = {r t i |i ∈ S a } and λ is a hyperparameter to balance the two terms. Minimax game between the defender and the attacker: Given the optimization problems solved by the defender and attacker, we have the following minimax game: min δre,y tc re max R t ( i∈Sa p t i - j∈S\Sa p t j + λ i∈Sa r t i ). Note that r t i (i ∈ S a ) is chosen by the attacker and thus we can add r t i to the objective function in Equation 4 without influencing its solution given the local model updates of clients.

4.2. SOLVING THE MINIMAX GAME BY THE DEFENDER

To solve the minimax game in Equation 6for the defender, our idea is to construct an auxiliary global model and then reverse engineer the backdoor trigger and target class based on it. Constructing an auxiliary global model: Suppose g t i is the local model update from each client i ∈ S. Our auxiliary global model is constructed as follows: Θ t a = Θ t + 1 |S| i∈S g t i . Our intuition is that such aggregated global model is very likely to predict an input embedded with the backdoor trigger δ as the target class y tc under backdoor attacks. Reverse engineering the backdoor trigger and target class: Given the auxiliary global model, we can use arbitrary methods to reverse engineer the backdoor trigger and target class. Roughly speaking, the goal is to find the backdoor trigger and target class such that the genuine scores for benign clients are large but they are small for compromised clients. For instance, we can leverage Neural Cleanse (Wang et al., 2019) , which is the state-of-the-art method to reverse engineer a backdoor trigger and target class. Roughly speaking, Neural Cleanse views each class (c = 1, 2, • • • , C and C is the total number of classes in the classification task) as a potential target class and finds a perturbation δ c with a small L 1 -norm such that any inputs embedded with it will be classified as the class c. We view the trigger with the smallest L 1 -norm as the backdoor trigger and view the corresponding class as the target class. Formally, we have y tc re = arg min c ∥δ c ∥ 1 and δ re = δ y tc re . The complete algorithm of our FLGAME is shown in Algorithm 1 in Appendix.

4.3. SOLVING THE MINIMAX GAME BY THE ATTACKER

The goal of the attacker is to find r t i for each client i ∈ S a such that the loss function in Equation 6 is maximized. As the attacker does not know the genuine scores of benign clients, the attacker can find r t i to maximize p t i + λr t i for client i ∈ S a to approximately solve the optimization problem in Equation 6. However, the key challenge is that the attacker does not know the reverse engineered backdoor trigger δ re and the target class y tc re of the defender to compute the genuine score for client i. In response, the attacker can use the backdoor trigger δ and target class y tc chosen by itself. Moreover, the attacker reserves a certain fraction (e.g., 10%) of training data from its local training dataset D i as the validation dataset (denoted as D rev i ) to find the best r t i . Estimating a genuine score for a given r t i : For a given r t i , the client i can embed the backdoor to r Then, the genuine score can be estimated as pt  i = 1 -1 |D rev i | x∈D rev i I(G(x ⊕ δ; Θt i ) = y tc ),

5. THEORETICAL ANALYSIS OF FLGAME

This section provides a theoretical analysis of FLGAME under backdoor attacks. In particular, we derive an upper bound for the L 2 -norm of the difference between the parameters of the global models with and without attacks. To analyze the robustness of FLGAME, we make the following assumptions on the loss function used by the clients, which are commonly used in the analysis of previous studies (Li et al., 2020b; Wang & Joshi, 2021; Fallah et al., 2020; Reisizadeh et al., 2020) on federated learning. Assumption 1. The loss function is µ-strongly convex with L-Lipschitz continuous gradient. Formally, we have the following for arbitrary Θ and Θ ′ : (∇ Θ ℓ(z; Θ) -∇ Θ ′ ℓ(z; Θ ′ )) T (Θ -Θ ′ ) ≥ µ ∥Θ -Θ ′ ∥ 2 2 , (7) ∥∇ Θ ℓ(z; Θ) -∇ Θ ′ ℓ(z; Θ ′ )∥ 2 ≤ L ∥Θ -Θ ′ ∥ 2 , ( ) where z is an arbitrary training example. Assumption 2. We assume the gradient ∇ Θ ℓ(z; Θ) is bounded with respect to L 2 -norm for arbitrary Θ and z, i.e., there exists some M ≥ 0 such that ∥∇ Θ ℓ(z; Θ)∥ 2 ≤ M. Suppose Θ t c is the global model trained by FLGAME without any attacks in the tth communication round, i.e., each client i ∈ S uses its clean local training dataset D i to train a local model. Moreover, we assume gradient descent with a local model learning rate 1 is used by each client to train its local model. Suppose q t i is the genuine score for client i without attacks. Moreover, we denote as the normalized genuine score for client i with attacks in the tth communication round. We prove the following robustness guarantee for FLGAME: Lemma 1 (Robustness Guarantee for One Communication Round). Suppose Assumptions 1 and 2 hold. Moreover, we assume β t i = q t i i∈S q t i (1 -r t )β t i ≤ α t i ≤ (1 + r t )β t i , where i ∈ S and r t = j∈Sa r t j . Then, we have: Θ t+1 -Θ t+1 c 2 ≤ 1 -ηµ + 2ηγ t + η 2 L 2 + 2η 2 Lγ t Θ t -Θ t c 2 + 2ηγ t (1 + ηL + 2ηγ t ) + 2ηr t M, ( ) where η is the learning rate of the global model, L and µ are defined in Assumption 1, γ t = i∈Sa α t i r t i M , and M is defined in Assumption 2. Proof sketch. Our idea is to decompose Θ t+1 -Θ t+1 c 2 into two terms. Then, we derive an upper bound for each term based on the change of the local model updates of clients under backdoor attacks and the properties of the loss function. As a result, our derived upper bound relies on r t i for each client i ∈ S a , parameters µ, L, and M in our assumptions, as well as the parameter difference of the global models in the previous iteration, i.e., ∥Θ t -Θ t c ∥ 2 . Our complete proof can be found in Appendix A.1. In the above lemma, we derive an upper bound of Θ t+1 -Θ t+1 c 2 with respect to ∥Θ t -Θ t c ∥ 2 for one communication round. In the next theorem, we derive an upper bound of ∥Θ t -Θ t c ∥ 2 as t → ∞. We iterative apply Lemma 1 for successive values of t and have the following theorem: Theorem 1 (Robustness Guarantee). Suppose Assumptions 1 and 2 hold. Moreover, we assume (1 -r t )β t i ≤ α t i ≤ (1 + r t )β t i for i ∈ S, γ t ≤ γ and r t ≤ r hold for all communication round t, and µ > 2γ, where r t = j∈Sa r t j and γ t = i∈Sa α t i r t i M . Let the global model learning rate by chosen as 0 < η < µ-2γ L 2 +2Lγ . Then, we have: Θ t -Θ t c 2 ≤ 2ηγ(1 + ηL + 2ηγ) + 2ηrM 1 -1 -ηµ + 2ηγ + η 2 L 2 + 2η 2 Lγ (11) holds as t → ∞. Proof sketch. Given the conditions that γ t ≤ γ and r t ≤ r as well as the fact that the right-hand side of Equation 10 is monotonic with respect to γ t and r t , we can replace γ t and r t in Equation 10 with γ and r. Then, we iterative apply the equation for successive values of t. When 0 < η < µ-2γ L 2 +2Lγ , we have 0 < 1 -ηµ + 2ηγ + η 2 L 2 + 2η 2 Lγ < 1. By letting r → ∞, we can reach the conclusion. The complete proof can be found in Appendix A.2. When r t i = 0 for ∀i, ∀t, we have γ = 0 and r = 0. Thus, the upper bound in Equation 11 becomes 0.

6.1. EXPERIMENTAL SETUP

Datasets and global models: We use two datasets: MNIST (LeCun et al., 2010) and CIFAR10 (Krizhevsky, 2009) for FL tasks. MNIST has 60,000 training and 10,000 testing images, each of which has a size of 28 × 28 belonging to one of 10 classes. CIFAR10 consists of 50,000 training and 10,000 testing images with a size of 32 × 32. Each image is categorized into one of 10 classes. For each dataset, we randomly sample 90% of training data for clients, and the remaining 10% of training data is reserved to evaluate our defense when the clean training dataset of the server is from the same domain as those of clients. We use a CNN with two convolution layers (detailed architecture can be found in Table 5 in Appendix) and ResNet-18 (He et al., 2016) which is pre-trained on ImageNet (Deng et al., 2009) as the global models for MNIST and CIFAR10. FL settings: We consider two settings: local training datasets of clients are independently and identically distributed (i.e., IID), and not IID (i.e., non-IID). In IID setting, we randomly distribute the training data to each client. In non-IID, we follow the previous work (Fang et al., 2020) to distribute training data to clients. In particular, they use a parameter q to control the degree of non-IID, which models the probability that training images from a class are distributed to a particular client (or a set of clients). We set q = 0.5 by following (Fang et al., 2020) 2020) and DBA (Xie et al., 2019) . We use the same backdoor trigger and target class as used in those works. By default, we assume 60% of clients are compromised by an attacker. We set the scaling parameter to be #total clients/(η×#compromised clients) following Bagdasaryan et al. (2020) . When the attacker solves the minimax game in Equation 6, we set the default λ = 1. We will explore its impact in our experiments. We randomly sample 10% of the local training data of each compromised client as validation data to search for an optimal r t i . Moreover, we set the granularity of grid search to be 0.1 when searching for r t i . Baselines: We compare our defense with the following methods: FedAvg (McMahan et al., 2017a) , Krum (Blanchard et al., 2017) , Median (Yin et al., 2018) , Norm-Clipping (Sun et al., 2019) , Differential Privacy (DP) (Sun et al., 2019) , and FLTrust (Cao et al., 2021a) . FedAvg is non-robust while Krum and Median are two Byzantine-robust baselines. Norm-Clipping clips the L 2 -norm of local model updates to a given threshold T N . We set T N = 0.01 for MNIST and T N = 0.1 for CIFAR10. DP first clips the L 2 -norm of a local model update to a threshold T D and then adds Gaussian noise. We set T D = 0.05 for MNIST and T D = 0.5 for CIFAR10. We set the standard deviation of noise to be 0.01 for both datasets. In FLTrust, the server uses its clean dataset to compute a server model update and assigns a trust score to each client by leveraging the similarity between the server model update and the local model update. We set the clean training dataset of the server to be the same as FLGAME in our comparison. Note that FLTrust is not applicable when the clean training dataset of the server is from a different domain from those of clients. Evaluation metrics: We use testing accuracy (TA) and attack success rate (ASR) as evaluation metrics. TA is the fraction of clean testing inputs that are correctly predicted. ASR is the fraction of backdoored testing inputs that are predicted as the target class. Defense setting: We consider two settings: in-domain and out-of-domain. For the in-domain setting, we consider the clean training dataset of the server is from the same domain as the local training datasets of clients. We use the reserved data as the clean training dataset of the server for each dataset. For the out-of-domain setting, we consider the server has a clean training dataset that is from the different domains of FL tasks. In particular, we randomly sample 6,000 images from FashionMNIST (Xiao et al., 2017) for MNIST and sample 5,000 images from GTSRB (Houben et al., 2013) for CIFAR10 as the clean training dataset of the server. We adopt Neural Cleanse (Wang et al., 2019) to reverse engineer the backdoor trigger and target class.

6.2. EXPERIMENTAL RESULTS

Our FLGAME consistently outperforms existing defenses: Table 1 and Table 2 show the results of FLGAME compared with existing defenses under IID and non-IID settings. We have the following observations from the experimental results. First, FLGAME outperforms all existing defenses in terms of ASR. In particular, FLGAME can reduce ASR to random guessing (i.e., ASR of FedAvg under no attacks) in both IID and non-IID settings for clients as well as both in-domain and out-of- Impact of λ: λ is a hyperparameter used by an attacker when searching for the optimal r t i for each compromised client i in each communication round t. Figure 1 shows the impact of λ on ASR of our FLGAME. The results show that our FLGAME is insensitive to different λ's. The reason is that the genuine score for a compromised client is small when λ is large, and the local model of a compromised client is less likely to predict a trigger-embedded input as the target class when λ is small. As a result, backdoor attacks with different λ are ineffective under our FLGAME. Impact of the fraction of compromised clients: Figure 2 shows the impact of the fraction of compromised clients on ASR of our FLGAME and FLTrust. As the results show, our FLGAME is effective for a different fraction of compromised clients in both in-domain and out-of-domain settings. In contrast, FLTrust is ineffective when the fraction of compromised clients is large. For instance, our FLGAME can achieve 9.84% (in-domain) and 10.12% (out-of-domain) ASR even if 80% of clients are compromised on MNIST. Under the same setting, the ASR of FLTrust is 99.95%, indicating that the defense fails. 

7. CONCLUSION AND FUTURE WORK

In this work, we propose FLGAME, a general game-theoretic defense against adaptive backdoor attacks to federated learning. Our formulated minimax game enables the defender and attacker to dynamically optimize their strategies. Moreover, we respectively design solutions for both of them to solve the minimax game. Theoretically, we show that the parameters of the global model with the backdoor attack under our FLGAME is close to that without attacks. Empirically, we perform systematic evaluations on benchmark datasets and compare FLGAME with multiple state-of-the-art baselines. Our results demonstrate the effectiveness of FLGAME under strategic backdoor attacks. Moreover, FLGAME achieves significantly higher robustness than baselines. Interesting future work includes: 1) extending our FLGAME to defend against other attacks to federated learning, and 2) improving FLGAME by designing new methods to reverse engineer the backdoor trigger and target class via exploiting the historical local model updates sent by each client.

ETHICS STATEMENT

We propose a game-theoretic defense against backdoor attacks to federated learning in this work. One potentially harmful effect is that an attacker may leverage our defense to enhance its attack. However, our defense already considers strategic attacks. Therefore, we do not see any explicit ethical issues with our work.

REPRODUCIBILITY STATEMENT

We discuss the reproducibility of our work from two aspects: theoretic analysis and empirical results. For theoretic analysis, we explicitly explain the assumptions that we make in Section 5. We also include the complete proofs for our lemmas and theorems in Appendix. For empirical results, we discuss the details of our experimental setup in Section 6.1, including datasets and global models, federated learning settings, backdoor attack settings, baselines, and their parameter settings, as well as our FLGAME settings. The datasets used in this work are all publicly available. We also add the link to the publicly available codes used in our experiments. We will release our code upon paper acceptance.

APPENDIX A COMPLETE PROOFS

A.1 PROOF OF LEMMA 1 We first present some preliminary lemmas that will be invoked for proving Lemma 1.  g i = 1 |Di∪D ′ i | ∇ Θ z∈Di∪D ′ i ℓ(z; Θ) and h i = 1 |Di| ∇ Θc z∈Di ℓ(z; Θ c ). We then have that (Θ -Θ c ) T (g i -h i ) ≥ (0.5µ -r t i M ) ∥Θ -Θ c ∥ 2 2 -r t i M, ( ) ∥g i -h i ∥ 2 ≤ L∥Θ -Θ c ∥ 2 + 2r t i M. ( ) Proof. We first prove Equation 12. We have the following relations: (Θ -Θ c ) T (g i -h i ) =(Θ -Θ c ) T ( 1 |D i ∪ D ′ i | z ′ ∈Di∪D ′ i ∇ Θ ℓ(z ′ ; Θ) - 1 |D i | z∈Di ∇ Θc ℓ(z; Θ c )) ▷ definition of g i and h i (14) =(Θ -Θ c ) T ( 1 (1 + r t i )|D i | z ′ ∈Di∪D ′ i ∇ Θ ℓ(z ′ ; Θ) - 1 |D i | z∈Di ∇ Θc ℓ(z; Θ c )) ▷ r t i = |D ′ i | |D i | (15) = 1 |D i |(1 + r t i ) (Θ -Θ c ) T ( z ′ ∈Di∪D ′ i ∇ Θ ℓ(z ′ ; Θ) -(1 + r t i ) z∈Di ∇ Θc ℓ(z; Θ c )) (16) = 1 |D i |(1 + r t i ) (Θ -Θ c ) T ( z ′ ∈Di ∇ Θ ℓ(z ′ ; Θ) - z∈Di ∇ Θc ℓ(z; Θ c ) + z ′ ∈D ′ i ∇ Θ ℓ(z ′ ; Θ) -r t i z∈Di ∇ Θc ℓ(z; Θ c )) (17) = 1 |D i |(1 + r t i ) ( z∈Di (Θ -Θ c ) T (∇ Θ ℓ(z; Θ) -∇ Θc ℓ(z; Θ c )) + (Θ -Θ c ) T ( z ′ ∈D ′ i ∇ Θ ℓ(z ′ ; Θ) -r t i z∈Di ∇ Θc ℓ(z; Θ c ))) (18) ≥ 1 |D i |(1 + r t i ) ( z∈Di (Θ -Θ c ) T (∇ Θ ℓ(z; Θ) -∇ Θc ℓ(z; Θ c )) -∥(Θ -Θ c ) T ( z ′ ∈D ′ i ∇ Θ ℓ(z ′ ; Θ) -r t i z∈Di ∇ Θc ℓ(z; Θ c ))∥ 1 ) ▷ ∀x, x ≥ -∥x∥ 1 (19) ≥ 1 |D i |(1 + r t i ) ( z∈Di (Θ -Θ c ) T (∇ Θ ℓ(z; Θ) -∇ Θc ℓ(z; Θ c )) -∥Θ -Θ c ∥ 2 • ∥ z ′ ∈D ′ i ∇ Θ ℓ(z ′ ; Θ) -r t i z∈Di ∇ Θc ℓ(z; Θ c )∥ 2 ) ▷ Cauchy-Schwarz inequality ≥ 1 |D i |(1 + r t i ) ( z∈Di (Θ -Θ c ) T (∇ Θ ℓ(z; Θ) -∇ Θc ℓ(z; Θ c )) -∥Θ -Θ c ∥ 2 • ( z ′ ∈D ′ i ∥∇ Θ ℓ(z ′ ; Θ)∥ 2 + r t i z∈Di ∥∇ Θc ℓ(z; Θ c )∥ 2 ) ▷ triangle inequality ≥ 1 |D i |(1 + r t i ) (µ|D i | ∥Θ -Θ c ∥ 2 2 -2r t i |D i |M ∥Θ -Θ c ∥ 2 ) ▷ Assumption 1 (20) = µ 1 + r t i ∥Θ -Θ c ∥ 2 2 - 1 1 + r t i 2r t i M ∥Θ -Θ c ∥ 2 ) (21) ≥0.5µ ∥Θ -Θ c ∥ 2 2 -2r t i M ∥Θ -Θ c ∥ 2 ▷ r t i ∈ [0, 1] (22) ≥0.5µ ∥Θ -Θ c ∥ 2 2 -r t i M ∥Θ -Θ c ∥ 2 2 -r t i M ) (23) =(0.5µ -r t i M ) ∥Θ -Θ c ∥ 2 2 -r t i M, where Equation 23holds based on the fact that -2r t i M ∥Θ -Θ c ∥ 2 ≥ -r t i M ∥Θ -Θ c ∥ 2 2 -r t i M for ∀r t i ≥ 0 and ∀M ≥ 0. In the following, we prove inequality 13. We have that ∥g i -h i ∥ 2 = 1 |D i |(1 + r t i ) ∥ z ′ ∈Di∪D ′ i ∇ Θ ℓ(z ′ ; Θ) -(1 + r t i ) z∈Di ∇ Θc ℓ(z; Θ c )∥ 2 ▷ definition of g i and h i (25) = 1 |D i |(1 + r t i ) ∥ z ′ ∈D ′ i ∇ Θ ℓ(z ′ ; Θ) + z ′ ∈Di ∇ Θ ℓ(z ′ ; Θ) -(1 + r t i ) z∈Di ∇ Θc ℓ(z; Θ c )∥ 2 (26) ≤ 1 |D i |(1 + r t i ) ∥ z ′ ∈D ′ i ∇ Θ ℓ(z ′ ; Θ) -r t i z∈Di ∇ Θc ℓ(z; Θ c )∥ 2 + 1 |D i |(1 + r t i ) ∥ z ′ ∈Di ∇ Θ ℓ(z ′ ; Θ) - z∈Di ∇ Θc ℓ(z; Θ c )∥ 2 ▷ triangle inequality (27) ≤ 1 1 + r t i (2r t i M + L∥Θ -Θ c ∥ 2 ) ( ) ≤2r t i M + L∥Θ -Θ c ∥ 2 ▷ r t i ∈ [0, 1] where Equation 28 is due to Assumption 1 and 2. Given Lemma 2, we prove Lemma 1 as follows. Recall that we have α t i = p t i i∈S p t i and β t i = q t i i∈S q t i . ∥Θ t+1 -Θ t+1 c ∥ 2 (30) =∥Θ t -η i∈S α t i g t i -(Θ t c -η i∈S β t i h t i )∥ 2 ▷ gradient descent for Θ t+1 and Θ t+1 c (31) =∥Θ t -η i∈S α t i g t i -(Θ t c -η i∈S (α t i + β t i -α t i )h t i )∥ 2 (32) =∥Θ t -Θ t c -η i∈S α t i (g t i -h t i ) + (η i∈S (β t i -α t i )h t i )∥ 2 ▷ rearranging Equation 32 (33) ≤∥Θ t -Θ t c -η i∈S α t i (g t i -h t i )∥ 2 + ∥η i∈S (β t i -α t i )h t i ∥ 2 . ▷ triangle inequality Next, we respectively derive an upper bound for the first and second terms in Equation 34. To derive the upper bound for the first term, we have that ∥Θ t -Θ t c -η i∈S α t i (g t i -h t i )∥ 2 2 =∥Θ t -Θ t c ∥ 2 2 -2η(Θ t -Θ t c ) T ( i∈S α t i (g t i -h t i )) + η 2 ∥ i∈S α t i (g t i -h t i )∥ 2 2 (35) =S 1 + S 2 + S 3 , where S 1 = ∥Θ t -Θ t c ∥ 2 2 , S 2 = -2η(Θ t -Θ t c ) T ( i∈S α t i (g t i -h t i )), and S 3 = η 2 i∈S α t i (g t i -h t i ) 2 2 . Next, we will bound S 2 and S 3 . We denote γ t = i∈Sa α t i r t i M . Note that we have γ t = i∈S α t i r t i M since r t i = 0 for ∀i ∈ S \ S a . We bound S 2 as follows. S 2 = -2η(Θ t -Θ t c ) T ( i∈S α t i (g t i -h t i )) (37) = -2η i∈S α t i (Θ t -Θ t c ) T (g t i -h t i ) (38) ≤ -2η i∈S α t i ((0.5µ -r t i M ) Θ t -Θ t c 2 2 -r t i M ) (39) = -2η((0.5µ - i∈S α t i r t i M ) Θ t -Θ t c 2 2 - i∈Sa α t i r t i M ) (40) =(-ηµ + 2ηγ t ) Θ t -Θ t c 2 2 + 2ηγ t , ▷definition of γ t where inequality 39 holds by Lemma 2 and the fact that η, α t i ≥ 0. We bound S 3 as follows. S 3 =η 2 ∥ i∈S α t i (g t i -h t i )∥ 2 2 (42) ≤η 2 ( i∈S α t i (g t i -h t i ) 2 ) 2 (43) ≤η 2 ( i∈S α t i (2r t i M + L∥Θ -Θ c ∥ 2 ) 2 ▷ Lemma 2 (44) =η 2 (2γ t + L∥Θ -Θ c ∥ 2 ) 2 (45) =η 2 (L 2 ∥Θ -Θ c ∥ 2 2 + 4γ t L ∥Θ -Θ c ∥ 2 + 4[γ t ] 2 ) (46) ≤η 2 (L 2 ∥Θ -Θ c ∥ 2 2 + 2γ t L ∥Θ -Θ c ∥ 2 2 + 2Lγ t + 4[γ t ] 2 ) (47) =η 2 • ((L 2 + 2Lγ t ) • ∥Θ -Θ c ∥ 2 2 + 2Lγ t + 4[γ t ] 2 ) (48) where Equation 47 is based on the fact that 4γ t L ∥Θ -Θ c ∥ 2 ≤ 2γ t L ∥Θ -Θ c ∥ 2 2 + 2γ t L when γ t L ≥ 0. Given the upper bounds of S 2 and S 3 , we can bound Θ t -Θ t c -η i∈S α t i (g t i -h t i ) 2 2 as follows. ∥Θ t -Θ t c -η i∈S α t i (g t i -h t i )∥ 2 2 (49) =S 1 + S 2 + S 3 (50) ≤ ∥Θ -Θ c ∥ 2 2 + (-ηµ + 2ηγ t ) Θ t -Θ t c 2 2 + 2ηγ t + (η 2 L 2 + η 2 2Lγ t ) Θ t -Θ t c 2 2 + η 2 2Lγ t + η 2 4[γ t ] 2 (51) =(1 -ηµ + 2ηγ t + η 2 L 2 + 2η 2 Lγ t ) Θ t -Θ t c 2 2 + 2ηγ t + 2η 2 Lγ t + 4η 2 [γ t ] 2 Next, we will derive an upper bound for η i∈S (β t i -α t i )h t i 2 . We denote r t = i∈Sa r t i . Note that we have that r t = i∈S r t i also holds since r t i = 0 for ∀i ∈ S \ S a . Given the assumption that (1 -r t )α t i ≤ β t i ≤ (1 + r t )α t i , we have ∥η i∈S (β t i -α t i )h t i ∥ 2 ≤ η i∈S |β t i -α t i | h t i 2 ≤ 2ηr t M, where the first inequality is due to triangle inequality and the second inequality is based on the assumption that ∥h t i ∥ 2 ≤ M . Therefore, we have: ∥Θ (t+1) -Θ (t+1) c ∥ 2 ≤∥Θ t -Θ t c -η i∈S α t i (g t i -h t i )∥ 2 2 + ∥η i∈S (β t i -α t i )h t i ∥ 2 ▷ Equation 30, 34 ≤ (1 -ηµ + 2ηγ t + η 2 L 2 + 2η 2 Lγ t ) ∥Θ t -Θ t c ∥ 2 2 + 2ηγ t (1 + ηL + 2ηγ t ) (55) + 2ηr t M ▷ Equation 49, 52, 53 ≤ 1 -ηµ + 2ηγ t + η 2 L 2 + 2η 2 Lγ t Θ t -Θ t c 2 + 2ηγ t (1 + ηL + 2ηγ t ) + 2ηr t M, where the last inequality holds due to the fact that √ a + b ≤ √ a + √ b for ∀a ≥ 0 and ∀b ≥ 0, which completes our proof for Lemma 1.

A.2 PROOF OF THEOREM 1

We denote A t = 1 -ηµ + 2ηγ t + η 2 L 2 + 2η 2 Lγ t , A = 1 -ηµ + 2ηγ + η 2 L 2 + 2η 2 Lγ, B t = 2ηγ t (1 + ηL + 2ηγ t ) + 2ηr t M , and B = 2ηγ(1 + ηL + 2ηγ) + 2ηrM . Since γ t ≤ γ and r t ≤ r, we have A t ≤ A and B t ≤ B. Thus, based on Lemma 1, we have: Θ t -Θ t c 2 ≤ A Θ t-1 -Θ t-1 c 2 + B. Then, we can iteratively apply the above equation to prove our theorem. In particular, we have: Θ t -Θ t c 2 ≤A Θ t-1 -Θ t-1 c 2 + B (59) ≤A(A Θ t-2 -Θ t-2 c 2 + B) + B (60) =A 2 Θ t-2 -Θ t-2 c 2 + (A 1 + A 0 )B ( ) ≤A t Θ 0 -Θ 0 c 2 + (A t-1 + A t-2 + • • • + A 0 )B ( ) =A t Θ 0 -Θ 0 c 2 + 1 -A t 1 -A B (63) =( 1 -ηµ + 2ηγ + η 2 L 2 + 2η 2 Lγ) t Θ 0 -Θ 0 c 2 + 1 -( 1 -ηµ + 2ηγ + η 2 L 2 + 2η 2 Lγ) t 1 -1 -ηµ + 2ηγ + η 2 L 2 + 2η 2 Lγ ( 2ηγ(1 + ηL + 2ηγ) + 2ηrM ), When the learning rate satisfies 0 < η < µ-2γ L 2 +2Lγ , we have that 0 < 1-ηµ+2ηγ+η 2 L 2 +2η 2 Lγ < 1. Therefore, the upper bound becomes 

B.2 COMPLETE ALGORITHM FOR A COMPROMISED CLIENT

Algorithm 2 shows the complete algorithm for a compromised client. In Line 1, we randomly subsample ρ i fraction of training data from D i . In Line 5, the function CREATEBACKDOOREDDATA Our FLGAME computes a genuine score for each client which quantifies the extent to which a client is benign in each communication round. Intuitively, our FLGAME would be effective if the genuine score is small for a compromised client but is large for a benign one. FLTrust (Cao et al., 2021a) 



t i ) of training examples from the local training dataset of the client, embed the backdoor trigger δ to those training inputs, and relabel them as the target class y tc . Those backdoored training examples are used to augment the local training dataset of the client.

t i fraction of training examples in D i \ D rev i and then use those backdoored training examples to augment D i \ D rev i to train a local model (denoted as Θt i ).

as the normalized genuine score for client i. To perform the backdoor attack, we assume a compromised client i can embed the backdoor trigger to r t i fraction of training examples in the local training dataset of the client and relabel them as the target class. Those backdoored training examples are used to augment the local training dataset of the client. Suppose Θ t is the global model under the backdoor attack in the tth communication round with our defense. We denote α

Figure 1: Impact of λ on ASR of FLGAME under Scaling attack. FLGAME is insensitive to various choices of λ.

Suppose D i is the clean local training dataset of the client i. An attacker can inject the backdoor trigger to r t i fraction of training examples in D i and relabel them as the target class. We use D ′ i to denote the set of backdoored training examples where r t i = |D ′ i | |Di| . Given two arbitrary Θ and Θ c , we let

the complete algorithm of FLGAME. In Line 1, we construct an auxiliary global model. In Line 2, the function REVERSEENGINEER is used to reverse engineer the backdoor trigger and target class. In Line 4, we compute the local model of client i based on its local model update. In Line 5, we compute a genuine score for client i. In Line 6, we update the global model based on genuine scores and local model updates of clients.

is used to generate backdoored training examples by embedding the backdoor trigger δ to ⌊min(j * ζ, 1)|D i \ D rev i |⌋ training examples in D i \ D rev i and relabel them as y tc , where | • | measures the number of elements in a set. In Line 6, the function TRAININGLOCALMODEL is used to train the local model on the training dataset D ′ i ∪ D i \ D rev i . In Line 7, we estimate a genuine score. In Line 11, we use the function CREATEBACKDOOREDDATA to generate backdoored training examples by embedding the backdoor trigger δ to ⌊min(o * ζ, 1)|D i |⌋ training examples in D i and relabel them as y tc . In Line 12, we use the function TRAININGLOCALMODEL to train a local model on the training dataset D ′ i ∪ D i . Algorithm 2: ALGORITHM FOR A COMPROMISED CLIENT Input: Θ t (global model in the tth communication round), D i (local training dataset of client i), ρ i (fraction of reserved data to find optimal r t i ), ζ (granularity of searching for r t i ), δ (backdoor trigger), y tc (target class), and λ (hyperparameter). Output: g t i (local model update) 1D rev i = RANDOMSAMPLING(D i , ρ i ) 2 count = ⌈ 1 ζ ⌉ 3 max value, o ← 0, 0 4 for j ← 0 to count do 5 D ′ i = CREATEBACKDOOREDDATA(D i \ D rev i , δ, y tc , min(j * ζ, 1)) 6 Θ ij = TRAININGLOCALMODEL(Θ t , D ′ i ∪ D i \ D rev i ) x ⊕ δ; Θ ij ) = y tc ) 8 if p ij + λ min(j * ζ, 1) > max value then 9 o = j 10 max value = p ij + λ min(j * ζ, 1) 11 D ′ i = CREATEBACKDOOREDDATA(D i , δ, y tc , min(o * ζ, 1)) 12 Θ t i = TRAININGLOCALMODEL(Θ t , D ′ i ∪ D i ) 13 return Θ t i -Θ tC ADDITIONAL EXPERIMENTAL SETUP AND RESULTSC.1 ARCHITECTURE OF GLOBAL MODEL

3.1 FEDERATED LEARNINGSuppose S is a set of clients. We use D i to denote the local training dataset of the client i ∈ S. In the tth communication round, the server first sends the current global model (denoted as Θ t ) to each client. Then, each client i trains a local model (denoted as Θ t i ) by finetuning the global model Θ t using its local training dataset D i . For simplicity, we use z = (x, y) to denote a training example in D i , where x is the training input (e.g., an image) and y is its ground truth label. Given D i and the global model Θ t , we denote a loss function L(D i ; Θ t ) = 1 Θ i (called local model update) to the server. Note that it is equivalent for the client to send a local model or local model update to the server as

After estimating the optimal r t i , client i can embed the backdoor to r t i fraction of training examples to augment the local training dataset, train a local model, and send the local model update to the server. The complete algorithm for each compromised client is shown in Algorithm 2 in Appendix.

. Unless otherwise mentioned, we consider the IID setting. Moreover, we train a global model based on 10 clients for 200 iterations with a global model learning rate η = 1.0. In each communication round, we use SGD to train the local model of each client for two epochs with a local model learning rate 0.01. Moreover, we consider all clients are selected in each communication round.

Comparison of FLGAME with existing defenses under Scaling attack. The total number of clients is 10 with 60% compromised. The best results for defense are bold.

Comparison of FLGAME with existing defenses under Scaling attack. The total number of clients is 10 with 60% compromised. The local training datasets of clients are non-IID. The best results for defense are bold.

Comparison of FLGAME with existing defenses under Scaling attack. The total number of clients is 30 with 60% compromised. The best results for defense are bold.

Comparison of FLGAME with existing defenses under DBA attack. The total number of clients is 10 with 60% compromised. The best results for defense are bold. the server. Intrinsically, FLGAME performs better because our game-theoretic defense enables the defender to optimize its strategy against dynamic, adaptive attacks. We note that FLTrust outperforms other defenses (except FLGAME) in most cases since it exploits a clean training dataset from the same domain as local training datasets of clients. However, FLTrust is not applicable when the server only holds an out-of-domain clean training dataset, while FLGAME can relax such an assumption and will still be applicable. Moreover, our experimental results indicate that FLGAME achieves comparable performance even if the server holds an out-of-domain clean training dataset. In Appendix C.2, we visualize the average genuine (or trust) scores computed by FLGAME (or FLTrust) for compromised and benign clients to further explain why our FLGAME outperforms FLTrust. Second, our FLGAME achieves comparable TA with existing defenses, indicating that our FLGAME preserves the utility of global models.

shows the comparison results of FLGAME with existing defenses when the total number of clients is 30. Table4shows the comparison results of FLGAME with existing defenses under DBA attack. Our observations are similar, which indicates that FLGAME consistently outperforms existing defenses under different numbers of clients and backdoor attacks.

Algorithm 1: FLGAME Input: Θ t (global model in the tth communication round), g t i , i ∈ S (local model updates of clients), D s (clean training dataset of server), η (learning rate of global model). Output: Θ t+1 (global model for the (t + 1)th communication round)

shows the global model architecture on MNIST dataset. C.2 VISUALIZATION OF GENUINE SCORE OF FLGAME AND TRUST SCORE OF FLTRUST (CAO ET AL., 2021A)

