A SIMULATION-BASED FRAMEWORK FOR ROBUST FEDERATED LEARNING TO TRAINING-TIME ATTACKS Anonymous

Abstract

Well-known robust aggregation schemes in federated learning (FL) are shown to be vulnerable to an informed adversary who can tailor training-time attacks (Fang et al., 2020; Xie et al., 2020) . We frame robust distributed learning problem as a game between a server and an adversary that is able to optimize strong trainingtime attacks. We introduce RobustTailor, a simulation-based framework that prevents the adversary from being omniscient. The simulated game we propose enjoys theoretical guarantees through a regret analysis. RobustTailor improves robustness to training-time attacks significantly while preserving almost the same privacy guarantees as standard robust aggregation schemes in FL. Empirical results under challenging attacks show that RobustTailor performs similar to an upper bound with perfect knowledge of honest clients. Due to information asymmetry between the server and the adversary, we assume every client donates a small amount of honest data to the server as the price to achieve some level of security more proactively and efficiently. Providing such public dataset to achieve robustness is a common assumption in FL (

1. INTRODUCTION

In federated learning (FL), a global/personalized model is learnt from data distributed on multiple clients without sharing data (McMahan et al., 2017; Kairouz et al., 2021) . Clients compute their (stochastic) gradients using their own local data and send them to a central server for aggregating and updating a model. While FL offers improvements in terms of privacy, it creates additional challenges in terms of robustness. Clients are often prone to the bias in the stochastic gradient updates, which comes not only from poor sampling or data noise but also from malicious attacks of Byzantine clients who may send arbitrary messages to the server instead of correct gradients (Guerraoui et al., 2018) . Therefore, in FL, it is essential to guarantee some level of robustness to Byzantine clients that might be compromised by an adversary. Compromised clients are vulnerable to data/model poisoning and tailored attacks (Fang et al., 2020) . Byzantine-resilience is typically achieved by robust gradient aggregation schemes e.g., Krum (Blanchard et al., 2017) , Comed (Yin et al., 2018) , and trimmedmean (Yin et al., 2018) . These aggregators are resilient against attacks that are designed in advance. However, such robustness is insufficient in practice since a powerful adversary could learn the aggregation rule and tailor its training-time attack. It has been shown that well-known Byzantine-resilient gradient aggregation schemes are susceptible to an informed adversary that can tailor the attacks (Fang et al., 2020) . Specifically, Fang et al. (2020) and Xie et al. (2020) proposed efficient and nearly optimal trainingtime attacks that circumvent Krum, Comed, and trimmedmean. A tailored attack is designed with a prior knowledge of the robust aggregation rule used by the server, such that the attacker has a provable way to corrupt the training process. Given the information leverage of the adversary, it is a significant challenge to establish successful defense mechanisms against such tailored attacks. In this paper, we formulate robust distributed learning problem against training-time attacks as a game between a server and an adversary. To prevent the adversary from being omniscient, we propose to follow a mixed strategy using the existing robust aggregation rules. In real-world settings, both server and adversary have a number of aggregation rules and attack programs. How to utilize these aggregators efficiently and guarantee robustness is a challenging task. We address scenarios where neither the specific attack method is known in advance by the aggregator nor the exact aggregation rule used in each iteration is known in advance by the adversary, while the adversary and the server know the set of server's aggregation rules and the set of attack programs, respectively. 1 aggregation in this paper. In this paper, we frame robust distributed learning problem as a game and consider the bandit feedback model.

2. PROBLEM SETTING

Under a synchronous setting in FL, clients compute their updates on their own local data and then aggregate from all peers to update model parameters. Consider a general distributed system consisting of a parameter server and n clients (Chen et al., 2017; Abadi et al., 2016; Li et al., 2014) . Suppose that f Byzantine clients are controlled by an adversary and behave arbitrarily. Let x ∈ R d denote a machine learning model, e.g., it represents all weights and biases of a neural network. We consider minimizing an overall empirical risk of multiple clients, which can be formulated as finite-sum problem: min x∈R d F (x) = 1 n n i=1 F i (x) (FL) where F i : R d → R denotes the training error (empirical risk) of x on the local data of client i. At iteration t, honest clients compute and send honest stochastic gradients g i (x t ) = ∇F i (x t ) for i ∈ [n -f ] while Byzantine clients, controlled by an informed adversary, output attacks b j ∈ R d for j ∈ [f ]. The server receives all n updates and aggregates them following a particular robust aggregation rule, which outputs an aggregated and updated model x t+1 ∈ R d . Finally, the server broadcasts x t+1 to all clients.

2.1. GAME CONSTRUCTION

We frame this distributed learning problem under training-time attack as a game played by the adversary and the server. The informed adversary and training-time attacks are described in Section 2.1.1. The details of aggregators for the server are provided in Section 2.1.2. Though our formulation seems natural and intuitive, to the best of our knowledge, our work is the first work that frames robust learning problem under training-time tailored attacks as a game. The adversary aims at corrupting training while the server aims at learning an effective model, which achieves a satisfactory overall empirical risk over honest clients.

2.1.1. INFORMED ADVERSARY WITH ATTACKS

The adversary controls f out of n clients where these Byzantine clients collude aiming at disturbing the entire training process by sending training-time attacks (Biggio et al., 2012; Bhagoji et al., 2019; Sun et al., 2019; Bagdasaryan et al., 2020) . We assume n ≥ 2f + 1 which is a common assumption in the literature (Guerraoui et al., 2018; Blanchard et al., 2017; Alistarh et al., 2018; Rajput et al., 2019; Karimireddy et al., 2022) ; otherwise the adversary will be able to provably control the optimization trajectory and set the global model arbitrarily (Lamport et al., 1982) . An informed adversary controls the outputs of those compromised clients, e.g., their gradients throughout the course of training. Moreover, the informed adversary has full knowledge of the outputs of n -f honest clients across the course of training. Having access to the gradients of honest nodes, the adversary can compute the global aggregated gradient of an omniscient aggregation rule, which is the empirical mean of all honest updates without an attack: g * = 1 n -f n-f i=1 g i . (1) When an adversary knows a particular server's aggregation rule, it is able to design tailored attacks using n -f honest gradients (Fang et al., 2020) . Definition 1 (Attack algorithm). Let {g 1 , . . . , g n-f } denote the set of honest updates computed by n -f honest clients. The adversary designs f Byzantine updates using an AT algorithm: {b n-f +1 , . . . , b n } := AT(g 1 , . . . , g n-f , A). ( ) where A denotes the set of aggregators formally defined in Section 2.1.2. It is shown that several tailored attacks can be designed efficiently and provably fail well-known aggregation rules with a specific structure e.g., Krum, Comed, and Krum + resampling (Fang et al., 2020; Xie et al., 2020; Ramezani-Kebrya et al., 2022) . Suppose that the adversary has a set of S computationally tractable programs to design tailored attacks: F = {AT 1 , AT 2 , . . . , AT S }. (3) 2.1.2 SERVER WITH AGGREGATORS The server aims at learning an effective model, which achieves a satisfactory overall empirical risk over honest clients comparable to that under no attack. To update the global model, the server aggregates all gradients sent by clients at each iteration. Definition 2 (Aggregation rule). Let g ′ j ∈ R d denote an update received from client j, which can be either an honest or compromised client for j ∈ [n]. The server aggregates all updates from n clients and outputs a global update g ∈ R d using an aggregation rule AG: g = AG(g ′ 1 , . . . , g ′ n , F). ( ) where F denotes the set of attacks defined in Section 2.1.1. We assume that the server knows the number of compromised clients f or an upper bound on f , which is a common assumption in robust learning (Guerraoui et al., 2018; Blanchard et al., 2017; Alistarh et al., 2018; Rajput et al., 2019; Karimireddy et al., 2022) . However, the server does not know the specific Byzantine clients among n clients in this distributed system such that the server cannot compute g * in Eq. ( 1) directly. To learn and establish some level of robustness against training-time attacks, several Byzantine-resilient aggregation rules have been proposed e.g., Krum (Blanchard et al., 2017) and Comed (Yin et al., 2018) . These methods inspired by robust statistics provide rigorous convergence guarantees under specific settings and have been shown to be vulnerable to tailored attacks (Fang et al., 2020; Xie et al., 2020) . The set of M aggregators used by the server is denoted by A = {AG 1 , AG 2 , . . . , AG M }. Note that the pool of aggregators A and the set of attacks F are known by both the server and the adversary, but the specific AT t are AG t chosen at iteration t are unknown. Moreover, such powerful but not omniscient adversary does not have access to the random seed generator at the server. This is a mild and common assumption in cryptography, which requires a secure source of entropy to generate random numbers at the server (Ramezani-Kebrya et al., 2022) . To avoid trivial solutions, we assume each aggregation rule is robust (formal definition of robustness is provided in Appendix B) against a subset of attack algorithms in F while no aggregation rule is immune to all attack algorithms. Similarly, each attack program can provably fail one or more aggregation rules in A while no attack program can provably fail all aggregation rules.

2.2. PROBLEM FORMULATION

To evaluate the performance of an updated global model, i.e., the output of AG in Eq. ( 4), we define a loss function, which measures the discrepancy of the output of AG and an omniscient model update under no attack. Definition 3 (Loss function). The loss function associated with using aggregation rule AG under attack AT is defined as: ℓ(AG, AT, {g ′ i } n i=1 ) = ||AG(g 1 , . . . , g n-f , AT(g 1 , . . . , g n-f , F), A) -g * || = ||AG(g 1 , . . . , g n-f , b n-f +1 , . . . , b n , A) -g * ||. ( ) where g * is the ideal model under no attack which is computed in Eq. (1). To train the global model, the server takes multiple rounds of stochastic gradient descent by aggregating the stochastic gradients from clients. However, some gradients might be corrupt at each round, which are sent by compromised clients controlled by the adversary. We frame this robust distributed learning scenario as a game between the adversary and the server. The server wants to minimize the loss defined in Definition 3, while the adversary aims to maximize it. This game as a minimax optimization problem is formulated as: min AG∈A max AT∈F ℓ(AG, AT, {g ′ i } n i=1 ). (MinMax) The entire process of model aggregation with T rounds is shown in Algorithm 1. Note that E G denotes the expectation with respect to randomness due to stochastic gradients. Ideally, the game in MinMax can reach a Nash equilibrium (NE) (Nash, 1950) . However, the loss is not computable for the server, since the server cannot distinguish honest gradients such that g * is unknown for it. Therefore, we will simulate the game in the following Section 3. Algorithm 1 Update model from the perspective of server  for i = 1 to n -f do Honest client i computes local gradient gi(xt) = ∇Fi(xt). Compromised clients send attacks AT t ({gi} n-f i=1 , A). Sever receives gradients from all clients {g ′ i } n i=1 . Server chooses AG t by solving min AG t ∈A max AT t ∈F EG[ℓ(AG t , AT t , {g ′ i } n i=1 )]. Server updates the model xt+1 = xt -ηtAG t ({g ′ i } n i=1 , F).

3. ROBUST AGGREGATION

Because MinMax cannot be solved during the process of updating the model, we propose to simulate it instead and obtain an optimized aggregator for model updates. As mentioned in Section 2, the informed adversary has an advantage over the server since it can perfectly estimate g * in Eq. (1) while the server does not have such knowledge and cannot identify honest clients a priori. We assume that each client donates a small amount of data as a public dataset to the server to achieve some level of security by controlling the information gap between the server and adversary. Let g denote the update computed at the server using the public dataset, which is a rough estimate of g * . Remark 1. The server may obtain such public dataset from other sources not necessarily the current clients. It is sufficient as long as the collected public dataset represents the clients' data distribution by some extent. To guarantee convergence, we only require that the update from the public dataset is a rough estimate of the ideal g * . In particular, the quality of such estimate directly impacts the convergence of our proposed algorithm (see Section 4 for details). As the quality of such estimate improves, the convergence of the global model to an effective model improves. The existence of such public dataset is a valid and common assumption in FL (Fang & Ye, 2022; Huang et al., 2022; Kairouz et al., 2021; Yoshida et al., 2020; Chang et al., 2019; Zhao et al., 2018) . For the simulation, the server generates the simulated gradients { g i } n-f i=1 based on the public dataset. The loss function in the simulated game becomes ℓ(AG, AT, { g i } n-f i=1 ) = ||AG( g 1 , . . . , g n-f , AT(( g 1 , . . . , g n-f , A), F) - 1 n -f n-f i=1 g i || (7) Let L ∈ R M ×S + denote the loss of M aggregators corresponding to S attacks, and L(AG i , AT j ) represents the loss associated with aggregation rule i in A under attack j in F in the simulation. After the adversary has committed to a probability distribution q over S attack algorithms, the server chooses a probability distribution p over M aggregation rules. Then, the server incurs the loss ℓ(p, q) = pE G [L]q ⊤ . We will solve Sim-MinMax below instead of MinMax. min p∈∆ M max q∈∆ S pE G [L]q ⊤ . (Sim-MinMax) where ∆ M and ∆ S denote the probability simplex in [M ] and [S], respectively. In practice, it is computationally expensive to compute L M ×S and meanwhile there is noise due to the gradients. Therefore, we consider bandit feedback model with limited feedback, in which the server and adversary can only observe the loss through exploration. To solve Sim-MinMax in the bandit feedback model, one player could implement the well-known Exponential-weight Algorithm for Exploration and Exploitation (Exp3) (Seldin et al., 2013) whose detailed description is deferred to Appendix C. Due to two players (the server and the adversary) in our model, we propose an algorithm which simultaneously executes Exp3. We term our proposed robust aggregation scheme as RobustTailor, which outputs an optimized AG at each iteration. The specific algorithm from perspective of the server is shown in Algorithm 2. Using the public dataset, the server generates n -f noisy stochastic gradients g i = ∇ F i (x t ) for i ∈ [n -f ] at iteration t. After K rounds of simulation on { g i } n-f i=1 , the server obtains a final probability distribution p and selects an aggregation rule by sampling from p. The steps for our robust training procedure are summarized in Algorithm 3. Note that Appendix H demonstrates both theoretical analysis and empirical results of RobustTailor's computation complexity.

Algorithm 2 RobustTailor

Input: Updating rates λ1, λ2, λ1 and λ2, simulation rounds K, simulated gradients { gi} n-f i=1 , A, F. Initialize weight vector w 0 (i) = 1 for i ∈ [M ] and v 0 (j) = 1 for j ∈ [S]. for k = 1 to K do Set p k ( AGi) = (1 -λ1) w k (i) M i=1 w k (i) + λ1 1 M for i ∈ [M ]. Set q k ( ATj) = (1 -λ2) v k (j) S j=1 v k (j) + λ2 1 S for j ∈ [S]. Sample AG k ∼ p k and AT k ∼ q k respectively. Estimate the loss ℓ k = ℓ(AG k , AT k , { gi} n-f i=1 ). Set lk Receive gradients from all clients {g ′ i } n i=1 . 1 (i) = I{ AG i =AG k } p k ( AG i ) ℓ k , w k+1 (i) = w k (i) exp(-λ1 lk 1 (i)/M ) for i ∈ [M ]. Set lk 2 (j) = I{ AT j =AT k } q k ( AT j ) ℓ k , v k+1 (j) = v k (j) exp( λ2 lk 2 (j)/S) for j ∈ [S]. Set pi = K k=1 p k ( AG i ) K for i ∈ [M ]. Sample AG ∼ p. Output: AG.

Algorithm 3 Server's aggregation

Calculate simulated gradients { gi} n-f i=1 . Call Algorithm 2 to aggregate AG t = RobustTailor({ gi} n-f i=1 , A, F).

Update the global model by

xt+1 = xt -ηtAG t ({g ′ i } n i=1 , F). The adversary can also perform simulation to optimize its attack at each iteration. The main differences for an adversarial simulation compared to RobustTailor include: 1) the adversary can use perfect honest stochastic gradients {g i } n-f i=1 instead of noisy estimates; 2) the probability output is q which is calculated by the weight vector of attacks v(j) for j ∈ [S]. The details of the adversarial simulation are provided in Appendix D.

4. THEORETICAL GUARANTEES

To show convergence of the inner optimization in Algorithm 2, we first show how to turn two simultaneously played no-regret algorithms for a minimax problem into convergence to a Nash equilibrium (NE). To make the optimization problem shown in Algorithm 2 more general, we define a new loss function L : [M ] × [S] → R + . Consider simultaneously running two algorithms on the objective L, such that their respective expected regret is upper bounded by some quantities R i K and R j K , i.e., E K k=1 L(i k , j k ) - K k=1 L(i, j k ) ≤ R i K , E K k=1 L(i k , j) - K k=1 L(i k , j k ) ≤ R j K , for any i ∈ [M ] and j ∈ [S] where the expectation is taken over the randomness of the algorithms. Lemma 1 (Folklore). Assume we run two algorithms simultaneously with regret as in (8) to obtain {(i k , j k )} K k=1 . By playing ī uniformly sampled from {i k } K k=1 , we can guarantee Eī [L( ī, j)] ≤ E i ⋆ ∼p ⋆ ,j ⋆ ∼q ⋆ [L(i ⋆ , j ⋆ )] + 1 K (R i K + R j K ), for any j ∈ [S] where (p ⋆ , q ⋆ ) is a Nash equilibrium of E i ⋆ ∼p ⋆ ,j ⋆ ∼q ⋆ [L(i ⋆ , j ⋆ )]. This kind of result is well-known in the literature (see for instance Dughmi et al. (2017, Cor. 4 )). When the algorithms have sublinear regrets, we refer to them as no-regret algorithms. This condition ensures that the error term in (9) vanishes as K → ∞. Exp3 (Auer et al., 2002) , employed by both the attacker and aggregator in Algorithm 2, enjoys such a no-regret property. Lemma 2 (Hazan et al. 2016, Lemma 6.3 ). Let K be the horizon, N be the number of actions, and L k : [N ] → R + be non-negative losses for all k. Then Exp3 with stepsize λ = log N KN enjoys the following regret bound, E K k=1 L k (i k ) - K k=1 L k (i) ≤ 2 KN log N , for any i ∈ [N ] , where the expectation is taken over the randomness of the algorithm. Note that any two simultaneously played no-regret algorithms for a minimax problem can be turned into convergence to a NE following Lemmas 1 and 2. We obtain guarantees for the aggregation rule returned from Algorithm 2 as a direct consequence of Lemmas 1 and 2. Considering a specific situation in Algorithm 2, L is replaced by the simulation loss ℓ shown in Eq. ( 7). Lemma 3. Let ℓ be the simulation loss in Eq. (7). Sample AG ∼ p as defined in Algorithm 2 with λ1 = log M KM and λ2 = log S KS . Then the loss is bounded in expectation for any attack ATK ∈ F as, E AG,G ℓ AG, AT, {g ′ i } n i=1 ≤ p ⋆ E G [L](q ⋆ ) ⊤ + 2 √ M log M + √ S log S √ K , where (p ⋆ , q ⋆ ) ∈ ∆ M × ∆ S is a Nash equilibrium of the zero-sum game with stochastic payoff matrix E G [L] as defined in Sim-MinMax. Lemma 3 implies that the simulated loss approaches the NE value even under the worst-case attack. The proofs of Lemma 1 and Lemma 3 are provided in Appendix E and Appendix F respectively. Importantly, a sufficient condition for almost sure convergence for the outer loop is provided in Appendix B.

5. EXPERIMENTAL EVALUATION

In this section, we evaluate the resilience of RobustTailor against tailored attacks. To provide intuitive results and show benefits of simulation in terms of robustness, we first construct a simple pool of aggregators including only Krum (Blanchard et al., 2017) and Comed (Yin et al., 2018) . Concerning the adversary's tailored attacks, we consider two different types of attacks, which can successfully ruin Krum and Comed, respectively. As described in (Fang et al., 2020; Xie et al., 2020) , an adversary with ϵ-reverse attack computes the average of honest update, scales the average with a parameter ϵ, and sends scaled updates to the server to induct the global model towards the inverse of the direction along the one without attacks. It is known that a small ϵ corrupts Krum, while a large one corrupts Comed (Fang et al., 2020; Xie et al., 2020) . We simulate training with a total of 12 clients, 2 of which are compromised by an informed adversary. We train a CNN model on MNIST (Lecun et al., 1998) under both iid and non-iid settings and meanwhile training CNN models on Fashion-MNIST (FMNIST) (Xiao et al., 2017) and CI-FAR10 (Krizhevsky et al., 2009) is under iid setting. Note that the dataset is shuffled and equally partitioned among clients in the iid settings (Fig. 1 ). In addition, all honest clients donate 5% of their local training dataset to the server as a public dataset as specified in Remark 1, and the informed adversary has access to the gradients of honest clients. Note that the details of the Poisoned data mixed in the public dataset. Byzantine clients may be able to donate poisoned data to the public dataset. We assume 16.7% of data in the public dataset is poisoned due to 16.7% of malicious clients. Two normal data poisoning methods we choose are label flipping (LF) (Muñoz-González et al., 2017) and random label (LR) (Zhang et al., 2021) . Fig. 4 demonstrates that poisoned data mixed in has little impact on RobustTailor, which also proves that a small gap between the public dataset and true samples does not reduce the effectiveness of RobustTailor substantially. Unknown attacks for the server. In our assumption, the server knows all attacks in the adversary's pool. What will happen if there is an attack out of the server's expectation? Fig. 5 gives the results. In particular, ϵ = 0.1 in Fig. 5a and ϵ = 150 in Fig. 5b are the same type of attacks as ϵ = 0.5/100, and Mimic (Karimireddy et al., 2022) in Fig. 5c and Alittle (Baruch et al., 2019) in Fig. 5d are the different type of attacks. Note that we expand the set of RobustTailor with GM (Pillutla et al., 2022) and Bulyan (Guerraoui et al., 2018) as in Fig. 9 and decrease the learning rate to 0.005 for against Alittle and Mimic. It shows that RobustTailor can defend against not only the attacks similar to expectation but also those totally different. As a mixed framework, RobustTailor is hard to be ruined since the adversary hardly designs a tailored attack ruining several aggregation rules simultaneously. Aggregators with auxiliary data. Public dataset can be used not only for simulation but also to assist with aggregation. Fang et al. (2020) have proposed two server-side verification methods: error rate based rejection (ERR) and loss function based rejection (LFR), which reject potentially harmful gradients using error rates or loss values before aggregation. We provide experiments with the setup as in Fig. 3 , and Krum/Comed assisted by ERR/LFR is totally ruined by AttackTailor with around 10% accuracy while RobustTailor reaches 90.28%. These results provide further evidence that RobustTailor delivers superior performance over existing techniques. By additional experiments, we observe that ERR/LFR helps aggregator achieve around 97% accuracy when facing ϵ = 0.5 attack while it is totally ruined when against ϵ = 100. In this more sensitive situation, AttackTailor is easily to break single aggregation rules but RobustTailor still performs well. Additional experiments. To further validate the performance of RobustTailor, we set up additional experiments in Appendix G.2 including 1) three datasets; 2) non-iid settings; 3) more Byzantines; 4) more aggregation rules added in RobustTailor; 5) the impact of the proportion of public data; 6) subsampling by the server; 7) dynamic strategy of the adversary; 8) adversary with partial knowledge.

6. CONCLUSIONS AND FUTURE WORK

We formulate the robust distributed learning problem as a game between a server and an adversary. We propose RobustTailor, which achieves robustness by simulating the server's aggregation rules under different attacks optimized by an informed and powerful adversary. RobustTailor provides theoretical guarantees for the simulated game through a regret analysis. We empirically demonstrate the significant superiority of RobustTailor over baseline robust aggregation rules. Any Byzantineresilient scheme with a given structure can be added to RobustTailor 's framework. Although the increased computation complexity of RobustTailor is acceptable for the great robustness which is analyzed in Appendix H, it is also a future work to develop efficient and secure protocols to apply RobustTailor e.g., using multi-party computation (Boneh et al., 2019) . In model poisoning attack, an adversary can control some clients and can directly manipulate their outputs trying to bias the global model towards the opposite direction (Kairouz et al., 2021) . If Byzantine clients have access to the updates of honest clients, they can tailor their attacks and make them difficult to detect (Fang et al., 2020; Xie et al., 2020; Lamport et al., 1982; Blanchard et al., 2017; Goodfellow et al., 2014; Bagdasaryan et al., 2020) . Notation. We use E[•], ∥ • ∥, ∥ • ∥ 0 , Robust aggregation and Byzantine resilience. To improve robustness under general Byzantine clients, a number of robust aggregation schemes have been proposed, which are mainly inspired by robust statistics such as median-based aggregators (Yin et al., 2018; Chen et al., 2017) , Krum (Blanchard et al., 2017) , trimmed mean (Yin et al., 2018) . Krum (Blanchard et al., 2017) and coordinate-wise median (Comed) (Yin et al., 2018; Chen et al., 2017) are two main aggregation rules used in this paper. Krum is a squared-distance-based aggregation rule and it aggregates the gradients that minimize the sum of squared distances to its n -f -2 closest vectors where n denotes the total number of clients and f is the number of adversarial ones. Comed is a median-based aggregator and it selects the gradient closest to the median of each dimension. Except of statistical aggregation rules, there are still many related works like server-side verification, client-side self-clipping etc. From the perspective of the server, Fang et al. ( 2020 2020) propose some server-side verification methods using auxiliary data. Specifically, Fang et al. (2020) assume the server has a small validation dataset and uses error rates to reject harmful gradients. In (Xie et al., 2020; Cao & Lai, 2019) , the server asks a small clean dataset from clients and filters out unreliable gradients. Cao et al. (2020) 2021). However, the ability of clients is not the focus of our paper and we will consider it in future work. Past work has shown that these aggregators can defend successfully under specific conditions (Blanchard et al., 2017; Chen et al., 2017; Su & Vaidya, 2016) . However, Fang et al. (2020) and Xie et al. (2020) argue that Byzantine-resilient aggregators can fail when an informed adversary tailors a careful attack. Therefore, developing a robust and efficient algorithm under such strong tailored attacks is essential to improve security of FL, which is the goal of this paper. Heterogeneous data. In the real world applications, many issues such as heterogeneous data become significant (Kairouz et al., 2021; Karimireddy et al., 2020) . Karimireddy et al. (2022) find that robust learning algorithms of FL may fail under iid setting. Several algorithms are proposed to tackle non-iid data (Yoshida et al., 2020; Zhao et al., 2018; Karimireddy et al., 2022; Wang et al., 2020a; Data & Diggavi, 2021; Zhu et al., 2021) . Besides, data heterogeneity easily leads to backdoor attacks, which can be viewed as a kind of training-time attacks (Xie et al., 2019; Bagdasaryan et al., 2019; Zawad et al., 2021) . Therefore, establishing robustness under non-iid setting is also an important indicator for an aggregator. Game theory in FL. Online Convex Optimization (OCO) framework (Zinkevich, 2003) is widely influential in the learning community (Hazan et al., 2016; Shalev-Shwartz et al., 2012) , and Bandit Convex Optimization (BCO) as an extension of OCO was proposed by Awerbuch & Kleinberg (2008) for decision making with limited feedback. Bandit paradigms paralleling FL framework are proposed by Shi & Shen (2021) and its extension under Byzantine attacks is proposed by Demirel et al. (2022) . However, they account for uncertainties from both arm and client sampling rather than robust aggregation in this paper. In this paper, we frame robust distributed learning problem as a game and consider the bandit feedback model.

B ROBUSTNESS OF RobustTailor

In this section, we define a general robustness definition of an aggregation rule against an attack. Note that our definition covers a broad range of settings with general pure and mixed aggregation along with general pure and mixed attack strategies. Our robustness notion leads to almost sure convergence guarantees to a local minimum of F in FL, which is equivalent to being immune to training-time attacks. Definition 4 (Robustness of an aggregator to an attack program). Let x ∈ R d denote a machine learning model. Let g i (x) = ∇F i (x) ∈ R d be independent honest updates for i ∈ [n]. Let G(x) denote a function that draws an honest client i uniformly at random followed by outputting an unbiased stochastic gradient of ∇F i (x) over that client such that E[G(x)] = ∇F (x) where E is over both random client and samples. Let AG denote an arbitrary aggregation rule, which can be a mixed aggregation strategy selecting an aggregator from A = {AG 1 , . . . , AG M } based on simulation. The output of AG is given by g(x) = AG({g ′ } n i=1 ). Note that {g ′ } n i=1 includes both honest and compromised updates. The compromised updates are the output of an attack program AT({g i } n-f i=1 , A). Note that AT can be a pure or mixed attack strategy. The mixed aggregation rule AG is Byzantine-resilient to AT if g(x) satisfies E[g(x)] ⊤ ∇F (x) > 0 and E[||g(x)|| r ] ≤ K r E[||G(x)|| r ] for r = 2, 3, 4 and some constant K r . Suppose {η t } ∞ t=1 in Algorithm 3 satisfies t η t = ∞ and t η 2 t < ∞. For a nonconvex loss function, which is three times differentiable with continuous derivatives, bounded from below, and satisfies global confinement assumption in (Bottou, 1998, Section 5.1) , general pure and mixed aggregation and attack strategies satisfying Definition 4, and general non-iid data distribution across clients, we can establish almost sure convergence (∇F (x t ) → 0 a.s.) of the output of AG in Algorithm 3 along the lines of (Bottou, 1998; Fisk, 1965; Métivier, 1982) . Note that to achieve E[g(x)] ⊤ ∇F (x) > 0 shown above, it requires both the distance between ∇F (x) and the estimate of the honest update g and the distance between g and the expected output of Algorithm 2, i.e., E[g(x)], are small. Let θ 1 denote the angle between ∇F (x) and g, and let θ 2 denote the angle between g and E[g(x)], given by arg cos g⊤ ∇F (x) ∥g∥•∥∇F (x)∥ and arg cos g⊤ E[g(x)] ∥g∥•∥E[g(x)]∥ , respectively. If θ 1 + θ 2 < π/2, then we have E[g(x)] ⊤ ∇F (x) > 0. Following the arguments in Appendix B, almost sure convergence of Algorithm 3 is guaranteed as long as θ 1 + θ 2 < π/2. This condition can be satisfied assuming 1) the public data donated by clients is representative of the underlying data distribution of honest clients, which controls θ 1 , and 2) the number of Byzantine clients is sufficiently small, which controls θ 2 . We defer derivation of the explicit necessary condition for almost sure convergence to future work.

C DETAILS OF EXP3

The bandit feedback model considers the following iterate game. Definition 5 (Bandit setting). The player is given a decision set [N ] . At each iteration k = 1, . . . , K: 1. the player picks i k ∈ [N ]. 2. the adversary picks a loss vector ℓ k . 3. the player observes and suffers the loss at index i k , i.e. ℓ k (i k ). Exp3, as shown abstractly in Algorithm 4, enjoys a so called no-regret property in this setting. We employ Exp3 from both the perspective of a simulated server and simulated attacker to find a robust aggregation rule in Algorithm 2. In Appendix E we show how to convert the no-regret properties into a convergence guarantee.

Algorithm 4 Exp3

Input: Updating rate λ and λ, iteration rounds K, N Initialize weight vector w 0 (i) = 1 for i = 1, . . . , N . for k = 1 to K do Set W k = N i=1 w k (i), and set for i = 1, . . . , N p(i) = (1 -λ) w k (i) W k + λ 1 N Draw i k randomly accordingly to the probabilities p. Receive loss ℓ k . Set for i = 1, . . . , N lk (i) = ℓ k /p(i), if i = i k ; 0, otherwise. w k+1 (i) = w k (i) exp(-λl k (i)/N ).

D SIMULATION OF ADVERSARY

In this section, we show the simulation of the adversary. We term adversarial simulation as AttackTailor, which outputs an appropriate AT at each iteration. The specific steps from the perspective of the adversary is shown in Algorithm 5. After observing n -f honest gradients, the server performs K-round simulation and obtains a final probability distribution q. By sampling from q, the server selects an attack. Then, f Byzantine clients create and send the compromised gradients to the server. The steps for simulating the attack procedure are summarized in Algorithm 6. Importantly, the main difference between the adversary's simulation compared with that of server is that the adversary does simulation based on realistic honest gradients while the server has access only to noisy estimates of true gradients. Hence, unlike typical games and simulation setups, the adversary has an additional advantage over the server, which is due to information asymmetry. We are interested in the i-players performance E[L( ī, j)] which we can relate to the mixed strategy Nash equilibrium defined as E[L(i ⋆ , j)] ≤ E[L(i ⋆ , j ⋆ )] ≤ E[L(i, j ⋆ )] where i ⋆ ∼ p ⋆ and j ⋆ ∼ q ⋆ . By picking i ∼ p ⋆ in (13) we get, E[L( ī, j)] ≤ E[L(i, j)] + 1 K (R j K + R i K ) = E[L(i ⋆ , j)] + 1 K (R j K + R i K ) ≤ E[L(i ⋆ , j ⋆ )] + 1 K (R j K + R i K ), where the last inequalities follows by the definition of a Nash equilibrium above. The claim follows by writing the expectation on the RHS in terms of p ⋆ and q ⋆ . F PROOF OF LEMMA 3 Proof. Let both player i and player j in Lemma 1 employ the no-regret algorithm Exp3 such that Lemma 2 applies and consequently R i K and R j K in (8) reduce to R i K = 2 KM log M R j K = 2 KS log S. Substituting ( 16) into Lemma 1, we have Eī [L( ī, j)] ≤ E i ⋆ ∼p ⋆ ,j ⋆ ∼q ⋆ [L(i ⋆ , j ⋆ )] + 2 √ M log M + √ S log S √ K . Notice that Algorithm 2 is an instance of two simultaneously played Exp3 algorithms where i = AG, j = AT and L(i, j) = ℓ AG, AT, {g ′ i } n i=1 . It follows from ( 17) that E AG,G ℓ AG, AT, {g ′ i } n i=1 ≤ E AG ⋆ ∼p ⋆ ,AT ⋆ ∼q ⋆ ,G ℓ(AG ⋆ , AT ⋆ , {g ′ i } n i=1 ) + 2 √ M log M + √ S log S √ K (18) where AG is the average iterate as defined in Algorithm 2. We can concisely write the Nash equilibrium on the R.H.S. of (18) in terms of the payoff matrix L from Sim-MinMax defined componentwise as L(AG, AT) = ℓ(AG, AT, {g ′ i } n i=1 ). This completes the proof.

G EXPERIMENTAL DETAILS AND ADDITIONAL EXPERIMENTS

In this section, we provide the training hype-parameters and show a series of additional experiments.

G.1 DETAILS OF IMPLEMENTATION

Both MNIST (Lecun et al., 1998) and FMNIST (Xiao et al., 2017) datasets contain 60000 training samples and 10000 test samples. Each sample is a 28 by 28 pixel grayscale image. The details of training hype-parameters are shown in Table 1 . The network architecture is a fully connected neural network with two fully connected layers (Leroux et al., 2016) . The number of neurons is 100 and 10 for the first and second layer, respectively. All experiments have been run on a cluster with Xeon-Gold processors and V100 GPUs.

G.2 ADDITIONAL EXPERIMENTS

To further validate the performance of RobustTailor, we set up additional experiments: • Training on 3 datasets. • Non-iid settings with 3 different heterogeneous degree. • More aggregators (Geomed, Trimmedmean, Bulyan) added against single attack. • The impact of the proportion of public data. • Subsampling by the server. • Dynamic strategy of the adversary. • Adversary with partial knowledge. Different datasets. We train a CNN model on MNIST (Lecun et al., 1998) , Fashion-MNIST (FMNIST) (Xiao et al., 2017) and CIFAR10 (Krizhevsky et al., 2009) under iid setting. We summarize the training results against 3 attacks here and they are consensus with the results shown in Section 5. Non-iid settings. We also extend our consideration to more realistic settings with non-iid data distribution across clients. We use the heterogeneous degree µ ∈ [0, 1] to represent the level of disparity among clients' local data. To be specific, we construct a setting, in which 100µ % of local data for each client is drawn in a non-identical but independent manner from a particular class and all of them can be added in RobustTailor. When Bulyan is included, each Bulyan aggregator uses a different aggregator from Krum, Comed, TM, and GM for either the selection phase or in the aggregation phase. For each class, we generate 16 aggregators, each with a randomly generated ℓ p norm from one to 16. RobustTailor selects one aggregator from the entire pool of 64 aggregators based on the simulation at each iteration. Moreover, centered clipping (CC) (Karimireddy et al., 2021) as a history-aided aggregator was proposed recently and it can also be added in RobustTailor framework. Fig. 9 shows results when more aggregators added into RobustTailor framework and they perform better. Proportion of public data. Because RobustTailor ask for clients to donate a small amount of data as the public dataset, we want to minimise data leakage as much as possible while maintaining effectiveness. Therefore, detecting the impact of the proportion of public data is necessary. In our experiments, we assume every client donates 5% of data to the server. Fig. 10 shows the performance of RobustTailor with different proportion of public data under 3 attacks. Note that except for the proportion of public data, other settings of 3 figures in Fig. 10 are the same as Fig. 1a , Fig. 1b , and Fig. 3 respectively. It is obvious that the amount of public data has little impact on RobustTailor and even very small proportion of data donated by clients (e.g., 0.1%) can help RobustTailor achieve a great performance. This also strongly proves Remark 1 that the public dataset just need to represent clients' data distribution. Dynamic strategy of the adversary. The adversary can also use a dynamic strategy that changes the number of malicious updates dynamically. The adversary picks 1-3 clients randomly to control at each iteration while the server still considers 2 Byzantines in 12 clients. Other settings are the same as Section 5. Fig. 12 shows the results and Table 2 compares them with the results without dynamic attack strategy in Section 5. Although some aggregation rules are impacted lightly by such dynamic attack strategy, RobustTailor has a good performance which is consistent with the original results. Adversary with partial knowledge. What we want to demonstrate in the main text is that RobustTailor can keep a great and stable performance even when facing a very strong adversary who has full knowledge of all honest clients. However, it is very hard for the adversary to have full knowledge of the model updates of all honest clients in reality. Although Fang et al. (2020) show that the partial knowledge attacks are weaker than the full knowledge attack's, it is still significant to consider more realistic attacks. For the partial knowledge setting, assume the adversary only knows the updates of two honest clients and design compromised gradients based on them. Note that other settings are the same as Section 5. We show the empirical results under both iid and non-iid (the heterogenous degree µ = 0.9) settings in Fig. 13 and compares them against the adversary with full knowledge in Table 3 . Most aggregation rules can perform at least the same as the scenario of the adversary with full knowledge. below and utilize empirical results to prove that it is worth to trade a little more computational complexity for a great robust model. Theoretical analysis For RobustTailor with K simulation rounds, our algorithm approaches a min-max equilibrium the best aggregation rule at the rate O(K -1/2 ). In addition, the computational cost of RobustTailor is influenced by the server's aggregation rules and the adversary's attacks. If n clients submit d-dimensional vectors, Krum's expected time complexity is O(n 2 d) (Blanchard et al., 2017) and Comed's is O(nd) (Pillutla et al., 2022) . For more fine-grained complexity analysis, suppose { T1 , . . . , TM } denote the number of elementary operations to run each aggregation rule within the set of M aggregation rules. The worst-case runtime complexity of RobustTailor per simulation round is determined by max i∈[M ] Ti . However, the average complexity per round is the expected value of the number of elementary operations where the expectation is over the distribution of how likely each aggregator is chosen during simulation, which can be estimated empirically. Let us use pi to denote the probability of choosing A i . The average complexity per round is given by T = M i=1 Ti pi . Finally, the overall time complexity of RobustTailor is given by O(T K -1/2 ). Moreover, the number of elementary operations for simulation can be much smaller than applying the actual aggregator on the model during training assuming the size of the public data is very small, which is typically the case in practice (Yoshida et al., 2020; Zhao et al., 2018) . Note that our algorithm just adds computation complexity to the server while all clients remain the same cost based on their models and datasets. Therefore, it is worthwhile to trade slightly longer training time for a significantly improved training procedure w.r.t. robustness.

Empirical results

We show computation costs and accuracy for different aggregation rules after running 15k iterations in reality, whose results are also shown in Fig. 1 . We can see that RobustTailor still maintains a stable and high accuracy when facing a powerful adversary although it needs more computation time. However, Krum cannot reach a high accuracy and Comed shows a very unstable performance with lots of fluctuations when facing a strong adversary with AttackTailor. We note that compared with undesirable models of Krum and Comed, RobustTailor improves accuracy and stability drastically at the cost of slightly increased training time. 



While this assumption is essential to frame our game, we provide experimental results on challenging settings where the server does not know the set of attack programs in Section 5.



Learning rate ηt, n clients, f compromised clients, iteration rounds T , A and F Initialize model x0. for t = 1 to T do Send xt to all clients.

model and training hyper-parameters are provided in Appendix G.1, and all experiments below without specific clarification have same setup with Fig. 1a, Fig. 1b, and Fig. 3.Single tailored attacks. RobustTailor successfully decreases the capability of the adversary to launch tailored attacks. RobustTailor maintains stability in Fig.1awhen Krum fails catastrophically under a small ϵ attack. Fig.1bshows that RobustTailor has much less fluctuations in terms of test accuracy compared to Comed when facing a large ϵ attack. In addition, on average, RobustTailor has 70.68% probability of choosing Comed under ϵ = 0.5 attack while 65.49% probability of choosing Krum under ϵ = 100 attack, which proves that the server successfully learns how to defend. Training on FMNIST shows consistent results as seen in Fig.1cand Fig.1dand further results on CIFAR10 are in Appendix G.2. Note that RobustTailor will outperform both Krum and Comed if there is a larger pool to select aggregators, which results are shown in Appendix G.2.

Figure 1: Test accuracy on MNIST and FMNIST under iid setting. Tailored attacks (ϵ = 0.1/0.5, 100) are applied. RobustTailor selects an aggregator from Krum and Comed based on the simulation at each iteration.Mixed attacks. We now consider additional and stronger attack strategies beyond vanilla ϵ-reverse attacks. We assume the adversary has a set of attacks including ϵ = 0.5 and ϵ = 100 attacks. StochasticAttack shown in Fig.2picks an attack from its set uniformly at random at each iteration. AttackTailor in Fig.3optimizes an attack based on simulation at each iteration, whose detailed algorithm is in Appendix D. Compared to all previous attacks including StochasticAttack, AttackTailor is much stronger since it can pick a proper attack under perfect knowledge of honest updates. The poison of AttackTailor shown in Fig.3is almost as effective as the most targeted attack tailored against a single deterministic aggregator. Importantly, RobustTailor shows impressive robustness when facing such a strong adversary like AttackTailor.

Figure 2: StochasticAttack. StochasticAttack applies ϵ = 0.5 and ϵ = 100 uniform randomly.

Figure 3: AttackTailor. AttackTailor applies ϵ = 0.5 and ϵ = 100 based on the simulation.

Figure 4: Poisoned data mixed in the public dataset.

Figure 5: Attacks out of the server's expectation.

); Xie et al. (2020); Cao & Lai (2019); Cao et al. (

(a) MNIST, ϵ = 0.5 (b) MNIST, ϵ = 100 (c) MNIST, AttackTailor (d) FMNIST, ϵ = 0.1 (e) FMNIST, ϵ = 100 (f) FMNIST, AttackTailor (g) CIFAR10, ϵ = 0.1 (h) CIFAR10, ϵ = 100 (i) CIFAR10, AttackTailor

Figure 6: iid setting on three datasets. RobustTailor includes Krum and Comed. AttackTailor includes ϵ = 0.1/0.5 and ϵ = 100.

Figure 9: More aggregators added into to RobustTailor structure against single attacks.

Figure 10: The impact of the proportion of public data.

Figure 11: Subsampling by the adversary.

Figure 12: Dynamic attack strategy of the adversary.

Initial weight vector x0, learning rate ηt, iteration rounds T , number of clients n, set of aggregation rules A, set of attack algorithms F, and public dataset. for t = 1 to T do Server sends xt to all clients.

and ∥ • ∥ * to denote the expectation operator, Euclidean norm, number of nonzero elements of a vector, and dual norm, respectively. We use | • | to denote the length of a binary string, the length of a vector, and cardinality of a set. We use lower-case bold letters to denote vectors. Sets are typeset in a calligraphic font. The base-2 logarithm is denoted by log, and the set of binary strings is denoted by {0, 1} * . We use [n] to denote {1, • • • , n} for an integer n. We use ∆ M to denote the probability simplex in R M .A COMPLETE RELATED WORKFederated learning (FL). FL(McMahan et al., 2017;Konečnỳ et al., 2016) keeps training data decentralized in multiple clients which collaboratively train a model under the orchestration of a server(Kairouz et al., 2021). For the server, such clients are often more unpredictable and especially more vulnerable to the attacks. Secure aggregation protocols(Bonawitz et al., 2017;So et al., 2020) ensure that the server computes aggregated updates without revealing the original data. In this paper, we focus on training-time attacks and corresponding aggregation rules.

Training Hyper-parameters for Fashion-MNIST and MNIST

Comparison between the adversary with and without dynamic strategy.

Computational Complexity based on MNIST after running 15k iterations

Algorithm 5 AttackTailor

Input: Updating rates λ1, λ2, λ1 and λ2, simulation rounds K, gradients of honest clients {gi} n-f i=1 , A and F Initialize weight vector w 0 1 (i) = 1 for i ∈ [M ] and w 0 2 (j) = 1 for j ∈ [S] .M for i = 1, . . . , M .Set q k ( ATj) = (1 -λ2) v k (j) S j=1 v k (j) + λ2 1 S for j = 1, . . . , S.Sample AG k ∼ p k and AT k ∼ q k respectively.Estimate the lossSample AT ∼ q. Output: AT.

Algorithm 6 Adversary's attack

Input: Learning rate ηt, n workers, f compromised workers, iteration rounds T , A and F for t = 1 to T do Observe all gradients of honest workers {gi} n-f i=1 .Call Algorithm to attack AT t = AttackTailor({gi} n-f i=1 , A, F).Produce f gradients for compromised clients. Set for j ∈ [f ] bj = AT t ({gi} n-f i=1 , A).Send compromised gradients {bj} f j=1 to the server.

E PROOF OF LEMMA 1

Proof. Defined ī ∼ p to be uniformly sampled from {i k } K k=1 , and j ∼ q to be uniformly sampled from {j k } K k=1 . Using the no-regret property from ( 8),for any i ∈ [M ] and j ∈ [S], where the expectation is taken over ī and j and the randomness of the algorithms. Subtracting the two equations,Observe that by first evoking the inequality with i ∼ p and secondly with j ∼ q, we see that (p, q) is an ε-approximate Nash equilibrium, i.e.,corresponding to the client index and 100(1 -µ) % of the local data is drawn iid from all classes. A small µ represents low disparity while a large µ means significant disparity among clients. Fig. 7 shows three non-iid settings including µ = 0.1, 0.5, 0.9. We surprisingly observe that RobustTailor shows a satisfactory level of robustness even under heterogeneous data settings. (g) µ = 0.9, ϵ = 0.5 (h) µ = 0.9, ϵ = 100 (i) µ = 0.9, AttackTailor More Byzantines. Fig. 8 shows the results when there are 4 Byzantines in 12 total clients under three different attacks. Except the number of compromised clients, which is four instead of two, the setting is same as that in Fig. 1 . We observe that both Krum and Comed are sensitive to the number of Byzantine clients while RobustTailor is much more stable. Specifically, Krum has lower accuracy closing to zero, and Comed shows more obvious fluctuations. More aggregators against single attacks. To show the intuitive results of RobustTailor, we just construct the server's pool with two aggregators including Krum and Comed in the main text. However, RobustTailor could outperform both Krum and Comed simultaneously when additional aggregators are put into the server's pool. Trimmedmean (TM) (Yin et al., 2018) , Geomed (GM) (Pillutla et al., 2022) , and Bulyan (Guerraoui et al., 2018) are also statistic-based Byzantine-resilient aggregators 

H COMPUTATIONAL COMPLEXITY

The computational complexity bound depends on the simulation of the inner loop (including simulation rounds K, aggregator set A, and attack set F) and problem dimensions of the outer loop (including number of clients n and the dimension of gradients). We show the theoretical analysis

