GAIN: ENHANCING BYZANTINE ROBUSTNESS IN FEDERATED LEARNING WITH GRADIENT DECOMPO-SITION

Abstract

Federated learning provides a privacy-aware learning framework by enabling participants to jointly train models without exposing their private data. However, federated learning has exhibited vulnerabilities to Byzantine attacks, where the adversary aims to destroy the convergence and performance of the global model. Meanwhile, we observe that most existing robust AGgregation Rules (AGRs) fail to stop the aggregated gradient deviating from the optimal gradient (the average of honest gradients) in the non-IID setting. We attribute the reason of the failure of these AGRs to two newly proposed concepts: identification failure and integrity failure. The identification failure mainly comes from the exacerbated curse of dimensionality in the non-IID setting. The integrity failure is a combined result of conservative filtering strategy and gradient heterogeneity. In order to address both failures, we propose GAIN, a gradient decomposition scheme that can help adapt existing robust algorithms to heterogeneous datasets. We also provide convergence analysis for integrating existing robust AGRs into GAIN. Experiments on various real-world datasets verify the efficacy of our proposed GAIN.

1. INTRODUCTION

Federated Learning (FL) (McMahan et al., 2017 ) is a privacy-aware distributed machine learning paradigm. It has recently attracted widespread attention as a result of emerging data silos and growing privacy awareness. In this paradigm, data owners (clients) repeatedly use their private data to compute local gradients and send them to a central server for aggregation. In this way, clients can collaborate to train a model without exposing their private data. However, the distributed property of FL also makes it vulnerable to Byzantine attacks (Blanchard et al., 2017; Guerraoui et al., 2018) . During the training phase, Byzantine clients can send arbitrary messages to the central server to bias the global model. Moreover, it is challenging for the central server to identify the Byzantine clients, since the server can neither access clients' training data nor monitor local training processes. In order to defend against Byzantine attacks, the community has proposed a wealth of defenses (Blanchard et al., 2017; Guerraoui et al., 2018; Yin et al., 2018) . Most defenses abandon the averaging step adopted by conventional FL frameworks, e.g., FedAvg (McMahan et al., 2017) . Instead, they use robust AGgregation Rules (AGRs) to aggregate local gradients in order to defend against Byzantine attacks. Most existing robust AGRs assume that the data distribution on different clients is identically and independently distributed (IID) (Bernstein et al., 2018; Ghosh et al., 2019) . However, the data is usually non-independent and identically distributed (non-IID) in real-world FL applications (McMahan et al., 2017; Karimireddy et al., 2020; Kairouz et al., 2021) . As a result, in more realistic non-IID settings, most robust AGRs fail to defend against Byzantine attacks, and thus suffer from significant performance degradation (Karimireddy et al., 2022; Acharya et al., 2022) . To investigate the cause of the degradation, we perform a thorough experimental study on various robust AGRs. Close inspection reveals that the reason behind the degradation is different for different AGRs with different types of aggregation strategies. For conservative AGRs that only aggregate few gradients to get rid of Byzantines, they suffer from integrity failure. The integrity failure describes that an AGR can only identify few honest gradients for aggregation. This failure will lead to an aggregated gradient with limited utility due to the gradient heterogeneity (Li et al., 2020; Karimireddy et al., 2020) in the non-IID setting. For radical AGRs that aggregate as many gradients as possible to avoid such deviation, they suffer from another identification failure. The identification failure means that an AGR fails to distinguish between honest and Byzantine gradients. This failure is mainly due to the curse of dimensionality (Guerraoui et al., 2018; Diakonikolas et al., 2017) aggravated by the non-IIDness. Both failures deviate the aggregated gradient from the optimal gradient (the average of honest gradients). As a result, most existing AGRs fail to achieve a satisfactory performance in the non-IID setting. Motivated by the above observations, we propose a GrAdient decomposItioN method called GAIN that can handle both failures in various non-IID settings. In particular, to address the identification failure due to the curse of dimensionality, GAIN decomposes each high-dimensional gradient into low-dimensional groups for gradient identification. Then, GAIN incorporates gradients with low identification scores into final aggregation to tackle the integrity failure. Our contributions in this work are summarized below. • We reveal the root reasons for the performance degradation of current robust AGRs in the non-IID setting by proposing two new concepts: integrity failure and identification failure. Integrity failure origins from the gradient heterogeneity, and identification failure is a result of the aggravated curse of dimensionality in the non-IID setting. • We propose a novel and compatible approach called GAIN, which applies robust AGRs on the decomposed gradients, followed by identification before aggregation, rather than directly operating on the original gradients as the existing defenses (Multi-Krum (Blanchard et al., 2017) , Bulyan (Guerraoui et al., 2018) , etc) do. • We also provide convergence analysis for integrating existing robust AGRs into GAIN. In particular, we provide an upper bound for the sum of gradient norms. • We also offer empirical experiments on three real-world datasets across various settings to validate the effectiveness and superiority of our GAIN.

2. RELATED WORKS

Byzantine robust learning is first introduced by Blanchard et al. (2017) . Subsequently, a range of works study the robustness against Byzantine attacks by proposing various robust AGgregation Rules (AGRs) under the IID setting. Generally, we can classify the current robust AGRs into two categories: conservative AGRs and radical AGRs. Typical conservative AGRs, including Bulyan (Guerraoui et al., 2018) , Median (Yin et al., 2018) , Trimmed Mean (Yin et al., 2018) , etc., only aggregate few gradients to reduce the risk of the introduced Byzantine gradients. Bulyan (Guerraoui et al., 2018) applies a variant of trimmed mean as a post-processing method to handle the curse of dimensionality. Yin et al. (2018) theoretically analyze the statistical optimality of Median and Trimmed Mean. The radical AGRs, e.g., Multi-Krum (Blanchard et al., 2017) , DnC (Shejwalkar & Houmansadr, 2021) , incorporate as many gradients as possible to avoid such deviation. Multi-Krum is a distance-based AGR proposed by Blanchard et al. (2017) . (Pillutla et al., 2019) discuss the Byzantine robustness of Geometric Median and propose a computationally efficient approximation. Shejwalkar & Houmansadr (2021) propose to perform dimensionality reduction using random sampling, followed by spectral-based outlier removal. Recently, a quantity of works (Allen-Zhu et al., 2020; Karimireddy et al., 2021; Farhadkhani et al., 2022) discuss the effect of distributed momentum to Byzantine robustness from different perspectives. However, in more realistic FL applications where the data is non-IID, the efficacy of these defenses are quite limited. They fail to obtain high-quality aggregated gradient in the non-IID setting, thus suffer from significant performance degradation. Recently works have also explored defenses that can be applicable to the non-IID setting. Park et al. (2021) can only achieve Byzantine robustness when the server has a validation set, which compromises the privacy principle of the FL (McMahan et al., 2017) . Data & Diggavi (2021) adapt a robust mean estimation algorithm to FL to combat Byzantines in the non-IID setting. However, it requires Ω(d 2 ) time (d is the number of model parameters), which is unacceptable due to the high dimensionality of model parameters. El-Mhamdi et al. (2021) consider Byzantine robustness in the asynchronous communication and unconstrained topologies settings. Acharya et al. (2022) propose to apply geometric median only to the sparsified gradients to save computation cost. Karimireddy et al. (2022) perform a bucketing process before aggregation to reduce the gradient heterogeneity. These methods guarantee convergence of SGD in the existence of Byzantines. However, convergence is not enough in the context of non-convex, high dimensional case of neural networks (Guerraoui et al., 2018) . These methods lack of the guarantee that the aggregated gradient does not deviate from the optimal gradient (the average of honest gradients). As a result, they may lead to convergence towards ineffectual models.

3. NOTATIONS AND PRELIMINARIES

Notations. For any positive integer n ∈ N + , we denote the set {1, . . . , n} by [n] . The cardinality of a set S is denoted by #S or |S|. For a real number x ∈ R, we use |x| to denote the absolute value of number x. We denote the ℓ 2 norm of vector x by ∥x∥. We use [x] j to represent the j-th component of vector x. The sub-vector of vector x indexed by index set J is denoted by [x] J = ([x] j1 , . . . , [x] j k ), where J = {j 1 , . . . , j k }, and k = |J | is the number of indices. For a random variable X, we use E[X] and Var[X] to denote the expectation and variance of X, respectively. Federated learning. We consider the federated learning system with a center server and n clients following Blanchard et al. (2017) ; Yin et al. (2018) ; Guerraoui et al. (2018) . Then the objective is to minimize loss L(w) defined as follows. L(w) = 1 n n i=1 L i (w), where L i (w) = E ξi [L(w; ξ i )], i ∈ [n], where w is the model parameter, L i is the loss function on the i-th client, ξ i is the data distribution on the i-th client, and L(w; ξ) is the loss function. In the t-th communication round, the server distributes the parameter w t to the clients. Each client i conducts several epochs of local training on local data to obtain the updated local parameter w t i . Then, client i computes the local gradient g t i as follows and sends it to the server. g t i = w t -w t i . Finally, the server collects the local gradients and uses the average gradient to update the global model. w t+1 = w t -g t , g t = 1 n n i=1 g t i . ( ) This process is repeated until the global model converges or the number of communication rounds reaches the set value T . Byzantine threat model. In real-world applications, not all clients in FL systems are honest. In other words, there may exist Byzantine clients in FL systems (Blanchard et al., 2017) . Suppose that an adversary controls f Byzantines clients among the total n clients. Let B ∈ [n] denote set of Byzantine clients and H = [n] \ B denote the set of honest clients. In the presence of Byzantine clients, the uploaded message of client i in the t-th communication round is g t i = w t+1 i -w t , i ∈ H, * , i ∈ B, where * represents arbitrary value. Robust AGRs. Most solutions replace the averaging step by a robust alternative to defend against Byzantine attacks. More specifically, the server aggregates the gradients and updates the global model as follows. w t+1 = w t -ĝt , ĝt = A(g t 1 , . . . , g t n ), ( ) where ĝt is the aggregated gradient, and A is a robust AGR, e.g., Multi-Krum (Blanchard et al., 2017) , Bulyan (Guerraoui et al., 2018) . For notation simplicity, we omit the superscript t of the gradient symbols when there is no ambiguity in the rest of this paper.

4. FAILURES OF EXISTING ROBUST AGRS IN THE NON-IID SETTING

Most robust AGRs focus on Byzantine robustness in the IID setting (Blanchard et al., 2017; Guerraoui et al., 2018) . When the data is non-IID, the performance of these robust AGRs drop drastically (Shejwalkar & Houmansadr, 2021; Karimireddy et al., 2022) . In order to understand the root cause of this performance drop, we perform an experimental study on various robust AGRs. Particularly, we examine the behaviors of robust AGRs under the attack of 20% Byzantines in both IID and non-IID settings on CIFAR-10 ( Krizhevsky et al., 2009) Integrity failure of a conservative AGR. We take a closer look at a representative conservative AGR -Bulyan (Guerraoui et al., 2018) . Specifically, we consider an indicator called honest ratio -the ratio of selected honest client number to all selected client number (# selected honest clients / # selected clients) of a robust AGR in each communication round. A higher honest ratio suggests a higher proportion of honest gradients among the gradients aggregated by AGR. In particular, honest ratio 1 (0) suggests that all gradients that the AGR aggregates are honest (Byzantine). In Figure 1 (a), we report the honest ratio of the conservative AGR (Bulyan) in both IID and non-IID settings. The results show that in both settings, all the gradients aggregated by the conservative AGR are honest, which demonstrates the strong Byzantine filtering ability of a conservative AGR. Unfortunately, we find that Byzantine filtering is not enough in the non-IID setting. The results in Figure 1 (b) illustrate that the accuracy is significantly lower in the non-IID setting. This performance degradation implies a sharp deviation of the aggregated gradient from the optimal gradient (the average of honest gradients) in the non-IID setting. Honest gradients are heterogeneous when the data is non-IID (Li et al., 2020; Wang et al., 2021; Karimireddy et al., 2020) . As a result, aggregating only partial honest gradients will deviate the aggregated gradient from the optimal gradient, and eventually lead to ineffectual models. Therefore, in addition to Byzantine filtering, it is also crucial to incorporate sufficient honest gradients into aggregation in the non-IID setting. Identification failure of a radical AGR. We closely examine a typical radical AGR -Multi-Krum (Blanchard et al., 2017) . In Figure 1 (c), we demonstrate the honest ratio of the radical AGR in both IID and non-IID settings. As shown in the figure, the radical AGR succeeds in identifying honest gradients for aggregation in the IID setting but fails in the non-IID setting. The critical reason behind is that while the curse of dimensionality (Guerraoui et al., 2018) can be provably addressed in the IID setting, it will be aggravated and intractable in the non-IID setting. When the data is non-IID, Byzantines can easily exploit the curse of dimensionality to compromise radical AGRs, thus degrade the utility of global model as shown in Figure 1 (d) . Therefore, it is critical to overcome the aggravated curse of dimensionality in the non-IID setting. As analyzed above, most robust AGRs suffer from identification failure and integrity failure due to gradient heterogeneity and the aggravated curse of dimensionality in the non-IID setting. As a result, they fail to stop the aggregated gradient deviating from the optimal gradient, which leads to unsatisfactory performance in the non-IID setting.

5. PROPOSED METHOD

Our observations in Sec. 4 clearly motivate the need for a more robust defense to defeat Byzantine attacks in the non-IID setting. Inspired by these observations, we propose a novel GrAdient decomposItioN method called GAIN, which consists of three stages as follows. Decomposition. First, GAIN decomposes the gradients for gradient identification. The decomposition is specified by a partition of set [d] , where d is the dimension of gradients. Let {J 1 , . . . J p } denote the partition, where p is the number of groups. Particularly, the partition satisfies: J q ̸ = ∅, ∀q ∈ [p] and [d] = p q=1 J q , and J q J q ′ = ∅, ∀q, q ′ ∈ [p], q ̸ = q ′ , where ∅ represents the empty set, respresents the union of sets, and respresents the intersection of sets. Each gradient g i is correspondingly decomposed into p sub-vectors as follows. g (q) i = [g i ] Jq , i ∈ [n], q ∈ [p], where g (q) i is the q-th sub-vector of gradient g i . Identification. Then, GAIN applies any robust AGR A to each group of sub-vectors corresponding to J q : ĝ(q) = A(g (q) 1 , . . . , g (q) n ), q ∈ [p], where ĝ(q) is the aggregation result of group q. By performing aggregation on groups of lowdimensional sub-vectors, GAIN can circumvent the curse of dimensionality, thus avoid the identification failure discussed in Sec. 4. In other words, ĝ(q) can get rid of Byzantines. Note that ĝ(q) may still suffer from deviation due to the integrity failure of the AGR A as illustrated in Sec. 4. Therefore, it is inappropriate to directly use the aggregation results {ĝ (q) , q ∈ [p]} as the final output. Instead, we use ĝ(q) as an honest reference to compute identification scores for each client as follows. s (q) i = ∥g (q) i -ĝ(q) ∥, i ∈ [n], q ∈ [p]. Since the group-wise aggregation result ĝ(q) can get rid of Byzantines, the identification score s (q) i can provably characterize the potential for the g (q) i being a sub-vector of an honest gradient. Then, GAIN collects the identification scores from all groups and computes the final aggregation result. In particular, the final identification score s i of each client is composed of its identification scores received from all groups as follows. s i = p q=1 s (q) i , i ∈ [n]. Aggregation. To avoid integrity failure, GAIN selects total n -f gradients with the lowest identification scores for aggregation. Let I denote the index set of selected gradients, then the average of selected gradients is output as the final aggregation result as follows: ĝ = 1 n -f i∈I g i . ( ) Note that in the first stage (Decomposition) of GAIN, A could be any c-resilient AGR (Definition 1). The key difference lies in that all the existing robust AGRs (Multi-Krum, Bulyan, etc) directly operate on the original gradients before aggregation; instead we propose to apply robust AGRs on the decomposed gradient, followed by identification before aggregation. In this way, we can help enhance the identification ability and integrity of the current robust AGRs that satisfy the c-resilient property (Definition 1) in the non-IID setting. Detailed theoretical analysis and empirical support can be referred to Sec. 6 and Sec. 7 respectively.

6. THEORETICAL ANALYSIS

In this section, we provide a theoretical convergence analysis for our GAIN. We analyze a popular FL model widely considered by Karimireddy et al. (2021; 2022) ; Acharya et al. (2022) . In particular, each local gradient is computed by SGD as follows. g t i = ∇L(w t ; ξ t i ), i ∈ [n], where ξ t i represents a minibatch uniformly sampled from the local data distribution ξ i in the t-th communication round, and ∇L(w t , ξ t i ) represents the gradient of loss over the minibatch ξ t i . We make the following assumptions, which are standard in FL (Karimireddy et al., 2021; 2022; Acharya et al., 2022) . Assumption 1 (Unbiased Estimator). The stochastic gradients sampled from any local data distribution are unbiased estimators of local gradients over R d for all clients, i.e., E ξ t i [∇L(w; ξ t i )] = ∇L i (w), ∀w ∈ R d , i ∈ [n], t ∈ N + . ( ) Assumption 2 (Bounded Variance). The variance of stochastic gradients sampled from any local data distribution is uniformly bounded over R d for all clients, i.e., there exists σ ≥ 0 such that E∥∇L(w; ξ t i ) -∇L i (w)∥ 2 ≤ σ 2 , ∀w ∈ R d , i ∈ [n], t ∈ N + . ( ) Assumption 3 (Gradient Dissimilarity). The difference between the local gradients and the global gradient is uniformly bounded over R d for all clients, i.e., there exists κ ≥ 0 such that ∥∇L i (w) -∇L(w)∥ 2 ≤ κ 2 , ∀w ∈ R d , i ∈ [n]. We consider arbitrary non-convex loss function L(•) that satisfies the following Lipschitz condition. This condition is widely applied in convergence analysis of Byzantine-robust federated learning (Karimireddy et al., 2022; Allen-Zhu et al., 2020; El-Mhamdi et al., 2021) . Assumption 4 (Lipschitz Smoothness). The loss function is L-Lipschitz smooth with respect over R d , i.e., ∥∇L(w) -∇L(w ′ )∥ ≤ ∥w -w ′ ∥, ∀w, w ′ ∈ R d . Assumption 1 establishes the unbiased property of stochastic gradient. Assumption 2 bounds the variance of the stochastic gradients within a client. And Assumption 3 is a common measure of the non-IID level in federated learning (Data & Diggavi, 2021; Karimireddy et al., 2020; 2022) . We further establish the Byzantine resilience of the base AGR A. Definition 1 (c-resilient AGR). Let A be an AGR. If for any input {x 1 , . . . , x n } such that there exists a set H ∈ [n] of size at least |H| > n/2 that satisfies: E∥x i -x i ′ ∥ 2 ≤ ρ 2 , ∀i, i ′ ∈ H, the output of A satisfies: E∥A(x 1 , . . . , x n ) -x∥ 2 ≤ cρ 2 , where x = 1 |H| h∈H x h , then the AGR A is called c-resilient. In fact, most popular AGRs (Blanchard et al., 2017; Guerraoui et al., 2018; Karimireddy et al., 2021; 2022) are shown to satisfy this c-resilient definition (Farhadkhani et al., 2022) . We show that given any c-resilient base AGR A, our GAIN can help the global model to reach a better parameter point. Proposition 1. Suppose Assumptions 1 to 4 hold, and let η = 1/2L. Given a c-resilient robust AGR A, we start from w 0 and run GAIN for T communication rounds, it satisfies L(w 0 ) ≥ 3 16L T t=1 (∥∇L(w t )∥ 2 -e 2 ), where e 2 = O( f 2 (n -f ) 2 (κ 2 + σ 2 )(1 + c 2 + 1 n -f )(1 + n p )). Please refer to Appendix B for the proof. From one hand, Proposition 1 provides an upper bound for the sum of gradient norms in presence of Byzantine gradients. Equation ( 19) indicates that as the number of communication rounds increases, we can find an approximate optimal parameter w with ∥∇L(w)∥ ≤ e. Furthermore, as the number of sub-vectors p increases, the approximation becomes better, i.e., e 2 decreases, which validates the efficacy of our method. From another hand, Proposition 1 characterizes the fundamental difficulties of Byzantine-robust federated learning in the non-IID setting. The negative term -e 2 on the RHS implies that FL may never achieve a convergence point. By contrast, the global model may wander among sub-optimal points. What's more, even after reaching the convergence point, the global model may step to a sub-optimal in the next communication round. A detailed comparison of the convergence rate between our method and recent works is presented in Appendix B.2. For FEMNIST, the data is naturally partitioned into 3,597 clients based on the writer of the digit/character. For each client, we randomly sample a 0.9 portion of data as the training data and let the remaining 0.1 portion of data be the test data following Caldas et al. (2018) . Intuitively, the data distribution across different clients is non-IID.

7. EXPERIMENTS

Evaluated attacks. We consider six representative attacks BitFlip (Allen-Zhu et al., 2020), La-belFlip (Allen-Zhu et al., 2020), LIE (Baruch et al., 2019) , Min-Max (Shejwalkar & Houmansadr, 2021) , Min-Sum (Shejwalkar & Houmansadr, 2021) and IPM (Xie et al., 2020) . The detailed hyperparameter setting of the attacks are shown in Table 5 in Appendix D. Baselines. We consider 6 robust AGRs: (Blanchard et al., 2017) , Bulyan (Guerraoui et al., 2018) , Median (Yin et al., 2018) , RFA (Pillutla et al., 2019) , DnC (Shejwalkar & Houmansadr, 2021) , RBTM (El-Mhamdi et al., 2021 ). Among the above six defenses, Bulyan, Median, and RBTM are conservative, and Multi-Krum, RFA, and DnC are radical. We compare each AGR with its variant with GAIN. The detailed hyperparameter settings of the robust AGRs are listed in Table 6 in Appendix D. Evaluation. We use top-1 accuracy, i.e., the proportion of correctly predicted testing samples to total testing samples, to evaluate the performance of global models. We run each experiment for 5 times and report the mean and standard deviation of the highest accuracy during the training process. Other settings. We utilize AlexNet (Krizhevsky et al., 2017 ), SqueezeNet (Iandola et al., 2016) , ResNet-18 (He et al., 2016) and a four-layer CNN (Caldas et al., 2018) for CIFAR-10, CIFAR-100, ImageNet-12 and FEMNIST, respectively. The number of Byzantine clients of all datasets is set to f = 0.2 • n. For the partition of set {1, ..., d}, we randomly partition {1, ..., d} into p disjoint subsets with equal size. Please refer to Table 4 in Appendix D for more details.

Main results

. Table 1 illustrates the results of different defenses against popular attacks on CIFAR-10, CIFAR-100, ImageNet-12 and FEMNIST. From these tables, we observe that: (1) Integrating current defenses into our GAIN generally outperform all their original versions on all datasets, which verifies the efficacy of our proposed GAIN. For example, GAIN improves the accuracy of Median by 15.93% under Min-Sum attack on CIFAR-10. (2) The improvement of DnC+GAIN to DnC is relatively mild on CIFAR-10. Our interpretation is that when the dataset is relatively small and simple, DnC is capable of obtaining a rational gradient estimation. Nevertheless, on larger and more complex datasets, i.e., FEMNIST and ImageNet-12, DnC fails to achieve satisfactory performance under Byzantine attacks. (3) We find that although RFA collapses on FEMNIST, integrating into our GAIN can still improve it to satisfactory performance. Our illustration is that although the aggregated gradient of RFA deviates from the optimal gradient, it can still assist in identifying honest gradients when combined with GAIN. As a result, GAIN-RFA is still effective on FEM-NIST. (4) Note that the improvement of GAIN on conservative methods is greater. We contribute this phenomenon to the gradient heterogeneity due to non-IID data. Excluding honest gradients deviates the aggregated gradients from the average of honest gradients, thus degrading the performance of conservative methods. When the non-IID degree increases, the gradient heterogeneity increases. As a result, the impact of excluding honest gradients may even be larger than incorporating Byzantine gradients. Therefore, the improvement on the conservative AGRs is greater. Number of sub-vectors. We study the influence of sub-vector number p on the heterogeneous CIFAR-10 dataset. Figure 2 shows the performance and honest ratio of a conservative AGR (Bulyan) and a radical AGR (Multi-Krum) across p = {100, 1000}. The results show that for both conservative and radical AGRs, GAIN with a larger sub-vector number p can select a higher proportion of honest clients and achieve a better performance. When the sub-vector number p increases, our GAIN can better handle the identification failure and the integrity failure, which corresponds to our theoretical analysis in Sec. 6. Results on different levels of non-IID. We discuss the impact of non-IID level of data distributions. We modify the concentration parameter β to change the non-IID level. A smaller β implies a higher non-IID level. Table 2 demonstrates the accuracy of different defenses under LIE attack on CIFAR-10 dataset across β = {0.3, 0.7}. Other setups follow the default setup of the main experiments as illustrated in Sec. 7.1 and Appendix D. As shown in Table 2 , all the existing AGRs that combined with GAIN achieve better performances than their original versions, which validates that integrating into our GAIN can effectively defend against Byzantine attacks under different non-IID levels. Moreover, when the level of non-IID is higher, the improvement on robust AGRs is more significant. The results further confirm that our GAIN can overcome the failures aggravated under a higher non-IID level. Results on different number of clients. We further analyze the efficacy of our GAIN under different number of clients. Detailed setups can be found in Appendix F. And the experimental results are shown in Table 7 in Appendix F. All results demonstrate that AGRs that combine with our GAIN consistently outperform all their original versions, which validates that integrating with our GAIN can effectively defend against Byzantine across different number of clients.

8. CONCLUSION

In this work, we identify two root causes of performance degradation of robust AGRs in the non-IID setting. The first cause is the integrity failure of conservative AGRs. Conservative AGRs aggregate only few honest gradients, which is unreliable due to the gradient heterogeneity in the non-IID setting. The second cause is the identification failure of radical AGRs. Radical AGRs inevitably introduce Byzantine gradients into aggregation due to the curse of dimensionality aggravated by the non-IIDness. Both failures result in a sharp deviation of the aggregated gradient. Motivated by the above discoveries, we propose a novel GrAdient decomposItioN (GAIN) method that can be combined with most existing defenses and overcome both failures. We also provide convergence analysis for integrating existing robust AGRs into GAIN. Empirical studies on three real-world datasets justify the efficacy of our proposed GAIN.

A SETUPS FOR EXPERIMENTS IN SEC. 4

The experiments are conducted on CIFAR-10 ( Krizhevsky et al., 2009) . For both IID and non-IID settings, the number of client is set to n = 50. For IID data distribution, all 50,000 samples are randomly partitioned into 50 clients each containing 1,000 samples. For non-IID data distribution, the samples are partitioned in a Dirichlet manner with concentration parameter β = 0.5. Please refer to Sec. 7.1 for the details of Dirichlet partition. The number of Byzantine clients is set to f = 10. LIE (Baruch et al., 2019) attack with z = 1.5 considered. We use AlexNet (Krizhevsky et al., 2017) as the model architecture. The number of communication rounds is set to 500. In each communication round, all client participate in the training. For local training, the number of local epochs is set to 1, batch size is set to 64, the optimizer is set to SGD. For SGD optimizer, learning rate is set to 0.1, momentum is set to 0.5, weight decay coeffecient is set to 0.0001. We also adopt gradient clipping with clipping norm 2. Two defenses are considered: a radical AGR Multi-Krum (Blanchard et al., 2017) and a conservative AGR Bulyan (Guerraoui et al., 2018) .

B PROOF FOR PROPOSITION 1

Here we restate the assumptions and the proposition for the integrity of this section. Assumption 1 (Unbiased Estimator). The stochastic gradients sampled from any local data distribution are unbiased estimators of local gradients over R d for all clients, i.e., E ξ t i [∇L(w; ξ t i )] = ∇L i (w), ∀w ∈ R d , i ∈ [n], t ∈ N + . Assumption 2 (Bounded Variance). The variance of stochastic gradients sampled from any local data distribution is uniformly bounded over R d for all clients, i.e., there exists σ ≥ 0 such that E∥∇L(w; ξ t i ) -∇L i (w)∥ 2 ≤ σ 2 , ∀w ∈ R d , i ∈ [n], t ∈ N + . ( ) Assumption 3 (Gradient Dissimilarity). The difference between the local gradients and the global gradient is uniformly bounded over R d for all clients, i.e., there exists κ ≥ 0 such that ∥∇L i (w) -∇L(w)∥ 2 ≤ κ 2 , ∀w ∈ R d , i ∈ [n]. Assumption 4 (Lipschitz Smoothness). The loss function is L-Lipschitz smooth with respect over R d , i.e., ∥∇L(w) -∇L(w ′ )∥ ≤ ∥w -w ′ ∥, ∀w, w ′ ∈ R d . Definition 1 (c-resilient AGR). Let A be an AGR. If for any input {x 1 , . . . , x n } such that there exists a set H ∈ [n] of size at least |H| > n/2 that satisfies: E∥x i -x i ′ ∥ 2 ≤ ρ 2 , ∀i, i ′ ∈ H, the output of A satisfies: E∥A(x 1 , . . . , x n ) -x∥ 2 ≤ cρ 2 , where x = 1 |H| h∈H x h , then the AGR A is called c-resilient. Proposition 1. Suppose Assumptions 1 to 4 hold, and let η = 1/2L. Given a c-resilient robust AGR A, we start from w 0 and run GAIN for T communication rounds, it satisfies L(w 0 ) ≥ 3 16L T t=1 (∥∇L(w t )∥ 2 -e 2 ), where e 2 = O( f 2 (n -f ) 2 (κ 2 + σ 2 )(1 + c 2 + 1 n -f )(1 + n p )). B.1 KEY LEMMA AND PROOF Before starting the proof of the main proposition, we first state and prove the following lemma. Lemma 1 (Estimation error). Suppose Assumptions 1 to 3 hold. Given a c-resilient robust AGR A, for any ε > 0, with probability at least 1 -ε, where p is the number of sub-vectors in GAIN, the aggregated gradient ĝ of GAIN is an unbiased estimator of the optimal gradient ḡ = ∇L(w) with bounded variance. Eĝ = ḡ, Var[ĝ] ≤ σ 2 n -f , when E∥g b -ḡ∥ = Ω(κ • (1 + c + n/pε(1 + √ c)) + σ • (1 + c + 1/ √ n -f + n/pε(1 + √ c + 1/ √ n -f )). We state and prove the following lemma for the proof of Lemma 1. Lemma 2. For any random vector X, we have Var[∥X∥] ≤ E∥X -EX∥ 2 . ( ) Proof. From the definition of variance, we have Var[∥X∥] = E(∥X∥ -E∥X∥) 2 (23) = E(∥X∥ -∥EX∥) 2 -(∥EX∥ -E∥X∥) 2 (24) ≤ E(∥X∥ -∥EX∥) 2 (25) ≤ E∥X -EX∥ 2 . ( ) The second inequality comes from triangular inequality. Equipped with Lemma 2, we start the formal proof for Lemma 1. Proof. For all honest clients i, j ∈ H, parameter group q ∈ [p], we have E∥g (q) i -g (q) j ∥ 2 (27) = E∥(g (q) i - ḡ(q) i ) + (ḡ i -ḡ(q) ) + (ḡ (q) -ḡ(q) j ) + (ḡ (q) j -g (q) j )∥ 2 (28) ≤ 4E[∥g (q) i - ḡ(q) i ∥ 2 + ∥ḡ (q) i -ḡ(q) ∥ 2 + ∥ḡ (q) - ḡ(q) j ∥ 2 + ∥ḡ (q) j -g (q) j ∥ 2 ] (29) ≤ 8σ 2 + 8κ 2 . ( ) Here the first inequality comes from the Cauchy inequality, and the second inequality follows Assumptions 2 and 3. Then according to the Definition 1, we have E∥ĝ (q) -g (q) ∥ 2 ≤ 8c(σ 2 + κ 2 ) (31) Then for honest client h, the expectation of abnormal score s (q) h from group q can be bounded as follows. E[s (q) h ] = E∥g (q) h -ĝ(q) ∥ (32) ≤ E[∥g (q) h - ḡ(q) h ∥ + ∥ḡ (q) h -g (q) ∥ + ∥g (q) -ĝ(q) ∥] (33) = E∥g (q) h - ḡ(q) h ∥ + E∥ḡ (q) h -g (q) ∥ + E∥g (q) -ĝ(q) ∥ (34) ≤ E∥g (q) h - ḡ(q) h ∥ 2 + E∥ḡ (q) h -g (q) ∥ 2 + E∥g (q) -ĝ(q) ∥ 2 (35) ≤ σ + κ + 2 √ 2c σ 2 + κ 2 . ( ) Here the first inequality is a result of triangular inequality, the second inequality comes from Cauchy inequality, and the third inequality is a combined result of Equation ( 31) and Assumptions 2 and 3.

The variance of s (q)

h can also be bounded as follows. Var[s (q) h ] = E[(s (q) h ) 2 ] -(E[s (q) h ]) 2 (37) ≤ E[(s (q) h ) 2 ] (38) = E∥g (q) h -ĝ(q) ∥ 2 (39) ≤ 4E[∥g (q) h - ḡ(q) h ∥ 2 + ∥ḡ (q) h -ḡ(q) ∥ 2 + ∥ḡ (q) -g (q) ∥ 2 + ∥g (q) -ĝ(q) ∥ 2 ]. (40) Here the second inequality is a result of Cauchy inequality. We bound E∥ḡ (q) -g (q) ∥ 2 as follows. E∥ḡ (q) -g (q) ∥ 2 = E∥ 1 n -f i∈H (ḡ (q) i -g (q) i )∥ 2 (41) = 1 (n -f ) 2 i∈H E∥ḡ (q) i -g (q) i ∥ 2 (42) ≤ 1 (n -f ) 2 i∈H σ 2 (43) = σ 2 n -f Here the second equality comes from the independence of minibatches sampling across different clients, and the first inequality is a result of Assumption 2. Applying Assumptions 2 and 3 and Equations ( 31) and ( 44) to Equation ( 40), we have Var[s (q) h ] ≤ 4(σ 2 + κ 2 + σ 2 n -f + 8c(σ 2 + κ 2 )) = (4 + 32c + 4 n -f )σ 2 + (4 + 32c)κ 2 . ( ) According to Equations ( 36) and ( 46), we can bound the expectation and variance of total abnormal score s h of an honest client h. E[s h ] = E[ p q=1 s (q) h ] ≤ p(σ + κ + 2 √ 2c σ 2 + κ 2 ) := A, Var[s h ] = p q=1 Var[s (q) h ] ≤ p((4 + 32c + 4 n -f )σ 2 + (4 + 32c)κ 2 ) := B. Here the addictive property of variance is a result of the independence of group abnormal scores {s (q) h | q ∈ [p]}, which comes from the independence of components in a gradient (Yang & Schoenholz, 2017) . From Chebyshev's inequality, for any ∆ h > 0 and honest client h ∈ [n] \ B, we have P (s h < E[s h ] + ∆ h ) ≥ 1 - Var[s h ] ∆ 2 h . ( ) Consider the expectation of abnormal score s (q) b from group q for Byzantine client b ∈ B E[s (q) b ] = E∥g (q) b -ĝ(q) ∥ (50) = E∥(g (q) b -ḡ(q) ) -(ĝ (q) -ḡ(q) )∥ (51) ≥ E[∥g (q) b -ḡ(q) ∥ -∥ĝ (q) -ḡ(q) ∥] (52) ≥ E[∥g (q) b -ḡ(q) ∥ -(∥ĝ (q) -g (q) ∥ + ∥g (q) -ḡ(q) ∥)] (53) ≥ E∥g (q) b -ḡ(q) ∥ -( E∥ĝ (q) -ḡ(q) ∥ 2 + E∥g (q) -ḡ(q) ∥ 2 ) (54) ≥ δ b -2 √ 2c σ 2 + κ 2 - σ √ n -f

B.2 PROOF FOR THE MAIN PROPOSITION

Proof. According to the Lipschitz property of loss function L, we have L(w t ) -L(w t+1 ) ≥ ∇L(w t ), w t -w t+1 - L 2 ∥w t -w t+1 ∥ 2 . ( ) Since w tw t+1 = ∇L(w t ) + (ĝ t -∇L(w t )), we can write Equation (89) as follows L(w t ) -L(w t+1 ) ≥ (η - L 2 η 2 )∥∇L(w t )∥ 2 + (η - L 2 η 2 ) ∇L(w t ), ĝt -∇L(w t ) - L 2 η 2 ∥ĝ t -∇L(w t )∥ 2 . ( ) Take the expectation on both sides of Equation ( 90), we have EL(w t ) -L(w t+1 ) ≥ (η - L 2 η 2 )E∥∇L(w t )∥ 2 + (η - L 2 η 2 )E ∇L(w t ), ĝt -∇L(w t ) - L 2 η 2 E∥ĝ t -∇L(w t )∥ 2 . ( ) We further bound terms E ⟨∇L(w t ), ĝt -∇L(w t )⟩ and E∥ĝ t -∇L(w t )∥ 2 . First, we bound term E ∥ĝ t -∇L(w t )∥ 2 . For notation simplicity, we define Ht = H ∩ I t and Bt = B ∩ I t . Then ĝt can be written as follows: ĝt = 1 n -f i∈I g t i = 1 n -f ( h∈ H g t h + b∈ B g t b ) = h n -f ḡt H + f n -f ḡt B. Here h = | Ht |, b = | B|, ḡt H = h∈ H g t h / h, and ḡt B = b∈ B g t b / b. Term E ∥ĝ t -∇L(w t )∥ 2 is then bounded as follows E∥ĝ t -∇L(w t )∥ 2 = E∥ h n -f ḡt H + f n -f ḡt B -∇L(w t )∥ 2 (93) = E∥ h n -f (ḡ t H -∇L(w t )) + f n -f (ḡ t B -∇L(w t ))∥ 2 (94) ≤ 2 h2 (n -f ) 2 E∥ḡ t H -ḡt ∥ 2 + 2 f 2 (n -f ) 2 E∥ḡ t B -ḡt ∥ 2 We further bound E∥ḡ t H -ḡt ∥ 2 and E∥ḡ t B -ḡt ∥ 2 . First, we bound ∥ h∈ H g t h -∇L(w t )∥ as follows. E∥ḡ t H -∇L(w t )∥ 2 = E∥(ḡ t H - 1 h h∈ H ∇L i (w t )) + ( 1 h h∈ H ∇L i (w t ) -ḡt )∥ 2 (96) ≤ 2E∥ḡ t H - 1 h h∈ H ∇L i (w t )∥ 2 + 2E∥ 1 h h∈ H ∇L i (w t ) -ḡt ∥ 2 (97) ≤ σ 2 / h + κ 2 Then we bound E∥ḡ t B -ḡt ∥ 2 . According to Lemma 1, Byzantine gradients away from the optimal gradient will be directly filtered. Therefore, with probability 1 -ε ∥g t b -∇L(w t )∥ ≤ O((κ + σ)(1 + c + 1 √ n -f )(1 + n pε )), b ∈ B (99) Then, we have E∥ḡ t B -∇L(w t )∥ 2 ≤ O((κ 2 + σ 2 )(1 + c 2 + 1 n -f )(1 + n p )) := C 2 1 (101) The elimination of ε is due to the sub-Gaussian property of ḡt B -∇L(w t ), which comes from the Gaussian property of benign gradients. Combine Equations ( 98) and ( 101), E∥ĝ t -∇L(w t )∥ is finally bounded as follows E∥ĝ t -∇L(w t )∥ 2 (102) ≤ 2 h2 (n -f ) 2 (σ 2 / h + κ 2 ) + 2 f 2 (n -f ) 2 C 1 (1 + 1/p) (103) ≤ 2(n -2f ) 2 (n -f ) 2 (σ 2 /(n -2f ) + κ 2 ) + 2f 2 (n -f ) 2 C 2 1 (104) := C 2 Then, we bound inner product term E⟨∇L(w t ), ĝt -∇L(w t )⟩. |E⟨∇L(w t ), ĝt -∇L(w t )⟩| ≤ E|⟨∇L(w t ), ĝt -∇L(w t )⟩| ≤ E|⟨∇L(w t ), ĝt -∇L(w t )⟩| (107) ≤ E∥⟨∇L(w t )∥ • ∥ĝ t -∇L(w t )∥ (108) ≤ E[ 1 2 ∥⟨∇L(w t )∥ 2 + 2∥ĝ t -∇L(w t )∥ 2 ] (109) ≤ 1 2 E∥⟨∇L(w t )∥ 2 + 2C 2 Combine Equations ( 91), ( 104) and ( 110), we have L(w t ) -L(w t+1 ) ≥ ( 1 2 η - L 4 η 2 )E ∇L(w t ) 2 -( 1 2 η - L 2 η 2 )C 2 . ( ) Sum Equation (90) over t = 0, 1, . . . , T -1 and take expectation, then we have E[L(w 0 ) -L(w T )] ≥ ( 1 2 η - L 4 η 2 ) T t-1 E ∇L(w t ) 2 -T ( 1 2 η - L 2 η 2 )C 2 . ( ) Take η = 1/2L, and consider that the loss function is generally non-negative, e.g., cross-entropy loss, ℓ 2 loss, EL(w 0 ) ≥ 3 16L T t-1 (E ∇L(w t ) 2 - 2 3 C 2 ), which completes the proof.

C COMPARASION AGAINST RECENT WORKS

Recent works (Karimireddy et al., 2022; Allen-Zhu et al., 2020; El-Mhamdi et al., 2021 ) also analyze the convergence of Byzantine-robust FL in the non-IID setting. We all provide an upper bound on the gradient norms for the convergence analysis. We all admit that convergence in presence of Byzantines may be impossible due to non-IID data, i.e., ∥∇L(w)∥ may never decrease to zero. And the non-IID degree plays a key role in the upper bound. In particular, for each client i, we sample p y i ∼ Dir(β) and allocate a p y i proportion of the data of label y to client i, where Dir(β) represents the Dirichlet distribution with a concentration parameter β. We follow Li et al. (2021) and set the number of clients n = 50 and the concentration parameter β = 0.5 as default. Other setups. The setups for datasets FEMNIST (Caldas et al., 2018) , CIFAR-10 (Krizhevsky et al., 2009) , CIFR-100 (Krizhevsky et al., 2009) and ImageNet-12 (Russakovsky et al., 2015) are listed in below Table 4 . We use AlexNet (Krizhevsky et al., 2017) as the model architecture. The number of communication rounds is set to 500. In each communication round, all client participate in the training. For local training, the number of local epochs is set to 1, batch size is set to 64, the optimizer is set to SGD. For SGD optimizer, learning rate is set to 0.1, momentum is set to 0.5, weight decay coeffecient is set to 0.0001. We also adopt gradient clipping with clipping norm 2. Two defenses are considered: a radical AGR Multi-Krum (Blanchard et al., 2017) and a conservative AGR Bulyan (Guerraoui et al., 2018) .

E GAIN MITIGATES THE DEVIATION OF AGGREGATED GRADIENTS

In Sec. 6, we claim that our GAIN method can reduce the deviation of aggregated gradient ĝ from the average of honest gradients g. To verify this fact, we compare the deviation of the aggregated gradient of different defenses and their GAIN variants in Figure 3 . In particular, we use ∥ĝ -g∥, the distance between the aggregated gradient ĝ and the average of honest gradients g to measure the deviation degree. As shown in Figure 3 , the gradient deviation degree of GAIN-enhanced defenses is much lower than their original versions as expected, which validates that our GAIN can mitigate the gradient deviation. F RESULTS ON DIFFERENT NUMBER OF CLIENTS. We also conduct experiments across different number of clients. Table 7 demonstrates the results of different defenses under LIE attack across n = {75, 100} clients on CIFAR-10 dataset. Note that the number of Byzantine clients is set to f = 0.2 • n correspondingly. Other setups follow the default 



Figure 1: The behaviors of a conservative AGR (Bulyan) and a radical AGR (Multi-Krum) under the attack of 20% Byzantines in both IID and non-IID settings on CIFAR-10 dataset. More detailed setups are covered in Appendix A. The dotted lines represent the performance without Byzantines.

Figure 2: The behaviors of a conservative AGR (Bulyan) and a radical AGR (Multi-Krum) across sub-vector number p = 100, 1000 under the attack of 20% Byzantines in the non-IID setting on CIFAR-10 dataset. More detailed setups are covered in Appendix D.2.

γ init = 10, τ = 1 × 10 -5 , δ: coordinate-wise standard deviation Min-Sum γ init = 10, τ = 1 × 10 -5 , δ: coordinate-wise standard deviation IPM DnC c = 4, niters = 1, b = 10000 RBTM N/A D.2 SETUP FOR EXPERIMENTS ON THE NUMBER OF SUB-VECTORS IN SECTION 7The number of client is set to n = 50. The samples are partitioned in a Dirichlet manner with concentration parameter β = 0.5. Please refer to Sec. 7.1 for the details of Dirichlet partition.The number of Byzantine clients is set to f = 10. LIE(Baruch et al., 2019) attack with z = 1.5 considered.

Figure 3: The gradient deviation ∥ĝ -g∥ of six different defenses w/ and w/o GAIN under LIE attack on CIFAR-10. The lower the better.

Our experiments are conducted on three real-world datasets: CIFAR-10(Krizhevsky et al., 2009), CIFAR-100(Krizhevsky et al., 2009), a subset of ImageNet(Russakovsky et al., 2015) refered as ImageNet-12 and FEMNIST(Caldas et al., 2018). CIFAR-10 dataset consists of 60,000 32×32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images in CIFAR-10 dataset. CIFAR-100 dataset consists of 60,000 32×32 color images in 100 classes, with 600 images per class. There are 50,000 training images and 10,000 test images in CIFAR-100 dataset. ImageNet-12 consists of 15,600 color images in 12 classes, with 1,300 images per class. There are 12,480 training images and 3,120 test images in this subset of ImageNet. FEMNIST consists of 817,851 28×28 gray-scale images in 62 classes. There are 772,066 training images and 857,85 test images in FEMNIST.

Accuracy (mean±std) of different defenses under 6 attacks on CIFAR-10, ImageNet-12, FEMNIST, and CIFAR-100. Krum+GAIN 59.23 ± 0.55 61.47 ± 0.26 55.66 ± 0.93 49.19 ± 0.72 53.59 ± 0.96 56.94 ± 3.60 42.41 ± 0.58 42.55 ± 0.12 27.81 ± 0.32 31.18 ± 1.48 41.33 ± 0.50 42.62 ± 1.53 Krum+GAIN 84.29 ± 1.76 85.45 ± 0.40 74.76 ± 1.74 57.46 ± 0.33 70.65 ± 1.35 81.46 ± 0.18 66.79 ± 1.08 63.04 ± 0.14 57.15 ± 0.19 59.94 ± 0.32 64.07 ± 1.38 61.92 ± 0.04

Accuracy (mean±std)  of different defenses against LIE attack under different non-IID levels on CIFAR-10. A smaller β implies a higher non-IID level.

Accuracy (mean±std) of different defenses against LIE attack with different Byzantine client numbers f = {5, 15} on CIFAR-10. The number of total clients is n = 50.

Technically, we improve convergence in different ways. In particular,Allen-Zhu et al. (2020) show how server momentum or history gradients can help convergence.Karimireddy et al. (2022) considers the combined effect of server momentum and gradient bucketing.El-Mhamdi et al. (2021) considers a decentralized setting and minimizes the upper bound from the point of view of the robust AGR design. Our method considers how gradient decomposition can help convergence. In this sense, our convergence analysis is orthogonal to the above works and may be combined with them to achieve a better upper bound.Data distribution. For CIFAR-10, CIFAR-100(Krizhevsky et al., 2009) and ImageNet-12, we use Dirichlet distribution to generate non-IID data by followingYurochkin et al. (2019);Li et al. (2021).

Default experimental settings for FEMNIST, CIFAR-10, CIFAR-100 and ImageNet-12.

The hyperparameters of six attacks.

Accuracy (mean±std) of different defenses against LIE attack under different client numbers on CIFAR-10.

ETHICS STATEMENT

In this paper, our studies are not related to human subjects, practices to dataset releases, discrimination/bias/fairness concerns, and also do not have legal compliance or research integrity issues. Our work is proposed to achieve Byzantine robustness when applying federated learning to real-world applications. In this case, if federated learning is applied for good, we believe our proposed method will not cause any ethical problems or pose any negative societal impacts.

REPRODUCIBILITY STATEMENT

The implementation code is provided in Supplementary Materials. All datasets and the code platform (PyTorch) we use are public. Detail experiment setups are provided in the Appendices A and D.where δ b = E∥g (q) b -ḡ(q) ∥ is the expected deviation of Byzantine client b from the average of honest gradients. Here the first and second inequalities come from triangular inequality, the third inequality is based on Cauchy inequality, and the 4-th inequality is a combined result of Equations ( 31) and ( 44).The variance of abnormal score s (q) b can be bounded as follows.The first inequality results from Lemma 2, and the second inequality comes from Cauchy inequality.We bound ∥ĝ (q) -Eĝ (q) )∥ as follows.Apply Equation (65) to Equation (60), we havewhereb ∥ 2 is the variance. Similar to Equations ( 47) and ( 48), we utilize Equations ( 55) and (66) to bound the expectation and variance of total abnormal score s b of a byzantine client b.whereSimilarly, we apply Chebyshev's inequality to the abnormal score of a Byzantine client b ∈ B.Combine Equations ( 47) to (49), and take ∆ h = (C -A)/(1 + D/B), we haveCombine Equations ( 67) to ( 69), and takeThen consider the probability all the Byzantines are filteredthat is, Three types of attacks based on the adversary's knowledge are considered:• Agnostic attack: the adversary knows neither honest gradients nor the AGR.• Partial knowledge attack: the adversary only has the knowledge of honest gradients.• Omniscient attack: the adversary knows both honest gradients and the AGR.Among the six attacks considered: BitFlip (Allen-Zhu et al., 2020), LabelFlip (Allen-Zhu et al., 2020) are agnostic attacks; LIE (Baruch et al., 2019) , Min-Max (Shejwalkar & Houmansadr, 2021) , Min-Sum (Shejwalkar & Houmansadr, 2021 ) are partial knowledge attacks; and IPM (Xie et al., 2020) is an omniscient attack.The hyperparameters of six attacks: BitFlip (Allen-Zhu et al., 2020 ), LabelFlip (Allen-Zhu et al., 2020) , LIE (Baruch et al., 2019) , Min-Max (Shejwalkar & Houmansadr, 2021) , Min-Sum (Shejwalkar & Houmansadr, 2021), IPM (Xie et al., 2020) , are listed in below Table 5 .The hyperparameters of six defenses: Multi-Krum (Blanchard et al., 2017) , Bulyan (Guerraoui et al., 2018) , Median (Yin et al., 2018) , RFA (Pillutla et al., 2019) , DnC (Shejwalkar & Houmansadr, 2021) , RBTM (El-Mhamdi et al., 2021) , are listed in below Table 5 .

