FEDREP: A BYZANTINE-ROBUST, COMMUNICATION-EFFICIENT AND PRIVACY-PRESERVING FRAMEWORK FOR FEDERATED LEARNING Anonymous authors Paper under double-blind review

Abstract

Federated learning (FL) has recently become a hot research topic, in which Byzantine robustness, communication efficiency and privacy preservation are three important aspects. However, the tension among these three aspects makes it hard to simultaneously take all of them into account. In view of this challenge, we theoretically analyze the conditions that a communication compression method should satisfy to be compatible with existing Byzantine-robust methods and privacy-preserving methods. Motivated by the analysis results, we propose a novel communication compression method called consensus sparsification (ConSpar). To the best of our knowledge, ConSpar is the first communication compression method that is designed to be compatible with both Byzantine-robust methods and privacypreserving methods. Based on ConSpar, we further propose a novel FL framework called FedREP, which is Byzantine-robust, communication-efficient and privacypreserving. We theoretically prove the Byzantine robustness and the convergence of FedREP. Empirical results show that FedREP can significantly outperform communication-efficient privacy-preserving baselines. Furthermore, compared with Byzantine-robust communication-efficient baselines, FedREP can achieve comparable accuracy with an extra advantage of privacy preservation.

1. INTRODUCTION

Federated learning (FL), in which participants (also called clients) collaborate to train a learning model while keeping data privately-owned, has recently become a hot research topic (Konevcnỳ et al., 2016; McMahan & Ramage, 2017) . Compared to traditional data-center based distributed learning (Haddadpour et al., 2019; Jaggi et al., 2014; Lee et al., 2017; Lian et al., 2017; Shamir et al., 2014; Sun et al., 2018; Yu et al., 2019a; Zhang & Kwok, 2014; Zhao et al., 2017; 2018; Zhou et al., 2018; Zinkevich et al., 2010) , service providers have less control over clients and the network is usually less stable with smaller bandwidth in FL applications. Furthermore, participants will also take the risk of privacy leakage in FL if privacy-preserving methods are not used. Consequently, Byzantine robustness, communication efficiency and privacy preservation have become three important aspects of FL methods (Kairouz et al., 2021) and have attracted much attention in recent years. Byzantine robustness. In FL applications, failure in clients or network transmission may not get discovered and resolved in time (Kairouz et al., 2021) . Moreover, some clients may get attacked by an adversarial party, sending incorrect or even harmful information purposely. The clients in failure or under attack are also called Byzantine clients. To obtain robustness against Byzantine clients, there are mainly three different ways, which are known as redundant computation, server validation and robust aggregation, respectively. Redundant computation methods (Chen et al., 2018; Konstantinidis & Ramamoorthy, 2021; Rajput et al., 2019) require different clients to compute gradients for the same training instances. These methods are mostly for traditional data-center based distributed learning, but unavailable in FL due to the privacy principle. In server validation methods (Xie et al., 2019b; 2020b) , server validates clients' updates based on a public dataset. However, the performance of server validation methods depends on the quantity and quality of training instances. In many scenarios, it is hard to obtain a large-scale high-quality public dataset. The third way is to replace the mean aggregation on server with robust aggregation (Alistarh et al., 2018; Bernstein et al., 2019; Blanchard et al., 2017; Chen et al., 2017; Ghosh et al., 2020; Karimireddy et al., 2021; Li et al (Ghosh et al., 2021) -F 2 ed-Learning (Wang et al., 2020) -SHARE (Velicheti et al., 2021) -SparseSecAgg (Ergun et al., 2021) -FedREP (Ours) 2019; Sohn et al., 2020; Yin et al., 2018; 2019) . Compared to redundant computation and server validation, robust aggregation usually has a wider scope of application. Many Byzantine-robust FL methods (Wang et al., 2020; Xie et al., 2019a ) take this way. Communication efficiency. In many FL applications, server and clients are connected by wide area network (WAN), which is usually less stable and has smaller bandwidth than the network in traditional data-center based distributed machine learning. Therefore, communication cost should also be taken into consideration. Local updating technique (Konevcnỳ et al., 2016; McMahan et al., 2017; Yu et al., 2019b; Zhao et al., 2017; 2018) , where clients locally update models for several iterations before global aggregation, is widely used in FL methods. Communication cost can also be reduced by communication compression techniques, which mainly include quantization (Alistarh et al., 2017; Faghri et al., 2020; Gandikota et al., 2021; Safaryan & Richtárik, 2021; Seide et al., 2014; Wen et al., 2017) , sparsification (Aji & Heafield, 2017; Chen et al., 2020; Stich et al., 2018; Wangni et al., 2018) and sketchingfoot_0 (Rothchild et al., 2020) . Error compensation (also known as error feedback) technique (Gorbunov et al., 2020; Wu et al., 2018; Xie et al., 2020c) is proposed to alleviate the accuracy decrease for communication compression methods. Moreover, different techniques can be combined to further reduce communication cost (Basu et al., 2020; Lin et al., 2018) . Privacy preservation. Most of the existing FL methods send gradients or model parameters during training process while keeping data decentralized due to the privacy principle. However, sending gradients or model parameters may also cause privacy leakage problems (Kairouz et al., 2021; Zhu et al., 2019) . Random noise is used to hide the true input values in some privacy-preserving techniques such as differential privacy (DP) (Abadi et al., 2016; Jayaraman et al., 2018; McMahan et al., 2018) and sketching (Liu et al., 2019; Zhang & Wang, 2021) . Secure aggregation (SecAgg) (Bonawitz et al., 2017; Choi et al., 2020) is proposed to ensure the privacy of computation. Based on secure multiparty computation (MPC) and Shamir's t-out-of-n secret sharing (Shamir, 1979) , SecAgg allows server to obtain only the average value for global model updating without knowing each client's local model parameters (or gradients). Since noises can be simply added to stochastic gradients in most of the exsiting FL methods to provide input privacy, we mainly focus on how to combine SecAgg with Byzantine-robust and communication-efficient methods in this work. There are also some methods that consider two of the three aspects (Byzantine robustness, communication efficiency and privacy preservation), including RCGD (Ghosh et al., 2021) , F 2 ed-Learning (Wang et al., 2020) , SHARE (Velicheti et al., 2021) and SparseSecAgg (Ergun et al., 2021) , which we summarize in Table 1 . However, the tension among these three aspects makes it hard to simultaneously take all of the three aspects into account. In view of this challenge, we theoretically analyze the tension among Byzantine robustness, communication efficiency and privacy preservation, and propose a novel FL framework called FedREP. The main contributions are listed as follows: • We theoretically analyze the conditions that a communication compression method should satisfy to be compatible with Byzantine-robust methods and privacy-preserving methods. Motivated by the analysis results, we propose a novel communication compression method called consensus sparsification (ConSpar). To the best of our knowledge, ConSpar is the first communication compression method that is designed to be compatible with both Byzantine-robust methods and privacy-preserving methods. • Based on ConSpar, we further propose a novel FL framework called FedREP, which is Byzantine-robust, communication-efficient and privacy-preserving. • We theoretically prove the Byzantine robustness and the convergence of FedREP. • We empirically show that FedREP can significantly outperform existing communicationefficient privacy-preserving baselines. Furthermore, compared with Byzantine-robust communication-efficient baselines, FedREP can achieve comparable accuracy with an extra advantage of privacy preservation.

2. PRELIMINARY

In this work, we mainly focus on the conventional federated learning setup with m clients and a single server (Kairouz et al., 2021) , which collaboratively to solve the finite-sum optimization problem: min w∈R d F (w) = m k=1 p k F k (w) s.t. F k (w) = 1 |D k | i∈D k f i (w), k = 1, 2, . . . , m, (1) where w is the model parameter and d is the dimension of parameter. f i (w) is the empirical loss of parameter w on the i-th training instance. D k denotes the index set of instances stored on the k-th client and F k (w) is the local loss function of the k-th client. We assume that D k ∩ D k = ∅ when k = k , and consider the instances on different clients with the same value as several distinct instances. p k is the weight of the k-th client satisfying that p k > 0 and m k=1 p k = 1. A common setting of p k is that p k = |D k |/( m k=1 |D k |). For simplicity, we assume |D k | = |D k | for all k, k ∈ [m] and thus p k = 1/m. The analysis in this work can be extended to general cases in a similar way. Most federated learning methods (Karimireddy et al., 2020; McMahan et al., 2017; 2018) to solve problem (1) are based on distributed stochastic gradient descent and its variants, where clients locally update model parameters according to its own training instances and then communicate with server for model aggregation in each iteration. However, the size of many widely-used models (Devlin et al., 2018; He et al., 2016) is very large, leading to heavy communication cost. Thus, techniques to reduce communication cost are required in FL. Moreover, FL methods should also be robust to potential Byzantine attack and privacy attack in real-world applications. Byzantine attack. Let [m] = {1, 2, . . . , m} denote the set of clients. G ⊆ [m] denotes the set of good (non-Byzantine) clients, which will execute the algorithm faithfully. The rest clients [m] \ G are Byzantine, which may act maliciously and send arbitrary values. The server, which is usually under service provider's control, will faithfully execute the algorithm as well. This Byzantine attack model is consistent to that in many previous works (Karimireddy et al., 2021) . Although there are some works (Burkhalter et al., 2021) focusing on another types of attacks called backdoor attacks (Kairouz et al., 2021) , in this paper we mainly focus on Byzantine attacks. The purpose of Byzantine attacks is to degrade the model performance. One typical technique to defend against Byzantine attacks is robust aggregation (Kairouz et al., 2021) , which guarantees bounded aggregation error even if Byzantine clients send incorrect values. Privacy attack. In a typical FL method, server is responsible for using the average of clients' local updating values for global model updating. However, local updating information may be used to recover client's training instances (Zhu et al., 2019) , which will increase the risk of privacy leakage. Thus, server is prohibited to directly receive individual client's updating information by the requirement of privacy preservation (Kairouz et al., 2021) . Secure aggregation (Bonawitz et al., 2017) is a typical privacy-preserving method, which only allows server to have access to the average value for global model updating. There are mainly two different types of FL settings, which are also called cross-silo FL and crossdevice FL (Kairouz et al., 2021) . We mainly focus on the cross-silo FL setting in this paper, where the number of clients m is usually not too large and all clients can participate in each training iteration. Meanwhile, in this paper we mainly focus on synchronous FL methods.

3. METHODOLOGY

In this section, we analyze the conditions that a communication compression method should satisfy to be compatible with Byzantine-robust methods and privacy-preserving methods. Based on the analysis, we propose a novel communication compression method called consensus sparsification and a novel federated learning framework called FedREP that is Byzantine-robust, communication-efficient and privacy-preserving. In FedREP, we adopt robust aggregation technique to obtain Byzantine robustness due to its wider scope of application than redundant computation and server validation. For privacy preservation, we mainly focus on secure aggregation, which is a widely used technique in FL to make server only have access to the average of clients' local updating values.

3.1. MOTIVATION

We first analyze the compatibility of existing communication compression methods with secure aggregation (SecAgg) (Bonawitz et al., 2017) . SecAgg is usually adopted together with quantization since it requires to operate on a finite field to guarantee the privacy preservation. Traditional quantization methods that represent each coordinate in lower bits can compress gradients in floating point number (32 bits) only up to 1/32 of the original size. Even with quantization, SecAgg still suffers from heavy communication cost. Thus, sparsification is required to further reduce the communication cost (Ergun et al., 2021) . However if we simply combine traditional sparsification methods (e.g., random-K and top-K sparsification) with SecAgg, the random mask in SecAgg will damage the sparsity. Thus, non-Byzantine clients should agree on the non-zero coordinates in order to keep the sparsity in SecAgg. Then we analyze the compatibility of sparsification with robust aggregation. As previous works (Karimireddy et al., 2021) have shown, to obtain Byzantine robustness, it requires the distances between compressed updates from different clients (a.k.a. dissimilarity between clients) to be small. Specifically, we present the definition of (δ, c)-robust aggregator in Definition 1. Definition 1 ((δ, c)-robust aggregator (Karimireddy et al., 2021; 2022)  ). Assume constant δ ∈ [0, 1 2 ) and index set G ⊆ [m] satisfies |G| ≥ (1 -δ)m. Suppose that we are given m random vectors v 1 , . . . , v m ∈ R d such that E v k -v k 2 ≤ ρ 2 for any fixed k, k ∈ G. v k can be arbitrary value if k ∈ [m] \ G. Aggregator Agg(•) is said to be (δ, c)-robust if the aggregation error e = Agg({v k } m k=1 ) -1 |G| k∈G v k satisfies that E e 2 ≤ cδρ 2 . ( ) As shown in previous works (Karimireddy et al., 2022) , many widely-used aggregators, such as Krum (Blanchard et al., 2017) , geoMed (Chen et al., 2017) and coordinate-wise median (Yin et al., 2018) , combined with averaging in buffers (please refer to Section 3.3), satisfy Definition 1. Moreover, O(δρ 2 ) is the tightest order (Karimireddy et al., 2021) . Thus, a compression method which is compatible with robust aggregation should satisfy the condition that the expectation of dissimilarity between clients' updates is kept small after compression. Therefore, we theoretically analyze the expectation of dissimilarity after sparsification. For space saving, we only present the results here. Proof details can be found in Appendix B. Theorem 1. Let {v k } m k=1 denote random vectors that satisfy E v k -v k 2 = (ρ k,k ) 2 and E v k 2 = (µ k ) 2 for any fixed k, k ∈ G. More specifically, E[(v k ) j -(v k ) j ] 2 = ξ k,k ,j (ρ k,k ) 2 and E[(v k ) 2 j ] = ζ k,j (µ k ) 2 , where ξ k,k ,j > 0, µ k,j > 0, j∈[d] ξ k,k ,j = 1 and j∈[d] ζ k,j = 1 for any fixed k, k ∈ G. Let C(•) denote any sparsification operator and N k denote the set of non-zero coordinates in C(v k ). For any fixed k, k ∈ G, we have: E C(v k )-C(v k ) 2 = (ρ k,k ) 2 • j∈[d] ξ k,k ,j Pr[j ∈ N k ∩ N k ] + (µ k ) 2 • j∈[d] ζ k,j Pr[j ∈ N k \ N k ] + (µ k ) 2 • j∈[d] ζ k ,j Pr[j ∈ N k \ N k ] . (3) Please note that when dissimilarity between the k-th and the k -th clients is not too large, (µ k ) 2 and (µ k ) 2 are usually much larger than (ρ k,k ) 2 . In Equation (3), terms (µ k ) 2 and (µ k ) 2 vanish if and only if N k \ N k = N k \ N k = ∅ with probability 1, which is equivalent to that N k = N k with probability 1. Furthermore, in order to lower the dissimilarity bewteen any pair of non-Byzantine clients, all non-Byzantine clients should agree on the non-zero coordinates of sparsified vectors. Motivated by the analysis results in these two aspects, we propose the consensus sparsification.

3.2. CONSENSUS SPARSIFICATION

We introduce the consensus sparsification (ConSpar) method in this section. For simplicity, we assume the hyper-parameter K is a multiple of client number m. We use u t k to denote the local memory for error compensation (Stich et al., 2018)  I t = Agg({(g t k ) I t } m k=1 ). Please note that ( Gt ) I t is still a sparsified vector. Moreover, server is only required to broadcast ( Gt ) I t since I t has already been sent to clients before. Thus, ConSpar is naturally a two-way sparsification method without the need to adopt DoubleSqueeze technique (Tang et al., 2019) . Then we analyze the dissimilarity between clients after consensus sparsification. Please note that we do not assume the behaviour of Byzantine clients, which may send arbitrary I t k . Proposition 1. Let {g t k } m k=1 denote the consensus sparsification results of vectors {g t k } m k=1 and then we have E gt k -gt k 2 ≤ E g t k -g t k 2 for any fixed k, k ∈ G. Proposition 1 indicates that ConSpar will not enlarge the dissimilarity between clients, which is consistent with Theorem 1. Meanwhile, SecAgg can be used in the second communication round and random masks are needed to add on the consensus non-zero coordinates only. Then we analyze the privacy preservation of the mechanism that we use to generate I t k in ConSpar. To begin with, we present the definition of -differential privacy ( -DP) in Definition 2. Definition 2. Let > 0 be a real number. A random mechanism M is said to provide -differential privacy if for any two adjacent input datasets T 1 and T 2 and for any subset of possible outputs S: Pr[M(T 1 ) ∈ S] ≤ exp( ) • Pr[M(T 2 ) ∈ S]. In the mechanism that we use to generate I t k in ConSpar, T 1 and T 2 are the top-K m coordinate sets. Definition 2 leaves the definition of adjacent datasets open. In this work, coordinate sets T 1 and T 2 that satisfy T 1 , T 2 ⊆ [d] and |T 1 | = |T 2 | = K m are defined to be adjacent if T 1 and T 2 differ only on one element. Liu et al. (2020) provides DP guarantee for sparsification methods with only one selected coordinate. Our definition is more general and includes the one coordinate special case where |T 1 | = |T 2 | = 1. Now we show that the coordinate generation mechanism provides -DP. Theorem 2. For any α ∈ (0, 1], the mechanism in consensus sparsification that takes the set of top coordinates T t k as an input and outputs I t k provides ln (1+α)• K m (d-K m +1) 2α -differential privacy. Finally, we analyze the communication complexity of ConSpar. Clients need to send candidate coordinate set I t k , receive I t , send local gradient in the form of (g t k ) I t , and then receive ( Gt ) I t in each iteration. Thus, each client needs to communicate no more than ( K m + K) integers and 2K floating point numbers in each iteration. When each integer or floating point number is represented by 32 bits (4 bytes), the total communication load of each client is no more than (96 + 32 m )K bits in each iteration. The communication load is not larger than that of vanilla top-K sparsification, in which 4 × 32K = 128K bits are transmitted in each iteration. Meanwhile, although ConSpar requires two communication rounds, the extra communication round is acceptable. For one reason, there is little computation between the two rounds, which will not significantly increase the risk of client disconnection during the aggregation process. For another reason, the cost of the extra communication round is negligible when combined with SecAgg since SecAgg already requires multiple communication rounds and can deal with offline clients.

3.3. FEDREP

As we have shown, ConSpar is compatible with each of robust aggregation and SecAgg. However, robust aggregation and SecAgg can not be simply applied together since SecAgg is originally designed for linear aggregation (such as summation and averaging) while robust aggregation is usually nonlinear. This is also known as the tension between robustness and privacy (Kairouz et al., 2021) . Buffers on server are widely studied in Byzantine-robust machine learning (Karimireddy et al., 2022; Velicheti et al., 2021; Wang et al., 2020; Yang & Li, 2021) , which can be used to make such a trade-off between robustness and privacy. We also introduce buffers in FedREP. The details of FedREP are illustrated in Algorithm 1 and Algorithm 2 in Appendix A. Let integer s denote the buffer size. For simplicity, we assume client number m is a multiple of buffer size s and hence there are m s buffers on server. At the beginning of the t-th global iteration, each client k locally trains model using optimization algorithm A and training instances D k based on w t and obtain model parameter w t+1 k = A(w t ; D k ). The update to be sent is g t k = u t k + (w t -w t+1 k ) , where u t k is the local memory for error compensation with u 0 k = 0. Then client k generates coordinate set I t k by consensus sparsification (please see Section 3.2), and sends I t k to the server. When server receives all clients' suggested coordinate sets, it broadcasts I t = ∪ m k=1 I t k , which is the set of coordinates to be transmitted in the current iteration, to all clients. In addition, server will randomly assign a buffer for each client. More specifically, server randomly picks a permutation π of [m] and assign buffer b l to clients {π(ls + k)} s k=1 (l = 0, 1, . . . , m s -1). Then for each buffer, b l = 1 s s k=1 (g t π(ls+k) ) I t is obtained by secure aggregationfoot_2 and global update ( Gt ) I t = Agg({b l } m/s l=1 ) is obtained by robust aggregation among buffers. During this time, clients could update the local memory for error compensation by computing u t+1 k = g t k -gt k . Finally, ( Gt ) I t is broadcast to all clients for global updating by w t+1 = w t -Gt . We have noticed that the consensus sparsification is similar to the cyclic local top-K sparsification (CLT-K) (Chen et al., 2020) , where all clients' non-zero coordinates are decided by one client in each communication round. However, there are significant differences between these two sparsification methods. CLT-K is designed to be compatible with all-reduce while consensus sparsification is designed to be compatible with robust aggregation and SecAgg in FL. In addition, when there are Byzantine clients, CLT-K does not satisfy the d -contraction property since Byzantine clients may purposely send wrong non-zero coordinates. Meanwhile, some works (Karimireddy et al., 2022) show that averaging in groups before robust aggregation (as adopted in FedREP) can help to enhance the robustness of learning methods on heterogeneous datasets. We will further explore this aspect in future work since it is beyond the scope of this paper. Finally, we would like to discuss more about privacy. In the ideal case, server would learning nothing more than the aggregated result. However, obtaining the ideal privacy-preserving property itself would be challenging, and even more so when we attempt to simultaneously guarantee Byzantine robustness and communication efficiency (Kairouz et al., 2021) . In FedREP, server has access to the partially aggregated mean b l and the coordinate set I t k . However, as far as we know, the risk of privacy leakage increased by the two kinds of information is limited. Server does not know the momentum sent from each single client, and only has access to the coordinate set I t k without knowing the corresponding values or even the signs. Although it requires further work to study how much information can be obtained from the coordinates, to the best of our knowledge, there are almost no exsiting methods that can recover the training data only based on the coordinates.

4. CONVERGENCE

In this section, we theoretically prove the convergence of FedREP. Due to limited space, proof details are in Appendix B. Firstly, we present the definition of d -contraction operator (Stich et al., 2018) . Definition 3 (d -contraction). C : R d → R d is called a d -contraction operator (0 < d ≤ d) if E x -C(x) 2 ≤ (1 -d /d) x 2 , ∀x ∈ R d . (4) The d -contraction property of consensus sparsification is shown in Proposition 2. Proposition 2. If the fraction of Byzantine clients is not larger than δ (0 ≤ δ < 1 2 ), consensus sparsification is a d cons -contraction operator, where d cons = d(1 -e -αK[(1-δ)m-1] md ) + K m e -αK[(1-δ)m-1] md . Therefore, the existing convergence results of d -contraction operator (Stich et al., 2018) can be directly applied to consensus sparsification when there is no Byzantine attack. Then we theoretically analyze the convergence of FedREP. For simplicity in the analysis, we consider the secure aggregation and the robust aggregation on server as a unit secure robust aggregator, which is denoted by SRAgg(•). Therefore, Gt = SRAgg({g t k } m k=1 ). The assumptions are listed below. Assumption 1 (Byzantine setting). The fraction of Byzantine clients is not larger than δ (0 ≤ δ < 1 2 ) and the secure robust aggregator SRAgg(•) is (δ, c)-robust with constant c ≥ 0. Assumption 2 (Lower bound). F (w) is bounded below: ∃F * ∈ R, F (w) ≥ F * , ∀w ∈ R d . Assumption 3 (L-smoothness). Global loss function F (w) is differentiable and L-smooth: ||∇F (w) -∇F (w )|| ≤ L||w -w ||, ∀w, w ∈ R d . Assumption 4 (Bounded bias). ∀k ∈ G, we have E[∇f i t k (w)] = ∇F k (w) and there exists B ≥ 0 such that ∇F k (w) -∇F (w) ≤ B, ∀w ∈ R d . Assumption 5 (Bounded gradient). ∀k ∈ G, stochastic gradient ∇f i t k (w) has bounded expectation: ∃D ∈ R + , such that ∇F k (w) ≤ D, ∀w ∈ R d . Assumption 1 is common in Byzantine-robust distributed machine learning, which is consistent with previous works (Karimireddy et al., 2022) . The rest assumptions are common in distributed stochastic optimization. Assumption 5 is widely used in the analysis of gradient compression methods with error compensation. We first analyze the convergence for a special case of FedREP where the training algorithm A is local SGD with learning rate η and interval I. Specifically, w t+1 k is computed by the following process: (i) w t+1,0 k = w t ; (ii) w t+1,j+1 k = w t+1,j k -η•∇f i t,j k (w t+1,j k ), j = 0, 1, . . . , I -1; (iii) w t+1 k = w t+1,I k , where i t,j k is uniformly sampled from D k . Assumption 6 is made for this case. Assumption 6 (Bounded variance). Stochastic gradient ∇f i t k (w) is unbiased with bounded variance: E[∇f i t k (w)] = F k (w) and ∃σ ∈ R + , such that E ∇f i t k (w) -∇F k (w) 2 ≤ σ 2 , ∀w ∈ R d , ∀k ∈ G. According to Assumption 5 and 6, the second order moment of stochastic gradient ∇f i t k (w) is bounded by (D 2 + σ 2 ). Let u t = 1 |G| k∈G u t k and let e t = SRAgg({g t k } m k=1 ) -1 |G| k∈G gt k denote the aggregation error . We first show that E u t k 2 and E e t 2 are both bounded above. Lemma 1. Under Assumption 1, 2, 3, 4, 5 and 6, let constant H = d/d cons and take learning rate η t = η > 0, we have E u t k 2 ≤ 4H 2 I 2 (D 2 + σ 2 ) • η 2 , ∀k ∈ G. Lemma 2. Under the same conditions in Lemma 1, we have E e t 2 ≤ 8cδI 2 (4H 2 +1)(D 2 +σ 2 )•η 2 . Based on Lemma 1 and Lemma 2, we have the following theorem. Theorem 3. For FedREP, under the same conditions in Lemma 1 and Lemma 2, we have: 1 T T -1 t=0 E ∇F (w t ) 2 ≤ 2[F ( ŵ0 ) -F * ] ηIT + ηγ 1 + η 2 γ 2 + ∆, where  γ 1 = 2IL • [2(1 -I -1 )LD + 2HD D 2 + σ 2 + (D 2 + σ 2 ) + 8cδ(4H 2 + 1)(D 2 + σ 2 )], γ 2 = 8H 2 I 2 L 2 (D 2 + σ 2 ) and ∆ = 2BD + 4 2cδ(4H 2 + 1)(D 2 + σ 2 )D. A > 0, A 1 ≥ 0 and A 2 > 0 such that local training algorithm A satisfies E[G A (w; D k )] -∇F k (w) ≤ A 1 and E G A (w; D k ) 2 ≤ (A 2 ) 2 , where G A (w; D k ) = (w -w )/η A , ∀k ∈ [m]. In Assumption 7, G A (w; D k ) can be deemed as an estimation of gradient ∇F k (w) by algorithm A with bounded bias A 1 and bounded second order moment (A 2 ) 2 . The expectation appears due to the randomness in algorithm A. Many widely used algorithms satisfy Assumption 7. For vanilla SGD, let η A be the learning rate and G A (w; D k ) is exactly the stochastic gradient. Thus, we have A 1 = 0 and (A 2 ) 2 = D 2 + σ 2 under Assumption 5 and 6. Moreover, previous works (Allen -Zhu et al., 2020; El-Mhamdi et al., 2020; Karimireddy et al., 2021) have shown that using history information such as momentum is necessary in Byzantine-robust machine learning. We show that local momentum SGD also satisfies Assumption 7 in Proposition 3 in Appendix B. Theorem 4. Let constant H = d/d cons . For FedREP, under Assumption 1, 2, 3, 4, 5 and 7, we have: 1 T T -1 t=0 E ∇F (w t ) 2 ≤ 2[F ( ŵ0 ) -F * ] η A T + η A γ A,1 + (η A ) 2 γ A,2 + ∆ A , where γ A,1 = 2(A 2 ) 2 L + 4HA 2 DL + 16cδ(4H 2 + 1)(A 2 ) 2 L, γ A,2 = 8H 2 (A 2 ) 2 L 2 and ∆ A = 2A 1 D + 2BD + 4 2cδ(4H 2 + 1)A 2 D. Compared to the error ∆ in Theorem 3, there is an extra term 2A 1 D in ∆ A , which is caused by the bias of gradient estimation in algorithm A. Meanwhile, we would like to point out that Theorem 4 provides convergence guarantee for general algorithms. For some specific algorithm, tighter upper bounds may be obtained by adopting particular analysis technique. We will leave it for future work since we mainly focus on a general framework in this paper.

5. EXPERIMENT

In this section, we evaluate the performance of FedREP and baselines on image classification task. Each method is evaluated on CIFAR-10 dataset (Krizhevsky et al., 2009) with a widely used deep learning model ResNet-20 (He et al., 2016) . Training instances are equally and uniformly distributed to each client. All experiments in this work are conducted by PyTorch on a distributed platform with dockers. More specifically, we set 32 dockers as clients, among which 7 clients are Byzantine. One extra docker is set to be the server. Each docker is bound to an NVIDIA Tesla K80 GPU. Unless otherwise stated, we set local training algorithm A to be local momentum SGD with momentum hyper-parameter β = 0.9 (see Equation ( 182) in Appendix B) for FedREP. We run each method in the same environment for 120 epochs. Initial learning rate is chosen from {0.1, 0.2, 0.5, 1.0, 2.0, 5.0}. At the 80-th epoch, learning rate will be multiplied by 0.1 as suggested in (He et al., 2016) . The best top-1 accuracy w.r.t. epoch is used as final metrics. We test each method under bit-flipping attack, 'A Little is Enough' (ALIE) attack (Baruch et al., 2019) and 'Fall of Empires' (FoE) attack (Xie et al., 2020a) . The updates sent by Byzantine clients with bit-flipping attack are in the opposite direction. ALIE and FoE are two omniscient attacks, where attackers are assumed to know the updates on all clients and use them for attack. We set attack magnitude hyper-parameter to be 0.5 for FoE attack. For FedREP, we test the performance when the robust aggregator is geometric median (geoMed) (Chen et al., 2017) , coordinate-wise trimmed-mean (TMean) (Yin et al., 2018) and centered-clipping (CClip) (Karimireddy et al., 2021) , respectively. More specifically, we adopt Weiszfeld's algorithm (Pillutla et al., 2019) with iteration number set to be 5 for computing geoMed. The trimming fraction in TMean is set to 7/16. For CClip, we set clipping radius to be 0.5 and iteration number to be 5. Batch size is set to be 25. We first empirically evaluate the effect of α on the performance of FedREP. We set I = 1, s = 2 and K = 0.05d. We compare the performance of FedREP when α = 0, 0.2, 0.5, 0.8, 0.95, 0.99 and 1. As illustrated in Figure 1 , the performance of FedREP with CClip changes little when α ranges from 0 to 0.95. Final accuracy will decrease rapidly when α continues to increase. A possible reason is that some coordinates could grow very large in local error compensation memory when α is near 1. More results of FedREP with geoMed and TMean are presented in Appendix C.1. Since the effect of α is small when 0 ≤ α ≤ 0.95, we set α = 0 in the following experiments. In addition, as the empirical results in Appendix C.2 show, Byzantine attacks on coordinates have little effect on the performance of FedREP. Thus, we assume no attacks on the coordinates in the following experiments. Then we compare FedREP with a Byzantine-robust communication-efficient baseline called Robust Compressed Gradient Descent with Error Feedback (RCGD-EF) (Ghosh et al., 2021) . For fairness, we set the compression operator Q(•) in RCGD-EF to be top-K sparsification, and set Γ = 0.05, which is the ratio of transmitted dimension number to total dimension number, for both FedREP and RCGD-EF. Local updating interval I is set to 5 for each method. As illustrated in Figure 2 (He et al., 2016) . Common settings for all the experiments in this work are presented at the beginning of Section 5. More settings for each single experiment are presented along with the empirical results in Section 5 and Appendix C. In addition, we provide the core part of our code in the supplementary material. All the proof details for the theoretical results in this work can be found in Appendix B.

A DETAILS OF FEDREP

The detailed algorithms of FedREP on server and clients are illustrated in Algorithm 1 and Algorithm 2, respectively. Algorithm  , k ∈ [m], E C(v k ) -C(v k ) 2 =E   j∈N k ∩N k [(v k ) j -(v k ) j ] 2   + E   j∈N k \N k (v k ) 2 j   + E   j∈N k \N k (v k ) 2 j   (7) =E   j∈N k ∩N k ξ k,k ,j (ρ k,k ) 2   + E   j∈N k \N k ζ k,j (µ k ) 2   + E   j∈N k \N k ζ k ,j (µ k ) 2   (8) = j∈[d] ξ k,k ,j (ρ k,k ) 2 • Pr[j ∈ N k ∩ N k ] + j∈[d] ζ k,j (µ k ) 2 • Pr[j ∈ N k \ N k ] + j∈[d] ζ k ,j (µ k ) 2 • Pr[j ∈ N k \ N k ] (9) =(ρ k,k ) 2 • j∈[d] ξ k,k ,j Pr[j ∈ N k ∩ N k ] + (µ k ) 2 • j∈[d] ζ k,j Pr[j ∈ N k \ N k ] + (µ k ) 2 • j∈[d] ζ k ,j Pr[j ∈ N k \ N k ] .

B.2 PROOF OF PROPOSITION 1

Proof. Let I t denote the set of non-zero coordinates after consensus sparsification. In general cases, for any fixed k, k ∈ [m], we have: E gt k -gt k 2 = j∈I t E[(g t k ) j -(g t k ) j ] 2 (11) ≤ j∈[d] E[(g t k ) j -(g t k ) j ] 2 (12) = E g t k -g t k 2 .

B.3 PROOF OF THEOREM 2

Proof. Let M be the the mechanism in consensus sparsification that takes the set of top coordinates T as an input and outputs a random coordinate set. T 1 , T 2 ⊆ [d] are two arbitrary adjacent input coordinate sets that satisfy |T 1 | = |T 2 | = K m and S is any subset of possible outputs of M. When S is empty, Pr[M(T 1 ) ∈ S] = Pr[M(T 2 ) ∈ S] = 0. Thus, for any > 0, we have: 0 = Pr[M(T 1 ) ∈ S] ≤ exp( ) • Pr[M(T 2 ) ∈ S] = 0. ( ) Without loss of generality, we suppose that S is non-empty. For any I ∈ S, let |T 1 ∩ I| = U 1 and |T 2 ∩ I| = U 2 . T 1 and T 2 only differ on one element since they are adjacent. Thus, we have U 2 = U 1 -1, U 1 or U 1 + 1. Set T1 is generated by randomly selecting ( K m -r 1 ) elements from T 1 , where r 1 follows the binomial distribution Pr( K m , α). Thus, ∀i = 0, 1, . . . , K/m, Pr[r 1 = i] = K/m i α i (1 -α) K/m-i . ( ) To obtain I as the final output, only the elements in T 1 ∩ I can be selected. Thus, r 1 should equal or be larger than |T 1 \ I| = K m -U 1 . Furthermore, for r 1 ≥ K m -U 1 , the probability that all elements are selected from T 1 ∩ I is U1 K/m-r1 / K/m K/m-r1 = U1 r1-(K/m-U1) / K/m r1 . Finally, the r 1 elements in I \ T1 should be selected from [d] \ T1 , of which the probability is 1/ d-K/m+r1 r1 since |[d] \ T1 | = d -(K/m -r 1 ) = d -K/m + r 1 . Thus, we have: Pr[M(T 1 ) = I] = K/m i=K/m-U1 Pr[r 1 = i] × U1 i-(K/m-U1) K/m i × 1 d-K/m+i i (16) = K/m i=K/m-U1 K/m i α i (1 -α) K/m-i × U1 i-(K/m-U1) K/m i × 1 d-K/m+i i (17) = K/m i=K/m-U1 α i (1 -α) K/m-i U 1 i -(K/m -U 1 ) × 1 d-K/m+i i (18) = U1 i=0 α K/m-i (1 -α) i U 1 U 1 -i × 1 d-i K/m-i (19) = α K/m • U1 i=0 1 -α α i U 1 i × 1 d-i d-K/m . Thus, Pr[M(T 1 ) = I] is monotonically increasing with respect to U 1 . Similarly, Pr[M(T 2 ) = I] = α K/m • U2 i=0 1 -α α i U 2 i × 1 d-i d-K/m , which is monotonically increasing with respect to U 2 . Thus, Pr[M(T1)=I] Pr[M(T2)=I] takes the maximum value when U 1 = U 2 + 1. Therefore, Pr[M(T 1 ) = I] Pr[M(T 2 ) = I] ≤ α K/m • U2+1 i=0 1-α α i U2+1 i × 1 ( d-i d-K/m ) α K/m • U2 i=0 1-α α i U2 i × 1 ( d-i d-K/m ) (22) = U2+1 i=0 1-α α i 1 ( d-i d-K/m ) × U2+1 i U2 i=0 1-α α i 1 ( d-i d-K/m ) × U2 i (23) = 1 ( d d-K/m ) + U2+1 i=1 1-α α i 1 ( d-i d-K/m ) × U2+1 i 1 ( d d-K/m ) + U2 i=1 1-α α i 1 ( d-i d-K/m ) × U2 i (24) = 1 + 1-α α • U2+1 i=1 1-α α i-1 ( d d-K/m ) ( d-i d-K/m ) × U2+1 i 1 + 1-α α • U2 i=1 1-α α i-1 ( d d-K/m ) ( d-i d-K/m ) × U2 i . ( ) Let S 0 (α) = U2 i=1 1 -α α i-1 d d-K/m d-i d-K/m × U 2 i (26) and S 1 (α) = U2+1 i=1 1 -α α i-1 d d-K/m d-i d-K/m × U 2 + 1 i . ( ) We have Pr[M(T 1 ) = I] Pr[M(T 2 ) = I] ≤ 1 + 1-α α • S 1 (α) 1 + 1-α α • S 0 (α) (28) = 1 + 1-α α • (S 1 (α) -S 0 (α)) 1 + 1-α α • S 0 (α) (29) ≤ 1 + 1-α α • (S 1 (α) -S 0 (α)) 1-α α • S 0 (α) (30) = S 1 (α) S 0 (α) . ( ) Since U 1 = U 2 + 1 ≤ K/m, we have U 2 ≤ K/m -1. Thus, S 1 (α) = U2+1 i=1 1-α α i-1 d d-K/m U2+1 i d-i d-K/m (32) ≤ U2 i=1 1-α α i-1 d d-K/m U2+1 i d-i d-K/m + U2+1 i=2 1-α α i-1 d d-K/m U2+1 i d-i d-K/m (33) = U2 i=1 1-α α i-1 d d-K/m U2+1 i d-i d-K/m + U2 i=1 1-α α i d d-K/m U2+1 i+1 d-i-1 d-K/m (34) = U2 i=1 1-α α i-1 d d-K/m U2 i U2+1 U2+1-i d-i d-K/m + U2 i=1 1-α α 1-α α i-1 d d-K/m U2 i U2+1 i+1 d-i d-K/m K/m-i d-i ≤ U2 i=1 1-α α i-1 d d-K/m U2 i U2+1 U2+1-U2 d-i d-K/m + U2 i=1 1-α α 1-α α i-1 d d-K/m U2 i U2+1 1+1 d-i d-K/m K/m-U2 d-U2 (36) = (U 2 + 1) + 1-α α • U2+1 2 K/m-U2 d-U2 • U2 i=1 1 -α α i-1 d d-K/m d-i d-K/m × U 2 i (37) ≤   (K/m -1 + 1) + 1-α α • K/m-1+1 2 K/m-K/m+1 d-K/m+1   • S 0 (α) (38) = K m + K m (1 -α)(d -K m + 1) 2α • S 0 (α) (39) ≤ K m (d - K m + 1) + K m (1 -α)(d -K m + 1) 2α • S 0 (α) ≤ (1 + α) • K m (d -K m + 1) 2α • S 0 (α). (41) Therefore, Pr[M(T 1 ) = I] Pr[M(T 2 ) = I] ≤ S 1 (α) S 0 (α) ≤ (1 + α) • K m (d -K m + 1) 2α . Consequently, Pr[M(T 1 ) = I] ≤ exp ln (1 + α) • K m (d -K m + 1) 2α • Pr[M(T 2 ) = I], which shows that M provides ln (1+α)• K m (d-K m +1) 2α -DP. Then we provide a stronger result for the case where α is close to 1. Specifically, when 1 2 ≤ α ≤ 1, Pr[M(T 1 ) = I] Pr[M(T 2 ) = I] ≤ 1 + 1-α α • (S 1 (α) -S 0 (α)) 1 + 1-α α • S 0 (α) ≤ 1 + 1 -α α • (S 1 (α) -S 0 (α)). Since U 1 = U 2 + 1 ≤ K/m, we have U 2 ≤ K/m -1. Thus, S 1 (α) -S 0 (α) = U2+1 i=1 1-α α i-1 d d-K/m U2+1 i d-i d-K/m - U2 i=1 1-α α i-1 d d-K/m U2 i d-i d-K/m (45) = U2 i=1 1-α α i-1 d d-K/m U2+1 i -U2 i d-i d-K/m + 1-α α U2 d d-K/m d-U2-1 d-K/m (46) = U2 i=1 1-α α i-1 d d-K/m U2 i-1 d-i d-K/m + 1-α α U2 d d-K/m d-U2-1 d-K/m (47) ≤ d d-K/m d-U2 d-K/m U2 i=1 1 -α α i-1 U 2 i -1 + 1-α α U2 d d-K/m d-U2-1 d-K/m (48) ≤ d d-K/m d-U2 d-K/m U2 i=0 1 -α α i U 2 i + 1-α α U2 d d-K/m d-U2-1 d-K/m (49) = d d-K/m d-U2 d-K/m 1 + 1 -α α U2 + 1-α α U2 d d-K/m d-U2-1 d-K/m (50) ≤ d d-K/m 1 α U2 d-U2-1 d-K/m + 1-α α U2 d d-K/m d-U2-1 d-K/m (51) ≤ d d-K/m 2 U2 d-U2-1 d-K/m + 1-1 2 1 2 U2 d d-K/m d-U2-1 d-K/m (52) = (2 U2 + 1) • d d-K/m d-U2-1 d-K/m (53) ≤ (2 K/m + 1) • d K/m (54) ≤ (2 K/m + 1) • d K/m . Consequently, Pr[M(T 1 ) = I] ≤ exp ln 1 + 1 -α α • (2 K/m + 1) • d K/m • Pr[M(T 2 ) = I]. It shows that when 1 2 ≤ α ≤ 1, M provides M -DP, where M = ln 1 + 1-α α • (2 K/m + 1) • d K/m . Specially, when α = 1, M = ln(1 + 0) = 0. It is consistent to that the coordinate set is totally random when α = 1.

B.4 PROOF OF PROPOSITION 2

Proof. ∀k ∈ G, ∀0 ≤ t < T , we have: I t = k ∈[m] I t k ⊇ k ∈G I t k =     k ∈G\{k} I t k   ∪ I t k   . ( ) Therefore, E[ gt k 2 |I t k ] = E   j∈I t (g t k ) 2 j I t k   (58) = E   j∈I t k (g t k ) 2 j I t k   + E   j∈(I t \I t k ) (g t k ) 2 j I t k   (59) = j∈I t k (g t k ) 2 j + E   j∈(I t \I t k ) (g t k ) 2 j I t k   (60) = j∈I t k (g t k ) 2 j + j ∈I t k (g t k ) 2 j • Pr j ∈ I t |I t k . ( ) For any j ∈ I t k , Pr j ∈ I t |I t k ≥ Pr   j ∈   k ∈G\{k} I t k   I t k   (62) = Pr   j ∈   k ∈G\{k} T t k ∪ R t k   I t k   (63) = Pr   j ∈   k ∈G\{k} T t k   ∪   k ∈G\{k} R t k   I t k   (64) = Pr   j ∈   k ∈G\{k} T t k   I t k   + Pr   j ∈   k ∈G\{k} R t k   \   k ∈G\{k} T t k   I t k   . ( ) For simplicity, let ν = Pr   j ∈   k ∈G\{k} T t k   I t k   ∈ [0, 1], and we have: Pr j ∈ I t |I t k = ν + (1 -ν) • Pr   j ∈   k ∈G\{k} R t k   I t k , j ∈   k ∈G\{k} T t k     (67) = ν + (1 -ν) •    1 -Pr   j ∈   k ∈G\{k} R t k   I t k , j ∈   k ∈G\{k} T t k        (68) = ν + (1 -ν) •    1 - k ∈G\{k} Pr   j ∈ R t k I t k , j ∈   k ∈G\{k} T t k        ( ) (i) = ν + (1 -ν) •    1 - k ∈G\{k}   K/m i=0 Pr[r t k = i] • 1 - i d -K/m + i      (70) (ii) ≥ ν + (1 -ν) •    1 - k ∈G\{k}   K/m i=0 Pr[r t k = i] • 1 - i d      (71) = ν + (1 -ν) •    1 - k ∈G\{k}   K/m i=0 Pr[r t k = i] - K/m i=0 i • Pr[r t k = i] d      (72) = ν + (1 -ν) •    1 - k ∈G\{k} 1 - E[r t k ] d    ( ) (iii) = ν + (1 -ν) •    1 - k ∈G\{k} 1 - αK md    (74) = ν + (1 -ν) • 1 -1 - αK md |G|-1 (75) ≥ ν + (1 -ν) • 1 -1 - αK md (1-δ)m-1 (76) (iv) ≥ ν • 1 -1 - αK md (1-δ)m-1 + (1 -ν) • 1 -1 - αK md (1-δ)m-1 (77) = 1 -1 - αK md (1-δ)m-1 , ( ) where (i) holds because when r t k = i, the probability that element j is among the i randomly selected elements from [d] \ T t k is i d-K/m+i since |[d] \ T t k | = d -K/m + i. Inequality (ii) holds because i ≤ K/m. Equation (iii) holds because r t k follows the binomial distribution B( K m , α). Inequality (iv) holds because 1 -1 - E[r t k ] d (1-δ)m-1 ≤ 1. Since 0 ≤ α ≤ 1 and 0 < K m < d, we have 0 ≤ αK md < 1. Thus, 1 - αK md (1-δ)m-1 = 1 - αK md -md αK - αK[(1-δ)m-1] md ≤ e -αK[(1-δ)m-1] md . ( ) Therefore, Pr j ∈ I t |I t k ≥ 1 -e -αK[(1-δ)m-1] md . (80) Substituting it into (61), it is obtained that E[ gt k 2 |I t k ] ≥ j∈I t k (g t k ) 2 j + 1 -e -αK[(1-δ)m-1] md • j ∈I t k (g t k ) 2 j (81) = j∈I t k (g t k ) 2 j + 1 -e -αK[(1-δ)m-1] md •   g t k 2 - j∈I t k (g t k ) 2 j   (82) = 1 -e -αK[(1-δ)m-1] md • g t k 2 + e -αK[(1-δ)m-1] md j∈I t k (g t k ) 2 j . ( ) Take total expectation and we have: E gt k 2 = E[E[ gt k 2 |I t k ]] = 1 -e -αK[(1-δ)m-1] md • g t k 2 + e -αK[(1-δ)m-1] md • E   j∈I t k (g t k ) 2 j   . (84) Also, E   j∈I t k (g t k ) 2 j r t k   =E   j∈ T t k (g t k ) 2 j r t k   + E   j∈R t k (g t k ) 2 j r t k   (85) =E   j∈ T t k (g t k ) 2 j r t k   + r t k d -K/m + r t k • E   j ∈ T t k (g t k ) 2 j r t k   (86) =E   j∈ T t k (g t k ) 2 j r t k   + r t k d -K/m + r t k •   g t k 2 -E   j∈ T t k (g t k ) 2 j r t k     (87) = r t k d -K/m + r t k • g t k 2 + d -K/m d -K/m + r t k • E   j∈ T t k (g t k ) 2 j r t k   (88) = r t k d -K/m + r t k • g t k 2 + d -K/m d -K/m + r t k • K/m -r t k K/m • E   j∈T t k (g t k ) 2 j r t k   (89) ≥ r t k d -K/m + r t k • g t k 2 + d -K/m d -K/m + r t k • K/m -r t k K/m • K/m d • g t k 2 (90) = r t k d -K/m + r t k • g t k 2 + d -K/m d -K/m + r t k • K/m -r t k d • g t k 2 (91) = dr t k + (d -K/m)(K/m -r t k ) d(d -K/m + r t k ) • g t k 2 (92) = (K/m) • (d -K/m + r t k ) d(d -K/m + r t k ) • g t k 2 (93) = K md g t k 2 . ( ) Thus, E   j∈I t k (g t k ) 2 j   = E   E   j∈I t k (g t k ) 2 j r t k     ≥ E K md g t k 2 = K md g t k 2 . ( ) Substituting ( 95) into (84), we have: E gt k 2 ≥ 1 -e -αK[(1-δ)m-1] md + K md e -αK[(1-δ)m-1] md • g t k 2 (96) = 1 - (d -K m )e -αK[(1-δ)m-1] md d • g t k 2 . ( ) Since gt k is the consensus sparsification result of g t k , we have: E gt k -g t k 2 = E   j∈G\I t (g t k ) 2 j   (98) = E   j∈G (g t k ) 2 j - j∈I t (g t k ) 2 j   (99) = g t k 2 -E gt k 2 (100) ≤ (d -K m )e -αK[(1-δ)m-1] md d • g t k 2 (101) = 1 - d(1 -e -αK[(1-δ)m-1] md ) + K m e -αK[(1-δ)m-1] md d • g t k 2 . ( ) By definition, consensus sparsification is a d cons -contraction operator, where , where i t,j k is uniformly sampled from D k . Therefore, we have the following inequality for all k ∈ G: d cons = d 1 -e -αK[(1-δ)m-1] md + K m e -αK[(1-δ)m-1] E u t+1 k 2 = E g t k -gt k 2 ( ) (i) ≤ 1 - d cons d E g t k 2 (104) = 1 - d cons d E u t k + (w t -w t+1 k ) 2 (105) (ii) ≤ 1 - d cons d (1 + d cons 2d )E u t k 2 + (1 + 2d d cons )E w t+1,0 k -w t+1,I k 2 (106) (iii) ≤ 1 - d cons 2d E u t k 2 + 2d d cons E w t+1,0 k -w t+1,I k 2 (107) ≤ 1 - d cons 2d E u t k 2 + 2Id d cons I-1 j=0 E w t+1,i k -w t+1,i+1 k 2 (108) = 1 - d cons 2d E u t k 2 + 2Id d cons I-1 j=0 E η t • ∇f i t,j k (w t+1,j k ) 2 (109) (iv) ≤ 1 - d cons 2d E u t k 2 + 2I 2 d d cons (η t ) 2 (D 2 + σ 2 ), where (i) is derived based on Proposition 2. (ii) is derived based on that x + y 2 ≤ (1 + θ) x 2 + (1+θ -1 ) y 2 for any constant θ > 0. (iii) is derived based on that (1- d cons d )(1+ d cons 2d ) < 1- d cons 2d and (1 - , the second term on the RHS d cons d )(1 + 2d d cons ) < 2d 2I 2 d d cons (η t ) 2 (D 2 + σ 2 ) = 2I 2 d d cons (D 2 + σ 2 ) • b 2 t + λ (111) = 8I 2 d 2 b 2 (d cons ) 2 (D 2 + σ 2 ) • 1 t + λ • d cons 4d (112) = 8I 2 d 2 b 2 (d cons ) 2 (D 2 + σ 2 ) • 1 t + λ • ( d cons 2d - d cons 4d ) (113) = 8I 2 d 2 b 2 (d cons ) 2 (D 2 + σ 2 ) • 1 t + λ • ( d cons 2d - 1 λ ) (114) ≤ 8I 2 d 2 b 2 (d cons ) 2 (D 2 + σ 2 ) • 1 t + λ • ( d cons 2d - 1 t + λ + 1 ) (115) = 8I 2 d 2 b 2 (d cons ) 2 (D 2 + σ 2 ) • d cons 2d (t + λ + 1) -1 (t + λ)(t + λ + 1) (116) = 8I 2 d 2 b 2 (d cons ) 2 (D 2 + σ 2 ) • 1 t + λ + 1 - (1 - d cons 2d ) t + λ . Combining ( 110) and ( 117), we have E u t+1 k 2 ≤ 1 - d cons 2d E u t k 2 + 8I 2 d 2 b 2 (d cons ) 2 (D 2 + σ 2 ) • 1 t + λ + 1 - (1 - d cons 2d ) t + λ . (118) Therefore, E u t+1 k 2 - 8I 2 d 2 b 2 (D 2 + σ 2 ) (d cons ) 2 (t + λ + 1) ≤ 1 - d cons 2d E u t k 2 - 8I 2 d 2 b 2 (D 2 + σ 2 ) (d cons ) 2 (t + λ) . (119) Recursively using (119), we have E u t k 2 - 8I 2 d 2 b 2 (D 2 + σ 2 ) (d cons ) 2 (t + λ) ≤ 1 - d cons 2d t E u 0 k 2 - 8I 2 d 2 b 2 (D 2 + σ 2 ) (d cons ) 2 λ < 0. (120) Thus, E u t k 2 ≤ 8I 2 d 2 (D 2 + σ 2 ) (d cons ) 2 • b 2 t + λ = 8I 2 d 2 (D 2 + σ 2 ) (d cons ) 2 • (η t ) 2 . Finally, E u t 2 = E 1 |G| k∈G u t k 2 ≤ 1 |G| k∈G E u t k 2 ≤ 8I 2 d 2 (D 2 + σ 2 ) (d cons ) 2 • (η t ) 2 . ( ) When η t = η, by (110), we have E u t+1 k 2 ≤ 1 - d cons 2d E u t k 2 + 2I 2 d d cons η 2 (D 2 + σ 2 ) (123) E u t+1 k 2 - 4I 2 d 2 (d cons ) 2 η 2 (D 2 + σ 2 ) ≤ 1 - d cons 2d • E u t k 2 - 4I 2 d 2 (d cons ) 2 η 2 (D 2 + σ 2 ) . Recursively using (124), we have E u t k 2 - 4I 2 d 2 (d cons ) 2 η 2 (D 2 + σ 2 ) ≤ 1 - d cons 2d t • E u 0 k 2 - 4I 2 d 2 (d cons ) 2 η 2 (D 2 + σ 2 ) < 0. (125) Thus, E u t k 2 ≤ 4I 2 d 2 (D 2 + σ 2 ) (d cons ) 2 • η 2 . Finally, E u t 2 = E 1 |G| k∈G u t k 2 ≤ 1 |G| k∈G E u t k 2 ≤ 4I 2 d 2 (D 2 + σ 2 ) (d cons ) 2 • η 2 . B.6 PROOF OF LEMMA 2 Proof. Based on Assumption 6 and Assumption 5, we have that ∀k ∈ G, E gt k 2 ≤ E g t k 2 =E u t k + (w t -w t+1 k ) 2 (128) ≤2E u t k 2 + 2E w t -w t+1 k 2 (129) ≤2E u t k 2 + 2I 2 (η t ) 2 (D 2 + σ 2 ). By Lemma 1, if η t = b √ t+λ where constant b > 0 and λ = 4d d cons , we have E gt k 2 ≤ 2I 2 (8H 2 + 1)(D 2 + σ 2 ) • (η t ) 2 , ∀k ∈ G. ( ) E k =k gt k -gt k 2 ≤ 2E gt k 2 + 2E gt k 2 ≤ 8I 2 (8H 2 + 1)(D 2 + σ 2 ) • (η t ) 2 . Therefore, by Definition 1 and ( 132), E e t 2 =E SRAgg({g t k } m k=1 ) - 1 |G| k∈G gt k 2 (133) ≤ cδ • E k =k gt k -gt k 2 (134) ≤ 8cδI 2 (8H 2 + 1)(D 2 + σ 2 ) • (η t ) 2 . ( ) Similarly, if η t = η > 0, we have E gt k 2 ≤ 2I 2 (4H 2 + 1)(D 2 + σ 2 ) • η 2 , ∀k ∈ G. and E e t 2 ≤ 8cδI 2 (4H 2 + 1)(D 2 + σ 2 ) • η 2 . B.7 PROOF OF THEOREM 3 Proof. Let u t = 1 |G| k∈G u t k be the averaging memory of non-Byzantine clients. ŵ is defined as ŵt = w tu t . The iteration rule for ŵ is derived as follows: ŵt+1 = w t+1 -u t+1 (138) = w t -SRAgg({g t k } m k=1 ) - 1 |G| k∈G u t k + (w t -w t+1 k ) -gt k (139) = w t -SRAgg({g t k } m k=1 ) -u t + 1 |G| k∈G (w t -w t+1 k ) - 1 |G| k∈G gt k (140) = (w t -u t ) - 1 |G| k∈G (w t -w t+1 k ) -SRAgg({g t k } m k=1 ) - 1 |G| k∈G gt k (141) = ŵt -w t - 1 |G| k∈G w t+1 k -e t , where e t = SRAgg({g t k } m k=1 ) -1 |G| k∈G gt k is the estimation error of 1 |G| k∈G gt k . Let Ḡt = (w t -1 |G| k∈G w t+1 k )/(ηI). Then we have Ḡt = 1 I|G| k∈G I-1 j=0 ∇f i t,j k (w t+1,j k ), t = 0, 1, . . . , T -1, and ŵt+1 = ŵt -ηI • Ḡte t , t = 0, 1, . . . , T -1. (144) The equation can be interpreted as that ŵt+1 is obtained by performing an SGD step on ŵt with learning rate ηI, gradient approximation Ḡt and error e t . Based on Assumption 3 and the inequality that x + y 2 ≤ 2 x 2 + 2 y 2 , F ( ŵt+1 ) =F ( ŵt -ηI • Ḡt -e t ) ≤F ( ŵt ) -∇F ( ŵt ) T (ηI • Ḡt + e t ) + L 2 ηI • Ḡt + e t 2 (146) ≤F ( ŵt ) -ηI • ∇F ( ŵt ) T Ḡt -∇F ( ŵt ) T e t + η 2 I 2 L Ḡt 2 + L e t 2 (147) =F ( ŵt ) -ηI • ∇F ( ŵt ) 2 + ∇F ( ŵt ) T [ Ḡt -∇F ( ŵt )] -∇F ( ŵt ) T e t + η 2 I 2 L Ḡt 2 + L e t 2 (148) =F ( ŵt ) -ηI • ∇F ( ŵt ) 2 -ηI • ∇F ( ŵt ) T [ Ḡt -∇F ( ŵt )] -∇F ( ŵt ) T e t + η 2 I 2 L Ḡt 2 + L e t 2 . Taking expectation on both sides, we have E[F ( ŵt+1 )|w t , u t ] ≤ F ( ŵt ) -ηI • ∇F ( ŵt ) 2 -ηI • E ∇F ( ŵt ) T [ Ḡt -∇F ( ŵt )] w t , u t -E[∇F ( ŵt ) T e t |w t , u t ] + η 2 I 2 L • E[ Ḡt 2 |w t , u t ] + L • E[ e t 2 |w t , u t ]. Based on Assumption 3 and that -x 2 ≤ -1 2 y 2 + xy 2 , we have: -∇F ( ŵt ) 2 = -∇F (w t ) + [∇F ( ŵt ) -∇F (w t )] 2 (151) ≤ - 1 2 ∇F (w t ) 2 + ∇F ( ŵt ) -∇F (w t ) 2 (152) = - 1 2 ∇F (w t ) 2 + ∇F (w t -u t ) -∇F (w t ) 2 (153) ≤ - 1 2 ∇F (w t ) 2 + L 2 u t 2 . ( ) In addition, using Assumption 3, Assumption 4, Assumption 5 and Equation ( 143), we have: -E ∇F ( ŵt ) T [ Ḡt -∇F ( ŵt )] w t , u t = -∇F ( ŵt ) T • E[ Ḡt -∇F ( ŵt )|w t , u t ] ≤ ∇F ( ŵt ) • E[ Ḡt -∇F ( ŵt )|w t , u t ] ≤ D • E   ∇F ( ŵt ) - 1 I|G| k∈G I-1 j=0 ∇f i t,j k (w t+1,j k ) w t , u t   (156) ≤ D • 1 I|G| k∈G I-1 j=0 E ∇F (w t -u t ) -∇F k (w t+1,j k ) w t , u t (157) ≤ D I|G| k∈G I-1 j=0 E ∇F (w t+1,0 k -u t ) -∇F (w t+1,j k ) w t , u t + D I|G| k∈G I-1 j=0 E ∇F (w t+1,j k ) -∇F k (w t+1,j k ) w t , u t (158) ≤ D I|G| k∈G I-1 j=0 L • E w t+1,0 k -u t -w t+1,j k w t , u t + B (159) ≤ DL I|G| k∈G I-1 j=0 u t + E w t+1,0 k -w t+1,j k w t , u t + BD (160) ≤ DL I|G| k∈G I-1 j=0 j-1 j =0 E w t+1,j k -w t+1,j +1 k w t , u t + DL • u t + BD. With Assumption 5, we have: E w t+1,j k -w t+1,j +1 k w t , u t = E η • ∇f i t,j k (w t+1,j k ) w t , u t ≤ ηD. Therefore, -E ∇F ( ŵt ) T [ Ḡt -∇F ( ŵt )] w t , u t ≤ DL I|G| k∈G I-1 j=0 j-1 j =0 ηD + DL • u t + BD (163) = 4L 2 I|G| k∈G I-1 j=0 jηD + DL • u t + BD (164) = 4L 2 I|G| • |G| I(I -1) 2 ηD + DL • u t + BD (165) = 2(I -1)ηDL 2 + DL • u t + BD. ( ) Note that E[XY ] ≤ E[X 2 ] • E[Y 2 ]. Using Assumption 5 and Lemma 2, we have: -E[∇F ( ŵt ) T e t |w t , u t ] ≤ E[ ∇F ( ŵt ) • e t |w t , u t ] (167) ≤ E[ ∇F ( ŵt ) 2 |w t , u t ] • E[ e t 2 |w t , u t ] (168) ≤ 8cδI 2 (4H 2 + 1)D 2 (D 2 + σ 2 ) • η 2 (169) = ηI 8cδ(4H 2 + 1)(D 2 + σ 2 )D. According to Assumption 5 and 6, E[ Ḡt 2 |w t , u t ] = E    1 I|G| k∈G I-1 j=0 ∇f i t,j k (w t+1,j k ) 2 w t , u t    (171) ≤ 1 I|G| k∈G I-1 j=0 E ∇f i t,j k (w t+1,j k ) 2 w t , u t (172) ≤ 1 I|G| k∈G I-1 j=0 (D 2 + σ 2 ) (173) = D 2 + σ 2 . ( ) Substituting ( 137), ( 154), ( 166), ( 170) and ( 174) into (150), we have: E[F ( ŵt+1 )|w t , u t ] ≤ F ( ŵt ) -ηI • ∇F ( ŵt ) 2 -ηI • E ∇F ( ŵt ) T [ Ḡt -∇F ( ŵt )] w t , u t -E[∇F ( ŵt ) T e t |w t , u t ] + η 2 I 2 L • E[ Ḡt 2 |w t , u t ] + L • E[ e t 2 |w t , u t ] (175) ≤ F ( ŵt ) - ηI 2 ∇F (w t ) 2 + ηIL 2 u t 2 + ηI 2(I -1)ηDL 2 + DL u t + BD + ηI 8cδ(4H 2 + 1)(D 2 + σ 2 )D + η 2 I 2 L(D 2 + σ 2 ) + L • [8cδI 2 (4H 2 + 1)(D 2 + σ 2 ) • η 2 ]. Note that E u t = [E u t ] 2 ≤ [E u t 2 ]. Taking total expectation on both sides and using that E u t 2 ≤ 4H 2 I 2 (D 2 + σ 2 ) • η 2 , we have: E[F ( ŵt+1 )] ≤ E[F ( ŵt )] - ηI 2 E ∇F (w t ) 2 + ηIL 2 [4H 2 I 2 (D 2 + σ 2 ) • η 2 ] + ηI 2(I -1)ηDL 2 + DL • 4H 2 I 2 (D 2 + σ 2 ) • η 2 + BD + ηI 8cδ(4H 2 + 1)(D 2 + σ 2 )D + η 2 I 2 L(D 2 + σ 2 ) + L • [8cδI 2 (4H 2 + 1)(D 2 + σ 2 ) • η 2 ]. Namely, E[F ( ŵt+1 )] ≤ E[F ( ŵt )] - ηI 2 E ∇F (w t ) 2 + (ηI) 2 L 2(1 -I -1 )DL + 2HD D 2 + σ 2 + (D 2 + σ 2 ) + 8cδ(4H 2 + 1)(D 2 + σ 2 ) + (ηI) 3 4H 2 L 2 (D 2 + σ 2 ) + (ηI) BD + 8cδ(4H 2 + 1)(D 2 + σ 2 )D . By taking summation from t = 0 to T -1, we have: E[F ( ŵT )] ≤ E[F ( ŵ0 )] - ηI 2 • T -1 t=0 E ∇F (w t ) 2 + T (ηI) 2 L 2(1 -I -1 )DL + 2HD D 2 + σ 2 + (D 2 + σ 2 ) + 8cδ(4H 2 + 1)(D 2 + σ 2 ) + T (ηI) 3 4H 2 L 2 (D 2 + σ 2 ) + T (ηI) BD + 8cδ(4H 2 + 1)(D 2 + σ 2 )D . ( ) Note that ŵ0 = w 0 and F ( ŵT ) ≥ F * . Thus, 1 T T -1 t=0 E ∇F (w t ) 2 ≤ 2[F ( ŵ0 ) -F * ] ηIT + η • 2IL 2(1 -I -1 )DL + 2HD D 2 + σ 2 + (D 2 + σ 2 ) + 8cδ(4H 2 + 1)(D 2 + σ 2 ) + η 2 • 8H 2 I 2 L 2 (D 2 + σ 2 ) + 2 BD + 8cδ(4H 2 + 1)(D 2 + σ 2 )D . In summary, 1 T T -1 t=0 E ∇F (w t ) 2 ≤ 2[F ( ŵ0 ) -F * ] ηIT + ηγ 1 + η 2 γ 2 + ∆, where γ 1 = 2IL • [2(1 -I -1 )LD + 2HD √ D 2 + σ 2 + (D 2 + σ 2 ) + 8cδ(4H 2 + 1)(D 2 + σ 2 )], γ 2 = 8H 2 I 2 L 2 (D 2 + σ 2 ) and ∆ = 2BD + 4 2cδ(4H 2 + 1)(D 2 + σ 2 )D.

B.8 ANALYSIS FOR LOCAL MOMENTUM SGD

We present the following proposition, which illustrates that Assumption 7 holds when A is set to be local momentum SGD. Proposition 3. Under Assumption 3, 5 and 6, local momentum SGD satisfies Assumption 7. Moreover, for local momentum SGD with learning rate η > 0, update interval I ∈ N + and momentum hyperparameter β ∈ [0, 1), we have η A = ηI, A 1 = β(1-β I ) I(1-β) D + √ D 2 + σ 2 + ( I-1 2 + β 2 (1-β I-1 ) I(1-β) 2 - β(I-1) I(1-β) ) • L √ D 2 + σ 2 and (A 2 ) 2 = D 2 + σ 2 . Proof. When A is set to be local momentum SGD with learning rate η, update interval I and momentum hyper-parameter β, let m 0,j k = 0 be the initial momentum and w t+1 k (t = 0, 1, . . . , T -1) is computed by the following process:                  m t+1,0 k = m t,I k ; w t+1,0 k = w t ; m t+1,j+1 k = β • m t+1,j k + (1 -β) • ∇f i t,j k (w t+1,j k ), j = 0, 1, . . . , I -1; w t+1,j+1 k = w t+1,j k -η • m t+1,j+1 k , j = 0, 1, . . . , I -1; w t+1 k = w t+1,I k . ( ) Let η A = ηI, we have G A (w t ; D k ) = (w t -w t+1 k )/(ηI) = 1 I I-1 j=0 (w t+1,j k -w t+1,j+1 k ) = 1 I I-1 j=0 m t+1,j+1 k . In addition, m t+1,j+1 k = β • m t+1,j k + (1 -β) • ∇f i t,j k (w t+1,j k ) (184) = β • (β • m t+1,j-1 k + (1 -β) • ∇f i t,j-1 k (w t+1,j-1 k )) + (1 -β) • ∇f i t,j k (w t+1,j k ) (185) = β 2 • m t+1,j-1 k + β(1 -β) • ∇f i t,j-1 k (w t+1,j-1 k ) + (1 -β) • ∇f i t,j k (w t+1,j k ) (186) = . . . . . . . . . . . . = β j+1 m t+1,0 k + (1 -β) j j =0 β j-j ∇f i t,j k (w t+1,j k ). Now we prove that E m t,j k 2 ≤ D 2 + σ 2 (j = 0, 1, . . . , I) by deduction on t. Step 1. When t = 0, we have E m 0,j k 2 = 0 ≤ D 2 + σ 2 . Step 2 (deduction). Suppose E m t,j k  2 ≤ D 2 + σ 2 , we have E m t+1,0 k 2 = E m t,I k 2 ≤ D 2 + σ 2 and E ∇f i t,j k (w t+1,j k ) 2 ≤ D 2 + σ 2 . Since β j+1 + (1 -β) j j =0 β j-j = β j+1 + (1 -β) 1 -β j+1 1 -β = 1, Therefore, E G A (w t ; D k ) 2 = E 1 I I-1 j=0 m t+1,j+1 k 2 ≤ D 2 + σ 2 . ( ) Substituting ( 187) into (183), we have: G A (w t ; D k ) = 1 I I-1 j=0 β j+1 m t+1,0 k + (1 -β) j j =0 β j-j ∇f i t,j k (w t+1,j k ) (192) = β(1 -β I ) I(1 -β) m t+1,0 k + 1 -β I I-1 j=0   j j =0 β j-j ∇f i t,j k (w t+1,j k )   (193) = β(1 -β I ) I(1 -β) m t+1,0 k + 1 -β I I-1 j =0   I-1 j=j β j-j ∇f i t,j k (w t+1,j k )   (194) = β(1 -β I ) I(1 -β) m t+1,0 k + 1 -β I I-1 j =0 1 -β I-j 1 -β ∇f i t,j k (w t+1,j k ) (195) = 1 I   β(1 -β I ) 1 -β m t+1,0 k + I-1 j =0 (1 -β I-j )∇f i t,j k (w t+1,j k )   . Therefore, E[G A (w t ; D k ) -∇F k (w t )] = 1 I β(1 -β I ) 1 -β • E[m t+1,0 k -∇F k (w t )] + I-1 j =0 (1 -β I-j ) • E[∇f i t,j k (w t+1,j k ) -∇F k (w t )] . (197) Since E m t+1,0 k -∇F k (w t ) ≤ √ D 2 + σ 2 + D and that E ∇f i t,j k (w t+1,j k ) -∇F k (w t ) ≤ E ∇f i t,j k (w t+1,j k ) -∇F k (w t+1,j k ) + E ∇F k (w t+1,j k ) -∇F k (w t ) (198) ≤ D 2 + σ 2 + L • w t+1,j k -w t (199) ≤ D 2 + σ 2 + L j -1 j =0 m t+1,j +1 k (200) ≤ D 2 + σ 2 + j L D 2 + σ 2 , we have E[G A (w t ; D k )] -∇F k (w t ) ≤ 1 I β(1 -β I ) 1 -β • [ D 2 + σ 2 + D] + I-1 j =0 (1 -β I-j ) • [ D 2 + σ 2 + j L D 2 + σ 2 ] (202) = 1 I β(1 -β I ) 1 -β D + I D 2 + σ 2 + L D 2 + σ 2 • I-1 j =0 (j -j β I-j ) (203) = 1 I β(1 -β I ) 1 -β D + I D 2 + σ 2 + L D 2 + σ 2 • I(I -1) 2 + β 2 (1 -β I-1 ) (1 -β) 2 - β(I -1) 1 -β (204) = β(1 -β I ) I(1 -β) D + D 2 + σ 2 + I -1 2 + β 2 (1 -β I-1 ) I(1 -β) 2 - β(I -1) I(1 -β) • L D 2 + σ 2 . B.9 PROOF OF THEOREM 4 Proof. Similar to Lemma 1 and Lemma 2, we have the following inequalities to bound the local memory and the aggregation error, respectively, for general training algorithm A that satisfies Assumption 7: E u t+1 k 2 = E g t k -gt k 2 ( ) (i) ≤ 1 - d cons d E g t k 2 (207) = 1 - d cons d E u t k + (w t -w t+1 k ) 2 (208) (ii) ≤ 1 - d cons d (1 + d cons 2d )E u t k 2 + (1 + 2d d cons )E η A • G A (w t ; D k ) 2 (209) (iii) ≤ 1 - d cons 2d E u t k 2 + 2d d cons (η A ) 2 • E G A (w t ; D k ) 2 (210) (iv) ≤ 1 - d cons 2d E u t k 2 + 2d d cons (η A ) 2 (A 2 ) 2 , where (i) is derived based on Proposition 2. (ii) is derived based on that x + y 2 ≤ (1 + θ) x 2 + (1+θ -1 ) y 2 for any constant θ > 0. (iii) is derived based on that (1- d cons d )(1+ d cons 2d ) < 1- d cons 2d and (1 - d cons d )(1 + 2d d cons ) < 2d d cons . (iv) is derived based on Assumption 7. Therefore, E u t+1 k 2 - 4d 2 (d cons ) 2 (η A ) 2 (A 2 ) 2 ≤ 1 - d cons 2d • E u t k 2 - 4d 2 (d cons ) 2 (η A ) 2 (A 2 ) 2 . (212) Recursively using (212), we have E u t k 2 - 4d 2 (d cons ) 2 (η A ) 2 (A 2 ) 2 ≤ 1 - d cons 2d t • E u 0 k 2 - 4d 2 (d cons ) 2 (η A ) 2 (A 2 ) 2 < 0. (213) Thus, E u t k 2 ≤ 4d 2 (A 2 ) 2 (d cons ) 2 • (η A ) 2 . ( ) Let H = d cons /d. Finally, it is obtained that E u t 2 = E 1 |G| k∈G u t k 2 ≤ 1 |G| k∈G E u t k 2 ≤ 4d 2 (A 2 ) 2 (d cons ) 2 • (η A ) 2 = 4H 2 (A 2 ) 2 (η A ) 2 . ( ) Based on Assumption 7, we have that ∀k ∈ G, E gt k 2 ≤ E g t k 2 =E u t k + (w t -w t+1 k ) 2 (216) ≤2E u t k 2 + 2E w t -w t+1 k 2 (217) ≤2E u t k 2 + 2(η A ) 2 (A 2 ) 2 (218) ≤2(4H 2 + 1)(A 2 ) 2 • (η A ) 2 . Thus, E k =k gt k -gt k 2 ≤ 2E gt k 2 + 2E gt k 2 ≤ 8(4H 2 + 1)(A 2 ) 2 • (η A ) 2 . ( ) Therefore, by Definition 1 and ( 220), E e t 2 =E SRAgg({g t k } m k=1 ) - 1 |G| k∈G gt k 2 (221) ≤ cδ • E k =k gt k -gt k 2 (222) ≤ 8cδ(4H 2 + 1)(A 2 ) 2 • (η A ) 2 . ( ) Let wt+1 = 1 |G| k∈G w t+1 k . Combining with Equation (142), we have: ŵt+1 = ŵt -(w t -wt+1 ) -e t . The equation can be interpreted as that ŵt+1 is obtained by adding a small term -(w t -wt+1 ) on ŵt with error e t . Therefore, F ( ŵt+1 ) =F ( ŵt -(w t -wt+1 ) -e t ) (225) ≤F ( ŵt ) -∇F ( ŵt ) T (w t -wt+1 + e t ) + L 2 w t -wt+1 + e t 2 (226) ≤F ( ŵt ) -∇F ( ŵt ) T (w t -wt+1 ) -∇F ( ŵt ) T e t + η 2 I 2 L w t -wt+1 2 + L e t 2 (227) =F ( ŵt ) - 1 |G| k∈G ∇F ( ŵt ) T (w t -w t+1 k ) -∇F ( ŵt ) T e t + L w t - 1 |G| k∈G w t+1 k 2 + L e t 2 (228) ≤F ( ŵt ) -η A ∇F ( ŵt ) 2 - 1 |G| k∈G ∇F ( ŵt ) T [w t -w t+1 k -η A • ∇F ( ŵt )] -∇F ( ŵt ) T e t + L |G| k∈G w t -w t+1 k 2 + L e t 2 (229) =F ( ŵt ) -η A ∇F ( ŵt ) 2 - 1 |G| k∈G ∇F ( ŵt ) T [η A • G A (w t ; D k ) -η A • ∇F ( ŵt )] -∇F ( ŵt ) T e t + L |G| k∈G η A • G A (w t ; D k ) 2 + L e t 2 . Taking expectation on both sides, we have E[F ( ŵt+1 )|w t , u t ] ≤ F ( ŵt ) -η A ∇F ( ŵt ) 2 - η A |G| k∈G E ∇F ( ŵt ) T [G A (w t ; D k ) -∇F ( ŵt )] w t , u t -E[∇F ( ŵt ) T e t |w t , u t ] + (η A ) 2 L |G| k∈G E G A (w t ; D k ) 2 w t , u t + L • E[ e t 2 |w t , u t ]. By using Assumption 3, Assumption 4, Assumption 7, we have: -E ∇F ( ŵt ) T [G A (w t ; D k ) -∇F ( ŵt )] w t , u t = -∇F ( ŵt ) T E[G A (w t ; D k )|w t , u t ] -∇F ( ŵt ) (232) ≤ ∇F ( ŵt ) • E[G A (w t ; D k )|w t , u t ] -∇F ( ŵt ) ≤ ∇F ( ŵt ) • E[G A (w t ; D k )|w t , u t ] -∇F k (w t ) + ∇F k (w t ) -∇F (w t ) + ∇F (w t ) -∇F ( ŵt ) (234) ≤D • (A 1 + B + L w t -ŵt ) (235) =A 1 D + BD + DL u t . ( ) Note that E[XY ] ≤ E[X 2 ]E[Y 2 ]. Based on Assumption 5, Assumption 7 and (223), we have: - E[∇F ( ŵt ) T e t |w t , u t ] ≤ E[ ∇F ( ŵt ) • e t |w t , u t ] (237) ≤ E[ ∇F ( ŵt ) 2 |w t , u t ] • E[ e t 2 |w t , u t ] (238) ≤ D 2 • 8cδ(4H 2 + 1)(A 2 ) 2 (η A ) 2 (239) = η A • 8cδ(4H 2 + 1)A 2 D. According to Assumption 7, E[ G A (w t ; D k ) 2 |w t , u t ] ≤ (A 2 ) 2 . ( ) Substituting ( 154), ( 223), ( 236), ( 240) and ( 241) into (231), we have: E[F ( ŵt+1 )|w t , u t ] ≤ F ( ŵt ) - η A 2 ∇F (w t ) 2 + η A L 2 u t 2 + η A A 1 D + BD + DL u t + η A • 8cδ(4H 2 + 1)A 2 D + (η A ) 2 L(A 2 ) 2 + L • [8cδ(4H 2 + 1)(A 2 ) 2 • (η A ) 2 ]. Note that E u t = [E u t ] 2 ≤ [E u t 2 ] and that E u t 2 = 4H 2 (A 2 ) 2 (η A ) 2 . Taking total expectation on both sides, we have: E[F ( ŵt+1 )] ≤ E[F ( ŵt )] - η A 2 E ∇F (w t ) 2 + η A L 2 [4H 2 (A 2 ) 2 (η A ) 2 ] + η A (A 1 D + BD + 2HA 2 DLη A ) + 8cδ(4H 2 + 1)A 2 D + (η A ) 2 L(A 2 ) 2 + L • [8cδ(4H 2 + 1)(A 2 ) 2 • (η A ) 2 ]. Namely, E[F ( ŵt+1 )] ≤ E[F ( ŵt )] - η A 2 E ∇F (w t ) 2 + η A [A 1 D + BD + 8cδ(4H 2 + 1)A 2 D] + (η A ) 3 [4H 2 (A 2 ) 2 L 2 ] + (η A ) 2 [(A 2 ) 2 L + 2HA 2 DL + 8cδ(4H 2 + 1)(A 2 ) 2 L]. By taking summation from t = 0 to T -1, we have: E[F ( ŵT )] ≤ E[F ( ŵ0 )] - η A 2 • T -1 t=0 E ∇F (w t ) 2 + T η A [A 1 D + BD + 8cδ(4H 2 + 1)A 2 D] + T (η A ) 3 [4H 2 (A 2 ) 2 L 2 ] + T (η A ) 2 [(A 2 ) 2 L + 2HA 2 DL + 8cδ(4H 2 + 1)(A 2 ) 2 L]. Note that ŵ0 = w 0 and F ( ŵT ) ≥ F * . Thus, 1 T T -1 t=0 E ∇F (w t ) 2 ≤ 2[F ( ŵ0 ) -F * ] η A T + [2A 1 D + 2BD + 4 2cδ(4H 2 + 1)A 2 D] + (η A ) 2 • [8H 2 (A 2 ) 2 L 2 ] + η A • [2(A 2 ) 2 L + 4HA 2 + 16cδ(4H 2 + 1)(A 2 ) 2 L]. In summary, 1 T T -1 t=0 E ∇F (w t ) 2 ≤ 2[F ( ŵ0 ) -F * ] η A T + η A γ A,1 + (η A ) 2 γ A,2 + ∆ A , where γ A,1 = 2(A 2 ) 2 L + 4HA 2 DL + 16cδ(4H 2 + 1)(A 2 ) 2 L, γ A,2 = 8H 2 (A 2 ) 2 L 2 and ∆ A = 2A 1 D + 2BD + 4 2cδ(4H 2 + 1)A 2 D.

C MORE EXPERIMENTAL RESULTS

In this section, we present more empirical results, which are consistent to the ones in the main text of this paper and further support our conclusions.

C.1 MORE EXPERIMENTS ABOUT THE EFFECT OF ALPHA

We present more empirical results about FedREP with aggregators geoMed and TMean in this section. The experimental settings are the same as those in the main text. As illustrated in Figure 3 , the empirical results are consistent with that in the main text. In addition, we have also noticed that the performance of FedREP with TMean is not stable enough under ALIE attack. A possible reason is that the aggregator TMean is not robust enough against ALIE attack since FedREP with each of the other two aggregators (geoMed and CClip) has a relatively stable empirical results. We will further study this phenomenon in future works. 

C.2 EXPERIMENTS ABOUT BYZANTINE ATTACKS ON COORDINATES

In each iteration of FedREP, clients will send the coordinate set I t k to server. However, Byzantine clients may send arbitrary coordinates. Although the theoretical analysis in the main text has included this case, we also provide empirical results about Byzantine behaviour on sending coordinates. We set α = 0 for non-Byzantine clients and consider four different Byzantine settings, where Byzantine clients send the correct coordinates (noAtk), send the coordinates of K m smallest absolute values (minAtk), send random coordinates (randAtk) and send coordinates that is the same as a non-Byzantine client (sameAtk), respectively. We set K = 0.065d while the other settings are the same as those in Section 5 in the main text. As illustrated in Figure 4 , although the Byzantine attack on coordinates slightly changes the communication cost, it has little effect on the convergence rate and final top-1 accuracy. The main reason is that the top-K m coordinates of each non-Byzantine client can always be sent to server in FedREP, no matter what is sent from Byzantine clients. 

C.3 EXPERIMENTS ABOUT LOCAL MOMENTUM

Previous works (Karimireddy et al., 2021) have shown that using momentum can help to reduce the variance of stochastic and obtain stronger Byzantine robustness. We also provide empirical results about the momentum in this section. The experimental settings keep the same as in Section 5 in the main text. As illustrated in Figure 5 , using local momentum can make FedREP more robust to Byzantine attack ALIE, which is consistent with previous works (Karimireddy et al., 2021) .

C.4 COMPARISON WITH SPARSESECAGG

We first empirically compare the performance of FedREP with the communication-efficient privacypreserving FL baseline SparseSecAgg (Ergun et al., 2021) when there is no attack. We test the performance of FedREP with buffer size s = 4, 8 and 16, respectively. We set Γ = 0.05 and 0.1 for SparseSecAgg in the two experiments, respectively. Correspondingly, we set K = 0.05d and K = 0.1d in the two experiments for FedREP since the transmitted dimension number in FedREP is uncertain but not larger than K. Thus, we have Γ ≤ K/d for FedREP. The top-1 accuracy w.r.t. epochs is illustrated in Figure 6 . The results show that FedREP can significantly outperform the existing communication-efficient privacy-preserving baseline SparseSecAgg when there is no Byzantine attack. In addition, we have tried different learning rates for SparseSecAgg and it has the best performance when learning rate equals 5. As illustrated in Figure 7 , FedREP significantly outperforms SparseSecAgg on top-1 accuracy when communication cost is similar. The communication cost of SparseSecAgg is much more than FedREP when the performance on top-1 accuracy is comparable. For one reason, FedREP is based on top-K sparsification while SparseSecAgg is based on random-K sparsification. For another reason, FedREP adopts error-compensation technique while SparseSecAgg does not.



Sketching technique can be used in different ways for reducing communication cost or protecting privacy. Thus, sketching appears in both communication-efficient methods and privacy-preserving methods. We use tilde to denote sparse vectors in this paper for easy distinguishment. Random quantization can be simply adopted before secure aggregation to make the values on a finite field for more privacy preservation. However for simplicity, we do not include it in the description here.



Under review as a conference paper at ICLR 2023 When taking η = O(1/ √ T ), Theorem 3 guarantees that FedREP has a convergence rate of O(1/ √ T ) with an extra error ∆, which consists of two terms. The first term 2BD comes from the bias of stochastic gradients, which reflects the degree of heterogeneity between clients. The term vanishes in i.i.d. cases where B = 0. The second term 4 2cδ(4H 2 + 1)(D 2 + σ 2 )D comes from the aggregation error. The term vanishes when there is no Byzantine client (δ = 0). Namely, the extra error ∆ vanishes in i.i.d. cases without Byzantine clients. Then we analyze the convergence of FedREP with general local training algorithms that satisfy Assumption 7, which illustrates two important properties of a training algorithm. Assumption 7. Let w = A(w; D k ). There exist constants η

Figure 1: Top-1 accuracy w.r.t. epochs of FedREP with CClip when there are 7 Byzantine clients under bit-flipping attack (left), ALIE attack (middle) and FoE attack (right).

Figure 2: Top-1 accuracy w.r.t. epochs of FedREP and RCGD-EF when there are 7 Byzantine clients under bit-flipping attack (left), ALIE attack (middle) and FoE attack (right).

When training algorithm A is I-iteration local SGD with learning rate η, we have w t+1,

d cons . (iv) is derived based on Assumption 5 and Assumption 6. When η t = b √ t+λ where constant b > 0 and λ = 4d d cons

2 + σ 2 , j = 1, 2, . . . , I. (189) By mathematical deduction, ∀t = 0, 1, . . . , T , we have E m t,j k 2 ≤ D 2 + σ 2 , j = 0, 1, 2, . . . , I.

Figure 3: Top-1 accuracy w.r.t. epochs of FedREP with geoMed (top row) and TMean (bottom row) under bit-flipping attack (left column), ALIE attack (middle column) and FoE attack (right column).

Figure 4: Top-1 accuracy w.r.t. epochs of FedREP with different Byzantine behaviour on sending coordinates when there are 7 Byzantine clients with bit-flipping attack (left), ALIE attack (middle) and FoE attack (right), respectively.

Figure 5: Top-1 accuracy w.r.t. epochs when there are 7 Byzantine clients with ALIE attack. β is the hyper-parameter of local momentum. Local momentum is not used when β = 0. The robust aggregator in FedREP is set to be geoMed (left), TMean (middle) and CClip (right), respectively.

., Comparison among different methods in terms of the three aspects of federated learning Method Byzt.-robust Comm.-efficient Privacy-preserving RCGD

on client k at the t-th iteration. Initially, u 0 k = 0.

, FedREP has comparable performance to RCGD-EF under bit-flipping and FoE attack, but outperforms RCGD-EF under ALIE attack. Meanwhile, FedREP is naturally a two-way sparsification method while RCGD-EF is not. Moreover, FedREP provides extra privacy preservation compared to RCGD-EF.Due to limited space, more empirical results are presented in the appendices. Empirical results in Appendix C.3 show the effect of momentum hyper-parameter β. Empirical results in Appendix C.4 show that FedREP can significantly outperform the communication-efficient privacy-preserving baseline SparseSecAgg(Ergun et al., 2021). Empirical results in Appendix C.5 show that compared with the Byzantine-robust privacy-preserving baseline SHARE(Velicheti et al., 2021), FedREP has comparable convergence rate and accuracy with much smaller communication cost.Reproducibility Statement. In this work, we empirically test the performance of FedREP and the baselines on the public dataset CIFAR-10 (Krizhevsky et al., 2009) with the widely used deep learning model

1 FedREP (Server) ) I t via SecAgg protocol; end for Compute: ( Gt ) I t = Agg({b l } Receive coordinate set I t and assigned buffer number l from the server; Compute (g t k ) I t and send it to the assigned buffer b l via SecAgg protocol; I t from the server and recover Gt according to I t ; /* Update model parameters*/ Update parameters: w t+1 = w t -Gt ; end for Output model parameter w T ;B PROOF DETAILSIn this section, we present the proof details of the theoretical results in the paper.

C.5 COMPARISON WITH SHARE

We empirically compare the performance of FedREP and SHARE (Velicheti et al., 2021) 9 , Figure 10 and Figure 11 , respectively. As we can see from the empirical results, compared to SHARE, there is almost no loss on the convergence rate and final accuracy when Γ is about 0.079 for FedREP. In addition, there is only a little loss on final accuracy when Γ is as low as about 0.017.Interestingly, when under ALIE attack, empirical results show that FedREP has an even higher final accuracy compared to SHARE. We conduct an extra experiment to compare the performance of FedREP and SHARE when there are 7 Byzantine clients with ALIE attack. We set local training algorithm to be momentum SGD with β = 0.9 and buffer size s = 2 in the extra experiment. The other settings are the same. As illustrated in Figure 12 , FedREP can still outperform SHARE in this setting. A possible reason is that the consensus sparsification in FedREP can lower the dissimilarity between the updates of different clients and thus lower the aggregation error (please see Definition 1 in the main text for more details). However, it requires more effort to further explore this aspect and we leave it for future work. In summary, FedREP has a comparable performance to SHARE on convergence rate and final accuracy, but has much less communication cost than SHARE. 

