BASGD: BUFFERED ASYNCHRONOUS SGD FOR BYZANTINE LEARNING

Abstract

Distributed learning has become a hot research topic due to its wide application in cluster-based large-scale learning, federated learning, edge computing and so on. Most traditional distributed learning methods typically assume no failure or attack on workers. However, many unexpected cases, such as communication failure and even malicious attack, may happen in real applications. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has recently attracted much attention. Most existing BL methods are synchronous, which are impractical in some applications due to heterogeneous or offline workers. In these cases, asynchronous BL (ABL) is usually preferred. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for ABL. To the best of our knowledge, BASGD is the first ABL method that can resist malicious attack without storing any instances on server. Compared with those methods which need to store instances on server, BASGD takes less risk of privacy leakage. BASGD is proved to be convergent, and be able to resist failure or attack. Empirical results show that BASGD significantly outperforms vanilla ASGD and other ABL baselines when there exists failure or attack on workers.

1. INTRODUCTION

Due to the wide application in cluster-based large-scale learning, federated learning (Konevcnỳ et al., 2016; Kairouz et al., 2019) , edge computing (Shi et al., 2016) and so on, distributed learning has recently become a hot research topic (Zinkevich et al., 2010; Yang, 2013; Jaggi et al., 2014; Shamir et al., 2014; Zhang & Kwok, 2014; Ma et al., 2015; Lee et al., 2017; Lian et al., 2017; Zhao et al., 2017; Sun et al., 2018; Wangni et al., 2018; Zhao et al., 2018; Zhou et al., 2018; Yu et al., 2019a; b; Haddadpour et al., 2019) . Most traditional distributed learning methods are based on stochastic gradient descent (SGD) and its variants (Bottou, 2010; Xiao, 2010; Duchi et al., 2011; Johnson & Zhang, 2013; Shalev-Shwartz & Zhang, 2013; Zhang et al., 2013; Lin et al., 2014; Schmidt et al., 2017; Zheng et al., 2017; Zhao et al., 2018) , and typically assume no failure or attack on workers. However, in real distributed learning applications with multiple networked machines (nodes), different kinds of hardware or software failure may happen. Representative failure include bit-flipping in the communication media and the memory of some workers (Xie et al., 2019) . In this case, a small failure on some machines (workers) might cause a distributed learning method to fail. In addition, malicious attack should not be neglected in an open network where the manager (or server) generally has not much control on the workers, such as the cases of edge computing and federated learning. Some malicious workers may behave arbitrarily or even adversarially. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has recently attracted much attention (Diakonikolas et al., 2017; Chen et al., 2017; Blanchard et al., 2017; Alistarh et al., 2018; Damaskinos et al., 2018; Xie et al., 2019; Baruch et al., 2019; Diakonikolas & Kane, 2019) . Existing BL methods can be divided into two main categories: synchronous BL (SBL) methods and asynchronous BL (ABL) methods. In SBL methods, the learning information, such as the gradient in SGD, of all workers will be aggregated in a synchronous way. On the contrary, in ABL methods the learning information of workers will be aggregated in an asynchronous way. Existing SBL methods mainly take two different ways to achieve resilience against Byzantine workers which refer to those workers with failure or attack. One way is to replace the simple averaging aggregation operation with some more robust aggregation operations, such as median and trimmed-mean (Yin et al., 2018) . Krum (Blanchard et al., 2017) and ByzantinePGD (Yin et al., 2019 ) take this way. The other way is to filter the suspicious learning information (gradients) before averaging. Representative examples include ByzantineSGD (Alistarh et al., 2018) and Zeno (Xie et al., 2019) . The advantage of SBL methods is that they are relatively simple and easy to be implemented. But SBL methods will result in slow convergence when there exist heterogeneous workers. Furthermore, in some applications like federated learning and edge computing, synchronization cannot even be performed most of the time due to the offline workers (clients or edge servers). Hence, ABL is preferred in these cases. To the best of our knowledge, there exist only two ABL methods: Kardam (Damaskinos et al., 2018) and Zeno++ (Xie et al., 2020) . Kardam introduces two filters to drop out suspicious learning information (gradients), which can still achieve good performance when the communication delay is heavy. However, when in face of malicious attack, some work finds that Kardam also drops out most correct gradients in order to filter all faulty (failure) gradients. Hence, Kardam cannot resist malicious attack (Xie et al., 2020) . Zeno++ scores each received gradient, and determines whether to accept it according to the score. But Zeno++ needs to store some training instances on server for scoring. In practical applications, storing data on server will increase the risk of privacy leakage or even face legal risk. Therefore, under the general setting where server has no access to any training instances, there have not existed ABL methods to resist malicious attack. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for ABL. The main contributions of BASGD are listed as follows: • To the best of our knowledge, BASGD is the first ABL method that can resist malicious attack without storing any instances on server. Compared with those methods which need to store instances on server, BASGD takes less risk of privacy leakage. • BASGD is theoretically proved to be convergent, and be able to resist failure or attack. • Empirical results show that BASGD significantly outperforms vanilla ASGD and other ABL baselines when there exist failure or malicious attack on workers. In particular, BASGD can still converge under malicious attack, when ASGD and other ABL methods fail.

2. PRELIMINARY

This section presents the preliminary of this paper, including the distributed learning framework used in this paper and the definition of Byzantine worker.

2.1. DISTRIBUTED LEARNING FRAMEWORK

Many machine learning models, such as logistic regression and deep neural networks, can be formulated as the following finite sum optimization problem: min w∈R d F (w) = 1 n n i=1 f (w; z i ), where w is the parameter to learn, d is the dimension of parameter, n is the number of training instances, f (w; z i ) is the empirical loss on the training instance z i . The goal of distributed learning is to solve the problem in (1) by designing learning algorithms based on multiple networked machines. Although there have appeared many distributed learning frameworks, in this paper we focus on the widely used Parameter Server (PS) framework (Li et al., 2014) . In a PS framework, there are several workers and one or more servers. Each worker can only communicate with server(s). There may exist more than one server in a PS framework, but for the problem of this paper servers can be logically conceived as a unity. Without loss of generality, we will assume there is only one server in this paper. Training instances are disjointedly distributed across m workers. Let D k denote the index set of training instances on worker k, we have ∪ m k=1 D k = {1, 2, . . . , n} and D k ∩ D k = ∅ if k = k . In this paper, we assume that server has no access to any training instances. If two instances have the same value, they are still deemed as two distinct instances. Namely, z i may equal z i (i = i ). One popular asynchronous method to solve the problem in (1) under the PS framework is ASGD (Dean et al., 2012) (see Algorithm 1 in Appendix A). In this paper, we assume each worker samples one instance for gradient computation each time, and do not separately discuss the mini-batch case. In PS based ASGD, server is responsible for updating and maintaining the latest parameter. The number of iterations that server has already executed is used as the global logical clock of server. At the beginning, iteration number t = 0. Each time a SGD step is executed, t will increase by 1 immediately. The parameter after t iterations is denoted as w t . If server sends parameters to worker k at iteration t , some SGD steps may have been excuted before server receives gradient from worker k next time at iteration t. Thus, we define the delay of worker k at iteration t as τ t k = t -t . Worker k is heavily delayed at iteration t if τ t k > τ max , where τ max is a pre-defined non-negative constant.

2.2. BYZANTINE WORKER

For workers that have sent gradients (one or more) to server at iteration t, we call worker k loyal worker if it has finished all the tasks without any fault and each sent gradient is correctly received by the server. Otherwise, worker k is called Byzantine worker. If worker k is a Byzantine worker, it means the received gradient from worker k is not credible, which can be an arbitrary value. In ASGD, there is one received gradient at a time. Formally, we denote the gradient received from worker k at iteration t as g t k . Then, we have: g t k = ∇f (w t ; z i ), if worker k is loyal at iteration t; arbitrary value, if worker k is Byzantine at iteration t, where 0 ≤ t ≤ t, and i is randomly sampled from D k . Our definition of Byzantine worker is consistent with most previous works (Blanchard et al., 2017; Xie et al., 2019; 2020) . Either accidental failure or malicious attack will result in Byzantine workers.

3. BUFFERED ASYNCHRONOUS SGD

In synchronous BL, gradients from all workers are received at each iteration. During this process, we can compare the gradients with each other, and then filter suspicious ones, or use more robust aggregation rules such as median and trimmed-mean for updating. However, in asynchronous BL, only one gradient is received by the server at a time. Without any training instances stored on server, it is difficult for server to identify whether a received gradient is credible or not. In order to deal with this problem in asynchronous BL, we propose a novel method called buffered asynchronous SGD (BASGD). BASGD introduces B buffers (0 < B ≤ m) on server, and the gradient used for updating parameters will be aggregated from these buffers. The detail of the learning procedure of BASGD is presented in Algorithm 2 in Appendix A. In this section, we will introduce the details of the two key components of BASGD: buffer and aggregation function.

3.1. BUFFER

In BASGD, the m workers do the same job as that in ASGD, while the updating rule on server is modified. More specifically, there are B buffers (0 < B ≤ m) on server. When a gradient g from worker s is received, it will be temporarily stored in buffer b, where b = s mod B, as illustrated in Figure 1 . Only when each buffer has stored at least one gradient, a new SGD step will be executed. Please note that no matter whether a SGD step is executed or not, the server will immediately send the latest parameters back to the worker after receiving a gradient. Hence, BASGD introduces no barrier, and is an asynchronous algorithm. For each buffer b, more than one gradient may have been received at iteration t. We will store the average of these gradients (denoted by h b ) in buffer b. Assume that there are already (N -1) gradients g 1 , g 2 , . . . , g N -1 which should be stored in buffer b, and h b(old) = 1 N -1 N -1 i=1 g i . When the N -th gradient g N is received, the new average value in buffer b should be: h b(new) = 1 N N i=1 g i = N -1 N • h b(old) + 1 N • g N . This is the updating rule for each buffer b when a gradient is received. We use N t b to denote the total number of gradients stored in buffer b at the t-th iteration. After the parameter w is updated, all buffers will be zeroed out at once. With the benefit of buffers, server has access to B candidate gradients when updating parameter. Thus, a more reliable (robust) gradient can be aggregated from the B gradients of buffers, if a proper aggregation function Aggr(•) is chosen. Figure 1 : An example of buffers. Circle represents worker, and the number is worker ID. There are 15 workers and 5 buffers. The gradient received from worker s is stored in buffer {s mod 5}.

3.2. AGGREGATION FUNCTION

When a SGD step is ready to be executed, there are B buffers providing candidate gradients. An aggregation function is needed to get the final gradient for updating. A naive way is to take the mean of all candidate gradients. However, mean value is sensitive to outliers which are common in BL. For designing proper aggregation functions, we first define the q-Byzantine Robust (q-BR) condition to quantitatively describe the Byzantine resilience ability of an aggregation function. Definition 1 (q-Byzantine Robust). For an aggregation function Aggr(•): Aggr([h 1 , . . . , h B ]) = G, where G = [G 1 , . . . , G d ] T and h b = [h b1 , . . . , h bd ] T , ∀b ∈ [B], we call Aggr(•) q-Byzantine Robust (q ∈ Z, 0 < q < B/2), if it satisfies the following two properties: (a). Aggr([h 1 + h , . . . , h B + h ]) = Aggr([h 1 , . . . , h B ]) + h , ∀h 1 , . . . , h B ∈ R d , ∀h ∈ R d ; (b). min s∈S {h sj } ≤ G j ≤ max s∈S {h sj }, ∀j ∈ [d], ∀S ⊂ [B] with |S| = B -q, Intuitively, property (a) in Definition 1 says that if all candidate gradients h i are added by a same vector h , the aggregated gradient will also be added by h . Property (b) says that for each coordinate j, the aggregated value G j will be between the (q + 1)-th smallest value and the (q + 1)-th largest value among the j-th coordinates of all candidate gradients. Thus, the gradient aggregated by a q-BR function is insensitive to at least q outliers. We can find that q-BR condition gets stronger when q increases. In other words, if Aggr(•) is q-BR, then for any 0 < q < q, Aggr(•) is also q -BR. Remark 1. It is not hard to find that when B > 1, mean function is not q-Byzantine Robust for any q > 0. We illustrate this by a one-dimension example: h 1 , . . . , h B-1 ∈ [0, 1], and h B = 10 × B. Then 1 B B b=1 h b ≥ h B B = 10 ∈ [0, 1]. Namely, the mean is larger than any of the first B -1 values. We find that the following two aggregation functions satisfy Byzantine Robust condition. Definition 2 (Coordinate-wise median (Yin et al., 2018) ). For candidate gradients h 1 , h 2 , . . . , h B ∈ R d , h b = [h b1 , h b2 , . . . , h bd ] T , ∀b = 1, 2, . . . , B. Coordinate-wise median is defined as: M ed([h 1 , . . . , h B ]) = [M ed(h •1 ), . . . , M ed(h •d )] T , where M ed(h •j ) is the scalar median of the j-th coordinates, ∀j = 1, 2, . . . , d. Definition 3 (Coordinate-wise q-trimmed-mean (Yin et al., 2018) ). For any positive interger q < B/2 and candidate gradients h 1 , h 2 , . . . , h B ∈ R d , h b = [h b1 , h b2 , . . . , h bd ] T , ∀b = 1, 2, . . . , B. Coordinate-wise q-trimmed-mean is defined as: T rm([h 1 , . . . , h B ]) = [T rm(h •1 ), . . . , T rm(h •d )] T , where T rm(h •j ) is the scalar q-trimmed-mean: T rm(h •j ) = 1 B-2q b∈Mj h bj . M j is the subset of {h bj } B b=1 obtained by removing the q largest elements and q smallest elements. In the following content, coordinate-wise median and coordinate-wise q-trimmed-mean are also called median and trmean, respectively. Proposition 1 shows the q-BR property of these two functions. Proposition 1. Coordinate-wise q-trmean is q-BR, and coordinate-wise median is B-1 2 -BR. Here, x is the maximum integer not larger than x. According to Proposition 1, both median and trmean are proper choices for aggregation function in BASGD. The proof can be found in Appendix B. Now we define another class of aggregation functions, which is also important in analysis in Section 4. Definition 4 (Stable aggregation function). Aggregation function Aggr(•) is said to be stable provided that ∀h 1 , . . . , h B , h1 , . . . , hB ∈ R d , letting δ = ( B b=1 h b -hb 2 ) 1 2 , we have: Aggr(h 1 , . . . , h B ) -Aggr( h1 , . . . , hB ) ≤ δ. If Aggr(•) is a stable aggregation function, it means that when there is a disturbance with L 2 -norm δ on buffers, the disturbance of aggregated result will not be larger than δ. Definition 5 (Effective aggregation function). A stable aggregation function Aggr(•) is called an (A 1 , A 2 )-effective aggregation function, provided that when there are at most r Byzantine workers and τ t k = 0 for each loyal worker k (∀t = 0, 1, . . . , T -1), it satisfies the following two properties: (i). E[∇F (w t ) T G t syn | w t ] ≥ ∇F (w t ) 2 -A 1 , ∀w t ∈ R d ; (ii). E[ G t syn 2 | w t ] ≤ (A 2 ) 2 , ∀w t ∈ R d ; where A 1 , A 2 ∈ R + are two non-negative constants, G t syn is the gradient aggregated by Aggr(•) at the t-th iteration in cases without delay (τ max = 0). For different aggregation functions, constants A 1 and A 2 may differ. A 1 and A 2 are also related to loss function F (•), distribution of instances, buffer number B, maximum Byzantine worker number r and so on. Inequalities (i) and (ii) in Definition 5 are two important properties in convergence proof of synchronous Byzantine learning methods. As revealed in (Yang et al., 2020) , there are many existing asynchronous Byzantine learning methods. Krum, median, and trimmed-mean are proved to satisfy these two properties (Blanchard et al., 2017; Yin et al., 2018) . SignSGD (Bernstein et al., 2019) can be seen as a combination of 1-bit quantization and median aggregation, while median satisfies the properties. Bulyan (Guerraoui et al., 2018) uses an existing aggregation rule to obtain a new one, and the property of Bulyan is difficult to be analyzed alone. Zeno (Xie et al., 2019) has an asynchronous version called Zeno++ (Xie et al., 2020) , and it is meaningless to check the properties for Zeno. Please note that too large B will slow down the updating frequency and damage the performance, which is supported by both theoretical (in Appendix B) and empirical (in Section 5) results. In practical application, we could estimate Byzantine worker number r in advance, and set B to make Aggr(•) be r-BR. Specially, B is suggested to be (2r + 1) for median, since median is B-1 2 -BR.

4. CONVERGENCE

In this section, we theoretically prove the convergence and resilience of BASGD against failure or attack. There are two main theorems. The first theorem presents a relatively loose but general bound for all q-BR aggregation functions. The other one presents a relatively tight bound for each distinct (A 1 , A 2 )-effective aggregation function. Since the definition of (A 1 , A 2 )-effective aggregation function is usually more difficult to verify than q-BR property, the general bound is also useful. Here we only present the results. Proof details are in Appendix B. We first make the following assumptions, which also have been widely used in stochastic optimization. Assumption 1. Global loss function F (w) is bounded below: ∃F * ∈ R, F (w) ≥ F * , ∀w ∈ R d . Assumption 2 (Bounded bias). For any loyal worker, it can use locally stored training instances to estimate global gradient with bounded bias κ: E[∇f (w; z i )] -∇F (w) ≤ κ, ∀w ∈ R d . Assumption 3 (Bounded gradient). ∇F (w) is bounded: ∃D ∈ R + , ∇F (w) ≤ D, ∀w ∈ R d . Assumption 4 (Bounded variance). E[||∇f (w; z i ) -E[∇f (w; z i ) | w]|| 2 | w] ≤ σ 2 , ∀w ∈ R d . Assumption 5 (L-smoothness). Global loss function F (w) is differentiable and L-smooth: ||∇F (w) -∇F (w )|| ≤ L||w -w ||, ∀w, w ∈ R d . Remark 2. Please note that we do not give any assumption about convexity. The analysis in this section is suitable for both convex and non-convex models in machine learning, such as logistic regression and deep neural networks. Also, we do not give any assumption about the behavior of Byzantine workers, which may behave arbitrarily. Let N (t) be the (q + 1)-th smallest value in {N t b } b∈[B] , N t b is the total number of gradients stored in buffer b at the t-th iteration. We define the constant Λ B,q,r = (B-r) √ B-r+1 √ (B-q-1)(q-r+1) , which will appear in Lemma 1 and Lemma 2. Lemma 1. If Aggr(•) is q-BR, and there are at most r Byzantine workers (r ≤ q), we have: E[||G t || 2 | w t ] ≤ Λ B,q,r d • (D 2 + σ 2 /N (t) ). Lemma 2. If Aggr(•) is q-BR, and the total number of heavily delayed workers and Byzantine workers is not larger than r (r ≤ q), we have: ||E[G t -∇F (w t ) | w t ]|| ≤ Λ B,q,r d • (τ max L • [Λ B,q,r d(D 2 + σ 2 /N (t) )] 1 2 + σ + κ). Theorem 1. Let D = 1 T T -1 t=0 (D 2 + σ 2 /N (t) ) 1 2 . If Aggr(•) is q-BR, B = O(r) , and the total number of heavily delayed workers and Byzantine workers is not larger than r (r ≤ q), set learning rate η = O( 1 L √ T ), we have: T -1 t=0 E[||∇F (w t )|| 2 ] T ≤O L[F (w 0 ) -F * ] T 1 2 + O rd D T 1 2 (q -r + 1) 1 2 + O rDdσ (q -r + 1) 1 2 + O rDdκ (q -r + 1) 1 2 + O r 3 2 LD Dd 3 2 τ max (q -r + 1) 3 4 . Please note that the convergence rate of vanilla ASGD is O(T -1 2 ). Hence, Theorem 1 indicates that BASGD has a theoretical convergence rate as fast as vanilla ASGD, with an extra constant variance. The term O(rDdσ(q -r + 1) -1 2 ) is caused by the aggregation function, which can be deemed as a sacrifice for Byzantine resilience. The term O(rDdκ(q -r + 1) -1 2 ) is caused by the differences of training instances among different workers. In independent and identically distributed (i.i.d.) cases, κ = 0 and the term vanishes. The term O(r 3 2 LD Dd 3 2 τ max (q -r + 1) -3 4 ) is caused by the delay, and related to parameter τ max . The term is also related to the buffer size. When N t b increases, N (t) may increase, and thus D will decrease. Namely, larger buffer size will result in smaller D. Besides, the factor (q -r + 1) -1 2 or (q -r + 1) -3 4 decreases as q increases, and increases as r increases. Although general, the bound presented in Theorem 1 is relatively loose in high-dimensional cases, since d appears in all the three extra terms. To obtain a tighter bound, we introduce Theorem 2 for BASGD with (A 1 , A 2 )-effective aggregation function (Definition 5). Theorem 2. If the total number of heavily delayed workers and Byzantine workers is not larger than r, B = O(r), and Aggr(•) is an (A 1 , A 2 )-effective aggregation function in this case. Set learning rate η = O( 1 √ LT ), and in general asynchronous cases, we have: T -1 t=0 E[ ∇F (w t ) 2 ] T ≤ O L 1 2 [F (w 0 ) -F * ] T 1 2 +O L 1 2 τ max DA 2 r 1 2 T 1 2 + O L 1 2 (A 2 ) 2 T 1 2 + O L 5 2 (A 2 ) 2 τ 2 max r T 3 2 + A 1 . Theorem 2 indicates that if Aggr(•) makes a synchronous BL method converge (i.e., satisfies Definition 5), BASGD converges when using Aggr(•) as aggregation function. Hence, BASGD can also be seen as a technique of asynchronization. That is to say, new asynchronous methods can be obtained from synchronous ones when using BASGD. The extra constant term A 1 is caused by gradient bias. When there is no Byzantine workers (r = 0), and instances are i.i.d. across workers, letting B = 1 and Aggr(h 1 , . . . , h B ) = Aggr(h 1 ) = h 1 , BASGD degenerates to vanilla ASGD. Under this circumstance, there is no gradient bias (A 1 = 0), and the extra constant term vanishes. In general cases, Theorem 2 guarantees BASGD to find a point such that the squared L 2 -norm of its gradient is not larger than A 1 (but not necessarily around a stationary point), in expectation. Please note that Assumption 3 already guarantees that gradient's squared L 2 -norm is not larger than D 2 . We introduce Proposition 2 to show that A 1 is guaranteed to be smaller than D 2 under a mild condition. Proposition 2. Aggr(•) is an (A 1 , A 2 )-effective aggregation function, and G t syn is aggregated by Aggr(•) in synchronous setting. If E[ G t syn -∇F (w t ) | w t ] ≤ D, ∀w t ∈ R d , then A 1 ≤ D 2 . G t syn is the aggregated result of Aggr(•), and is a robust estimator of ∇F (w t ) used for updating. Since ∇F (w t ) ≤ D, ∇F (w t ) locates in a ball with radius D. E[ G t syn -∇F (w t ) | w t ] ≤ D means that the bias of G t syn is not larger than the radius D, which is a mild condition for Aggr(•). As many existing works have indicated (Assran et al., 2020; Nokleby et al., 2020) , speed-up is also an important aspect of distributed learning methods. In BASGD, different workers can compute gradients concurrently, make each buffer be filled more quickly, and thus speed up the model updating. However, we mainly focus on Byzantine-resilience in this work. Speed-up will be thoroughly studied in future work. Besides, heavily delayed workers are considered as Byzantine in the current analysis. We will analyze heavily delayed worker's behavior more finely to obtain better results in future work.

5. EXPERIMENT

In this section, we empirically evaluate the performance of BASGD and baselines in both image classification (IC) and natural language processing (NLP) applications. Our experiments are conducted on a distributed platform with dockers. Each docker is bound to an NVIDIA Tesla V100 (32G) GPU (in IC) or an NVIDIA Tesla K80 GPU (in NLP). Please note that different GPU cards do not affect the reported metrics in the experiment. We choose 30 dockers as workers in IC, and 8 dockers in NLP. An extra docker is chosen as server. All algorithms are implemented with PyTorch 1.3. 

5.1. EXPERIMENTAL SETTING

We compare the performance of different methods under two types of attack: negative gradient attack (NG-attack) and random disturbance attack (RD-attack). Byzantine workers with NG-attack send gNG = -k atk •g to server, where g is the true gradient and k atk ∈ R + is a parameter. Byzantine workers with RD-attack send gRD = g + g rnd to server, where g rnd is a random vector sampled from normal distribution N (0, σ atk g 2 • I). Here, σ atk is a parameter and I is an identity matrix. NG-attack is a typical kind of malicious attack, while RD-attack can be seen as an accidental failure with expectation 0. Besides, each worker is manually set to have a delay, which is k del times the computing time. Training set is randomly and equally distributed to different workers. We use the average top-1 test accuracy (in IC) or average perplexity (in NLP) on all workers w.r.t. epochs as final metrics. For BASGD, we use median and trimmed-mean as aggregation function. Because BASGD is an ABL method, SBL methods cannot be directly compared with BASGD. The ABL method Zeno++ either cannot be directly compared with BASGD, because Zeno++ needs to store some instances on server. The number of instances stored on server will affect the performance of Zeno++ (Xie et al., 2020) . Hence, we compare BASGD with ASGD and Kardam in our experiments. We set dampening function Λ(τ ) = 1 1+τ for Kardam as suggested in (Damaskinos et al., 2018) .

5.2. IMAGE CLASSIFICATION EXPERIMENT

In IC experiment, algorithms are evaluated on CIFAR-10 ( Krizhevsky et al., 2009) with deep learning model ResNet-20 (He et al., 2016) . Cross-entropy is used as the loss function. We set k atk = 10 for NG-attack, and σ atk = 0.2 for RD-attack. k del is randomly sampled from truncated standard normal distribution within [0, +∞). As suggested in (He et al., 2016) , learning rate η is set to 0.1 initially for each algorithm, and multiplied by 0.1 at the 80-th epoch and the 120-th epoch respectively. The weight decay is set to 10 -4 . We run each algorithm for 160 epochs. Batch size is set to 25. Firstly, we compare the performance of different methods when there are no Byzantine workers. Experimental results with median and trmean aggregation functions are illustrated in Figure 2 (a) and Figure 2 (b), respectively. ASGD achieves the best performance. BASGD (B > 1) and Kardam have similar convergence rate to ASGD, but both sacrifice a little accuracy. Besides, the performance of BASGD gets worse when the buffer number B increases, which is consistent with the theoretical results. Please note that ASGD is a degenerated case of BASGD when B = 1 and Aggr(h 1 ) = h 1 . Hence, BASGD can achieve the same performance as ASGD when there is no failure or attack. Then, for each type of attack, we conduct two experiments in which there are 3 and 6 Byzantine workers, respectively. We respectively set 10 and 15 buffers for BASGD in these two experiments. For space saving, we only present average top-1 test accuracy in Figure 2 Moreover, we count the ratio of filtered gradients in Kardam, which is shown in Table 1 . We can find that in order to filter Byzantine gradients, Kardam also filters approximately equal ratio of loyal gradients. It explains why Kardam performs poorly under malicious attack.

5.3. NATURAL LANGUAGE PROCESSING EXPERIMENT

In NLP experiment, the algorithms are evaluated on the WikiText-2 dataset with LSTM networks. We only use the training set and test set, while the validation set is not used in our experiment. For LSTM, we adopt 2 layers with 100 units in each. Word embedding size is set to 100, and sequence length is set to 35. Gradient clipping size is set to 0.25. Cross-entropy is used as the loss function. For each algorithm, we run each algorithm for 40 epochs. Initial learning rate η is chosen from {1, 2, 5, 10, 20}, and is divided by 4 every 10 epochs. The best test result is adopted as the final one. The performance of ASGD under no attack is used as gold standard. We set k atk = 10 and σ atk = 0.1. One of the eight workers is Byzantine. k del is randomly sampled from exponential distribution with parameter λ = 1. Each experiment is carried out for 3 times, and the average perplexity is reported in Figure 3 . We can find that BASGD converges under each kind of attack, with only a little loss in perplexity compared to the gold standard (ASGD without attack). On the other hand, ASGD and Kardam both fail, even if we have set the largest γ (γ = 3) for Kardam.

6. CONCLUSION

In this paper, we propose a novel method called BASGD for asynchronous Byzantine learning. To the best of our knowledge, BASGD is the first ABL method that can resist malicious attack without storing any instances on server. Compared with those methods which need to store instances on server, BASGD takes less risk of privacy leakage. BASGD is proved to be convergent, and be able to resist failure or attack. Empirical results show that BASGD significantly outperforms vanilla ASGD and other ABL baselines, when there exists failure or attack on workers.

Algorithm 1 Asynchronous SGD (ASGD)

Server: Initialization: initial parameter w 0 , learning rate η; Send initial w 0 to all workers; for t = 0 to t max -1 do Wait until a new gradient g t k is received from arbitrary worker k; Execute SGD step: w t+1 ← w t -η • g t k ; Send w t+1 back to worker k; end for Notify all workers to stop; Worker k: (k = 0, 1, ..., m -1) repeat Wait until receiving the latest parameter w from server; Randomly sample an index i from D k ; Compute ∇f (w; z i ); Send ∇f (w; z i ) to server; until receive server's notification to stop A ALGORITHM DETAILS A.1 ASYNCHRONOUS SGD (ASGD) One popular asynchronous method to solve the problem in (1) under the PS framework is ASGD (Dean et al., 2012) , which is presented in Algorithm 1.

A.2 BUFFERED ASYNCHRONOUS SGD (BASGD)

The details of learning procedure in BASGD is presented in Algorithm 2.

B PROOF DETAILS B.1 PROOF OF PROPOSITION 1

Proof. Firstly, we prove coordinate-wise q-trimmed-mean is q-BR. It is not hard to check that trmean satisfies the property (a) in the definition of q-BR, then we prove that it also satisfies property (b). Without loss of generality, we assume h 1j , . . . , h Bj are already in descending order. By definition, T rm(h •j ) is the average value of M j , which is obtained by removing q largest values and q smallest values of {h ij } B i=1 . Therefore, h (q+1)j = max x∈Mj {x} ≥ T rm(h •j ) ≥ min x∈Mj {x} = h (n-q)j For any S ⊂ [B] with |S| = B -q, by Pigeonhole Principle, S includes at least one of h 1j , . . . , h (q+1)j , and includes at least one of h (n-q)j , . . . , h Bj . Therefore, max s∈S {h sj } ≥ h (q+1)j ; min s∈S {h sj } ≤ h (n-q)j . Combining these two inequalities, we have: max s∈S {h sj } ≥ T rm(h •j ) ≥ min s∈S {h sj }. Thus, coordinate-wise q-trimmed-mean is q-BR. By definition, coordinate-wise median can be seen as B-1 2 -trimmed-mean, and thus is B-1 2 -BR. Proof. Denote the Probability Density Function (PDF) and Cumulative Density Function (CDF) of D as p(x) and P (x), respectively. Then the PDF of X (K) is: p (K) (x) = M ! (K -1)!(M -K)! [1 -P (x)] K-1 P (x) M -K p(x). Thus, E[X (K) ] = +∞ 0 x • p (K) (x)dx = +∞ 0 [ M ! (K -1)!(M -K)! • [1 -P (x)] K-1 P (x) M -K ] • xp(x)dx (a) ≤ +∞ 0 [ M ! (K -1)!(M -K)! • (K -1) K-1 (M -K) M -K (M -1) M -1 ] • xp(x)dx = M !(K -1) K-1 (M -K) M -K (K -1)!(M -K)!(M -1) M -1 • E[X]. Inequality (a) is derived based on [1 -P (x)] K-1 P (x) M -K ≤ (K-1) K-1 (M -K) M -K (M -1) M -1 , which is obtained by the following process: Let θ(x) = (1 -x) K-1 x M -K , x ∈ [0, 1]. Then θ (x) = (1 -x) K-2 x M -K-1 [(M -K)(1 -x) -(K -1)x]. Let θ (x) = 0. Solving the equation, we obtain x = M -K M -1 , 0 or 1. Also, we have θ(0) = θ(1) = 0, and θ( M -K M -1 ) = (K-1) K-1 (M -K) M -K (M -1) M -1 . Then we have max x∈[0,1] θ(x) = θ( M -K M -1 ) = (K-1) K-1 (M -K) M -K (M -1) M -1 . Thus, [1 -P (x)] K-1 P (x) M -K = θ(P (x)) ≤ (K-1) K-1 (M -K) M -K (M -1) M -1 . Proposition 3. ∀B, q, r ∈ Z + , 0 ≤ r ≤ q < B 2 , C B-r,q-r+1 ≤ (B -r) √ B -r + 1 (B -q -1)(q -r + 1) . Proof. By Stirling's approximation, we have: √ 2πn • n n e -n ≤ n! ≤ e √ n • n n e -n , ∀n ∈ Z + . Therefore, √ 2πn • e -n ≤ n! n n ≤ e √ n • e -n , ∀n ∈ Z + . By definition of C M,k , C M,K = M !(K -1) K-1 (M -K) M -K (K -1)!(M -K)!(M -1) M -1 =M • (M -1)! (M -1) M -1 • (K -1) K-1 (K -1)! • (M -K) M -K (M -K)! ≤M • [e √ M -1 • e -(M -1) ] • e K-1 2π(K -1) • e M -K 2π(M -K) = e 2π • M √ M -1 (M -K)(K -1) , where the inequality uses Inequality (2). Case (i). When r < q, C B-r,q-r+1 ≤ e 2π • (B -r) √ B -r -1 (B -q -1)(q -r) ≤ (B -r) √ B -r + 1 (B -q -1)(q -r + 1) . Case (ii). When r = q, by definition of C M,K , we have: C B-r,q-r+1 = C B-q,1 = B -q = (B -r) √ B -r + 1 (B -q -1)(q -r + 1) . In conclusion, when r ≤ q, we have: C B-r,q-r+1 ≤ (B -r) √ B -r + 1 (B -q -1)(q -r + 1) . When B and q are fixed, the upper bound of C B-r,q-r+1 will increase when r (number of Byzantine workers) increases. Namely, the upper bound will be larger if there are more Byzantine workers. When B and r are fixed, q measures the Byzantine Robust degree of aggregation function Aggr(•). The factor [(B -q -1)(q -r)] -1 2 is monotonically decreasing with respect to q, when q < B-1+r 2 . Since r ≤ q < B 2 , the upper bound will decrease when q increases. Also, B -q decreases when q increases. Namely, the upper bound will be smaller if Aggr(•) has a stronger q-BR property. In the worst case (q = r), the upper bound of C B-r,q-r+1 is linear to B. Even in the best case (r = 0, q = B-1 2 ), the denominator is about B 2 and the upper bound of C B-r,q-r+1 is linear to √ B. Thus, larger B might result in larger error. Hence, buffer number is not supposed to be set too large. Now we prove Lemma 1. Proof. E[||G t || 2 | w t ] =E[||Aggr([h 1 , . . . , h B ])|| 2 | w t ] = d j=1 E[Aggr([h 1 , . . . , h B ]) 2 j | w t ], where Aggr([h 1 , . . . , h B ]) j represents the j-th coordinate of the aggregated gradient. We use H t to denote the credible buffer index set, which is composed by the index of buffers, where the stored gradients are all from loyal workers. For each b ∈ H t , h b has stored N t b gradients at iteration t: g 1 , . . . , g N t b , and we have: h b = 1 N t b N t b i=1 g i . Then, E[ h b 2 | w t ] =E[ h b -E[h b | w t ] 2 | w t ] + E[h b | w t ] 2 =E[ 1 N t b N t b i=1 (g i -E[g i | w t ]) 2 | w t ] + E[ 1 N t b N t b i=1 g i | w t ] 2 (a) ≤ σ 2 N t b + E[ 1 N t b N t b i=1 g i | w t ] 2 = σ 2 N t b + 1 (N t b ) 2 N t b i=1 E[g i | w t ] 2 (b) ≤ σ 2 N t b + 1 (N t b ) 2 • N t b • N t b i=1 E[g i | w t ] 2 (c) ≤ σ 2 N t b + D 2 . Inequality (a) is derived based on Assumption 4 and the fact that g i is mutually uncorrelated. Inequality (b) is derived by the following process: N t b i=1 E[g i | w t ] 2 = N t b i=1 E[g i | w t ] 2 + 1≤i<i ≤N t b 2 • E[g i | w t ] T E[g i | w t ] ≤ N t b i=1 E[g i | w t ] 2 + 1≤i<i ≤N t b ( E[g i | w t ] 2 + E[g i | w t ] 2 = N t b i=1 E[g i | w t ] 2 + (N t b -1) • N t b i=1 E[g i | w t ] 2 =N t b • N t b i=1 E[g i | w t ] 2 . Inequality (c) is derived based on Assumption 3. Because there are no more than r Byzantine workers at iteration t, no more than r buffers contain Byzantine gradient. Thus, the credible buffer index set H t has at least (B -r) elements. In case that H t has more than (B -r) elements, we take the indices of the smallest (B -q) elements in {h bj } b∈H t to compose H t j , and we have |H t j | = B -q. Note that Aggr(•) is q-BR, and by definition we have: min b∈H t j {h bj } ≤ Aggr([h 1 , . . . , h B ]) j ≤ max b∈H t j {h bj }. Therefore, d j=1 E[Aggr([h 1 , . . . , h B ]) 2 j |w t ] ≤ d j=1 E[max b∈H t j {h 2 bj }|w t ]. There are (B -r) credible buffers, and we choose the smallest (B -q) buffers to compose H t j . Therefore, for all b ∈ H t j , h bj is not larger than the (q -r + 1)-th largest one in {h bj } b∈H t . Let N (t) be the (q + 1)-th smallest value in {N t b } b∈ [B] . Using Lemma 3, we have: E[max b∈H t j {h 2 bj }|w t ] ≤E[max b∈H t j { h b 2 }|w t ] ≤E[max b∈H t j {D 2 + σ 2 N t b }|w t ] =C B-r,q-r+1 • (D 2 + σ 2 N (t) ). Thus, E[||G t || 2 | w t ] ≤ d j=1 E[max b∈H t j {h 2 bj }|w t ] ≤ C B-r,q-r+1 d • (D 2 + σ 2 N (t) ). By Proposition 3, we have: E[||G t || 2 | w t ] ≤ d • (B -r) √ B -r + 1 (B -q -1)(q -r + 1) • (D 2 + σ 2 N (t) ). B.3 PROOF OF LEMMA 2 Proof. E[G t -∇F (w t ) | w t ] =E[Aggr([h 1 , . . . , h B ]) -∇F (w t ) | w t ] =E[Aggr([h 1 -∇F (w t ), . . . , h B -∇F (w t )]) | w t ], where the second equation is derived based on the Property (b) in the definition of q-BR. For each b ∈ H t , h b has stored N t b gradients at iteration t: g 1 , . . . , g N t b , and we have: h b -∇F (w t ) = 1 N t b N t b k=1 g i -∇F (w t ) = 1 N t b N t b k=1 [∇f (w t k ; z i k ) -∇F (w t )], where 0 ≤ t -t k ≤ τ max , ∀k = 1, 2, . . . , N t b . Taking expectation on both sides, we have: E[||h b -∇F (w t )|| |w t ] =E[|| 1 N t b N t b k=1 (∇f (w t k ; z i k ) -∇F (w t ))|| |w t ] ≤ 1 N t b N t b k=1 E[||∇f (w t k ; z i k ) -∇F (w t )|| |w t ] (a) ≤ 1 N t b N t b k=1 {E[||∇F (w t k ) -∇F (w t )|| |w t ] + E[||∇f (w t k ; z i k ) -E[∇f (w t k ; z i k )]|| |w t ] + E[||E[∇f (w t k ; z i k )] -∇F (w t k )|| |w t ]}, where (a) is derived based on Triangle Inequality. The first part: E[||∇F (w t k ) -∇F (w t )|| |w t ] (b) ≤L • E[||w t k -w t || |w t ] =L • E[|| t-1 t =t k G t || |w t ] ≤ t-1 t =t k L • E[||G t || |w ] = t-1 t =t k L • E[||G t || |w t ] 2 ≤ t-1 t =t k L • E[||G t || 2 |w t ] (c) ≤ t-1 t =t k L • C B-r,q-r+1 d • (D 2 + σ 2 /N (t) ) (d) ≤ τ max L • C B-r,q-r+1 d • (D 2 + σ 2 /N (t) ), where (b) is derived based on Assumption 5, (c) is derived based on Lemma 1 and (d) is derived based on t -t k ≤ τ max . The second part: E[||∇f (w t k ; z i k ) -E[∇f (w t k ; z i k )]|| |w t ] = E[||∇f (w t k ; z i k ) -E[∇f (w t k ; z i k )]|| |w t ] 2 ≤ E[||∇f (w t k ; z i k ) -E[∇f (w t k ; z i k )]|| 2 |w t ] (e) ≤ σ, where (e) is derived based on Assumption 4. By Assumption 2, we have the following estimation for the third part: E[||E[∇f (w t k ; z i k )] -∇F (w t k )|| |w t ] ≤ κ. Therefore, E[||h b -∇F (w t )|| |w t ] ≤ 1 N t b N t b k=1 (τ max L C B-r,q-r+1 d • (D 2 + σ 2 /N (t) ) + σ + κ) =τ max L C B-r,q-r+1 d • (D 2 + σ 2 /N (t) ) + σ + κ. Similar to the proof of Lemma 1, ∀j ∈ [d], we have: min b∈H t j {h bj -∇F (w t ) j } ≤Aggr([h 1 -∇F (w t ), . . . , h B -∇F (w t )]) j ≤ max b∈H t j {h bj -(w t ) j }, where H t j is composed by the indices of the smallest (B -q) elements in {h bj -∇F (w t ) j } b∈H t . Therefore, ||E[Aggr([h 1 -∇F (w t ), . . . , h B -∇F (w t )]) | w t ]|| ≤ d j=1 ||E[Aggr([h 1 -∇F (w t ), . . . , h B -∇F (w t )]) j | w t ]|| ≤ d j=1 E[||Aggr([h 1 -∇F (w t ), . . . , h B -∇F (w t )]) j || | w t ] (f ) ≤ d j=1 E[max b∈H t j ||h bj -∇F (w t ) j || | w t ] (g) ≤ d j=1 C B-r,q-r+1 E[||h bj -∇F (w t ) j || |w t ] ≤ d j=1 C B-r,q-r+1 E[||h b -∇F (w t )|| |w t ] (h) ≤ d j=1 C B-r,q-r+1 • (τ max L C B-r,q-r+1 d • (D 2 + σ 2 /N (t) ) + σ + κ) =C B-r,q-r+1 d • (τ max L C B-r,q-r+1 d • (D 2 + σ 2 /N (t) ) + σ + κ), where (f) is derived based on definition of q-BR, (g) is derived based on Lemma 3, and (h) is derived based on Inequality (4). Combining Equation (3) and Inequality (5), we obtain: ||E[G t -∇F (w t ) | w t ]|| ≤ C B-r,q-r+1 d • (τ max L C B-r,q-r+1 d • (D 2 + σ 2 /N (t) ) + σ + κ). By Proposition (3), we have: ||E[G t -∇F (w t ) | w t ]|| ≤ d(B -r) √ B -r + 1 (B -q -1)(q -r + 1) •(τ max L d (B -r) √ B -r + 1 (B -q -1)(q -r + 1) • (D 2 + σ 2 /N (t) ) + σ + κ).

B.4 PROOF OF THEOREM 1

Proof. where (a) is derived based on Assumption 5. Using Lemma 1 and Lemma 2, we have: + η • C B-r,q-r+1 Dd(τ max L C B-r,q-r+1 d • (D 2 + σ 2 /N (t) ) + σ + κ). t) . By telescoping, we have: E[F (w t+1 ) | w t ] ≤F (w t ) -η • ||∇F (w t )|| 2 + η 2 L 2 C B-r,q-r+1 d • (D 2 + σ 2 /N (t) ) + η • C B-r,q-r+1 d • (τ max L C B-r,q-r+1 d • (D 2 + σ 2 /N (t) ) + σ + κ) • ||∇F (w t )||. Let D = 1 T T -1 t=0 D 2 + σ 2 /N ( η • T -1 t=0 E[||∇F (w t )|| 2 ] ≤{F (w 0 ) -E[F (w T )]} + η 2 T • L 2 C B-r,q-r+1 d • 1 T T -1 t=0 (D 2 + σ 2 /N (t) ) + ηT • C B-r,q-r+1 Dd(τ max L D C B-r,q-r+1 d + σ + κ). Note that E[F (w T )] ≥ F * , and let η = O When q = r and B = O(r), we have C B-r,q-r+1 ≤ (B-r) √ B-r+1 √ (B-q-1)(q-r+1) = O r (q-r+1) + O rDdσ (q -r + 1) 1 2 and E[ G t 2 | w t ] ≤ (α t+1 + 2 1 -α ) • (A 2 ) 2 . (12) Also, E[ G t | w t ] 2 + V ar[ G t | w t ] = E[ G t 2 | w t ]. Therefore, E[ G t | w t ] = E[ G t | w t ] 2 ≤ α t+1 + 2 1 -α • A 2 . ( ) We have: η • E[∇F (w t ) T G t | w t ] =η • E[∇F (w t ) T G t syn | w t ] + η • E[∇F (w t ) T (G t -G t syn ) | w t ] (l) ≥η • ( ∇F (w t ) 2 -A 1 ) + η • E[∇F (w t ) T (G t -G t syn ) | w t ] ≥η • ∇F (w t ) 2 -η • A 1 -η • ∇F (w t ) • E[(G t -G t syn ) | w t ] (m) ≥ η • ∇F (w t ) 2 -η • A 1 -η • D • E[(G t -G t syn ) | w t ] (n) ≥ η • ∇F (w t ) 2 -η • A 1 -η • D • 1 2 α t+1 + α 1 -α • A 2 , where (l) is derived based on the definition of (A 1 , A 2 )-effective aggregation function, (m) is derived by Assumption 3, and (n) is derived based on Inequality (11). Combining Inequalities ( 6), ( 12), ( 14) and taking total expectation, we have: E[F (w t+1 )] ≤E[F (w t )] -η • E[ ∇F (w t ) 2 ] + η • A 1 + η • D 1 2 α t+1 + α 1 -α • A 2 + 1 2 η 2 L(α t+1 + 2 1 -α ) • (A 2 ) 2 . By telescoping, we have: η • T -1 t=0 E[ ∇F (w t ) 2 ] ≤{F (w 0 ) -E[F (w T )]} + 1 2 η 2 T L(α + 2 1 -α ) • (A 2 ) 2 + ηT A 1 + ηT D • 1 2 α + α 1 -α • A 2 . Divide both sides of the equation by ηT , and let η = O( 1 √ LT ): T -1 t=0 E[ ∇F (w t ) 2 ] T ≤ {F (w 0 ) -E[F (w T )]} ηT + 1 2 ηL(α + 2 1 -α ) • (A 2 ) 2 + A 1 + D • 1 2 α + α 1 -α • A 2 ≤ √ L[F (w 0 ) -F * ] √ T + √ L( 1 2 α + 1 1-α ) • (A 2 ) 2 √ T + A 1 + α 1 2 [ 3 -α 2(1 -α) ] 1 2 • DA 2 . Note that α = 2η 2 L 2 τ 2 max (B -r) = O Lτ 2 max (B-r) T , finally we have: T -1 t=0 E[ ∇F (w t ) 2 ] T ≤O √ L • [F (w 0 ) -F * ] √ T + O √ L(A 2 ) 2 (1 + α) √ T + O α 1 2 DA 2 + A 1 =O L 1 2 [F (w 0 ) -F * ] T 1 2 + O L 1 2 τ max (B -r) 1 2 DA 2 T 1 2



Figure2: Average top-1 test accuracy w.r.t. epochs when there are no Byzantine workers (the first row), 3 Byzantine workers (the second row) and 6 Byzantine workers (the last row), respectively. Subfigures (c) and (e) are for RD-attack, while Subfigures (d) and (f) for NG-attack.

(c) and Figure 2(d) (3 Byzantine workers), and Figure 2(e) and Figure 2(f) (6 Byzantine workers

Figure 3: Average perplexity w.r.t. epochs with 1 Byzantine worker. Subfigures (a) and (b) are for RD-attack, while Subfigures (c) and (d) for NG-attack. Due to the differences in magnitude of perplexity, y-axes of Subfigures (a) and (c) are in log-scale. In addition, Subfigures (b) and (d) illustrates that BASGD converges with only a little loss in perplexity compared to the gold standard.

Also, by Assumption 3, ||∇F (w t )|| ≤ D.Taking total expectation and combining ||∇F (w t )|| ≤ D, we have:E[F (w t+1 )] ≤E[F (w t )] -η • E[||∇F (w t )|| 2 ] + η 2 L 2 C B-r,q-r+1 d • (D 2 + σ 2 /N (t) )

B-r,q-r+1 Dd • (τ max L D C B-r,q-r+1 d + σ + κ) .

t=0 E[||∇F (w t )|| 2 ] T ≤O L[F (w 0 ) -F * ]

Filtered ratio of received gradients in Kardam under NG-attack (3 Byzantine workers)

). Results about training loss are in Appendix C. We can find that BASGD significantly outperforms ASGD and Kardam

E[F (w t+1 ) | w t ] =E[F (w t -η • G t ) | w t ] ≤ E[F (w t ) -η • ∇F (w t ) T G t + L 2 η 2 ||G t || 2 | w t ] =F (w t ) -η • E[∇F (w t ) T G t | w t ] + η 2 L 2 E[||G t || 2 | w t ] =F (w t ) -η • ∇F (w t ) T E[G t | w t ] + η 2 L 2 E[||G t || 2 | w t ] =F (w t ) -η • ∇F (w t ) T ∇F (w t ) + η 2 L 2 E[||G t || 2 | w t ] -η • ∇F (w t ) T E[G t -∇F (w t ) | w t ] ≤F (w t ) -η • ||∇F (w t )|| 2 + η 2 L 2 E[||G t || 2 | w t ] + η • ||∇F (w t )|| • ||E[G t -∇F (w t ) | w t ]||,

annex

To begin with, we will introduce a lemma to estimate the ordered statistics. Lemma 3. X 1 , . . . , X M are non-negative, independent and identically distributed (i.i.d.) random variables sampled from distribution D, and have limited expectationwhere+ O rDdκ (q -r + 1) Proof. Let h b be the value of the b-th buffer, if all received loyal gradients were computed based on w t . Note G t = Aggr(h 1 , . . . , h B ).where (a) is derived based on Assumption (5).Firstly, we estimate the value ofSince there are at most r Byzantine workers, at most r buffers may contain Byzantine gradients. Without loss of generality, suppose only the first r buffers may contain Byzantine gradients., where h 1 , . . . , h r may contain Byzantine gradients and be arbitrary value, and h r+1 , . . . , h B each stores loyal gradients computed based on w t . Thus,andNow we prove it by induction on t.Step 1. When t = 0, all gradients are computed according to w 0 , and we have G 0 = G 0 syn . Thus,Step 2. Ifholds for all t = 0, 1, . . . , t -1 (induction hypothesis), then:where (b) is derived based on the definition of stable aggregation function, (c) is derived based on Cauchy's Inequality, (d) is derived based on Assumption 5, (e) is also derived based on Cauchy's Inequality, (f) is derived based on induction hypothesis, (g) is derived based on that t -t k ≤ τ max , and (h) is derived based on that α = 2η 2 L 2 τ 2 max (B -r). Therefore,where (i) is derived based on that x + y 2 ≤ 2 x 2 + 2 y 2 , ∀x, y ∈ R d , (j) is derived by the definition of (A 1 , A 2 )-effective aggregation function, and (k) is derived based on Inequality (9).By Inequality ( 9) and ( 10), the claimed property also holds for t = t.In conclusion, for all t = 0, 1, . . . , T -1, we have:Proof. Under the condition thatwe have:Combining with the property (i) of (A 1 , A 2 )-effective aggregation function, we have A 1 ≤ D 2 .

C MORE EXPERIMENTAL RESULTS

Figure 4 , Figure 5 and Figure 6 illustrate the average training loss w.r.t. epochs when there are no Byzantine workers, 3 Byzantine workers and 6 Byzantine workers. Please note that in Figure 5 and Figure 6 , some curves do not appear, because the value of loss function is extremely large or even exceeds the range of floating-point numbers, due to the Byzantine attack. γ is the hyper-parameter about the assumed number of Byzantine workers in Kardam. The experimental results about training loss give further support to the experimental summary in Section 5. 

