BYZANTINE-ROBUST LEARNING ON HETEROGE-NEOUS DATASETS VIA RESAMPLING

Abstract

In Byzantine robust distributed optimization, a central server wants to train a machine learning model over data distributed across multiple workers. However, a fraction of these workers may deviate from the prescribed algorithm and send arbitrary messages to the server. While this problem has received significant attention recently, most current defenses assume that the workers have identical data. For realistic cases when the data across workers are heterogeneous (non-iid), we design new attacks which circumvent these defenses leading to significant loss of performance. We then propose a simple resampling scheme that adapts existing robust algorithms to heterogeneous datasets at a negligible computational cost. We theoretically and experimentally validate our approach, showing that combining resampling with existing robust algorithms is effective against challenging attacks.

1. INTRODUCTION

Distributed or federated machine learning, where the data is distributed across multiple workers, has become an increasingly important learning paradigm both due to growing sizes of datasets, as well as privacy and security concerns. In such a setting, the workers collaborate to train a single model without transmitting their data directly over the networks (McMahan et al., 2016; Bonawitz et al., 2019; Kairouz et al., 2019) . Due to the presence of either actively malicious agents in the network, or simply due to system and network failures, some workers may disobey the protocols and send arbitrary messages; such workers are also known as Byzantine workers (Lamport et al., 2019) . Byzantine robust optimization algorithms combine the gradients received by all workers using robust aggregation rules, to ensure that the training is not impacted by the malicious workers. While this problem has received significant recent attention, (Alistarh et al., 2018; Blanchard et al., 2017; Yin et al., 2018a) , most of the current approaches assume that the data present on each different worker has identical distribution. In this work, we show that existing Byzantine-robust methods catastrophically fail in the realistic setting when the data is distributed heterogeneously across the workers. We then propose a simple resampling scheme which can be readily combined with existing aggregation rules to allow robust training on heterogeneous data.

Contribution. Concretely, our contributions in this work are

• We show that when the data across workers is heterogeneous, existing robust rules might not converge, even without any Byzantine adversaries. • We propose two new attacks, normalized gradient and mimic, which take advantage of data heterogeneity and circumvent median and sign-based defenses (Blanchard et al., 2017; Pillutla et al., 2019; Li et al., 2019) . • We propose a simple new resampling step which can be used before any existing robust aggregation rule. We instantiate our scheme with KRUM and theoretically prove that the resampling generalizes it to the setting of heterogeneous data. • Our experiments evaluate the proposed resampling scheme against known and new attacks and show that it drastically improves the performance of 3 existing schemes on realistic heterogeneously distributed datasets. Setup and notations. We study the general distributed optimization problem L = min x∈R d {L(x) := 1 n n i=1 L i (x)} where L i : R d → R are the individual loss functions distributed among n workers, each having its own (heterogeneous) data distribution {D i } n i=1 . The case of empirical risk minimization with m i datapoints ξ i ∼ D i on worker i is obtained when using L i (x) := 1 mi mi j=1 L i (x, ξ j i ). The (stochastic) gradient computed by a good node i with sample j is given as g i (x) := ∇L i (x, ξ j i ) with mean µ i and variance σ 2 i . We also assume that the heterogeneity (variance across good workers) is bounded i.e. E i ∇L i (x) -∇L(x) 2 ≤ σ2 , ∀x . We write g i instead of g i (x t ) when there is no ambiguity. A distributed training step using an aggregation rule is given as x t+1 := x t -γ t Aggr({g i (x t ) : i ∈ [n]}) If the aggregation rule is the arithmetic mean, then (2) recovers standard minibatch SGD. Byzantine attack model. In each iteration, there is a set Byz of at most f Byzantine workers. The remaining workers are good, thus follow the described protocol. A Byzantine worker j ∈Byz can deviate from protocol and send an arbitrary vecter to the server. Besides, we also allow that Byzantine workers can collude with each other and know every state of the system. Unlike martingale-based approaches like (Alistarh et al., 2018) , we allow the set Byz to change over time (Blanchard et al., 2017; Chen et al., 2017; Mhamdi et al., 2018) .

2. RELATED WORK

There has been significant recent work of the case when the workers have identical data distributions (Blanchard et al., 2017; Chen et al., 2017; Mhamdi et al., 2018; Alistarh et al., 2018; Mhamdi et al., 2018; Yin et al., 2018a; b; Su & Xu, 2018; Damaskinos et al., 2019) . We discuss the most pertinent of these methods next. Blanchard et al. (2017) formalize the Byzantine robust setup and propose a distance-based approach KRUM which selects a worker whose gradient is very close to at least half the other workers. A different approach involves using the median and its variants (Blanchard et al., 2017; Pillutla et al., 2019; Yin et al., 2018a) . Yin et al. (2018a) propose to use and analyze the coordinate-wise median method (CM). Pillutla et al. (2019) use a smoothed version of Weiszfeld's algorithm to iteratively compute an approximate geometric median of the input gradients. In a third approach, (Bernstein et al., 2018) propose to use the signs of the gradients and then aggregate them by majority vote, however, (Karimireddy et al., 2019) show that it may not always converge. Finally, Alistarh et al. (2018) use a martingale-based aggregation rule which gives a sample complexity optimal algorithm for iid data. The distance-based approach of KRUM was later extended in Mhamdi et al. (2018) who propose BULYAN to overcome the dimensional leeway attack. This is the so called strong Byzantine resilience and is orthogonal to the question of non-iid-ness we study here. Recently, (Peng & Ling, 2020; Yang & Bajwa, 2019a; b) studied Byzantine-resilient algorithms in the decentralized setting where there is no central server available. Extending our techniques to the decentralized setting is an important direction for future work. In a different line of work, (Lai et al., 2016; Diakonikolas et al., 2019) develop sophisticated spectral techniques to robust estimate the mean of a high dimensional multi-variate standard Gaussian distribution where samples are evenly distributed in all directions and the attackers are concentrated in one direction. Very recent work (Data & Diggavi, 2020) extend the theoretical analysis to non-convex, strongly-convex and non-i.i.d setup under a gradient dissimilarity assumption and propose a gradient compression scheme on top of it. Our resampling trick can be combined with it to further reduce gradient dissimilarity. Many attacks have been devised for distributed training. For the iid setting, the state-of-the-art attacks are (Baruch et al., 2019; Xie et al., 2019b) . The latter attack is very strong when the fraction of adversaries is large (nearly half), but in this work we focus on settings when this fraction is quite small (e.g. ≤ 0.2). Further our normalized mean attack Section 3.2 is inspired by (Xie et al., 2019b) . The former work focuses on attacks which are coordinated across time steps. Developing strong practical defenses even in the iid case against such time-coordinated attacks remains an open problem. In this work, we sidestep this issue by restricting ourselves to new attacks made possible by non-iid data and studying how to overcome them. We focus on schemes which work in the iid setting, but fail with non-iid data. Once a new method which can defend against (Baruch et al., 2019) is developed, our proposed scheme shows how to adapt such a method to the important non-iid case. For the non-iid setting, backdoor attacks are designed to take advantage of heavy-tailed data and manipulate model inference on specific subtask, rather than lower the overall accuracies of training (Bagdasaryan et al., 2018; Bhagoji et al., 2018) . In contrast, this paper is not intended to address aforementioned challenges but rather to defend the attacks that lower the training accuracies in the non-iid setting. As far as we are aware, only (Li et al., 2019; Ghosh et al., 2019; Sattler et al., 2020) explicitly investigate Byzantine robustness with non-iid workers. Li et al. (2019) proposes an SGD variant (RSA) which modifies the original objective by adding an 1 penalty. Miura & Harada (2015) . However, none of these methods are applicable in the standard federated learning setup we consider here. We aim to minimize the original loss function over workers while respecting the non-iid data locality, i.e. the partition of the given heterogeneous dataset over the workers, without data transfer.

3. ATTACKS AGAINST EXISTING AGGREGATION SCHEMES

In this section we show that when the data across the workers is heterogeneous (non-iid), then we can design new attacks which take advantage of the heterogeneity, leading to the failure of existing aggregation schemes. We study three classes of robust aggregation schemes: i) schemes which select a representative worker in each round (e.g. KRUM (Blanchard et al., 2017 )), ii) schemes which use normalized means (e.g. RSA (Li et al., 2019) ), and iii) those which use the median (e.g. RFA (Pillutla et al., 2019) ). We show realistic settings under which each of these classes would fail when faced with heterogeneous data.

3.1. FAILURE OF REPRESENTATIVE WORKER SCHEMES ON NON-IID DATA

Algorithms like KRUM select workers who are representative of a majority of the workers, by relying on statistics such as pairwise differences between the various worker updates. Let (g 1 , . . . , g n ) be the gradients by the workers, f of which are Byzantine (e.g. n ≥ 2f + 3 for KRUM). For i = j, let i → j denote that g j belongs to the n -f -2 closest vectors to g i . Then KRUM is defined as follows KRUM(g 1 , . . . , g n ) := arg min i i→j g i -g j

2

(3) However, when the data across the workers is heterogeneous, there is no 'representative' worker. This is because each worker computes their local gradient over vastly different local data. Hence, for convergence it is important to not only select a good (non-Byzantine) worker, but also ensure that each of the good workers is selected with roughly equal frequency. Hence KRUM suffers a significant loss in performance with heterogeneous data, even when there are no Byzantine workers. For example, when KRUM is used for iid datasets without adversary (f = 0, see left of Figure 1a ), the test accuracy is close to simple average and the gap can be filled by MULTI-KRUM (Blanchard et al., 2017) . The right plot of Figure 1a also shows that KRUM's selection of gradients is biased towards certain nodes. When KRUM is applied to non-iid datasets (the middle of Figure 1a ), KRUM performs poorly even without any attack. This is because KRUM mostly selects gradients from a few nodes whose distribution is closer to others (the right of Figure 1a ). This is an example of how robust aggregation rules may fail on realistic non-iid datasets.

3.2. ATTACKS ON NORMALIZED AGGREGATION SCHEMES

Instead of simply averaging the gradients, some methods first normalize them and then average. This limits the influence of the Byzantine workers since they cannot output extremely large gradients, and hence is more robust. For example RFA (Pillutla et al., 2019) with T =1 uses following aggregation rule: NM(g 1 , . . . , g n ) = n i=1 gi gi 2 Other methods such as RSA (Li et al., 2019) or signum (Bernstein et al., 2018) normalize entries coordinate-wise before taking a majority vote i.e. update the server model x 0 on server using local model x i from node i (not gradient) using where f 0 is a strongly convex penalty term and λ > 0 is a relaxation parameter. RSA(x 0 ; x 1 , . . . , x n ) := ∇f 0 (x 0 ) + λ n i=1 sign(x 0 -x i ) However, a Byzantine worker can still craft an "omniscient" attack to foil robust aggregations, using an approach similar to the negative sum for the arithmetic mean (Blanchard et al., 2017; Li et al., 2019) : v := -i∈good gi gi 2 (6) On the right side of Figure 1b , we can see that this attack lowers the accuracy of RFA-T1 significantly, as the number of Byzantine workers increases. Comparing to its iid counterpart, the normalized mean attack is even more impactful in the non-iid setting.

3.3. ATTACKS ON MEDIAN-BASED SCHEMES

Geometric median and its variants are popular in robust learning research (Blanchard et al., 2017; Chen et al., 2017; Pillutla et al., 2019; Yin et al., 2018a; Mhamdi et al., 2018) . Given gradients {g 1 , . . . , g n }, we use the estimator GM(g 1 , . . . , g n ) := argmin v n i=1 v -g i . If the vectors {g 1 , . . . , g n } are drawn independently from the same distribution, intuitively most of them would concentrate around their mean. Then, even if there are some Byzantine outputs, the median would ignore those as outliers and output a 'central' point close to the mean. However, when {g 1 , . . . , g n } are gradients over heterogeneous data, they may be vastly different from each other and do not concentrate around the mean. In such a scenario, the median such as (7) can be even less robust than simply taking the mean. Suppose that worker 0 is Byzantine and the remaining workers {1, . . . , 2n} are good, with a total of 2n + 1 workers. Now suppose that g i = (-1) i for all the workers, with half the good workers having -1 and the other half +1. This means that the true mean is 0, however, the median estimator (7) will output 1. Mimic attack. This motivates our mimic attack in which all Byzantine workers collude and agree to always send gradients from the same worker. We define a specialized attack, called mimic2, where half of the good workers have same datasets and send g 1 while the rest good workers send g 2 ; then all Byzantine workers send v = g 1 such that the geometric median of the gradients received by the server is always g 1 . Therefore, this attack breaks geometric-median-based robust aggregation rules, by leading them to wrong solutions. The left plot of Figure 1c shows the impact of the mimic2 attack. Test accuracies of CM and RFA both drop drastically to around 50%.

Algorithm 1 Robust Learning with Resampling

Setup: n workers, f of which are Byzantine; resampling T times, each time samples s gradients. A robust learning algorithm AGGR on iid datasets; γ is the learning rate. Workers: 1. Each good worker i randomly samples a datapoint j and computes a stochastic gradient g i := ∇F i (x, ξ j i ) where ξ j i ∼ D i ; each Byzantine worker i sends arbitrary vector g i . 2. Send g i to server.

Servers:

1. Receive {g i } n i=1 from all workers. 2. S, I S = Resampling({g i : i ∈ [n]}, f , T , s); See Algorithm 2 . 3. Compute x := x -γAGGR(S); 4. Broadcast x to all workers. Algorithm 2 Resampling with s-replacement Input: {g i : i ∈ [n]}, T := n, s, {c[i] := 0 : i ∈ [n]} for t := 1, . . . , T do for i := 1, . . . , s do while Select j i ∼ Uniform([n]) do if c[j i ] < s then c[j i ] + = 1 If c[j i ] == s Break; Compute average ḡt := 1 s s i=1 g ji Return {ḡ t : t ∈ [T ]}, {j t i : t ∈ [T ], i ∈ [s]} 4 ROBUST AGGREGATION ON NON-IID DATA In Section 3 we have demonstrated how existing robust aggregation rules can fail in realistic non-iid scenarios, with and without attackers (Sections 3.2 and 3.3 and Section 3.1 respectively). To overcome this problem, we propose a simple new resampling-based aggregation rule for training, shown in Algorithm 1. More specifically, we choose s-resampling without replacement in Algorithm 2 where each gradient can be sampled at most s times. The key property of our rule is that after resampling, the resulting set of averaged gradients {ḡ t : t ∈ [T ]} are much more homogeneous (lower variance). Then these averaged gradients are fed to existing Byzantine robust aggregation schemes, such as KRUM, see Section 5. Given an existing aggregation rule AGGR, we denote by AGGR • Resampling the resulting new robust aggregation rule for non-iid input gradients. In the following proposition, we list the desired properties of Algorithm 2 Proposition I. Given a population {g i : i ∈ [n]} ⊂ R d of mean µ := 1 n n i=1 g i and variance σ 2 := 1 n n i=1 g i -µ 2 , let {ḡ t : t ∈ [T ]} be the output of Algorithm 2 on {g i : i ∈ [n]}. Then • If there are no Byzantine workers, then {ḡ t : t ∈ [T ]} are identically distributed E[ḡ t ] = µ, var(ḡ t ) = n-1 sn-1 σ 2 ∀ t ∈ [T ] • If f of the n inputs are Byzantine, then at least T -sf gradients in {ḡ t : t ∈ [T ]} are good; that is, a good ḡt is the average of gradients {g j t i : i ∈ [s]} ⊂good⊂ [n]. Then such good {ḡ t } are identically distributed with E[ḡ t ] = μ, var(ḡ t ) = n-1 sn-1 σ2 (9) where μ := 1 |good| i∈good g i , and σ2 := 1 |good| i∈good g i -E[ḡ t ] 2 . Proof. Since Algorithm 2 resamples s gradients to estimate a population of sn samples, we can use sampling theory (Middleton, 1988, Ch. Survey Sampling) to compute the sample mean E RS(g 1 , . . . , g n ) = µ (10) and the sample variance E (RS(g 1 , . . . , g n ) -µ) 2 = 1 s 1 -s-1 sn-1 σ 2 = n-1 sn-1 σ 2 . ( ) Since the gradients are sampled at most s times, at most sf out of the T gradients are affected by a Byzantine worker. Its mean and variance can be calculated in the same way shown above. Remark 1. For s = 1, resampling simply becomes shuffling of the input elements, and var(ḡ t ) = σ 2 is unchanged. For s > 1, the resampling scheme reduces the heterogeneity (variance) by approximately 1/s. Thus, increasing s leads to the resulting resampled gradients being a better estimator of the population mean, thus improving training convergence speed. On the other hand, increasing s also increases the number of resampled gradients which can be affected by a Byzantine worker. In particular, if f workers are Byzantine, then up to f s resampled gradients can be incorrect, which has to be taken into account by the employed robust aggregation rule. In practice, we found that using a small value s = 2 was already sufficient to overcome heterogeneity. Remark 2. A natural question to ask is what happens if we resample with replacement but do not limit on the number of replacements. We discuss this additional algorithm variant in Appendix C. Note that the {ḡ t : t ∈ [T ]} are identically distributed but not independent. This does not directly fit into the original assumptions of Byzantine robust algorithms like KRUM and hence the robustness has to be reproved for our more general setting.

5. CONVERGENCE ANALYSIS WITH KRUM

In this section, we analyze the convergence of SGD with robust aggregation on non-iid data. Since the definition of robustness and other conditions vary from paper to paper, it is not possible to give a uniform proof perfectly fit for all methods. For example, (Yin et al., 2018a) assumes the gradients have bounded variance and skewness whereas others like KRUM, RFA, BULYAN does not. Thus we only analyze KRUM for its simplicity and popularity, and show that analysis is only slightly different from the original version. For other algorithms, we show by experiments that resampling helps them achieve better performance on heterogeneous data, see Section 6. Definition A generalizes the Byzantine resilience of (Blanchard et al., 2017 , Definition 1) to the cases where we have non-iid data. Let G be an estimator of the good gradients. Definition A ((α, f )-Byzantine Resilience.). Let 0 ≤ α < π/2 be any angular value, and any integer 0 ≤ f ≤ n. Let B = {j 1 , . . . , j f : j 1 ≤ j 1 < • • • < j f ≤ n} be the indices of Byzantine workers. Let {V i ∈ D i : i ∈ [n]\B} be independent random vectors in R d . Let G = G(ξ) be an independent random variable which randomly selects a good worker i and samples a vector from D i and E G = g. Let B 1 , . . ., B f be any Byzantine vectors in R d , possibly dependent on the V i 's. An aggregation rule F is said to be (α, f )-Byzantine resilient if F = F (V 1 , . . . , B 1 j1 , . . . , B f j f , . . . V n ) satisfies (i) E F, g ≥ (1 -sin α) • g 2 > 0 and (ii) for r = 2, 3, 4, E F r ≤ E G r . Then we can conclude the almost sure convergence similar to (Blanchard et al., 2017 , Proposition 2) Theorem II (Resampling KRUM). We assume that (i) the cost function L is three times differentiable with continuous derivatives and non-negative L(x) ≥ 0; (ii) the learning rates satisfy t γ t = ∞ and t γ 2 t < ∞. Let the good workers have stochastic gradients G i (x, ξ) for i ∈ good ⊂ [n]. We assume that for a uniformly chosen j ∈ good, the following is true (iii) E j,ξ [G j (x, ξ)] = ∇L(x) and ∀r ∈ {2, 3, 4}, E j,ξ G j (x, ξ) r ≤ A r + B r x r for some constants A r , B r ; (iv) there exists a constant 0 ≤ α < π/2 such that for all x we have η(T, sf ) • √ d • σ(x) ≤ ∇L(x) • sin α where σ 2 (x) := n-1 sn-1 σ2 (x); (v) finally, beyond a certain horizon, x 2 ≥ D, there exist ε > 0 and 0 ≤ β < π/2 -α such that ∇L(x) ≥ ε > 0 and x, ∇L(x) ≥ cos β x • ∇L(x) . If s > 1 and 2sf + 3 ≤ n, then • KRUM •Resampling is (α, sf )-Byzantine resilient where 0 ≤ α < π/2 is defined by sin α = η(T,sf )• √ d•σ ∇L(x) , η(n, f ) := 2 n -f + f •(n-f -2)+f 2 •(n-f -1) n-2f -2 (12) • the sequence of gradients ∇L(x t ) converges almost surely to zero. We defer the proof to Appendix A. The above convergence result for heterogeneous data is nearly identical to (Blanchard et al., 2017, Proposition 2) for iid data, except for the slightly stronger restriction on the number of Byzantine workers 2sf + 3 ≤ n. 

6. EXPERIMENTS

In this section, we demonstrate the effect of resampling on datasets distributed in a non-iid fashion. Throughout the section, we illustrate the challenge, attacks, and defense by an example of training an MLP on the MNIST dataset (LeCun et al., 1998) . In Appendix D, we present the results of similar experiments on Fashion-MNIST (Xiao et al., 2017) and CIFAR-10 ( Krizhevsky et al., 2009) . The dataset is sorted by labels and sequentially divided into equal parts among good workers; Byzantine workers have access to the dataset on all good workers. Implementations are based on PyTorch (Paszke et al., 2019) and will be made publicly available.

6.1. RESAMPLING AGAINST THE ATTACKS ON NON-IID DATA

In Section 3 we have presented how heterogeneous data can lead to failure of existing robust aggregation rules. Here we apply our proposed resampling with T =n, s=2 to the same aggregation rules, showing that resampling overcomes the described failures. Results are presented in Figure 2 . In Figure 2a , we show that using resampling helps KRUM to achieve better test accuracy on noniid data. Since resampling KRUM with s=2 actually averages 2 gradients, we compare it with MULTIKRUM with m=2. The middle of Figure 2a shows that MULTIKRUM with m=2 performs better than KRUM, but KRUM with resampling is even better which suggests the resampling step improves the performance on non-iid data. The selection histogram on the rightmost part of Figure 2a shows that after resampling, KRUM's selection is much more evenly distributed between the good workers. In Figure 2b , we show that resampling fixes RFA with T =1 and allows it to defend against the normalized mean attack. The resampling-based aggregation can almost reach same accuracy for both iid and non-iid setup. In Figure 2c , while mimic attack does not work for median-based rules in the iid setting, resampling still slightly improves the performance due to variance reduction. In the non-iid setting, resampling drastically improves the accuracy to the same level as the iid setting.

6.2. RESAMPLING AGAINST GENERAL BYZANTINE ATTACKS

In Figure 3 , we present thorough experiments on non-iid data over 10 workers with 2 Byzantine workers. In each subfigure, we compare an aggregation rule with its variant with resampling. Three aggregation rules are compared: KRUM, CM, RFA. In particular, we compare to RFA with both T=1 (normalized mean) and T=8 (geometric median). Columns show each aggregation rule applied without (red) and with resampling (blue). Dashed lines for comparison are showing the same method without any Byzantine workers (f = 0). For RFA, T1, T8 refers to the number of inner iterations of Weiszfeld's algorithm. Attacks. 5 different kinds of attacks are applied (one per row in the figure): bitflipping, labelflipping, gaussian attack, as well as the mimic and mimic2 attacks. • Bitflipping: A Byzantine worker flips the sign bits and sends -∇f (x) instead of ∇f (x) because of problems like hardware failures etc. • Labelflipping: The dataset on workers have corrupted labels. For the MNIST dataset, we let Byzantine workers transform labels by T (y) := 9 -y. • Gaussian: A Byzantine worker sends a Gaussian random vector of 0 mean and isotropic covariance matrix with standard deviation 200 (Xie et al., 2018) . • mimic & mimic2: Explained in Section 3.3. From Figure 3 we can see that resampling improves the accuracy on most of the tasks. The final accuracies achieved vary with the aggregation rules we use. Notice that RFA-T1 is more robust to the mimic attack than RFA-T8 in Figure 3 because more inner iterations lead to better approximate geometric median and less robust to normalized mean attacks. The normalized mean attack has been addressed in Section 3.2.

7. CONCLUSION

In this paper, we initiated a study of robust distributed learning problem under realistic heterogeneous data. We showed that many existing Byzantine-robust aggregation rules fail under simple new attacks, or sometimes even without any Byzantine workers. As a solution, we propose a resampling scheme which effectively adapts existing robust algorithms to heterogeneous datasets at a negligible computational cost. We believe robustness under heterogeneous conditions has been an overlooked direction of research thus far and hope to inspire more work on this topic. Extending to the decentralized setting, stronger Byzantine adversaries, as well as obtaining optimal algorithms are other challenging directions for future work.

A CONVERGENCE ANALYSIS OF KRUM WITH RESAMPLING

Theorem II (Resampling KRUM). We assume that (i) the cost function L is three times differentiable with continuous derivatives and non-negative L(x) ≥ 0; (ii) the learning rates satisfy t γ t = ∞ and t γ 2 t < ∞. Let the good workers have stochastic gradients G i (x, ξ) for i ∈ good ⊂ [n]. We assume that for a uniformly chosen j ∈ good, the following is true (iii) E j,ξ [G j (x, ξ)] = ∇L(x) and ∀r ∈ {2, 3, 4}, E j,ξ G j (x, ξ) r ≤ A r + B r x r for some constants A r , B r ; (iv) there exists a constant 0 ≤ α < π/2 such that for all x we have η(T, sf ) • √ d • σ(x) ≤ ∇L(x) • sin α where σ 2 (x) := n-1 sn-1 σ2 (x); (v) finally, beyond a certain horizon, x 2 ≥ D, there exist ε > 0 and 0 ≤ β < π/2 -α such that ∇L(x) ≥ ε > 0 and x, ∇L(x) ≥ cos β x • ∇L(x) . If s > 1 and 2sf + 3 ≤ n, then • KRUM •Resampling is (α, sf )-Byzantine resilient where 0 ≤ α < π/2 is defined by sin α = η(T,sf )• √ d•σ ∇L(x) , η(n, f ) := 2 n -f + f •(n-f -2)+f 2 •(n-f -1) n-2f -2 (12) • the sequence of gradients ∇L(x t ) converges almost surely to zero. Proof. We only prove the first statement. The second one follows from applying (Blanchard et al., 2017 , Proposition 2) directly. After resampling, we have ñ := T gradients, and at most f := sf of them are Byzantine. Without loss of generality, we assume that Byzantine vectors B 1 , . . . , B f occupy the last f positions in the arguments of KRUM. We denote as KRUM := KRUM( Ṽ1 , . . . , Ṽñf , B1 , . . . , B f ). For each index i, we denote by δ c (i) (resp. δ b (i)) the number of correct (resp. Byzantine) indices j such that i → j (recall again that i → j denotes that V j belongs to the n -f -2 closest vectors to V i . We have δ c (i) + δ b (i) = ñ -f -2 ñ -2 f -2 ≤ δ c (i) ≤ ñ -f -2 δ b (i) ≤ f We focus first on the condition (i) of (α, f )-Byzantine resilience as in Definition A. We determine an upper bound on the squared distance EKRUM -g 2 . Note that, for any correct j, E Ṽj = g. Let i * be the index chosen by KRUM. EKRUM -g 2 ≤ E(KRUM -1 δc(i * ) i * →correctj Ṽj ) 2 ≤ E KRUM -1 δc(i * ) i * →correctj Ṽj 2 ≤ correct i E Ṽi -1 δc(i) i→correctj Ṽj 2 1(i * = i) + byz k E Bk -1 δc(k) k→correctj Ṽj 2 1(i * = k) where 1 is the indicator function. Examine i * = i for correct index i. Ṽi -1 δc(i) i→correctj Ṽj 2 = 1 δc(i) i→correctj Ṽi -Ṽj 2 ≤ 1 δc(i) i→correctj Ṽi -Ṽj 2 E Ṽi -1 δc(i) i→correctj Ṽj 2 ≤ 1 δc(i) i→correctj E Ṽi -Ṽj 2 ≤ 2dσ 2 where σ 2 := n-1 sn-1 σ2 in Proposition I. We now examine the case i * = k for some Byzantine index k. The fact that k minimizes the score implies for all correct i k→correctj Bk -Ṽj 2 + k→byz l Bk -Bl 2 ≤ i→correctj Ṽi -Ṽj 2 + i→byz l Ṽi -Bl 2 (13) Then, for all correct indices i Bk -1 δc(k) k→correctj Ṽj 2 ≤ 1 δc(k) k→correctj Bk -Ṽj 2 ≤ 1 δc(k) i→correctj Ṽi -Ṽj 2 + 1 δc(k) i→byz l Ṽi -Bl 2 Focus on D 2 (i) := i→byz l Ṽi -Bl 2 , each correct worker i has ñ -f -2 neighbors, and f + 1 non-neighbors. Thus there exists a correct ζ(i) which is farther from i than any of the neighbors of i. In particular, for each Byzantine index l such that i → l, Ṽi -Bl 2 ≤ Ṽi -Ṽζ(i) 2 . Whence Bk -1 δc(k) k→correctj Ṽj 2 ≤ 1 δc(k) i→correctj Ṽi -Ṽj 2 + δ b (i) δc(k) Ṽi -Ṽζ(i) 2 E Bk -1 δc(k) k→correctj Ṽj 2 ≤ δc(i) δc(k) 2dσ 2 + δ b (i) δc(k) correct j =i E Ṽi -Ṽj 2 1(ζ(i) = j) ≤ ( δc(i) δc(k) + δ b (i) δc(k) (ñ -f -1))2dσ 2 ≤ ( ñ-f -2 ñ-2 f -2 + f ñ-2 f -2 (ñ -f -1))2dσ 2 Putting everything together we have EKRUM -g 2 ≤ (ñ -f )2dσ 2 + f • ( ñ-f -2 ñ-2 f -2 + f ñ-2 f -2 (ñ -f -1))2dσ 2 ≤ η 2 (ñ, f )dσ 2 By assumption η 2 (ñ, f )dσ 2 < g , we know EKRUM, g ≥ ( g -η(ñ, f ) • f • σ) • g = (1 -sin α) • g 2 To sum up, (i) of Byzantine resilience holds. Now we focuse on (ii), E KRUM r = correct i E Ṽi 2 1(i * = i) + byz k E Bk 2 1(i * = k) We think of G as the estimator of gradients among good workers. More specifically, G is a random variable which uniformly samples one worker among good workers and samples one gradient from the worker. Then P (G = v) = 1 n-f n-f i=1 P (V i = v), E G r = v P (G = v) v r = 1 n-f n-f i=1 v P (V i = v) v r = 1 n-f n-f i=1 E V i r Thus correct i E Ṽi r ≤ n-f i=1 E V i r = (ñ -f )E G r (16) We have E KRUM r ≤ (ñ -f )E G r + byz k E Bk 2 1(i * = k) Denoting C a generic constant, when i * = k, we have for all correct indices i Bk -1 δc(k) k→correct j Ṽj ≤ 1 δc(k) i→correctj Ṽi -Ṽj 2 + δ b (i) δc(k) Ṽi -Ṽζ(i) 2 ≤ C • ( 1 δc(k) i→correctj Ṽi -Ṽj + δ b (i) δc(k) ≤ C • correct j Ṽj Now Bk ≤ Bk - 1 δ c (k) k→correct j Ṽj + 1 δ c (k) k→correct j Ṽj ≤ C correct j Ṽj Bk r ≤ C r1+•••+r ñ-f =r Ṽ1 r1 • • • Ṽñ-f r ñ-f Take expectation to both sides and apply generalized Young's inequality a p1 1 • • • a pm m ≤ p 1 a 1 + • • • + p m a m where p 1 + • • • + p m = 1, we have E Bk r ≤ C r1+•••+r ñ-f =r ( r1 r E Ṽ1 r + • • • + r ñ-f r E Ṽñ-f r ) ≤ CE G r the last inequality comes from ( 16). Thus we have proven property (ii) of (α, f )-Byzantine resilience.

A.1 ALTERNATIVE CONVERGENCE PROOF OF KRUM WITH RESAMPLING

For completeness we also provide an alternative proof for a slight variation of the Byzantine Resilience definition, under the same algorithm of KRUM with resampling. Definition B ((α, f )-Byzantine Resilience Alternative.). Let 0 ≤ α < π/2 be any angular value, and any integer 0 ≤ f ≤ n. Let V 1 , . . ., V n be any identically distributed random vectors in R d , V i ∼ G, with E G = g. Let B 1 , . . ., B f be any Byzantine vectors in R d , possibly dependent on the V i 's. An aggregation rule F is said to be (α, f )-Byzantine resilient if, for any 1 ≤ j 1 < • • • < j f ≤ n, then F = F (V 1 , . . . , B 1 j1 , . . . , B f j f , . . . V n ) satisfies (i) E F, g ≥ (1 -sin α) • g 2 > 0 and (ii) for r = 2, 3, 4, E F r ≤ E G r . Again, similarly as for the above Theorem II we can obtain almost sure convergence analogous to (Blanchard et al., 2017 , Proposition 2) Theorem III (Resampling KRUM, Alternative). We assume that (i) the cost function L is three times differentiable with continuous derivatives and non-negative L(x) ≥ 0; (ii) the learning rates satisfy t γ t = ∞ and t γ 2 t < ∞; (iii) the gradient estimator satisfies E G(x, ξ) = ∇L(x) and ∀r ∈ {2, 3, 4}, E G(x, ξ) r ≤ A r + B r x r for some constants A r , B r ; (iv) there exists a constant 0 ≤ α < π/2 such that for all x we have η(T, sf ) • √ d • σ(x) ≤ ∇L(x) • sin α; (v) finally, beyond a certain horizon, x 2 ≥ D, there exist ε > 0 and 0 ≤ β < π/2 -α such that ∇L(x) ≥ ε > 0 and x, ∇L(x) ≥ cos β x • ∇L(x) . If s > 1 and 2sf + 3 ≤ n, then • KRUM •Resampling is (α, sf )-Byzantine resilient as in Definition B where 0 ≤ α < π/2 is defined by sin α = η(T,sf )• √ d•σ ∇L(x) , η(n, f ) := 2 n -f + f •(n-f -2)+f 2 •(n-f -1) n-2f -2 • the sequence of gradients ∇L(x t ) converges almost surely to zero. Proof. A key difference between Definition B and (Blanchard et al., 2017 , Definition 1) is that Definition B removes the independence requirement for the input good gradients {V i } n-f i=1 . In (Blanchard et al., 2017) , there are two propositions: (Blanchard et al., 2017 , Proposition 1) proves that KRUM satisfy (Blanchard et al., 2017, Definition 1) , and (Blanchard et al., 2017, Proposition 2) shows almost surely convergence of the gradient ∇L. In our case, we only need to show that KRUM •Resampling satisfy Definition B because the convergence of ∇L is identical to the proof of (Blanchard et al., 2017, Proposition 2) . Since the gradients after resampling is identically distributed according to Proposition I, we can keep all of the proofs of (Blanchard et al., 2017 , Proposition 1) except for the last inequality where the independence is used. B k r ≤ C r=r1+•••+r n-f V 1 r1 • • • V n-f r n-f By applying expectation to both sides of the inequality and the independence of {V i } n-f i=1 , they conclude that E B k r ≤ C r=r1+•••+r n-f E V 1 r1 • • • E V n-f r n-f = C r=r1+•••+r n-f G r1 • • • G r n-f (21) We can immediately show that E B k r ≤ C G r with the help of a general form of Young's inequality. We prove this using the standard Young's inequality. The Young's inequality states that for p > 1, q > 1, 1 p + 1 q = 1, a, b ≥ 0, we have ab ≤ a p p + b q q (22) Let a = V 1 r1 , b = V 2 r2 • • • V n-f r n-f , and p = r r1 , q = r r-r1 , we apply Young's inequality V 1 r1 • • • V n-f r n-f ≤ r1 r V 1 r + r-r1 r V 2 rr 2 r-r 1 • • • V n-f rr n-f r-r 1 We can apply Young's inequality agin for the second term of right hand side. Let a = V 2 rr 2 r-r 1 , b = V 3 rr 3 r-r 1 • • • V n-f rr n-f r-r 1 , and p = r-r1 r2 , q = r-r1 r-r1-r2 , we apply Young's inequality V 1 r1 • • • V n-f r n-f ≤ r1 r V 1 r + r2 r V 2 r + r-r1-r2 r V 3 rr 3 r-r 1 -r 2 • • • V n-f rr n-f r-r 1 -r 2 ≤ n-f i=1 ri r V i r where the second inequality results from recursively applying Young's inequality. Then we apply expection to both sizes of the inequality gives E V 1 r1 • • • V n-f r n-f ≤ n-f i=1 ri r E V i r = G r Then applying expectation to Equation ( 20) gives E B k r ≤ C r=r1+•••+r n-f E V 1 r1 • • • V n-f r n-f = C G r ( ) where C is a general coefficient.

B CONVERGENCE OF BYZANTINE RESILIENT SGD

In this section, we obtain a finite-time convergence guarantee for any algorithm which satisfies (α, f )-Byzantine resilience. As far as we are aware, this is the first convergence result for KRUM. From Theorem II, we know that Suppose that the following condition holds for any iterations k for some constants δ ∈ (0, 1] and β ≥ 0 such that E F -g 2 ≤ (1 -δ) g 2 , and E F -E F 2 ≤ β 2 , ( ) where F is the output of the robust-aggregation algorithm and g := ∇f (x k ). This first condition bounds the bias wheras the second part bounds the variance of F . Convergence. Theorem IV. Given any biased stochastic estimator F satisfying (24), the following holds for the update x k+1 = x k -ηF and an L-smooth potentially non-convex f (x) lower-bounded by f : 1 K K-1 k=0 E ∇f (x k ) 2 ≤ 2L(f (x 0 ) -f ) δK + 8Lβ 2 (f (x 0 ) -f ) δ 2 K Proof. The following holds for any η ≤ 1 L : E k [f (x k+1 )] ≤ f (x k ) -E η g, F + η 2 L 2 F 2 ≤ f (x k ) -E η g, F + η 2 L 2 E F 2 + η 2 β 2 L 2 ≤ f (x k ) - η 2 g 2 + η 2 E F -g 2 + η 2 β 2 L 2 n ≤ f (x k ) - ηδ 2 g 2 + η 2 β 2 L 2 . The first inequality uses the smoothness of f , the second separated and bounded the variance of F by β 2 , third inequality follows from rearranging terms and using η ≤ 1 L , and the final inequality used our bound on the bias of F . Now rearranging the terms and averaging over k gives: 1 K K-1 k=0 E g k 2 ≤ 2(f (x 0 ) -f (x k )) ηδK + ηLβ 2 δ ≤ 2(f (x 0 ) -f ) ηδK + ηLβ 2 δ . Picking η = min( 1 L , 2(f (x0)-f ) LKβ 2 ) gives the desired rate. Relation to other conditions. If F is an unbiased stochastic gradient E[F | x k ] = g and variance σ 2 , then (24) holds with δ = 1 and β = σ. In this case, Thm. IV recovers the standard convergence of SGD for smooth non-convex functions. If instead, we have E F, g ≥ (1 -sin(α)) g 2 , and E F 2 ≤ g 2 ≤ G 2 , E F -g 2 = g 2 -2 E[F ], g + E[F ] 2 ≤ g 2 -2 E[F ], g + E F 2 ≤2 sin(α) g 2 . Thus, in this case (24) holds with δ = 1 -2 sin(α), and β = G. Here, we only get convergence if sin(α) ≤ 1 2 . Further, the assumption that E F 2 ≤ g 2 ≤ G 2 is extremely strong.  = T (1 -f /n) s . Remark 4. The ḡt has same expectation as minibatch sgd. We denote the expected number of Byzantine gradients in { ḡt : t ∈ [T ]} as follows: f := T -E|S| = T (1 -(1 -f /n) s ) Since the vectors in S are identically distributed, we can apply robust aggregation rule A, like Multik-KRUM, to { ḡt : t ∈ [T ]}. The convergence of Algorithm 1 for Multi-KRUM is stated below. Theorem V (Resampled KRUM). Assume the dataset is decentralized stored. Assume other conditions in (Blanchard et al., 2017, Prop. 1 & 2) hold true. K is the number of aggregations. If 2 f + 2 < T , then with a probability p K , KRUM•RSWR is Byzantine robust and the sequence of gradients almost surely converges to zero. The p is defined as p := T /2 i=0 q T /2 +i (1 -q) T /2 -i where q = (1 -f /n) s Proof. The probability of ḡt is good is q = (1 -f /n) s . The probability of non-faulty majority after RSWR is thus p := T /2 i=0 q T /2 +i (1 -q) T /2 -i The output of RSWR are identically distributed by Lemma 3. Thus we know that the robust aggregation rule converge with probability p K . Remark 5. Consider the range of T , s, f . Since Theorem V requires 2 f + 2 < T , we need s < log(1/2+1/T ) log(1-f /n) . Furthermore, s ≥ 1 is lower bounded, T > 2n n-2f which only requires 2f < n. Remark 6. Notice that the cardinality |S| is stochastic which means it is possible to have a faulty majority in any round. if the Byzantine gradients are very large, like gaussian attack, the model diverges as soon as one Byzantine gradient is selected. On the other hand, in the experiment section, we show that for many attacks (labelflipping, bitflipping, etc.) the error introduced by faulty majority rounds are amortized overtime. To fix this issue, the server can normalize all the gradient by their norm such that a faulty majority round would not lead to catestrophic consequences.

D.2 CIFAR-10

We run experiments on CIFAR-10 ( Krizhevsky et al., 2009) ResNet-20 (He et al., 2015) . We train our model on 12 nodes which includes 2 Byzantine attacker with mimic attack. We use KRUM as the aggregation rule for demonstration. We choose learning rate to be 0.1 and batch size per node to be 120. The CIFAR-10 dataset is splitted across good nodes such that 50% of samples on each good node has same distribution as overall distribution, the rest 50% are samples from a single class which is different among good workers. We present our results in Figure 5 . Note that the accuracy in this example is lower than the normal setting because krum may bias towards certain nodes. Besides, the batch norm maybe influenced by the heterogenous distribution of data.

D.3 RESAMPLING OR FIXED GROUPING

In (Chen et al., 2017) , are grouped at the beginning of training, and they are trained on i.i.d datasets. In contrast, resampling is performed every round, and applies to non-iid datasets. If a Byzantine worker can predict the random bits on server, resampling becomes grouping in each round which is still stronger than (Chen et al., 2017) . In Figure 7 , we compare KRUM •resampling with vanilla KRUM and KRUM with fixed grouping. As we can see, the fixed grouping has better accuracy than vanilla KRUM, but weaker than resampling as we expected. (Blanchard et al., 2017) , CM (Yin et al., 2018a) , RFA (Pillutla et al., 2019) . The RFA-T1, T3, T8 refers to the number of inner iterations. 

D.4 RESAMPLING HYPERPARAMETER

The resampling hyperparameter s controls the reduction, as has been stated in Proposition I. In Figure 8 , we compare the performance of no resampling and resampling with s = 2, 3, 4 on heterogenous MNIST dataset. There are 10 workers in total and no Byzantine workers. Each experiment has been run for 5 times. As we can see from Figure 8 , higher s leads to faster convergence. It matches with the Proposition I that higher s leads to greater variance reduction. 



Left & middle: Comparing arithmetic mean with KRUM on iid and non-iid datasets, without any Byzantine workers. Right: Histogram of selected gradients. Comparing normalized mean (RFA with T=1) under the normalized mean attack with f = 0, 1, 2 attackers. Comparing coordinate-wise median (CM) and geometric median (RFA with T=8) under the mimic2 attack on iid and non-iid datasets.

Figure 1: Failures of existing aggregation rules on the non-iid MNIST dataset. In all experiments, there are 8 good and f Byzantine workers.

Left & middle: Comparing arithmetic mean with KRUM on iid and non-iid datasets, without any Byzantine workers. Right: Histogram of selected gradients. Comparing normalized mean (RFA with T=1) under the normalized mean attack with f = 0, 1, 2 attackers. Comparing coordinate-wise median (CM) and geometric median (RFA with T=8) under mimic2 attack on iid and non-iid datasets.

Figure 2: Combining resampling with existing aggregation rules on non-iid MNIST dataset. In all experiments, there are 8 good and f Byzantine workers. For each aggregation we resample and average s gradients for T = n times.

Figure3: Test accuracies of KRUM, CM, RFA under 5 kinds of attacks (and without attack) on non-iid datasets. There are 10 workers and 2 of them are Byzantine according to each attack row. Columns show each aggregation rule applied without (red) and with resampling (blue). Dashed lines for comparison are showing the same method without any Byzantine workers (f = 0). For RFA, T1, T8 refers to the number of inner iterations of Weiszfeld's algorithm.

Figure 5: Compre KRUM and KRUM •Resampling for training ResNet-20 on CIFAR-10 dataset. There are 10 good workers and 2 Byzantine workers.

Figure6: Comparing 3 aggregation rules under 5 kinds of attacks on non-iid datasets. There are 10 workers and 2 of them are Byzantine. In the grid of experiments, same aggregation rules are used in the same column and same attacks are applied to the same row. The aggregation rules are KRUM(Blanchard et al., 2017), CM(Yin et al., 2018a), RFA(Pillutla et al., 2019). The RFA-T1, T3, T8 refers to the number of inner iterations.

Figure 7: Comparison with no resampling, and fixed grouping for KRUM on non-i.i.d datasets.

Figure 8: Compare no resampling with s = 2, 3, 4 on MNIST data. There are 10 workers and 0 Byzantine worker.

-f /n) chance that a sampled gradient g ji is good and (1 -f /n) s chance that all of the s sampled gradients are good. Repeat the resampling for T times gives E|S|

D ADDITIONAL EXPERIMENTS D.1 FASHION-MNIST

In this subsection, we demonstrate that our algorithm also works on modern dataset like Fashion-MNISTXiao et al. (2017) . Since the Fashion-MNIST is designed to be a drop-in replacement of MNIST, we conduct experiments on Fashion-MNIST with same setups as Figure 3 . The results are presented in Figure 4 . 

