BYZANTINE-ROBUST DECENTRALIZED LEARNING VIA CLIPPEDGOSSIP

Abstract

In this paper, we study the challenging task of Byzantine-robust decentralized training on arbitrary communication graphs. Unlike federated learning where workers communicate through a server, workers in the decentralized environment can only talk to their neighbors, making it harder to reach consensus and benefit from collaborative training. To address these issues, we propose a CLIPPEDGOSSIP algorithm for Byzantine-robust consensus and optimization, which is the first to provably converge to a O(δ max ζ 2 /γ 2 ) neighborhood of the stationary point for non-convex objectives under standard assumptions. Finally, we demonstrate the encouraging empirical performance of CLIPPEDGOSSIP under a large number of attacks.

1. INTRODUCTION

"Divide et impera". Distributed training arises as an important topic due to privacy constraints of decentralized data storage (McMahan et al., 2017; Kairouz et al., 2019) . As the server-worker paradigm suffers from a single point of failure, there is a growing amount of works on training in the absence of server (Lian et al., 2017; Nedic, 2020; Koloskova et al., 2020b) . We are particularly interested in decentralized scenarios where direct communication may be unavailable due to physical constraints. For example, devices in a sensor network can only communicate devices within short physical distances. Failures-from malfunctioning or even malicious participants-are ubiquitous in all kinds of distributed computing. A Byzantine adversarial worker can deviate from the prescribed algorithm and send arbitrary messages and is assumed to have the knowledge of the whole system (Lamport et al., 2019) . It means Byzantine workers not only collude, but also know the data, algorithm, and models of all regular workers. However, they cannot directly modify the states on regular workers, nor compromise messages sent between two connected regular workers. Defending Byzantine attacks in a communication-constrained graph is challenging. As secure broadcast protocols are no longer available (Pease et al., 1980; Dolev & Strong, 1983; Hirt & Raykov, 2014) , regular workers can only utilize information from their own neighbors who have heterogeneous data distribution or are malicious, making it very difficult to reach global consensus. While there are some works attempt to solve this problem (Su & Vaidya, 2016a; Sundaram & Gharesifard, 2018) , their strategies suffer from serious drawbacks: 1) they require regular workers to be very densely connected; 2) they only show asymptotic convergence or no convergence proof; 3) there is no evidence if their algorithms are better than training alone. In this work, we study the Byzantine-robustness decentralized training in a constrained topology and address the aforementioned issues. The main contributions of our paper are summarized as follows: • We identify a novel network robustness criterion, characterized in terms of the spectral gap of the topology (γ) and the number of attackers (δ), for consensus and decentralized training, applying to a much broader spectrum of graphs than (Su & Vaidya, 2016a; Sundaram & Gharesifard, 2018 ). • We propose CLIPPEDGOSSIP as the defense strategy and provide, for the first time, precise rates of robust convergence to a O(δ max ζ 2 /γ 2 ) neighborhood of a stationary point for stochastic objectives under standard assumptions. 1 We also empirically demonstrate the advantages of CLIPPEDGOSSIP over previous works. • Along the way, we also obtain the fastest convergence rates for standard non-robust (Byzantine-free) decentralized stochastic non-convex optimization by using local worker momentum.

2. RELATED WORK

Recently there have been extensive works on Byzantine-resilient distributed learning with a trustworthy server. The statistics-based robust aggregation methods cover a wide spectrum of works including median (Chen et al., 2017; Blanchard et al., 2017; Yin et al., 2018; Mhamdi et al., 2018; Xie et al., 2018; Yin et al., 2019) , geometric median (Pillutla et al., 2019) , signSGD (Bernstein et al., 2019; Li et al., 2019; yong Sohn et al., 2020) , clipping (Karimireddy et al., 2021a; b) , and concentration filtering (Alistarh et al., 2018; Allen-Zhu et al., 2020; Data & Diggavi, 2021) . Other works explore special settings where the server owns the entire training dataset (Xie et al., 2020a; Regatti et al., 2020; Su & Vaidya, 2016b; Chen et al., 2018; Rajput et al., 2019; Gupta et al., 2021) . The state-of-the-art attacks take advantage of the variance of good gradients and accumulate bias over time (Baruch et al., 2019; Xie et al., 2019) . A few strategies have been proposed to provably defend against such attacks, including momentum (Karimireddy et al., 2021a; El Mhamdi et al., 2021) and concentration filtering (Allen-Zhu et al., 2021) . Decentralized machine learning has been extensively studied in the past few years (Lian et al., 2017; Koloskova et al., 2020b; Li et al., 2021; Ying et al., 2021b; Lin et al., 2021; Kong et al., 2021; Yuan et al., 2021; Kovalev et al., 2021) . The state-of-the-art convergence rate is established in (Koloskova et al., 2020b) is O( σ 2 nε 2 + σ √ γε 3/2 ) where the leading σ 2 nε 2 is optimal. In this paper we improve this rate to O( σ 2 nε 2 + σ 2/3 γ 2/3 ε 4/3 ) using local momentum. Decentralized machine learning with certified Byzantine-robustness is less studied. When the communication is unconstrained, there exist secure broadcast protocols that guarantee all regular workers have identical copies of each other's update (Gorbunov et al., 2021; El-Mhamdi et al., 2021) . We are interested in a more challenging scenario where not all workers have direct communication links. In this case, regular workers may behave very differently depending on their neighbors in the topology. One line of work constructs a Public-Key Infrastructure (PKI) so that the message from each worker can be authenticated using digital signatures. However, this is very inefficient requiring quadratic communication (Abraham et al., 2020) . Further, it also requires every worker to have a globally unique identifier which is known to every other worker. This assumption is rendered impossible on general communication graphs, motivating our work to explicitly address the graph topology in decentralized training. Sybil attacks are an important orthogonal issue where a single Byzantine node can create innumerable "fake nodes" overwhelming the network (cf. recent overview by Ford (2021) ). Truly decentralized solutions to this are challenging and sometimes rely on heavy machinery, e.g. blockchains (Poupko et al., 2021) or Proof-of-Personhood (Borge et al., 2017) . More related to the approaches we study, Su & Vaidya (2016a) ; Sundaram & Gharesifard (2018) ; Yang & Bajwa (2019b; a) use trimmed mean at each worker to aggregate models of its neighbors. This approach only works when all regular workers have an honest majority among their neighbors and are densely connected. Guo et al. (2021) evaluate the incoming models of a good worker with its local samples and only keep those well-perform models for its local update step. However, this method only works for IID data. Peng & Ling (2020) reformulate the original problem by adding TV-regularization and propose a GossipSGD type algorithm which works for strongly convex and non-IID objectives. However, its convergence guarantees are inferior to non-parallel SGD. In this work, we address all of the above issues and are able to provably relate the communication graph (spectral gap) with the fraction of Byzantine workers. Besides, most works do not consider attacks that exploit communication topology, except (Peng & Ling, 2020) who propose zero-sum attack. We defer detailed comparisons and more related works to § F.

3.1. DECENTRALIZED THREAT MODEL

Consider an undirected graph G = (V, E) where V = {1, . . . , n} denotes the set of workers and E denotes the set of edges. Let N i ⊂ V be the neighbors of node i and N i := N i ∪ {i}. In addition, we assume there are no self-loops and the system is synchronous. Let V B ⊂ V be the set of Byzantine workers with b = |V B | and the set of regular (non-Byzantine) workers is V R := V\V B . Let G R be the subgraph of G induced by the regular nodes V R which means removing all Byzantine nodes and their associated edges. If the reduced graph G R is disconnected, then there exist two regular workers who cannot reliably exchange information. In this setting, training on the combined data of all the good workers is impossible. Hence, we make the following necessary assumption. (A1) Connectivity. G R is connected. Remark 1. In contrast, Su & Vaidya (2016a) ; Sundaram & Gharesifard (2018) impose a much stronger assumption that the subgraph of G R of the regular workers remain connected even after additionally removing any |V B | number of edges. For example, the graph in Fig. 1 with 1 Byzantine worker V 1 satisfies (A1) but does not satisfy their assumption as removing an additional edge at A 1 or B 1 may discard the graph cut. In decentralized learning, each regular worker i ∈ V R locally stores a vector {W ij } n j=1 of mixing weights, for how to aggregate model updates received from neighbors. We make the following assumption on the weight vectors. (A2) Mixing weights. The weight vectors on regular workers satisfy the following properties: • Each regular worker i ∈ V R stores non-negative {W ij } n j=1 with W ij > 0 iff j ∈ N i ; • The adjacent weights to each regular worker i ∈ V R sum up to 1, i.e. n j=1 W ij = 1; • For i, j ∈ V R , W ij = W ji . We can construct such weights even in the presence of Byzantine workers, using algorithms that only rely on communication with local neighbors, e.g. Metropolis-Hastings (Hastings, 1970) . We defer details of the construction to § C.2. Note that the Byzantine workers V B might also obtain such weights, however, they can use arbitrary different weights in reality during the training. We define δ i := j∈V B W ij to be the total weight of adjacent Byzantine edges around a regular worker i, and define the maximum Byzantine weight as δ max := max i∈V R δ i . Remark 2. In the decentralized setting, the total fraction of Byzantine nodes |V B |/n is irrelevant. Instead, what matters is the fraction of the edge weights they control which are adjacent to regular nodes (as defined by δ i and δ max ). This is because a Byzantine worker can send different messages along each edge. Thus, a single Byzantine worker connected to all other workers with large edge weights can have a large influence on all the other workers. Similarly, a potentially very large number of Byzantine workers may overall have very little effect-if the edges they control towards good nodes have little weight. When we have a uniform fully connected graph (such as in the centralized setting), the two notions of bad nodes & edges become equivalent. To facilitate our analysis of convergence rate, we define a hypothetical mixing matrix W ∈ R (n-b)×(n-b) for the subgraph G R of regular workers with entry i, j ∈ V R defined as W ij = W ij if i = j W ii + δ i if i = j. (1) By the construction of this hypothetical matrix W , the following property directly follows. Lemma 3. Given (A2), then W is symmetric and doubly stochastic, i.e. W ij = W ji , n i=1 W ij = 1, n j=1 W ij = 1. ∀i, j ∈ [n-b] Further, the spectral gap of the matrix W is positive. Lemma 4. By (A1) and (A2), there exists γ ∈ (0, 1] such that ∀ x ∈ R n-b and x = 1 x n-b 1 ∈ R n-b W x -x 2 ≤ (1 -γ) x -x 2 . (2) The γ( W ) is the spectral gap of the subgraph of regular workers G R . We have γ = 0 if and only if G R is disconnected, and γ = 1 if and only if G R is fully connected. In summary, γ measures the connectivity of the regular subgraph G R formed after removing the Byzantine nodes, whereas δ i and δ max are a measure of the influence of the Byzantine nodes.

3.2. OPTIMIZATION ASSUMPTIONS

We study the general distributed optimization problem min x∈R d f (x) := 1 |V R | i∈V R f i (x) := E ξi∼Di F i (x; ξ i ) on heterogeneous (non-IID) data, where f i is the local objective on worker i with data distribution D i and independent noise ξ i . We assume that the gradients computed over these data distributions satisfy the following standard properties. (A3) Bounded noise and heterogeneity. Assume that for all i ∈ V R and x ∈ R d , we have E ξ∼Di ∇F i (x; ξ) -∇f i (x) 2 ≤ σ 2 , E j∼V R ∇f j (x) -∇f (x) 2 ≤ ζ 2 . (4) (A4) L-smoothness. For i ∈ V R , f i (x) : R d → R is differentiable and there exists a constant sL ≥ 0 such that for each x, y ∈ R d : ∇f i (x) -∇f i (y) ≤ L x -y . (5) We denote x t i ∈ R d as the state of worker i ∈ V R at time t.

4. ROBUST DECENTRALIZED CONSENSUS

Agreeing on one value (consensus) among regular workers is one of the fundamental questions in distributed computing. Gossip averaging is a common consensus algorithm in the Byzantine-free case (δ = 0). Applying gossip averaging steps iteratively to all nodes formally writes as x t+1 i := n j=1 W ij x t j , t = 0, 1, . . . (GOSSIP) Suppose each worker i ∈ [n] initially owns a different x 0 i and (A1) and (A2) hold true, then each worker's iterate x t i asymptotically converges to x ∞ i = x = 1 n n j=1 x 0 j , for all i ∈ [n] , which is also known as average consensus (Boyd et al., 2006) . Reaching consensus in the presence of Byzantine workers is more challenging, with a long history of study (LeBlanc et al., 2013; Su & Vaidya, 2016a) .

4.1. THE CLIPPED GOSSIP ALGORITHM

We introduce a novel decentralized gossip-based aggregator, termed CLIPPEDGOSSIP, for Byzantinerobust consensus. CLIPPEDGOSSIP uses its local reference model as center and clips all received neighbor model weights. Formally, for CLIP(z, τ ) := min(1, τ / z ) • z, we define for node i x t+1 i := n j=1 W ij (x t i +CLIP(x t j -x t i , τ i )), t = 0, 1, . . . (CLIPPEDGOSSIP) Theorem I. Let xt := 1 |V R | i∈V R x t i be the average iterate over the unknown set of regular nodes. If the initial consensus distance is bounded as 1 |V R | i∈V R E x t i -xt 2 ≤ ρ 2 , then for all i ∈ V R , the output x t+1 i of CLIPPEDGOSSIP with an appropriate choice of clipping radius satisfies 1 |V R | i∈V R E x t+1 i -xt 2 ≤ 1 -γ + c √ δ max 2 ρ 2 and E xt+1 -xt 2 ≤ c 2 δ max ρ 2 where the expectation is over the random variable {x t i } i∈V R and c > 0 is a constant. We inspect Theorem I on corner cases. If regular workers have already reached consensus before aggregation (ρ = 0), then Theorem I shows that we retain consensus even in the face of Byzantine agents. In this case, we can use a simple majority, which corresponds to setting clipping threshold τ i = 0. Further, if there is no Byzantine worker (δ max = 0), then the robust aggregator must improve the consensus distance by a factor of (1 -γ) 2 which matches standard gossiping analysis (Boyd et al., 2006) . Finally, for the complete graph (γ = 1) CLIPPEDGOSSIP satisfies the centralized notion of (δ max , c 2 )-robust aggregator in (Karimireddy et al., 2021a, Definition C). Thus, CLIPPEDGOSSIP recovers all past optimal aggregation methods as special cases. Note that if the topology is poorly connected and there are Byzantine attackers with (γ < c √ δ max ), then Theorem I gives no guarantee that the consensus distance will reduce after aggregation. This is unfortunately not possible to improve upon, as we will show in the following § 4.2-if the connectivity is poor then the effect of Byzantine workers can be significantly amplified.

4.2. LOWER BOUNDS DUE TO COMMUNICATION CONSTRAINTS

Not all pairs of workers have direct communication links due to constraints such as physical distances in a sensor network. It is common that a subset of sensors are clustered within a small physical space while only few of them have communication links to the rest of the sensors. Such links form a cut-set of the communication topology and are crucial for information diffusion. On the other hand, attackers can increase consensus errors in the presence of these critical links. Theorem II. Consider networks satisfying (A1) of n nodes, each holding a number in {0, 1}, and only O(1/n 2 ) of the edges are adjacent to attackers. For any robust consensus algorithm A, there exists a network such that the output of A has an average consensus error of at least Ω(1). Further, the performance of CLIPPEDGOSSIP is best explained by the magnitude of (δ/γ 2 )it is excellent when the ratio is less than a threshold and degrades as it increases. Proof. Consider two cliques A and B with n nodes each connected by an edge to each other and to a Byzantine node V 2 , c.f. Fig. 1 . Suppose that we know all nodes have values in {0, 1}. Let all nodes in A have value 0. Now consider two settings: World 1. All B nodes have value 0. However, Byzantine node V 2 pretends to be part of a clique identical to B which it simulates, except that all nodes have value 1. The true consensus average is 0. World 2. All B nodes have value 1. This time the Byzantine node V 2 simulates clique B with value 0. The true consensus average here is 0.5. From the perspective of clique A, the two worlds are identical-it seems to be connected to one clique with value 0 and another with value 1. Thus, it must make Ω(1) error at least in one of the worlds. This proves that consensus is impossible in this setting. While arguments above are similar to classical lower bounds in decentralized consensus which show we need δ ≤ 1/3 (Fischer et al., 1986) , in our case there is only 1 Byzantine node (out of 2n + 1 regular nodes) which controls only 2 edges i.e. δ = O(1/n 2 ). This impossibility result thus drives home the additional impact through the restricted communication topology. Further, past impossibility results about robust decentralized consensus such as (Sundaram & Gharesifard, 2018; Su & Vaidya, 2016a) use combinatorial concepts such as the number of node-disjoint paths between the good nodes. However, such notions cannot account for the edge weights easily and cannot give finite-time convergence guarantees. Instead, our theory shows that the ratio of δ max /γ 2 accurately captures the difficulty of the problem. We next verify this empirically. In Fig. 3 , we show the final consensus error of three defenses under Byzantine attacks. TM and MEDIAN have a large error even for small δ max and large γ. The consensus error of CLIPPEDGOSSIP increases almost linearly with δ max /γ 2 . However, this phenomenon is not observed by looking at γ -2 or δ max alone, validating our theoretical analysis in Theorem I. Details are deferred to § D.1. Algorithm 1 Byzantine-Resilient Decentralized Optimization with CLIPPEDGOSSIP Input: x 0 ∈ R d , α, η, {τ t i }, m 0 i = g i (x 0 ) 1: for t = 0, 1, . . . do 2: for i = 1, . . . , n in parallel 3: m t+1 i = (1 -α)m t i + αg i (x t i ) 4: x t+ 1 /2 i = x t i -ηm t+1 i if i ∈ V R else * 5: Exchange x t+ 1 /2 i with N i 6: x t+1 i = CLIPPEDGOSSIP i (x t+ 1 /2 1 , . . . , x t+ 1 /2 n ; τ t+1 i ) 7: end for Table 1 : Comparison with prior work of convergence rates for non-convex objectives to a O(δζ 2 )neighborhood of stationary points. We recover comparable or improved rates as special cases. Reference Setting Convergence to ε-accuracy Regular (δ = 0) Decentralized Koloskova et al. (2020b) - O( σ 2 nε 2 + ζ γε 3/2 + σ √ γε 3/2 + 1 γε ) This work δ = 0 O( σ 2 nε 2 + ζ γε 3/2 + σ 2/3 γ 2/3 ε 4/3 + 1 γε ) Byzantine-robust Fully-connected (γ = 1) IID (ζ = 0) Guo et al. (2021) - Gorbunov et al. (2021) δ known O( σ 2 nε 2 + nδσ 2 mε + 1 ε ) † Gorbunov et al. (2021) δ unknown O( σ 2 nε 2 + n 2 δσ 2 mε + 1 ε ) † This work γ = 1, ζ = 0 O( σ 2 nε 2 + δσ 2 ε 2 + 1 ε ) Byzantine-robust Federated Learning Karimireddy et al. (2021b) - O( σ 2 ε 2 (δ+ 1 n )+ 1 ε ) This work γ = 1 O( σ 2 ε 2 (δ+ 1 n )+ ζ ε 3/2 + σ 2/3 ε 4/3 + 1 ε ) † This method does not generalize to constrained communication topologies.

5. ROBUST DECENTRALIZED OPTIMIZATION

The general decentralized training algorithm can be formulated as x t+ 1 /2 i := x t i -ηg i (x t i ) i ∈ V R * i ∈ V B , x t+1 i := AGG i ({x t+ 1 /2 k : k ∈ N i }) where η is the learning rate, g i (x) := ∇F (x, ξ i ) is a stochastic gradient, and ξ t i ∼ D i is the random batch at time t on worker i. The received message x t+ 1 /2 k can be arbitrary for Byzantine nodes k ∈ V B . Replacing AGG with plain gossip averaging (GOSSIP) recovers standard gossip SGD (Koloskova et al., 2019) . Under the presence of Byzantine workers, which is the main interest of our work, we will show that we can replace AGG with CLIPPEDGOSSIP and use local worker momentum to achieve Byzantine robustness (Karimireddy et al., 2021a) . The full procedure is described in Algorithm 1. Theorem III. Suppose Assumptions 1-4 hold and δ max = O(γ 2 ). Then for α := 3ηL, Algorithm 1 reaches 1 T +1 T t=0 ∇f ( xt ) 2 2 ≤ δmaxζ 2 γ 2 + ε in iteration complexity O σ 2 nε 2 1 n +δ max + ζ γε 3/2 + σ 2/3 γ 2/3 ε 4/3 + 1 γε . Furthermore, the consensus distance satisfies the upper bound 1 |V R | i∈V R x T i -xT 2 2 ≤ O( ζ 2 γ 2 (T +1) ). We compare our analysis with existing works for non-convex objectives in Table 1 . Regular decentralized training. Even if there are no Byzantine workers (δ max = 0), our convergence rate is slightly faster than that of standard gossip SGD (Koloskova et al., 2020b) . The difference is that our third term O( σ 2/3 γ 2/3 ε 4/3 ) is faster than their O( σ √ γε 3/2 ) for large σ and small ε. This is because we use local momentum which reduces the effect of variance σ. Thus momentum has a double use in this paper in achieving robustness as well as accelerating optimization. Byzantine-robust federated learning. Federated learning uses a fully connected graph (γ = 1). We compare state of the art federated learning method (Karimireddy et al., 2021b) with our rate when γ = 1. Both algorithms converge to a Θ(δζfoot_1 )-neighborhood of a stationary point and share the same leading term. This neighborhood can be circumvented with strong growth condition and overparameterized models (Karimireddy et al., 2021b, Theorem III) . We incur additional higher-order terms O( ζ γε 3/2 + σ 2/foot_2 γ 2/3 ε 4/3 ) as a penalty for the generality of our analysis. This shows that the trusted server in federated learning can be removed without significant slowdowns. Byzantine-robust decentralized SGD with fully connected topology. If we limit our analysis to a special case of a fully connected graph (γ = 1) and IID data (ζ = 0), then our rate has the same leading term as (Gorbunov et al., 2021) , which enjoys the scaling of the total number of regular nodes. The second term O( n m δσ 2 ε ) of (Gorbunov et al., 2021) is better than our O( 1 ε δσ 2 ε ) for small ε because they additionally validate m random updates in each step. However, (Gorbunov et al., 2021) relies on secure protocols which do not easily generalize to constrained communication. Byzantine-robust decentralized SGD with constrained communication. MOZI (Guo et al., 2021) does not provide a theoretical analysis on convergence and TM (Sundaram & Gharesifard, 2018; Su & Vaidya, 2016a; Yang & Bajwa, 2019a) only prove the asymptotic convergence of full gradient under a very strong assumption on connectivity and local honest majority. 2 Peng & Ling (2020) don't prove a rate for non-convex objective; but Gorbunov et al. (2021) which shows convergence of (Peng & Ling, 2020) on strongly convex objectives at a rate inferior to parallel SGD. In contrast, our convergence rate matches the standard stochastic analysis under much weaker assumptions than Sundaram & Gharesifard (2018) ; Su & Vaidya (2016a) ; Yang & Bajwa (2019a) . Unlike these prior works, our guarantees hold even if some subsets of nodes are surrounded by a majority of Byzantine attackers. This can also be observed in practice, as we show in § D.2.3. Consensus for Byzantine-robust decentralized optimization. Theorem III gives a non-trivial result that regular workers reach consensus under the CLIPPEDGOSSIP aggregator. In Fig. 2 we demonstrate the consensus behavior of robust aggregators on the CIFAR-10 dataset on a dumbbell topology, without attackers (δ = 0). We compare the accuracies of models averaged within cliques A and B with model averaged over all workers. In the IID setting, the clique-averaged models of GM and TM are over 80% accuracy but the globally-averaged models are less than 30% accuracy. It means clique A and clique B are converging to two different critical points and GM and TM fail to reach consensus within the entire network! In contrast, the globally-averaged model of CLIPPEDGOSSIP is as good as or better than the clique-averaged models, both in the IID and non-IID setting. Finally, we point out some avenues for further improvement: our results depend on the worst-case δ max . We believe it is possible to replace it with a (weighted) average of the {δ i } instead. Also, extending our protocols to time-varying topologies would greatly increase their practicality. Remark 5 (Adaptive choice of clipping radius τ t i ). In § D.5, we give an adaptive rule to choose the clipping radius τ t i for all i ∈ V R and times t, based on the top percentile of close neighbors. This adaptive rule results in a value τ t i slightly smaller than the required theoretical value to preserve Byzantine robustness. In experiments, we found that the performance of optimization is robust to small perturbations of the clipping radius and that the adaptive rule performs well in all cases.

6. EXPERIMENTS

In this section, we empirically demonstrate successes and failures of decentralized training in the presence of Byzantine workers, and compare the performance of CLIPPEDGOSSIP with existing robust aggregators: 1) geometric median GM (Pillutla et al., 2019) ; 2) coordinate-wise trimmed mean TM (Yang & Bajwa, 2019a) ; 3) MOZI (Guo et al., 2020) . Coordinate-wise median (Yin et al., 2018) and Krum (Blanchard et al., 2017) usually perform worse than GM so we exclude them in the experiments. All implementations are based on PyTorch (Paszke et al., 2019) and evaluated on different graph topologies, with a distributed MNIST dataset (LeCun & Cortes, 2010) . We defer the experiments on CIFAR10 (Krizhevsky et al., 2009) to § D.3. 3 We defer details of robust aggregators to § A, attacks to § B, topologies and mixing matrix to § C and experiment setups and additional experiments to § D. Figure 4 : Accuracy of the averaged model in clique A for the dumbbell topology. In the plot title "B." stands for the bucketing (aggregating means of bucketed values) and "R." stands for adding 1 additional random edge between two cliques. We see that i) CLIPPEDGOSSIP is consistently the best matching ideal averaging performance, ii) performance mildly improves by using bucketing, and iii) significantly improves when adding a single random edge (thereby improving connectivity).

6.1. DECENTRALIZED DEFENSES WITHOUT ATTACKERS

Challenging topologies and data distribution may prevent existing robust aggregators from reaching consensus even when there is no Byzantine worker (δ = 0). In this part, we consider the "dumbbell" topology c.f. Fig. 1 . As non-IID data distribution, we split the training dataset by labels such that workers in clique A are training on digits 0 to 4 while workers in clique B are training on digits 5 to 9. This entanglement of topology and data distribution is motivated by realistic geographic constraints such as continents with dense intra-connectivity but sparse inter-connection links e.g. through an undersea cable. In Fig. 4 we compare CLIPPEDGOSSIP with existing robust aggregators GM, TM, MOZI in terms of their accuracies of averaged model in clique A. The ideal communication refers to aggregation with gossip averaging. Existing robust aggregators impede information diffusion. When cliques A and B have distinct data distribution (non-IID), workers in clique A rely on the graph cut to access the full spectrum of data and attain good performance. However, existing robust aggregators in clique A completely discard information from clique B because: 1) clique B model updates are outliers to clique A due to data heterogeneity; 2) clique B updates are outnumbered by clique A updates -clique A can only observe 1 update from B due to constrained communication. The 2nd plot in Fig. 4 shows that GM, TM, and MOZI only reach 50% accuracy in the non-IID setting, supporting that they impede information diffusion. This is in contrast to the 1st plot where cliques A and B have identical data distribution (IID) and information on clique A alone is enough to attain good performance. However, reaching local models does not imply reaching consensus, c.f. Fig. 2 . On the other hand, CLIPPEDGOSSIP is the only robust aggregator that preserves the information diffusion rate as the ideal gossip averaging. Techniques that improve information diffusion. To address these issues, we locally employ the bucketing technique of (Karimireddy et al., 2021b) for the non-IID case in the 3rd subplot. Plots 4 and 5 demonstrate the impact of one additional edge between the cliques to improve the spectral gap. • The bucketing technique randomly inputs received vectors into buckets of equal size, averages the vectors in each bucket, and finally feeds the averaged vectors to the aggregator. While bucketing helps TM to overcome 50% accuracy, TM is still behind CLIPPEDGOSSIP. GM only improves by 1% while MOZI remains at almost the same accuracy. • Adding one more random edge between two cliques improves the spectral gap γ from 0.0154 to 0.0286. CLIPPEDGOSSIP and gossip averaging converge faster as the theory predicts. However, TM, GM, and MOZI are still stuck at 50% for the same heterogeneity reason. • Bucketing and adding a random edge help all aggregators exceed 50% accuracy.

6.2. DECENTRALIZED LEARNING UNDER MORE ATTACKS AND TOPOLOGIES.

In this section, we compare robust aggregators over more topologies and Byzantine attacks in the non-IID setting. We consider two topologies: randomized small world (γ = 0.084) and torus (γ = 0.131). They are much less restrictive than the dumbbell topology (γ = 0.043) where all existing aggregators fail to reach consensus even δ = 0. For attacks, we implement state of the art federated attacks Inner product manipulation (IPM) (Xie et al., 2019) and A little is enough (ALIE) (Baruch et al., 2019) and label-flipping (LF) and bit-flipping (BF). Details about topologies and the adaptation of FL attacks to the decentralized setup are provided in § C.1 and § B. We observe that across all attacks and networks, clipped gossip has excellent performance, with the geometric median (GM) coming second. The results in Fig. 5 show that CLIPPEDGOSSIP has consistently superior performance under all topologies and attacks. All robust aggregators are generally performing better on easier topology (large γ). The GM has a very good performance on these two topologies but, as we have demonstrated in the dumbbell topology, GM does not work in more challenging topologies. Therefore, CLIPPEDGOSSIP is recommended for a general constrained topology.

6.3. LOWER BOUND OF OPTIMIZATION

We empirically investigate the lower bound of optimization O(δ max ζ 2 γ -2 ) in Theorem III. In this experiment, we fix spectral gap γ, heterogeneity ζ 2 and use different δ max fractions of Byzantine edges in the dumbbell topology. The Byzantine workers are added to V 1 in clique A and its mirror node in clique B. We define the following dissensus attack for decentralized optimization Definition A (DISSENSUS attack). For i ∈ V R and ε i > 0, a dissensus attacker j ∈ N i ∩ V B sends The resulting Figure 6 shows that with increasing δ max the model quality drops significantly. This is in line with our proven robust convergence rate in terms of δ max . Notice that for large δ max , the model averaged over all workers performs even worse than those averaged within cliques. It means the models in two cliques are essentially disconnected and are converging to different local minima or stationary points of a non-convex landscape. See § D.2.2 for details. x j := x i -ε i k∈N i ∩V R W ik (x k -xi) j∈N i ∩V B Wij .

7. DISCUSSION

The main takeaway from our work is that illconnected communication topologies can vastly magnify the effect of bad actors. As long as the communication topology is reasonably well connected (say γ = 0.35) and the fraction of attackers is mild (say δ = 10%), clipped gossip provably ensures robustness. Under more extreme conditions, however, no algorithm can guarantee robust convergence. Given that decentralized consensus has been proposed as a backbone for digital democracy (Bulteau et al., 2021) , and that decentralized learning is touted to be an alternative to current centralized training paradigms, our findings are significant. A simple strategy we recommend (along with using CLIPPEDGOSSIP) is adding random edges to improve the connectivity and robustify the network. 

A EXISTING ROBUST AGGREGATORS

In this section, we describe existing robust aggregators mentioned in this paper. Regular nodes can replace gossip averaging (GOSSIP) with robust aggregators in the federated learning. Let's take geometric median and trimmed mean for example. • Geometric median (GM). Pillutla et al. (2019) implements the geometric median GM(x 1 , . . . , x n ) := arg min v n i=1 v -x i 2 . • Coordinate-wise trimmed mean (TM). Yin et al. (2018) ; Yang & Bajwa (2019a) computes the k-th coordinate of TM as [TM(x 1 , . . . , x n )] k := 1 (1-2β)n i∈U k [x i ] k where U k is a subset of [n] obtained by removing the largest and smallest β-fraction of its elements. These aggregators don't take advantage of the trusted local information and treat all models equally. The MOZI algorithm (Guo et al., 2021) leverages local information to filter outliers. • Mozi. Guo et al. (2021) applies two screening steps on worker i ∈ V R N s i := arg min N * ⊂Ni |N * |=δi|Ni| j∈N * x i -x j , N r i :=N s i ∩ {j ∈ [n] : (x j , ξ i ) ≤ (x i , ξ i )} where ξ i ∼ D i is a random sample. If N r i = ∅, then redefine N r i := {arg min j (x j , ξ i )}. Then they update the model with x t+1 i := αx t i + 1-α |N r i | j∈N r i x t j -η∇F i (x t i ; ξ t i ) where α ∈ [0, 1] is an hyperparameter.

B BYZANTINE ATTACKS IN THE DECENTRALIZED ENVIRONMENT

In this section, we first describe how to transform attacks from the federated learning to the decentralized environment. Then we introduce the dissensus attack for decentralized environment.

B.1 EXISTING ATTACKS IN FEDERATED LEARNING

A little is enough (ALIE). The attackers estimate the mean µ Ni and standard deviation σ Ni of the regular models, and send µ Ni -zσ Ni to regular worker i where z is a small constant controlling the strength of the attack (Baruch et al., 2019) . The hyperparameter z for ALIE is computed according to (Baruch et al., 2019)  z = max z φ(z) < n -b -s n -b where s = n 2 + 1 -b and φ is the cumulative standard normal function. Inner product manipulation attack (IPM). The inner product manipulation attack is proposed in (Xie et al., 2019) which lets all attackers send same corrupted gradient u based on the good gradients u j = -εAVG({v i : i ∈ V R }) ∀ j ∈ V B . If ε is small enough, then u j can be detected as good by the defense, circumventing the defense. There are 3 main differences where IPM need to adapt to the decentralized environment: 1. Byzantine workers may not connected to the same good worker. 2. The model vectors are transmitted instead of gradients. 3. The AVG should be replaced by its equivalent gossip form. This motivates our dissensus attack in the next section. In this section, we introduce a novel dissensus attack inspired by our impossibility construction in Theorem II and the IPM attack described above. The dissensus attack aims to prevent regular worker models from reaching consensus. Roughly speaking, dissensus attackers around worker i send its model weights that are symmetric to the weighted average of regular neighbors around i. Then after gossip averaging step, the consensus distance drops slower or even grows which motivates the name "dissensus". We can parameterize the attack through hyperparameter ε i and summarize the attack in Definition A x j := x i -ε i k∈N i ∩V R W ik (x k -xi) j∈N i ∩V B Wij . ( ) The ε i determines the behavior of the attack. By taking smaller ε i , Byzantine model weights are closer to the target updates i and difficult to be detected. On the other hand, a larger ε i pulls the model away from the consensus. Note that this attack requires omniscience since it exploits model information from across the network. If the attackers in addition can choose which node to attack, then they can choose either to spread about the attack across the network or focus on the targeting graph cut, that is min-cut of the graph. Effect of the dissensus attack. The dissensus attack enjoy the following properties. Proposition IV. (i) For all i ∈ V R , under the dissensus attack with ε i = 1, the gossip averaging step (GOSSIP) is equivalent to no communication on i, x t+1 i = x t i . Secondly, (ii) If the graph is fully connected, gossip averaging recovers the correct consensus even in the presence of dissensus attack. The above proposition illustrates two interesting aspects of the attack. Firstly, dissensus works by negating the progress that would be made by gossip. The attack in (Peng & Ling, 2020 ) also satisfies this property (see Appendix for additional discussion). Secondly, it is a uniquely decentralized attack and has no effect in the centralized setting. Hence, its effect can be used to measure the additional difficulty posed due to the restricted communication topology. Proof. For the first part, by definition (GOSSIP) we know that x t+1 i = n j=1 W ij x t j = x t i + j∈Ni W ij (x t j -x t i ) By setting ε i = 1 in the attack (6), the second term 0 and therefore x t+1 i = x t i . For part (ii), note that in a fully connected graph the gossip average is the same as standard average. Averaging all the perturbations introduced by the dissensus attack gives -ε i,j∈V R W i,j (x t j -x t i ) = 0 . All terms cancel and sum to 0 by symmetry. Thus, in a fully connected graph the dissensus perturbations cancel out and the gossip average returns the correct consensus. Relation with zero-sum attack and dissensus. Peng & Ling (2020) propose the "zero-sum" attack which achieves similar effects as Proposition IV part (i). This attack is defined for j ∈ V B x j := -k∈N i ∩V R x k |Ni∩V B | . The key difference between zero-sum attack and our proposed attack is three-fold. First, zero-sum attack ensures j∈Ni x j = 0 which means the Byzantine models have to be far away from x t i and therefore easy to detect. This attack pull the aggregated model to 0. On the other hand, our attack ensures 1 j∈Ni W ij j∈Ni W ij x t j = x t i and the Byzantine updates can be very close to x t i and it is more difficult to be detected. Second, our proposed attack considers the gossip averaging which is prevalent in decentralized training (Koloskova et al., 2020b) while the zero-sum attack only targets simple average. Third, our attack has an additional parameter ε controlling the strength of the attack with ε > 1 further compromise the model quality while zero-sum attack is fixed to training alone.

C TOPOLOGIES AND MIXING MATRICES C.1 CONSTRAINED TOPOLOGIES

Topologies that do not satisfy the robust network assumption in (LeBlanc et al., 2013; Sundaram & Gharesifard, 2018; Su & Vaidya, 2016a) . The robust network assumption requires there to be at least b + 1 paths between any two regular workers when there are b Byzantine workers in the network (LeBlanc et al., 2013; Sundaram & Gharesifard, 2018; Su & Vaidya, 2016a) . The topology in Figure 8 only has 1 path between regular workers in two cliques while having 2 Byzantine workers in the network. Therefore this topology does not satisfy the robust network assumption. But the graph cut is not adjacent to the Byzantine workers and, intuitively, it would be possible for an ideal robust aggregator to help reach consensus. The experimental results are given in Appendix D.4. (Randomized) Small-world topology. The small-world topology is a random graph generated with Watts-Strogatz model (Watts & Strogatz, 1998) . The topology is created using NetworkX package (Hagberg et al., 2008) with 10 regular workers each connected to 2 nearest neighbors and probability of rewiring each edge as 0.15. Two additional Byzantine workers are linked to 2 random regular workers. There are 12 workers in total.

Clique A Clique B

Cut 𝐴 1 𝐵 1 Figure 8 : Example topology that does not satisfy the robust network assumptions in (Sundaram & Gharesifard, 2018; Su & Vaidya, 2016a) . Torus topology. The regular workers form a torus grid T 3,3 and two additional Byzantine workers are linked to 2 random regular workers. There are 11 workers in total. The mixing matrix for these topologies are constructed with Metropolis-Hastings algorithm introduced in the previous section. The spectral gap for small-world topology and torus topology are 0.084 and 0.131 respectively. In contrast, the dumbbell topology in Figure 16 is more challenging with a spectral gap of 0.043. The data distribution is non-IID.

C.2 CONSTRUCTING MIXING MATRICES

In this section, we introduce a few possible ways to construct the mixing weight vectors in the presence of Byzantine workers. The constructed weight vectors satisfy (A2) in Section 3. • Metropolis-Hastings weight (Hastings, 1970) . The Metropolis-Hastings algorithm locally constructs the mixing weights by exchanging degree information (d i and d j ) between two nodes i and j. The mixing weight vector on regular worker i ∈ V R is computed as follows W ij =      1 max{di,dj }+1 j ∈ N i , 1 -l∈Ni W il j = i, Otherwise. If worker j ∈ V B is Byzantine, then the only way for j to maximize its weight W ij to regular worker i is to report a smaller degree d j . However, such Byzantine behavior of node j has limited influence on worker i's weight W ij because it can not be greater than 1 di+1 . • Equal-weight. Let d max be the maximum degree of nodes in a graph. Such upper bound d max can be a public information, for example, a bluetooth device can at most connect to d max other devices due to physical constraints. The Byzantine worker cannot change the value of d max . Then we use the following naive construction W ij =      1 dmax+1 j ∈ N i , 1 -|Ni| dmax+1 j = i, 0 Otherwise. ( ) Note that these construction schemes are not proved to be the optimal. In this work, we focus on the Byzantine attacks given a topology and associated mixing weights. We leave it as future work to explore the best strategy to construct mixing weights.

D EXPERIMENTS

We summarize the hardware and software for experiments in Table 2 . We list the setups and results of experiments for consensus in Appendix D.1 and optimization in Appendix D.2. In this section, we provide detailed setups for Figure 3 . The Figure 9 demonstrates the topology for the experiment. The 4 regular workers are connected with two of them holding value 0 and the others holding 200. Then the average consensus is 100 with initial mean square error equals 10000. Two Byzantine workers are connected to two regular workers in the middle. We can tune the weights of each edge to change the mixing matrix and γ. Then we can decide the weight δ on the Byzantine edge. The γ and δ used in the experiments are

D.1 BYZANTINE-ROBUST CONSENSUS

• p := 1 - (1 - γ) 2 ∈ [0. 06, 0.05, 0.04, 0.03, 0.02, 0.01, 0.005, 0.0014, 3.7e -4, 1e -4, 1e -5] • δ ∈ [0.05, 0.1, 0.2, 0.3, 0.4, 0.5] where non-compatible combination of γ and δ are ignored in the Figure 3 . The dissensus attack is applied with ε = 0.05. The hyperparameter β of trimmed mean (TM) is set to the actual number of Byzantine workers around the regular worker. The clipping radius of CLIPPEDGOSSIP is chosen according to (27). In Figure 10 , we show the iteration-to-error curves for all possible combinations of γ and δ. In addition, we provide a version of TM and MEDIAN which takes the mixing weight into account. As we can see, the naive TM, MEDIAN, and MEDIAN* cannot bring workers closer because of the data distribution we constructed. The TM* is performing better than the other baselines but worse than CLIPPEDGOSSIP especially on the challenging cases where γ is small and δ is large. For CLIPPEDGOSSIP, it matches with our intuition that for a fixed γ the convergences is worse with increasing δ while for a fixed δ the convergence is worse with decreasing γ.

D.2 BYZANTINE-ROBUST DECENTRALIZED OPTIMIZATION

In this section, we provide detailed hyperparameters and setups for experiments in the main text and then provide additional experiments. For all MNIST tasks, we use the default setup listed in Table 3 unless specifically stated. The default hyperparameters of the robust aggregators: 1) For GM, we this setting. For bucketing experiment, we choose bucket size of s = 2. It means we randomly put at most two updates into one bucket and average within each bucket and then apply robust aggregators to the averaged updates.

D.2.2 SETUP FOR "EFFECTS OF THE NUMBER OF BYZANTINE WORKERS"

The Fig. 6 

D.2.3 SETUP FOR "DEFENSE WITHOUT HONEST MAJORITY"

The Fig. 12 uses the ring topology of 5 regular workers in Fig. 13 . 11 Byzantine workers are added to the ring so that 1 regular worker do no have honest majority. The experiments run for 900 iterations. We use ε i = 1.5 for dissensus attacks. We use clipping radius τ = 0.1 for CLIPPEDGOSSIP. In the decentralized environment, the common honest majority assumption in the federated learning setup can be strengthen to honest majority everywhere, meaning all regular workers have an honest majority of neighbors (Su & Vaidya, 2016b; Yang & Bajwa, 2019b; a) . Considering a ring of 5 regular workers with IID data, and adding 2 Byzantine workers to each node will still satisfy the honest majority assumption everywhere. Now adding one more Byzantine worker to a node will break the assumption. Figure 12 shows that while TM and GM can sometimes counter the attack under the honest majority assumption, adding one more Byzantine worker always corrupts the entire training. The CLIPPED-GOSSIP defend attacks successfully even beyond the assumption, because they leverage the fact that local updates are trustworthy. This suggest that existing statistics-based aggregators which take no advantage of local information are vulnerable under this realistic decentralized threat model.

D.2.4 SETUP FOR "MORE TOPOLOGIES AND ATTACKS."

In Figure 5 , we use the small-world and torus topologies described in Appendix C.1. More specifically, we created a randomized small-world topology using NetworkX package (Hagberg et al., 2008) with 10 regular workers each connected to 2 nearest neighbors and probability of rewiring each edge as 0.15. Two additional Byzantine workers are linked to 2 random regular workers. There are 12 workers in total. For the torus topology, we let regular workers form a torus grid T 3,3 where all 9 regular workers are connected to 3 other workers. Two additional Byzantine workers are linked to 2 random regular workers. There are 11 workers in total. The mixing matrix for these topologies are constructed with Metropolis-Hastings algorithm in Appendix C.2. The spectral gap for small-world topology and torus topology are 0.084 and 0.131 respectively. In contrast, the dumbbell topology in Figure 16 is more challenging with a spectral gap of 0.043. The data distribution is non-IID.

D.3 EXPERIMENT: CIFAR-10 TASK

In this section, we conduct experiments on CIFAR-10 dataset Krizhevsky et al. (2009) . The running environment of this experiment is the same as MNIST experiment Table 2 . The default setup for CIFAR-10 experiment is summarized in Table 4 . We compare performances of 5 aggregators on dumbbell topology with 10 nodes in each clique (no attackers). The results of experiments are shown in Figure 14 . In order to investigate if consensus has reached among the workers, we average the worker nodes in 3 different categories ( "Global", Clique A, and Clique B) and compare their performances on IID and NonIID datasets. The "IID-Global" result show that GM and TM is much worse than CLIPPEDGOSSIP and Gossip, in contrast to the MNIST experiment Figure 4 where they have matching result. This is because the workers with in each clique are converging to different stationary point -"IID-Clique A" and "IID-Clique B" show GM and TM in each clique can reach over 80% accuracy which is close to Gossip. It demonstrates that GM and TM fail to reach consensus even in this Byzantine-free case and therefore vulnerable to attacks. The NonIID experiment also support that CLIPPEDGOSSIP perform much better than all other robust aggregators. Notice that CLIPPEDGOSSIP's "NonIID-Global" performance is better than "NonIID-Clique A" and "NonIID-Clique B" while GM and TM's result are opposite. This is because CLIPPEDGOSSIP allows effective communication in this topology and therefore clique models are close to each other in the same local minima basin such that their average (global model) is better than both of them. The GM's and TM's clique models converge to different local minima, making their averaged model underperform.

D.4 EXPERIMENT FOR "WEAKER TOPOLOGY ASSUMPTION"

As is mentioned in Remark 1 and Appendix C.1, the topology assumption in this work is weaker than the robust network assumption in Su & Vaidya (2016a) ; Sundaram & Gharesifard (2018) . We use the topology in Figure 8 which consists of 10 regular workers and 2 dissensus attack workers. While this topology does not satisfy the robust network assumption, it intuitively should allow communication between two cliques as no Byzantine workers are attached to the cut. However, both GM and TM will discard the graph cut due to data heterogeneity. This shows that GM and TM impede information diffusion. On the other hand, CLIPPEDGOSSIP is the only robust aggregator which help two cliques reaching consensus in the NonIID case. The CLIPPEDGOSSIP theoretically applies to more topologies and empirically perform better.

D.5 EXPERIMENT: CHOOSING CLIPPING RADIUS

In Figure 16 we show the sensitive of tuning clipping radius. We use dumbbell topology with 5 regular workers in each clique and add 1 more Byzantine worker to each clique. The clipping radius is searched over a grid of [0.1, 0.5, 1, 2, 10]. The Byzantine workers are chosen to be Bit-Flipping, Label-Flipping, and ALIE. We also give an adaptive clipping strategy for different i ∈ V R and time t. After communication step at time t, the value of x t+ 1 /2 i is available. Therefore we can sort the values of x t+ 1 /2 i -x t+ 1 /2 j 2 2 for all j ∈ N i . We denote the set of indices set S t i as the indices of workers that have the smallest distances to worker i Then the adaptive strategy picks clipping radius as follows S t i = arg min S: j∈S Wij ≤1-δmax j∈S x t+ 1 /2 i -x t+ 1 /2 j τ t+1 i = j∈S t i W ij x t+ 1 /2 i -x t+ 1 /2 j 2 2 . ( ) Note that this adaptive choice of clipping radius is generally a bit smaller than the theoretical value ( 27). It guarantees that the Byzantine workers have limited influences at cost of small slow down on the convergence. As we can see from Figure 16 , the performances of CLIPPEDGOSSIP are similar with different constant choices of τ which shows that the choice of τ is not very sensitive. The adaptive algorithms perform well in all cases. Therefore, the adaptive choice of τ will be recommended in general.

E ANALYSIS

We restate the core equations in Algorithm 1 at time t on worker i as follows m t+1 i = (1 -α)m t i + αg i (x t i ) x t+ 1 /2 i = x t i -ηm t+1 i (12) z t+1 j→i = x t+ 1 /2 i + CLIP(x t+ 1 /2 j -x t+ 1 /2 i , τ t i ) x t+1 i = n j=1 W ij z t+1 j→i In addition, we define the following virtual iterates on the set of good nodes V R • x t = 1 |V R | i∈V R x t i the average (over time) of good iterates. • m t = 1 |V R | i∈V R m t i the average (over time) of momentum iterates. In this proof, we define p := 1 -(1 -γ) 2 ∈ (0, 1] for convenience. In this section, we show that the convergence behavior of the virtual iterates x t . The structure of this section is as follows: • In Appendix E.1, we give common quantities, simplified notations and list common equalities/inequalities used in the proof. • In Appendix E.2, we provide all auxiliary lemmas necessary for the proof. Among these lemmas, Lemma 8 is the key sufficient descent lemma. • In Appendix E.3, we provide the proof of the main theorem.

E.1 DEFINITIONS, AND INEQUALITIES

Notations for the proof. We use the following variables to simplify the notation • Optimization sub-optimality: r t := f ( xt ) -f • Consensus distance: Ξ t := 1 |V R | i∈V R x t i -xt 2 2 • The distance between the ideal gradient and actual averaged momentum e t+1 1 := E ∇f ( xt ) -mt+1 2 2 • Similarly, the distance between the ideal gradient and individual momentums ẽt+1 1 := 1 |V R | i∈V R E ∇f ( xt ) -m t+1 i 2 2 • Similar, distance between individual ideal gradients and individual momentums which is weighted by the mixing matrix ēt+1 1 := 1 |V R | i∈V R E j∈V R W ij (∇f j ( xt ) -m t+1 j ) 2 2 • Similar we have distance between individual ideal gradients and individual momentums e t+1 I := 1 |V R | i∈V R E m t+1 i -∇f i ( xt ) 2 2 , • Let e t+1 2 be the averaged squared error introduced by clipping and Byzantine workers e t+1 2 := 1 |V R | i∈V R E j∈V R W ij (z t+1 j→i -x t+ 1 /2 j ) + j∈V B W ij (z t+1 j→i -x t+ 1 /2 i ) 2 2 . Lemma 6 (Common equalities and inequalities). We use the following equalities and inequalities • The cosine theorem: ∀ x, y ∈ R d x, y = - 1 2 x -y 2 2 + 1 2 x 2 2 + 1 2 y 2 2 (15) • Young's inequality: For ε > 0 and x, y ∈ R d x + y 2 2 ≤ (1 + ε) x 2 2 + (1 + ε -1 ) y 2 2 (16) • If f is convex, then for α ∈ [0, 1] and x, y ∈ R d f (αx + (1 -α)y) ≤ αf (x) + (1 -α)f (y) • Cauchy-Schwarz inequality x, y ≤ x 2 y 2 (18) • Let {x i : i ∈ [m]} be independent random variables and E x i = 0 and E x i 2 = σ 2 then E 1 m m i=1 x i 2 2 = σ 2 m (19) E.2 LEMMAS The following lemma establish the update rule for xt . Lemma 7. Assume Lemma 3. Let ∆ t+1 be the error incurred by clipping and V B ∆ t+1 := 1 |V R | i∈V R   j∈V R W ij (z t+1 j→i -x t+ 1 /2 j ) + j∈V B W ij (z t+1 j→i -x t+ 1 /2 i )   . ( ) Then the virtual iterate updates xt+1 = xt -η mt+1 + ∆ t+1 . Proof. Expand xt+1 with the definition of x t+1 i in ( 14) yields xt+1 = 1 |V R | i∈V R x t+1 i = 1 |V R | i∈V R   j∈V R W ij z t+1 j→i + j∈V B W ij z t+1 j→i   = 1 |V R | i∈V R   j∈V R W ij (z t+1 j→i -x t+ 1 /2 j ) + j∈V R W ij x t+ 1 /2 j   + 1 |V R | i∈V R   j∈V B W ij (z t+1 j→i -x t+ 1 /2 i ) + j∈V B W ij x t+ 1 /2 i   . Reorganize the terms to form ∆ t+1 xt+1 = 1 |V R | i∈V R   j∈V R W ij x t+ 1 /2 j + j∈V B W ij x t+ 1 /2 i   + ∆ t+1 = 1 |V R | j∈V R (1 -δ j )x t+ 1 /2 j + 1 |V R | i∈V R δ i x t+ 1 /2 i + ∆ t+1 = 1 |V R | i∈V R x t+ 1 /2 i + ∆ t+1 = 1 |V R | i∈V R (x t i -ηm t+1 i ) + ∆ t+1 = xt i -η mt+1 + ∆ t+1 . Note that the ∆ t+1 can be written as the follows ∆ t+1 = 1 |V R | i∈V R   x t+1 i - j∈V R Wij x t+ 1 /2 j   = xt+1 - 1 |V R | i∈V R x t+ 1 /2 i . where measures the error introduced to xt+1 considering the impact of Byzantine workers and clipping. Therefore when V B = ∅ and τ is sufficiently large, ∆ t+1 = 0 and xt+1 converge at the same rate as the centralized SGD with momentum. Recall that e t+1 1 := E ∇f ( xt ) -mt+1 2 2 . The key descent lemma is stated as follow Lemma 8 (Sufficient decrease). Assume (A4) and η ≤ 1 2L , then E f ( xt+1 ) ≤f ( xt ) - η 2 ∇f ( xt ) 2 2 - η 4 E mt+1 - 1 η ∆ t+1 2 2 + ηe t+1 1 + 1 η e t+1 2 . Proof. Use smoothness (A4) and expand it with ( 21) f ( xt+1 ) ≤f ( xt ) -∇f ( xt ), η mt+1 -∆ t+1 + L 2 η mt+1 -∆ t+1 2 2 Apply cosine theorem (15) to the inner product η ∇f ( xt ), mt+1 -1 η ∆ t+1 yields E f ( xt+1 ) ≤f ( xt ) - η 2 ∇f ( xt ) 2 2 - η -Lη 2 2 E mt+1 - 1 η ∆ t+1 2 2 + η 2 E ∇f ( xt ) -mt+1 + 1 η ∆ t+1 2 2 . If step size η ≤ 1 2L , then -η-Lη 2 2 ≤ -η 4 . Applying inequality (16) to the last term η 2 E ∇f ( xt ) -mt+1 + 1 η ∆ t+1 2 2 ≤ η E ∇f ( xt ) -mt+1 2 2 + 1 η E ∆ t+1 2 2 . Since e t+1 1 := E ∇f ( xt ) -mt+1 2 2 and E ∆ t+1 2 2 ≤ e t+1 2 , then we have E f ( xt+1 ) ≤f ( xt ) - η 2 ∇f ( xt ) 2 2 - η 4 E mt+1 - 1 η ∆ t+1 2 2 + ηe t+1 1 + 1 η e t+1 2 . In the next lemma, we establish the recursion for the distance between momentums and gradients Lemma 9. Assume (A3) and (A4) and Lemma 3, For any doubly stochastic mixing matrix A ∈ R n×n e t+1 A = 1 |V R | i∈V R E j∈V R A ij (m t+1 j -∇f j ( xt )) 2 2 , then we have the following recursion e t+1 A ≤ (1 -α)e t A + α 2 σ 2 |V R | A 2 F,V R + 2αL 2 Ξ t + 2L 2 η 2 α mt - 1 η ∆ t 2 2 . ( ) where we define A 2 F,V R := i∈V R j∈V R A 2 ij Therefore, • If A ij = 1 |V R | for all i, j ∈ V R , then e t+1 A = e t+1 1 and A 2 F,V R = 1. • If A = W , then e t+1 A = ēt+1 1 and A 2 F,V R = i∈V R j∈V R W 2 ij ≤ |V R |. • If A = I, then A 2 F,V R = |V R |. In addition, ẽt+1 1 ≤ 2e t+1 I + 2ζ 2 where A = I. Proof. We can expand e t+1 A by expanding m t+1 j e t+1 A (11) = 1 |V R | i∈V R E j∈V R A ij ((1 -α)m t j + αg j (x t j ) -∇f j ( xt )) 2 2 = 1 |V R | i∈V R E j∈V R A ij ((1-α)m t j +α(g j (x t j ) ± ∇f j (x t j ))-∇f j ( xt )) 2 2 Extract the stochastic term g j (x t j ) -∇f j (x t j ) inside the norm and use that E g j (x t j ) = ∇f j (x t j ), e t+1 A = 1 |V R | i∈V R j∈V R A ij ((1-α)m t j +α∇f j (x t j )-∇f j ( xt )) 2 2 + 1 |V R | i∈V R E j∈V R A ij α(g j (x t j ) -∇f j (x t j )) 2 2 ≤ 1 |V R | i∈V R j∈V R A ij ((1-α)m t j +α∇f j (x t j )-∇f j ( xt )) 2 2 + α 2 |V R | i∈V R j∈V R A 2 ij E g j (x t j ) -∇f j (x t j ) 2 2 . Then we can use (A3) for the last term to get e t+1 A = 1 |V R | i∈V R j∈V R A ij ((1-α)m t j +α∇f j (x t j )-∇f j ( xt )) 2 2 + α 2 σ 2 |V R | A 2 F,V R . Then we insert ±(1 -α)∇f j ( xt-1 ) inside the first norm and expand using (17) e t+1 A ≤ 1 -α |V R | i∈V R j∈V R A ij (m t j -∇f j ( xt-1 )) 2 2 + α 2 σ 2 |V R | A 2 F,V R + α |V R | i∈V R j∈V R A ij (∇f j (x t j ) -∇f j ( xt ) + 1 -α α (∇f j ( xt-1 ) -∇f j ( xt )) 2 2 . Note that the first term is e t A and by the convexity of • for the last term we have e t+1 A ≤(1 -α)e t A + α 2 σ 2 |V R | A 2 F,V R + α |V R | j∈V R ∇f j (x t j ) -∇f j ( xt ) + 1 -α α (∇f j ( xt-1 ) -∇f j ( xt )) 2 2 . Then we can further expand the last term e t+1 A ≤(1 -α)e t A + α 2 σ 2 |V R | A 2 F,V R + 2α |V R | j∈V R ∇f j (x t j ) -∇f j ( xt ) 2 2 + 2(1 -α) 2 α|V R | j∈V R ∇f j ( xt-1 ) -∇f j ( xt ) 2 2 . Then we can apply smoothness (A4) and use (1 -α) 2 ≤ 1 e t+1 A ≤(1 -α)e t A + α 2 σ 2 |V R | A 2 F,V R + 2αL 2 Ξ t + 2L 2 η 2 α mt - 1 η ∆ t 2 2 . Besides, consider ẽt+1 1 ẽt+1 1 = 1 |V R | i∈V R E m t+1 i -∇f ( xt ) 2 2 = 1 |V R | i∈V R E m t+1 i ± ∇f i ( xt ) -∇f ( xt ) 2 2 ≤2 1 |V R | i∈V R E m t+1 i -∇f i ( xt ) 2 2 + 2 1 |V R | i∈V R ∇f i ( xt ) -∇f ( xt ) 2 2 =2e t+1 I + 2ζ 2 . As we know that ∆ t+1 2 2 ≤ e t+1 2 , then we need to finally bound e t+1 2 Lemma 10 (Bound on e t+1 2 ). For δ max := max i∈V R δ i , if  τ t+1 i = 1 δ i j∈V R W ij E x t+ 1 /2 i -x t+ 1 / e t+1 2 = 1 |V R | i∈V R E j∈V R W ij (z t+1 j→i -x t+ 1 /2 j ) + j∈V B W ij (z t+1 j→i -x t+ 1 /2 i ) 2 2 ≤ 2 |V R | i∈V R E j∈V R W ij (z t+1 j→i -x t+ 1 /2 j ) 2 2 =:A1 + 2 |V R | i∈V R E j∈V B W ij (z t+1 j→i -x t+ 1 /2 i ) 2 2 =:A2 . Look at the first term use triangular inequality of • and the definition of τ t+1 i A 1 ≤ 2 |V R | i∈V R   j∈V R W ij E z t+1 j→i -x t+ 1 /2 j 2   2 ≤ 2 |V R | i∈V R   1 τ t+1 i j∈V R W ij E x t+ 1 /2 i -x t+ 1 /2 j 2 2   2 . The second inequality holds true because we can consider two cases of z t+1 j→i for all j ∈ V R • If x t+ 1 /2 i -x t+ 1 /2 j 2 2 ≤ τ t+1 i , then CLIP has no effect and therefore z t+1 j→i = x t+ 1 /2 j 0 = z t+1 j→i -x t+ 1 /2 j 2 ≤ 1 τ t+1 i x t+ 1 /2 i -x t+ 1 /2 j 2 2 . • If x t+ 1 /2 i -x t+ 1 /2 j 2 2 > τ t+1 i , then z t+1 j→i sits between x t+ 1 /2 j and x t+ 1 /2 i with z t+1 j→i -x t+ 1 /2 j 2 + τ t+1 i = x t+ 1 /2 i -x t+ 1 /2 j 2 . Therefore, using the inequality a -τ ≤ a 2 τ for a > 0 we have that z t+1 j→i -x t+ 1 /2 j 2 = x t+ 1 /2 i -x t+ 1 /2 j 2 -τ t+1 i ≤ 1 τ t+1 i x t+ 1 /2 i -x t+ 1 /2 j 2 2 . Therefore we justify the second inequality. On the other hand, A 2 ≤ 2 |V R | i∈V R   j∈V B W ij E z t+1 j→i -x t+ 1 /2 i 2   2 ≤ 2 |V R | i∈V R   j∈V B W ij (τ t+1 i )   2 = 2 |V R | i∈V R δ 2 i (τ t+1 i ) 2 . Then minimizing the RHS of e t+1 2 by tuning radius for clipping τ t+1 i = 1 δ i j∈V R W ij E x t+ 1 /2 i -x t+ 1 /2 j 2 2 Then we come to the following bound e t+1 2 ≤ 4 |V R | i∈V R δ i j∈V R W ij E x t+ 1 /2 i -x t+ 1 /2 j 2 2 . Then we expand the norm as follows E x t+ 1 /2 i -x t+ 1 /2 j 2 2 = E x t i -ηm t+1 i -x t j + ηm t+1 j 2 2 = E x t i ± xt -x t j + ηm t+1 j ± η∇f ( xt ) -ηm t+1 i 2 2 ≤4η 2 E m t+1 i -∇f ( xt ) 2 2 + 4η 2 E m t+1 j -∇f ( xt ) 2 2 + 4 x t i -xt 2 2 + 4 x t j -xt 2 2 (23) Use the fact that j∈V R W ij = 1 -δ i we have e t+1 2 ≤ 16η 2 |V R | i∈V R δ i (1 -δ i ) E m t+1 i -∇f ( xt ) 2 2 + 16η 2 |V R | j∈V R i∈V R δ i W ij E m t+1 j -∇f ( xt ) 2 2 + 16 |V R | i∈V R δ i (1 -δ i ) x t i -xt 2 2 + 16 |V R | j∈V R i∈V R δ i W ij x t j -xt 2 2 Use the fact that δ i ≤ δ max and 1 -δ i ≤ 1 for all i ∈ V R , e t+1 2 ≤ 32δ max (2η 2 (e t+1 I + ζ 2 ) + Ξ t ). Theorem I . Let x := 1 |V R | i∈V R x i be the average iterate over the unknown set of regular nodes with τ i = 1 δi j∈V R W ij E x i -x j 2 2 . ( ) If the initial consensus distance is bounded as 1 |V R | i∈V R E x i -x 2 ≤ ρ 2 , then for all i ∈ V R , the output xi of CLIPPEDGOSSIP satisfies 1 |V R | i∈V R E xi -x 2 ≤ 1 -γ + c √ δ max 2 ρ 2 where the expectation is over the random variable {x i } i∈V R and c > 0 is a constant. Proof. We can consider the 1-step consensus problem as 1-step of optimization problem with ρ 2 = Ξ t and η = 0. Then we look for the upper bound of 1 |V R | i∈V R E x t+1 i -xt 2 2 in terms of ρ 2 , p, and δ max . 1 |V R | i∈V R E x t+1 i -xt 2 2 = 1 |V R | i∈V R E n j=1 W ij z t+1 j→i -xt 2 2 = 1 |V R | i∈V R E ( j∈V R W ij x t j -xt ) + ( n j=1 W ij z t+1 j→i - j∈V R W ij x t j ) 2 2 . Apply ( 16) with ε > 0 and use the expected improvement Lemma 4 1 |V R | i∈V R E x t+1 i -xt 2 2 ≤ 1 + ε |V R | i∈V R j∈V R W ij x t j -xt 2 2 + 1 + 1 ε |V R | i∈V R E n j=1 W ij z t+1 j→i - j∈V R W ij x t j 2 2 ≤ (1 + ε)(1 -p) |V R | i∈V R x t i -xt 2 2 + 1 + 1 ε |V R | i∈V R E n j=1 W ij z t+1 j→i - j∈V R W ij x t j 2 2 ≤(1 + ε)(1 -p)Ξ t + 1 + 1 ε |V R | i∈V R E n j=1 W ij z t+1 j→i - j∈V R W ij x t j 2 2 Replace x t j = x t+ 1 /2 j + ηm t+1 j using ( 12), then apply ( 18) and η = 0 1 |V R | i∈V R E x t+1 i -xt 2 2 ≤ (1 + ε)(1 -p)Ξ t + 1 + 1 ε |V R | i∈V R E n j=1 W ij z t+1 j→i - j∈V R W ij x t+ 1 /2 j 2 2 . Recall the definition of e t+1 2 e t+1 2 := 1 |V R | i∈V R E n j=1 W ij z t+1 j→i - j∈V R W ij x t+ 1 /2 j 2 2 . Then use Lemma 9 with the case A = W and apply Lemma 10 with η = 0 1 |V R | i∈V R E x t+1 i -xt 2 2 ≤ (1 + ε)(1 -p)Ξ t + (1 + 1 ε )e t+1 2 ≤ (1 + ε)(1 -p)Ξ t + (1 + 1 ε )32δ max Ξ t . Let's minimize the right hand side of the above inequality by taking ε such that ε(1 -p) = 32δmax ε which leads to ε = 32δmax 1-p , then the above inequality becomes 1 |V R | i∈V R E x t+1 i -xt 2 2 ≤ (1 -p + 32δ max + 2 32δ max (1 -p))Ξ t = ( 1 -p + 32δ max ) 2 Ξ t . The consensus distance to the average consensus is only guaranteed to reduce if √ 1 -p+ √ 32δ max < 1 which is δ max < 1 32 (1 -1 -p) 2 . Finally, we complete the proof by simplifying the notation to spectral gap γ := 1 -√ 1 -p. Recall that e t+1 2 := 1 |V R | i∈V R j∈V R W ij (z t+1 j→i -x t+ 1 /2 j ) + j∈V B W ij (z t+1 j→i -x t+ 1 /2 i ) 2 2 . ( ) Next we consider the bound on consensus distance Ξ t . Lemma 11 (Bound consensus distance Ξ t ). Assume Lemma 4, then Ξ t has the following iteration Ξ t+1 ≤ (1 + ε)(1 -p)Ξ t + c 2 (1 + 1 ε ) e t+1 2 + η 2 ēt+1 1 + η 2 ζ 2 + η 2 ∇f ( xt ) 2 2 + η 2 E mt+1 - 1 η ∆ t+1 2 2 . where ε > 0 is determined later such that (1 + ε)(1 -p) < 1 and c 2 = 5. Proof. Expand the consensus distance at time t + 1 Ξ t+1 = 1 |V R | i∈V R E x t+1 i -xt+1 2 2 = 1 |V R | i∈V R E n j=1 W ij z t+1 j→i -xt+1 2 2 = 1 |V R | i∈V R E n j=1 W ij z t+1 j→i -xt + xt -xt+1 2 2 = 1 |V R | i∈V R E ( j∈V R W ij x t j -xt ) + ( n j=1 W ij z t+1 j→i - j∈V R W ij x t j ) + xt -xt+1 2 2 . Apply Young's inequality ( 16) with coefficient ε, like the proof of Theorem I, and use the expected improvement Lemma 4 Ξ t+1 ≤ 1 + ε |V R | i∈V R j∈V R W ij x t j -xt 2 2 + 1 + ε ε|V R | i∈V R E n j=1 W ij z t+1 j→i - j∈V R W ij x t j + xt -xt+1 2 2 ≤ (1 + ε)(1 -p) |V R | i∈V R x t i -xt 2 2 + 1 + ε ε|V R | i∈V R E n j=1 W ij z t+1 j→i - j∈V R W ij x t j + xt -xt+1 2 2 ≤(1 + ε)(1 -p)Ξ t + 1 + ε ε|V R | i∈V R E ( n j=1 W ij z t+1 j→i - j∈V R W ij x t j ) + xt -xt+1 2 2 =:T1 Replace x t j = x t+ 1 /2 j + ηm t+1 j using (12), then apply ( 18) T 1 = 1 + ε ε|V R | i∈V R E n j=1 W ij z t+1 j→i - j∈V R W ij x t+ 1 /2 j -η j∈V R W ij m t+1 j + xt -xt+1 2 2 ≤5 1 + ε ε   1 |V R | i∈V R E n j=1 W ij z t+1 j→i - j∈V R W ij x t+ 1 /2 j 2 2 + η 2 |V R | i∈V R E j∈V R W ij (m t+1 j -∇f j ( xt )) 2 2 + η 2 |V R | i∈V R j∈V R W ij ∇f j ( xt ) -∇f ( xt ) 2 2 + η 2 ∇f ( xt ) 2 2 + E xt -xt+1 2 2   . Recall the definition of e t+1 2 e t+1 2 := 1 |V R | i∈V R E j∈V R W ij (z t+1 j→i -x t+ 1 /2 j ) + j∈V B W ij (z t+1 j→i -x t+ 1 /2 i ) 2 2 = 1 |V R | i∈V R E n j=1 W ij z t+1 j→i - j∈V R W ij x t+ 1 /2 j 2 2 Then use Lemma 9 with the case A = W , T 1 ≤5(1 + 1 ε )   e t+1 2 + η 2 ēt+1 1 + η 2 |V R | i∈V R j∈V R W ij ∇f j ( xt ) -∇f ( xt ) 2 2 + η 2 ∇f ( xt ) 2 2 + E xt -xt+1 2 2   . Use convexity of • 2 2 and (A3) we have T 1 ≤5(1 + 1 ε ) e t+1 2 + η 2 ēt+1 1 + η 2 ζ 2 + η 2 ∇f ( xt ) 2 2 + E xt -xt+1 2 2 . Use (21) for the last term T 1 ≤5(1 + 1 ε ) e t+1 2 + η 2 ēt+1 1 + η 2 ζ 2 + η 2 ∇f ( xt ) 2 2 + η 2 E mt+1 - 1 η ∆ t+1 2 2 . Finally, by the definition of ẽt+1 1 , we have Ξ t+1 ≤ (1 + ε)(1 -p)Ξ t + 5(1 + 1 ε ) e t+1 2 + η 2 ēt+1 1 + η 2 ζ 2 + η 2 ∇f ( xt ) 2 2 + η 2 E mt+1 - 1 η ∆ t+1 . Lemma 12 (Tuning stepsize.). Suppose the following holds for any step size η ≤ d: Ψ T ≤ r 0 η(T + 1) + bη + eη 2 + f η 3 . Then, there exists a step-size η ≤ d such that Ψ T ≤ 2( br 0 T + 1 ) 1 2 + 2e 1 3 ( r 0 T + 1 ) 2 3 + 2f 1 4 ( r 0 T + 1 ) 3 4 + dr 0 T + 1 . Proof. Choosing η = min r0 b(T +1) 1 2 , r0 e(T +1) 1 3 , r0 f (T +1) 1 4 , 1 d ≤ 1 d we have four cases • η = 1 d and is smaller than r0 b(T +1) 1 2 , r0 e(T +1) 1 3 , r0 f (T +1) 1 4 , then Ψ T ≤ dr 0 T + 1 + b d + e d 2 + f d 3 ≤ dr 0 T + 1 + br 0 T + 1 1 2 + e 1/3 r 0 T + 1 2 3 + f 1/4 r 0 T + 1 3 4 . • η = r0 b(T +1) 1 2 < min{ r0 e(T +1) 1 3 , r0 f (T +1) 1 4 }, then Ψ T ≤ 2 br 0 T + 1 1 2 + er 0 b(T + 1) +f r 0 b(T + 1) 3 2 ≤ 2 br 0 bT + 1 1 2 +e 1/3 r 0 T + 1 2 3 +f 1/4 r 0 T + 1 . • η = r0 e(T +1) 1 3 < min{ r0 b(T +1) 1 2 , r0 f (T +1) 1 4 }, then Ψ T ≤ 2e 1/3 r 0 T + 1 2 3 +b r 0 e(T + 1) 1 3 + f r 0 e(T + 1) ≤ br 0 T + 1 1 2 +2e 1/3 r 0 T + 1 2 3 +f 1/4 r 0 T + 3 4 . • η = r0 f (T +1) 1 4 < min{ r0 b(T +1) 1 2 , r0 e(T +1) 1 3 }, then Ψ T ≤ 2f 1/4 r 0 T + 1 3 4 +b r 0 f (T + 1) 1 4 +e r 0 f (T + 1) 1 2 ≤ br 0 T + 1 1 2 +e 1/3 r 0 T + 1 2 3 +2f 1/4 r 0 T + 1 3 4 . Then, take the uniform upper bound of the upper bound gives the result.

E.3 PROOF OF THE MAIN THEOREM

Theorem III . Suppose Assumptions 1-4 hold and δ max = O(γ 2 ). Define the clipping radius as τ t+1 i = 1 δi j∈V R W ij E x t+ 1 /2 i -x t+ 1 /2 j 2 2 . ( ) Then for α := 3ηL, the iterates of Algorithm 1 satisfy 1 T +1 T t=0 ∇f ( xt ) 2 2 ≤ 200c1c2 γ 2 δ max ζ 2 + 2( 3 2 |V R | + 320c1c2 γ 2 δ max ) 1 /2 3Lσ 2 r0 T +1 1 /2 + 2 48c2 γ 2 ζ 2 1 /3 r0L T +1 2 /3 + 2 144c2 γ 2 σ 2 1 /4 r0L T +1 3 /4 + d0r0 T +1 . where r 0 := f (x 0 ) -f and c 1 = 32 and c 2 = 5. Furthermore, the consensus distance has an upper bound 1 |V R | i∈V R x t i -xt 2 2 = O( ζ 2 γ 2 (T +1) ). Remark 13. The requirement δ max = O(γ 2 ) suggest that δ max and γ 2 are of same order. The exact constant are determined in the proof and can be tighten simply through better constants in equalities like (23), ( 26). In practice CLIPPEDGOSSIP allow high number of attackers. For example in Figure 15 , 1/6 of workers are Byzantine and CLIPPEDGOSSIP still perform well in the non-IID setting. Proof. Denote the terms of average t from 0 to T as follows (28) C 1 := 1 1 + T T t=0 ∇f ( xt ) 2 2 , C 2 := 1 1 + T T t=0 mt+1 - 1 η ∆ t+1 2 2 , D 1 := 1 1 + T T t=0 Ξ t+1 Then we rewrite key Lemma 8 as ∇f ( xt ) 2 2 + 1 2 E mt+1 - 1 η ∆ t+1 2 2 ≤ 2 η (r t -r t+1 ) + 2e t+1 1 + 2 η 2 e t+1 2 , and further average over time t C 1 + 1 2 C 2 ≤ 2r 0 η(T + 1) + 2E 1 + 2 η 2 E 2 where we use -f (x T +1 ) ≤ -f . Combined with (28) gives C 1 + 1 2 C 2 ≤ 2r 0 η(T + 1) + 2E 1 + 4c 2 δ max E I + 4c 2 δ max ζ 2 + 2c 2 δ max η 2 D 1 Now we also average Lemma 9 for e t+1 1 over t gives 1 1 + T T t=0 e t+1 1 ≤ 1 -α 1 + T T t=0 e t 1 + 2αL 2 D 1 + α 2 σ 2 |V R | + 2L 2 η 2 α 1 1 + T T t=0 mt - 1 η ∆ t 2 2 ≤ 1 -α 1 + T T t=0 e t+1 1 + 2αL 2 D 1 + α 2 σ 2 |V R | + 2L 2 η 2 α C 2 where we use Ξ 0 = e 0 1 = 0 and m0 = ∆ 0 = 0. Then let β 1 := 2L 2 η 2 α 2 E 1 ≤ 2L 2 D 1 + ασ 2 |V R | + β 1 C 2 . ( ) Under review as a conference paper at ICLR 2023 Similarly, Lemma 9 for e t+1 I the only difference is that we don't have 1 n for σ 2 E I ≤ 2L 2 D 1 + ασ 2 + β 1 C 2 . (31) Similarly, let's call β 2 := 1 |V R | i∈V R j∈V R W 2 ij ≤ 1 Ē1 ≤ 2L 2 D 1 + β 2 ασ 2 + β 1 C 2 . ( ) The consensus distance Lemma 11 has D 1 ≤ (1 + ε)(1 -p) 1 + T T t=0 Ξ t + c 2 (1 + 1 ε )E 2 + c 2 (1 + 1 ε )η 2 ( Ēt+1 1 + ζ 2 + C 1 + C 2 ) ≤(1 + ε)(1 -p)D 1 + c 2 (1 + 1 ε )E 2 + c 2 (1 + 1 ε )η 2 ( Ēt+1 1 + ζ 2 + C 1 + C 2 ). Replace E 2 using (28) gives D 1 ≤(1 + ε)(1 -p)D 1 + c 2 (1 + 1 ε )(c 1 δ max (2η 2 (E t+1 I + ζ 2 ) + D 1 )) + c 2 (1 + 1 ε )η 2 ( Ēt+1 1 + ζ 2 + C 1 + C 2 ) ≤((1 + ε)(1 -p) + c 1 c 2 (1 + 1 ε )δ max )D 1 + c 2 (1 + 1 ε )η 2 (2c 1 δ max E t+1 I + Ēt+1 1 + (1 + 2c 1 δ max )ζ 2 + C 1 + C 2 ). Now replace Ē1 , E I with (32), (31), then D 1 ≤((1 + ε)(1 -p) + c 2 (1 + 1 ε )(c 1 δ max (1 + 4L 2 η 2 ) + 2L 2 η 2 ))D 1 + c 2 (1 + 1 ε )η 2 ((2c 1 δ max + β 2 )ασ 2 + (2c 1 δ max + 1)ζ 2 + ((2c 1 δ max + 1)β 1 + 1)C 2 + C 1 ). By enforcing η ≤ γ 9L and δ max ≤ γ 2 10c1c2 we have 2c 2 L 2 η 2 ≤γ 2 /8 c 1 c 2 δ max (1 + 4L 2 η 2 ) ≤γ 2 /8 we can achieve c 1 c 2 δ max (1 + 4L 2 η 2 ) + 2c 2 L 2 η 2 ≤ γ 2 . Then D 1 ≤ ((1 + ε)(1 -p) + (1 + 1 ε ) γ 2 4 ) =:T2 D 1 + c 2 (1 + 1 ε )η 2 ((2c 1 δ max + β 2 )ασ 2 + (2c 1 δ max + 1)ζ 2 + ((2c 1 δ max + 1)β 1 + 1)C 2 + C 1 ). Let us minimize the the coefficients of D 1 on the right hand side of inequality by having ε(1 -p) = 1 ε γ 2 4 , that is ε = γ 2 4(1-p) . Then the coefficient becomes T 2 =(1 + ε)(1 -p) + (1 + 1 ε ) γ 2 4 =( 1 -p + γ 2 ) 2 =(1 - γ 2 ) 2 . Then we use 1 ε = 4(1-p) γ 2 ≤ 2 γ and 1 + 1 ε ≤ 3 γ D 1 ≤ 4c2η 2 γ 2 ((2c 1 δ max + β 2 )ασ 2 + (2c 1 δ max + 1)ζ 2 + ((2c 1 δ max + 1)β 1 + 1)C 2 + C 1 ). This leads to 2c 1 δ max ≤ γ 2 5c2 ≤ 1 and β 2 ≤ 1, then we know D 1 ≤ 4c 2 η 2 γ 2 (2ασ 2 + 2ζ 2 + C 1 + (1 + 2β 1 )C 2 ) Finally, we combine ( 29), ( 30), ( 32) C 1 + 1 2 C 2 ≤ 2r 0 η(T + 1) + 2E 1 + 4c 1 δ max E I + 4c 1 δ max ζ 2 + 2c 1 δ max η 2 D 1 ≤ 2r 0 η(T + 1) +(4L 2 D 1 + 2ασ 2 |V R | + 2β 1 C 2 )+2c 1 δ max (4L 2 D 1 + 2β 2 ασ 2 + 2β 1 C 2 ) + 4c 1 δ max ζ 2 + 2c 1 δ max η 2 D 1 ≤ 2r 0 η(T + 1) + (4L 2 + 8c 1 δ max L 2 + 2c 1 δ max η 2 )D 1 + ( 1 |V R | + 2c 1 δ max )2ασ 2 +4β 1 C 2 + 4c 1 δ max ζ 2 Then we replace D 1 with (33) C 1 + 1 2 C 2 ≤ 2r0 η(T +1) + ( 1 |V R | + 2c 1 δ max )2ασ 2 +4β 1 C 2 + 4c 1 δ max ζ 2 + (4L 2 η 2 + 8c 1 δ max L 2 η 2 + 2c 1 δ max ) 4c2 γ 2 (2ασ 2 + 2ζ 2 + C 1 + (1 + 2β 1 )C 2 ) To have a valid bound on C 1 , there are two constraints on the coefficient of the RHS C 1 and C 2 . (4L 2 η 2 + 8c 1 δ max L 2 η 2 + 2c 1 δ max ) 4c2 γ 2 <1 (4L 2 η 2 + 8c 1 δ max L 2 η 2 + 2c 1 δ max ) 4c2 γ 2 (1 + 2β 1 ) + 4β 1 ≤ 1 2 . We can strength the first requirement to (4L 2 η 2 + 8c 1 δ max L 2 η 2 + 2c 1 δ max ) 4c2 γ 2 ≤ 1 4 . Then, apply this inequality to the second inequality gives 1 4 + 1 2 β 1 + 4β 1 ≤ 1 2 which requires η ≤ α 3L . Next (35) can be achieved by requiring δ max ≤ γ 2 64c1c2 (4 + 8c 1 δ max )L 2 η 2 + 2c 1 δ max ≤ 8L 2 η 2 + 2c 1 δ max ≤ γ 2 16c 2 which requires 8η 2 L 2 ≤ γ 2 32c2 , and we can simplify it to η ≤ γ 40L . Now we can simplify (34) with (35) 3 4 C 1 ≤ 2r0 η(T +1) + ( 1 |V R | + 2c 1 δ max )2ασ 2 + 4c 1 δ max ζ 2 + (4L 2 η 2 + 8c 1 δ max L 2 η 2 + 2c 1 δ max ) 4c2 γ 2 (2ασ 2 + 2ζ 2 ) Multiply both sides with 4 3 and relax constant 4 3 • 2 ≤ 3. Then by taking η ≤ 1 2L we have that C 1 ≤ 3r0 η(T +1) + ( 1 |V R | + 151 γ 2 2c 1 δ max )3ασ 2 + 200c1c2 γ 2 δ max ζ 2 + 48c2 γ 2 (ασ 2 + ζ 2 )L 2 η 2 By taking α := 3ηL and relax the constants we have C 1 ≤ 3r0 η(T +1) + ( 3 2 |V R | + 320c1 γ 2 δ max )Lσ 2 η + 48c2 γ 2 (ασ 2 + ζ 2 )L 2 η 2 + 200c1c2 γ 2 δ max ζ 2 . Minimize the the right hand side by tuning step size Lemma 12 we have  1 T + 1 T t=0 ∇f ( xt ) 2 2 ≤ 200c1c2 γ 2 δ max ζ 2 + 2   ( 3 2 |V R | + 320c1 γ 2 δ max )3Lσ 2 r 0 T + 1   1 2 + 2 48c2 γ 2 ζ 2 1 3 r0L T +1 2 3 + 2 144c2 γ 2 σ 2



In a previous version, we referred to CLIPPEDGOSSIP as self-centered clipping. MOZI is renamed to UBAR in the latest version. The code is available at this anonymous repository. Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication, 2019. Anastasia Koloskova, Tao Lin, Sebastian U. Stich, and Martin Jaggi. Decentralized deep learning with arbitrary communication compression, 2020a.



Figure 1: A dumbbell topology of two cliques A and B of regular workers connected by an edge (graph cut). Byzantine workers (red) may attack the graph at different places.

Figure5: Robust aggregators on randomized small-world (10 regular nodes) and torus topology (9 regular nodes) under Byzantine attacks (2 attackers). We observe that across all attacks and networks, clipped gossip has excellent performance, with the geometric median (GM) coming second.

Figure 6: Effect of the number of attackers on the accuracy of CLIPPEDGOSSIP under dissensus attack with varying δ max and fixed γ, ζ 2 . The solid (resp. dashed) lines denote models averaged over all (resp. clique A or B) regular workers. The right figure shows the performance of the last iterates of curves in the left figure.

Figure 9: The topology for the attacks on consensus. The grey and red nodes denote regular and Byzantine workers respectively.

Figure 11: Dumbbell variant where Byzantine workers maybe added to the central worker.

uses a dumbbell topology variant in Fig. 11 . The experiments run for 1500 iterations. In this experiment we choose n -b = 11 and b = 0, 1, 2, 3. We choose the edge weight of Byzantine workers such that the W and p remain the same for all these b. Then we can easily investigate the relation between δ max ∈ [0, b b+3 ] and p by varying b. The hyperparameter of dissensus attack is set to ε i = 1.5 for all workers and all experiments.

Figure 13: Ring topology without honest majority.

Figure14: Train models on dumbbell topology with IID and NonIID datasets. The three figures in each row correspond to the same experiment with "Global", "Clique A", "Clique B" denoting the performances of globally averaged model, within-Clique A averaged model, and within-Clique B averaged model.

Figure 15: Compare robust aggregators under dissensus attacks over dumbbell topology Figure 5.

Figure16: Tuning clipping radius on the dumbbell topology against Byzantine attacks. The y-axis is the averaged test accuracy over all of the regular workers.

First we apply average to Lemma 10E 2 ≤ c 2 δ max (2η 2 (E I + ζ 2 ) + D 1 ).

R | + 320c1 γ 2 δ max )Lσ 2 (T + 1) 1 /2 , 2r 0 γ 2 48c 2 ζ 2 L 2 (T + 1) 1 /3 , 2r 0 γ 2 L 3 σ 2 (T + 1)

Existing attacks in federated learning . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Dissensus attack and other attacks in the decentralized environment . . . . . . . . Constrained topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 Constructing mixing matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Byzantine-robust consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2 Byzantine-robust decentralized optimization . . . . . . . . . . . . . . . . . . . . . D.2.1 Setup for "Decentralized defenses without attackers" . . . . . . . . . . . . D.2.2 Setup for "Effects of the number of Byzantine workers" . . . . . . . . . . D.2.3 Setup for "Defense without honest majority" . . . . . . . . . . . . . . . . D.2.4 Setup for "More topologies and attacks." . . . . . . . . . . . . . . . . . . D.3 Experiment: CIFAR-10 task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.4 Experiment for "Weaker topology assumption" . . . . . . . . . . . . . . . . . . . D.5 Experiment: choosing clipping radius . . . . . . . . . . . . . . . . . . . . . . . .

Runtime hardwares and softwares.

Default experimental settings for CIFAR-10

annex

The Fig. 4 uses the dumbbell topology in Fig. 1 with 10 regular workers in each clique. There is no Byzantine workers. The experiments run for 900 iterations. MOZI uses α = 0.5 and ρ i = 0.99 in Bound on the consensus distance D 1 . Since= 2 9 , we can relax (33) toFor significantly large T , we know that) and find the upper bound ofwhere higher order terms of 1/T are dropped. Therefore, the upper bound on the consensus distance

F OTHER RELATED WORKS AND DISCUSSIONS

In this section, we add more related works and discussions.Byzantine resilient learning with constraints Byzantine-robustness is challenging when the training is combined with other constraints, such as asynchrony (Damaskinos et al., 2018; Xie et al., 2020b; Yang & Li, 2021) , data heterogeneity (Karimireddy et al., 2021b; Peng & Ling, 2020; Li et al., 2019; Data & Diggavi, 2021) , privacy (He et al., 2020; Burkhalter et al., 2021) . These works all assume the existence of a central server which can communicate with all regular workers. In this paper, we consider the decentralized setting and focus on the constraint that not all regular workers can communicate with each other.More works on decentralized learning. Many works focus on compression-techniques (Koloskova et al., 2019; 2020a; Vogels et al., 2020) , data heterogeneity (Tang et al., 2018; Vogels et al., 2021; Koloskova et al., 2021) , and communication topology (Assran et al., 2019; Ying et al., 2021a) .Detailed comparison with one line of work. Among all the works on robust decentralized training, Sundaram et al. Sundaram & Gharesifard (2018) and Su et al. Su & Vaidya (2016a) and their followup works Yang & Bajwa (2019b; a) have the most similar setup with ours. They are all using the trimmed mean as the aggregator assumptions on the graph. We illustrate our advantages over these methods as follows 1. Their methods (TM) make unrealistic assumptions about the graph while our method is much more relaxed. Their main assumption on the graph has 2 parts: 1) each good node should have at least 2b + 1 neighbors where b is the maximum number of Byzantine workers in the whole network; 2) by removing any b edges the good nodes should be connected. This assumption essentially requires the good workers have honest majority everywhere and additionally they have to be well connected. This can be hardly enforced in the decentralized environment. In contrast, our method has a weaker condition relating the spectral gap and δ. Our method also works without a honest majority Figure 12 . The second part of their assumption exclude common topologies like Dumbbell.2. TM fails to reach consensus even in some Byzantine-free graphs (e.g. Dumbbell) while SSClip converges as fast as gossip. For example, TM fails to reach consensus in NonIID setting for MNIST dataset (Figure 4 ) and even fails in IID setting for CIFAR-10 dataset (Figure 14 ).3. We have a clear convergence rate for SGD while they only show asymptotic convergence for GD.In fact, we even improve the state-of-art decentralized SGD analysis (Koloskova et al., 2020b) .4. Our work reveals how the quantitative relation between percentage of Byzantine workers (δ) and information bottleneck (γ) influence the consensus (see Figure 3 and Theorem I).5. We propose a novel dissensus attacks that utilize topology information.6. Impossibility results. Sundaram et al. Sundaram & Gharesifard (2018) and Su et al. Su & Vaidya (2016a) give impossibility results in terms of number of nodes while we give a novel results in terms of spectral gap (γ).(a) Before aggregation. ) clipped to the circle (e.g. z t+1 j→i ) while nodes inside the circle (e.g. x t+1 j ) remain the same after clipping (e.g. z t+1 j →i ). In the right figure (c) worker i update its model to x t+1 i using gossip averaging over clipped models.Other related works and discussions. Zhao et al. Zhao et al. (2019) make assumption that some users are trusted and then adopt trimmed mean as robust aggregator. But this assumption is incompatible with our setting where every node only trusts itself. Peng et al. Peng & Ling (2020) propose a "zero-sum" attack which exploits the topology where Byzantine worker j constructThey aim to manipulate the good worker i's model to 0, but it also makes the constructed Byzantine model very far away from the good worker models, making it easy to detect. In contrast, our dissensus attack (6) simply amplifies the existing disagreement amongst the good workers, which keeps the attack much less undetectable. In addition, we take mixing matrix into consideration and use ε i to parameterize the attack which makes it more flexible.Clarifications about our method. We make the following clarifications regarding our method:• Ideally we would like to replace the δ max = max j δ j with an average δ = 1 n j δ j . However, the requirement that δ max be small may be achieved by the good workers increasing its weight on itself. Note that Byzantine workers cannot alter good workers local behavior.• Theorem III does not tell us what happens if the percentage of Byzantine workers δ is relatively larger than spectral gap (γ), but it does not necessarily mean that CLIPPEDGOSSIP diverges. Instead, it means reaching global consensus is not possible as Byzantine workers effectively block the information bottleneck. We conjecture that within each connected good component not blocked by the byzantine workers, the good workers still reach component-level consensus by applying the analysis of Theorem III to only this component. We leave such a component-wise analysis for future work.

