VARIANCE REDUCTION IS AN ANTIDOTE TO BYZANTINE WORKERS: BETTER RATES, WEAKER ASSUMPTIONS AND COMMUNICATION COMPRESSION AS A CHERRY ON THE TOP

Abstract

Byzantine-robustness has been gaining a lot of attention due to the growth of the interest in collaborative and federated learning. However, many fruitful directions, such as the usage of variance reduction for achieving robustness and communication compression for reducing communication costs, remain weakly explored in the field. This work addresses this gap and proposes Byz-VR-MARINA-a new Byzantine-tolerant method with variance reduction and compression. A key message of our paper is that variance reduction is key to fighting Byzantine workers more effectively. At the same time, communication compression is a bonus that makes the process more communication efficient. We derive theoretical convergence guarantees for Byz-VR-MARINA outperforming previous state-of-the-art for general non-convex and Polyak-Łojasiewicz loss functions. Unlike the concurrent Byzantine-robust methods with variance reduction and/or compression, our complexity results are tight and do not rely on restrictive assumptions such as boundedness of the gradients or limited compression. Moreover, we provide the first analysis of a Byzantine-tolerant method supporting non-uniform sampling of stochastic gradients. Numerical experiments corroborate our theoretical findings.

1. INTRODUCTION

Distributed optimization algorithms play a vital role in the training of the modern machine learning models. In particular, some tasks require training of deep neural networks having billions of parameters on large datasets (Brown et al., 2020; Kolesnikov et al., 2020) . Such problems may take years of computations to be solved if executed on a single yet powerful machine (Li, 2020) . To circumvent this issue, it is natural to use distributed optimization algorithms allowing to tremendously reduce the training time (Goyal et al., 2017; You et al., 2020) . In the context of speeding up the training, distributed methods are usually applied in data centers (Mikami et al., 2018) . More recently, similar ideas have been applied to train models using open collaborations (Kijsipongse et al., 2018; Diskin et al., 2021) , where each participant (e.g., a small company/university or an individual) has very limited computing power but can donate it to jointly solve computationally-hard problems. Moreover, in Federated Learning (FL) applications (McMahan et al., 2017; Konečný et al., 2016; Kairouz et al., 2021) , distributed algorithms are natural and the only possible choice since in such problems, the data is privately distributed across multiple devices. In the optimization problems arising in collaborative and federated learning, there is a high risk that some participants deviate from the prescribed protocol either on purpose or not. In this paper, we call such participants as Byzantine workers 1 For example, such peers can maliciously send incorrect gradients to slow down or even destroy the training. Indeed, these attacks can break the convergence of naïve methods such as Parallel-SGD (Zinkevich et al., 2010) . Therefore, it is crucial to use secure (a.k.a. Byzantine-robust/Byzantine-tolerant) distributed methods for solving such problems. However, designing distributed methods with provable Byzantine-robustness is not an easy task. The non-triviality of this problem comes from the fact that the stochastic gradients of good/honest/regular workers are naturally different due to their stochasticity and possible data heterogeneity. At the same time, malicious workers can send the vectors looking like the stochastic gradients of good peers or create small but time-coupled shifts. Therefore, as it is shown in (Baruch et al., 2019; Xie et al., 2020; Karimireddy et al., 2021) , Byzantine workers can circumvent popular defences based on applying robust aggregation rules (Blanchard et al., 2017; Yin et al., 2018; Damaskinos et al., 2019; Guerraoui et al., 2018; Pillutla et al., 2022) with Parallel-SGD. Moreover, in a broad class of problems with heterogeneous data, it is provably impossible to achieve any predefined accuracy of the solution (Karimireddy et al., 2022; El-Mhamdi et al., 2021) . Nevertheless, as it becomes evident from the further discussion, several works have provable Byzantine tolerance and rigorous theoretical analysis. In particular, Wu et al. (2020) propose a natural yet elegant solution to the problem of Byzantine-robustness based on the usage of variance-reduced methods (Gower et al., 2020) and design the first variance-reduced Byzantine-robust method called Byrd-SAGA, which combines the celebrated SAGA method (Defazio et al., 2014) with geometric median aggregation rule. As a result, reducing the stochastic noise of estimators used by good workers makes it easier to filter out Byzantine workers (especially in the case of homogeneous data). However, Wu et al. (2020) derive their results only for the strongly convex objectives, and the obtained convergence guarantees are significantly worse than the best-known convergence rates for SAGA, i.e., their results are not tight, even when there are no Byzantine workers and all peers have homogeneous data. It is crucial to bypass these limitations since the majority of the modern, practically interesting problems are non-convex. Furthermore, it is hard to develop the field without tight convergence guarantees. All in all, the above leads to the following question: Q1: Is it possible to design variance-reduced methods with provable Byzantine-robustness and tight theoretical guarantees for general non-convex optimization problems? In addition to Byzantine-robustness, one has to take into account that naïve distributed algorithms suffer from the so-called communication bottleneck-a situation when communication is much more expensive than local computations on the devices. This issue is especially evident in the training of models with a vast number of parameters (e.g., millions or trillions) or when the number of workers is large (which is often the case in FL). One of the most popular approaches to reducing the communication bottleneck is to use communication compression (Seide et al., 2014; Konečný et al., 2016; Suresh et al., 2017) , i.e., instead of transmitting dense vectors (stochastic gradients/Hessians/higherorder tensors) workers apply some compression/sparsification operator to these vectors and send the compressed results to the server. Distributed learning with compression is a relatively well-developed field, e.g., see (Vogels et al., 2019; Gorbunov et al., 2020b; Richtárik et al., 2021; Philippenko & Dieuleveut, 2021) and references therein for the recent advances. Perhaps surprisingly, there are not many methods with compressed communication in the context of Byzantine-robust learning. In particular, we are only aware of the following works (Bernstein et al., 2018; Ghosh et al., 2020; 2021; Zhu & Ling, 2021) ,. Bernstein et al. (2018) propose signSGD to reduce communication cost and and study the majority vote to cope with the Byzantine workers under some additional assumptions about adversaries. However, it is known that signSGD is not guaranteed to converge (Karimireddy et al., 2019) . Next, Ghosh et al. (2020; 2021) apply aggregation based on the selection of the norms of the update vectors. In this case, Byzantine workers can successfully hide in the noise applying SOTA attacks (Baruch et al., 2019) . Zhu & Ling (2021) study Byzantine-robust versions of compressed SGD (BR-CSGD) and SAGA (BR-CSAGA) and also propose a combination of DIANA (Mishchenko et al., 2019; Horváth et al., 2019b) with BR-CSAGA called BROADCAST. However, the derived convergence results for these methods have several limitations. First of all, the analysis is given only for strongly convex problems. In addition, it Table 1 : Comparison of the state-of-the-art (in theory) Byzantine-tolerant distributed methods. Columns: "NC" = does the theory works for general smooth non-convex functions?; "PL" = does the theory works for functions satysfying PŁ-condition (As. 2.5)?; "Tight?" = does the theory recover tight best-known results for the version of the method with δ = 0 (no Byzantines)?; "Compr.?" = does the method use communication compression?; "VR?" = is the method variance-reduced?; "No UBV?" = does the theory work without assuming uniformly bounded variance of the stochastic gradients, i.e., without the assumption that for all x ∈ R d the good workers have an access to the unbiased estimators gi(x) of ∇fi(x) such that E gi(x) -∇fi(x) 2 ≤ σ 2 for all i ∈ G and σ ≥ 0?; "No BG?" = does the theory work without assuming uniformly bounded second moment of the stochastic gradients, i.e., without the assumption that for all x ∈ R d the good workers have an access to the unbiased estimators gi(x) of ∇fi(x) such that E gi(x) 2 ≤ D 2 for all i ∈ G and σ > 0?; "Non-US?" = does the theory support non-uniform sampling of the stochastic gradients; "Het.?" = does the theory work under ζ 2 -heterogeneity assumption (As. 2.2)? Method NC PL Tight? Compr.? VR? No UBV? No BG? Non-US? Het.? BR-SGDm (Karimireddy et al., 2021; 2022) BTARD-SGD (Gorbunov et al., 2021a ) (1) Byrd-SAGA (Wu et al., 2020) (1)

BR-MVR

( Karimireddy et al., 2021) BR-CSGD (Zhu & Ling, 2021) (1) BR-CSAGA (Zhu & Ling, 2021) (1) BROADCAST (Zhu & Ling, 2021) (1) Byz-VR-MARINA

[This work]

(1) Strong convexity of f is assumed. relies on restrictive assumptions. Namely, Zhu & Ling (2021) assume uniform boundedness of the second moment of the stochastic gradient in the analysis of BR-CSGD and BR-CSAGA. This assumption rarely holds in practice, and it also implies the boundedness of the gradients, which contradicts the strong convexity assumption. Next, although the bounded second-moment assumption is not used in the analysis of BROADCAST, Zhu & Ling (2021) derive the rates of BROADCAST under the assumption that the compression operator is very accurate, which implies that in theory workers apply almost no compression to the communicated messages (see remark (5) under Table 2 ). Finally, even if there are no Byzantine workers and no compression, similar to the guarantees for Byrd-SAGA, the rates obtained for BR-CSGD, BR-CSAGA, and BROADCAST are outperformed with a large margin by the known rates for SGD and SAGA. All of these limitations lead to the following question: Q2: Is it possible to design distributed methods with compression, provable Byzantine-robustness and tight theoretical guarantees without making strong assumptions? In this paper, we give confirmatory answers to Q1 and Q2 by proposing and rigorously analyzing a new Byzantine-tolerant variance-reduced method with compression called Byz-VR-MARINA. Detailed related work overview is deferred to Appendix A. Our Contributions. Before we proceed, we need to specify the targetted problem. We consider a centralized distributed learning in the possible presence of malicious or so-called Byzantine peers. We assume that there are n clients consisting of the two groups: [n] = G B, where G denotes the set of good clients and B is the set of bad/malicious/Byzantine workers. The goal is to solve the following optimization problem min x∈R d f (x) = 1 G i∈G f i (x) , f i (x) = 1 m m j=1 f i,j (x) ∀i ∈ G, where G = |G| and functions f i,j (x) are assumed to be smooth, but not necessarily convex. Here each good client has its dataset of the size m, f i,j (x) is the loss of the model, parameterized by vector x ∈ R d , on the j-th sample from the dataset on the i-th client. Following the classical convention Hom. data, no compr.

BR-SGDm

( Karimireddy et al., 2021; 2022) UBV 1 ε 2 + σ 2 (cδ+1/n) bε 4 BR-MVR (Karimireddy et al., 2021) UBV 1 ε 2 + σ √ cδ+1/n √ bε 3 BTARD-SGD (Gorbunov et al., 2021a) UBV ( 1) 1 ε 2 + n 2 δσ 2 Cbε 2 + σ 2 nbε 4 1 µ + σ 2 nbµε + n 2 δσ C √ bµε (7) Byrd-SAGA (2) (Wu et al., 2020) Smooth fi,j m 2 b 2 (1-2δ)µ 2 (7) Byz-VR-MARINA Cor. E.1 & Cor. E.5 As. 2.4 1+ cδm 2 b 3 + m b 2 n ε 2 1+ cδm 2 b 3 + m b 2 n µ + m b Het. data, no compr. BR-SGDm (3) (Karimireddy et al., 2022) UBV 1 ε 2 + σ 2 (cδ+1/n) bε 4 Byrd-SAGA (2),(3) (Wu et al., 2020) Smooth fi,j m 2 b 2 (1-2δ)µ 2 (7) Byz-VR-MARINA (3),(4) Cor. E.2 & Cor. E.6 As. 2.4 1+ cδm 2 b 2 (1+ 1 b )+ m b 2 n ε 2 1+ cδm 2 b 2 (1+ 1 b )+ m b 2 n µ + m b Het. data, compr. BR-CSGD (2),(3) (Zhu & Ling, 2021) UBV, BG 3),( 5) (Zhu & Ling, 2021) Smooth fi,j 1 µ 2 (7) BR-CSAGA (2),(3) (Zhu & Ling, 2021) Smooth fi,j UBV, BG m 2 b 2 µ 2 (1-2δ) 2 (7) BROADCAST (2),( m 2 (1+ω) 3/2 b 2 µ 2 (1-2δ) (7) Byz-VR-MARINA (3), (6) Cor. E.3 & Cor. E.7 As. 2.4 Gorbunov et al. (2021a) assume additionally that the tails of the noise distribution in stochastic gradients are sub-quadratic. 1+ cδ(1+ω)(1+ 1 b ) pε 2 + (1+ω)(1+ 1 b ) √ pnε 2 1+ cδ(1+ω)(1+ 1 b ) pµ + (1+ω)(1+ 1 b ) √ pnµ + m b + ω (1) (2) Although the analyses by Wu et al. (2020) ; Zhu & Ling (2021) support inexact geometric median computation, for simplicity of presentation, we assume that geometric median is computed exactly. (3) BR-SGDm: ε 2 = Ω(cδζ 2 ); Byrd-SAGA: ε = Ω( ζ 2 /(µ 2 (1-2δ) 2 )); Byz-VR-MARINA: ε 2 = Ω(max{ m /b, 1 + ω}cδζ 2 ) for general non-convex case and ε = Ω(max{ m /b, 1 + ω} cδζ 2 /µ) for the case of PŁ functions (with ω = 0, where there is no compression); BR-CSGD: ε = Ω( (σ 2 +ζ 2 +ωD 2 ) /(µ 2 (1-2δ) 2 )) (positive even when ζ 2 = 0); BR-CSAGA: ε = Ω( (ζ 2 +ωD 2 ) /(µ 2 (1-2δ) 2 )) (positive even when ζ 2 = 0); BROADCAST: ε = Ω( (1+ω)ζ 2 /(µ 2 (1-2δ) 2 )). (4) The term m √ cδ bε 2 is proportional to much smaller Lipschitz constant than the term m √ cδ b 3/2 ε 2 does. A similar statement holds in PŁ case as well. (5) For this result Zhu & Ling (2021)  assume that ω ≤ µ 2 (1-2δ) 2 56L 2 (2-2δ 2 ) , which is a very restrictive assumption even when δ = 0. For example, even for well-conditioned problems with µ /L ∼ 10 -3 and δ = 0 (no Byzantine workers), this bound implies that ω should be not larger than 10 -7 . Such a value of ω corresponds to almost non-compressed communications. (6) The term 1+ √ cδ(1+ω) pε 2 + √ 1+ω √ pnε 2 is proportional to much smaller Lipschitz constant than the term 1+ √ cδ(1+ω) √ bpε 2 + √ 1+ω √ pnbε 2 does. A similar statement holds in PŁ case as well. (7) The rate is derived under the strong convexity assumption. Strong convexity implies PŁ-condition but not vice versa: there exist non-convex PŁfunctions (Karimi et al., 2016) . (Lyu et al., 2020) , we make no assumptions on the malicious workers B, i.e., Byzantine workers are allowed to be omniscient. Our main contributions are summarized below. New method: Byz-VR-MARINA. We propose a new Byzantine-robust variance-reduced method with compression called Byz-VR-MARINA (Alg. 1). In particular, we make VR-MARINA (Gorbunov et al., 2021b) , which is a variance-reduced method with compression, applicable to the context of Byzantine-tolerant distributed learning via using the recent tool of robust agnostic aggregation of Karimireddy et al. (2022) . As Tbl. 1 shows, Byz-VR-MARINA and our analysis of the method leads to several important improvements upon the previously best-known methods. New SOTA results. Under quite general assumptions listed in Section 2, we prove theoretical convergence results for Byz-VR-MARINA in the cases of smooth non-convex (Thm. 2.1) and Polyak-Łojasiewicz (Thm. 2.2) functions. As Tbl. 2 shows, our complexity bounds in the non-convex case are always better than previously known ones when the target accuracy ε is small enough. In the PŁ case, our results improve upon previously known guarantees when the problem has bad conditioning or when ε is small enough. Moreover, we provide the first theoretical convergence guarantees for Byzantine-tolerant methods with compression in the non-convex case for arbitrary adversaries. Byzantine-tolerant variance-reduced method with tight rates. Our results are tight, i.e., when there are no Byzantine workers, our rates recover the rates of VR-MARINA, and when additionally no compression is applied, we recover the optimal rates of Geom-SARAH (Horváth et al., 2022) /PAGE (Li et al., 2021) . In contrast, this is not the case for previously known variancereduced Byzantine-robust methods such as Byrd-SAGA, BR-CSAGA, and BROADCAST that in the homogeneous data scenario have worse rates than single-machine SAGA. Support of the compression without strong assumptions. As we point out in Tbl. 2, the analysis of BR-CSGD and BR-CSAGA relies on the bounded second-moment assumption, which contradicts strong convexity, and the rates for BROADCAST are derived under the assumption that the compression operator almost coincides with the identity operator, meaning that in practice workers essentially do not use any compression. In contrast, our analysis does not have such substantial limitations. Enabling non-uniform sampling. In contrast to the existing works on Byzantine-robustness, our analysis supports non-uniform sampling of stochastic gradients. Considering the dependencies on smoothness constants, one can quickly notice our rates' even more significant superiority compared to the previous SOTA results. 1 G(G-1) i,l∈G E[ x i -x l 2 ] ≤ σ 2 where the expectation is taken w.r.t. the randomness of {x i } i∈G . We say that the quantity x is (δ, c)-Robust Aggregator ((δ, c)-RAgg) and write x = RAgg(x 1 , . . . , x n ) for some c > 0, if the following inequality holds: E x -x 2 ≤ cδσ 2 , where x = 1 |G| i∈G x i . If additionally x is computed without the knowledge of σ 2 , we say that x is (δ, c)-Agnostic Robust Aggregator ((δ, c)-ARAgg) and write x = ARAgg(x 1 , . . . , x n ). In fact, Karimireddy et al. (2021; 2022) propose slightly different definition, where they assume that E x i -x l 2 ≤ σ 2 for all fixed good workers i, l ∈ G, which is marginally stronger than what we assume. Karimireddy et al. (2021) prove tightness of their definition, i.e., up to the constant c one cannot improve bound (2), and prove that popular "middle-seekers" such as Krum (Blanchard et al., 2017) , Robust Federated Averaging (RFA) (Pillutla et al., 2022) , and Coordinate-wise Median (CM) (Chen et al., 2017) do not satisfy their definition. However, there is a trick called bucketing (Karimireddy et al., 2022) that provably robustifies Krum/RFA/CM. Nevertheless, the difference between our definition and the original one from (Karimireddy et al., 2021; 2022) is very subtle and it turns out that Krum/RFA/CM with bucketing fit Definition 2.1 as well (see Appendix D). Compression. We consider unbiased compression operators, i.e., quantizations. Definition 2.2 (Unbiased compression (Horváth et al., 2019b) ). Stochastic mapping Q : R d → R d is called unbiased compressor/compression operator if there exists ω ≥ 0 such that for any x ∈ R d E [Q(x)] = x, E Q(x) -x 2 ≤ ω x 2 . (3) For the given unbiased compressor Q(x), one can define the expected density as ζ Q = sup x∈R d E [ Q(x) 0 ] , where y 0 is the number of non-zero components of y ∈ R d . The above definition covers many popular compression operators such as RandK sparsification (Stich et al., 2018) , random dithering (Goodall, 1951; Roberts, 1962) , and natural compression (Horváth et al., 2019a) (see also the summary of various compression operators in (Beznosikov et al., 2020) ). There exist also other classes of compression operators such as δ-contractive compressors (Stich et al., 2018) and absolute compressors (Tang et al., 2019; Sahu et al., 2021) . However, these types of compressors are out of the scope of this work. Assumptions. The first assumption is quite standard in the literature on non-convex optimization. Assumption 2.1. We assume that function f : R d → R is L-smooth, i.e., for all x, y ∈ R d we have ∇f (x) -∇f (y) ≤ L x -y . Moreover, we assume that f is uniformly lower bounded by f * ∈ R, i.e., f * = inf x∈R d f (x). Next, we need to restrict the data heterogeneity of regular workers. Indeed, in arbitrarily heterogeneous scenario, it is impossible to distinguish regular workers and Byzantine workers. Therefore, we use a quite standard assumption about the heterogeneity of the local loss functions. Assumption 2.2 (ζ 2 -heterogeneity). We assume that good clients have ζ 2 -heterogeneous local loss functions for some ζ ≥ 0, i.e., 1 G i∈G ∇f i (x) -∇f (x) 2 ≤ ζ 2 ∀x ∈ R d . ( ) We emphasize here that the homogeneous data case (ζ = 0) is realistic in collaborative learning. This typically means that the workers have an access to the entire data. For example, this can be implemented using so-called dataset streaming when the data is received just in time in chunks (Diskin et al., 2021; Kijsipongse et al., 2018) (this can also be implemented without using the server via special protocols similar to BitTorrent). The following assumption is a refinement of a standard assumption that f i is L i -smooth for all i ∈ G. Assumption 2.3 (Global Hessian variance assumption (Szlendak et al., 2021) ). We assume that there exists L ± ≥ 0 such that for all x, y ∈ R d 1 G i∈G ∇f i (x) -∇f i (y) 2 -∇f (x) -∇f (y) 2 ≤ L 2 ± x -y 2 . ( ) If f i is L i -smooth for all i ∈ G, then the above assumption is always valid for some Szlendak et al., 2021) . Moreover, Szlendak et al. (2021) show that there exist problems with heterogeneous functions on workers such that (5) holds with L ± = 0, while L avg > 0. L ± ≥ 0 such that L 2 avg -L 2 ≤ L 2 ± ≤ L 2 avg , where L 2 avg = 1 G i∈G L 2 i ( We propose a generalization of the above assumption for samplings of stochastic gradients. Assumption 2.4 (Local Hessian variance assumption). We assume that there exists L ± ≥ 0 such that for all x, y ∈ R d 1 G i∈G E ∆ i (x, y) -∆ i (x, y) 2 ≤ L 2 ± b x -y 2 , ( ) where ∆ i (x, y) = ∇f i (x) -∇f i (y) and ∆ i (x, y) is an unbiased mini-batched estimator of ∆ i (x, y) with batch size b. We notice that the above assumption covers a wide range of samplings of mini-batched stochastic gradient differences, e.g., standard uniform sampling or importance sampling. We provide the examples in Appendix E.1. We notice that all previous works on Byzantine-robustness focus on the standard uniform sampling only. However, uniform sampling can give m times worse constant L 2 ± than importance sampling. This difference significantly affect the complexity bounds. New Method: Byz-VR-MARINA. Now we are ready to present our new method-Byzantine-tolerant Variance-Reduced MARINA (Byz-VR-MARINA). Our algorithm is based on the recently proposed variance-reduced method with compression (VR-MARINA) from (Gorbunov et al., 2021b) . At each iteration of Byz-VR-MARINA, good workers update their parameters x k+1 = x k -γg k using estimator g k received from the parameter-server (line 7). Next (line 8), with (typically small) probability p each good worker i ∈ G computes its full gradient, and with (typically large) probability 1 -p this worker computes compressed mini-batched stochastic gradient difference Q( ∆ i (x k+1 , x k )), where ∆ i (x k+1 , x k ) satisfies Assumption 2.4. After that, the server gathers the results of computations from the workers and applies (δ, c)-ARAgg to compute the next estimator g k+1 (line 10). Let us elaborate on several important parts of the proposed algorithm. First, we point out that with large probability 1 -p good workers need to send just compressed vectors Q( ∆ i (x k+1 , x k )), i ∈ G. Indeed, since the server knows when workers compute full gradients and when they compute compressed stochastic gradients, it needs just to add g k to all received vectors to perform robust aggregation from line 10. Moreover, since the server knows the type of compression operator that good workers apply, it can typically easily filter out those Byzantine workers who try to slow down the training via sending dense vectors instead of compressed ones (e.g., if the compression operator is RandK sparsification, then Byzantine workers cannot send more than K components; otherwise they will be easily detected and can be banned). Next, the right choice of probability p allows equalizing the communication cost of all steps when good workers send dense gradients and compressed gradient differences. The same is true for oracle complexity: if p ≤ b /m, then the computational cost of full-batch computations is not bigger than that of stochastic gradients. Algorithm 1 Byz-VR-MARINA: Byzantine-tolerant VR-MARINA for i ∈ G in parallel do 7: 1: Input: starting point x 0 , stepsize γ, minibatch size b, probability p ∈ (0, 1], number of iterations K, (δ, c)-ARAgg 2: Initialize g 0 = ∇f (x 0 ) 3: for k = 0, 1, . . . , K -1 do x k+1 = x k -γg k 8: Set g k+1 i = ∇f i (x k+1 ), if c k = 1, g k + Q ∆ i (x k+1 , x k ) , otherwise, , where minibatched estimator ∆ i (x k+1 , x k ) of ∇f i (x k+1 ) -∇f i (x k ); Q(•) for i ∈ G are computed independently 9: end for 10: g k+1 = ARAgg(g k+1 1 , . . . , g k+1 n ) 11: end for 12: Return: xK chosen uniformly at random from {x k } K-1 k=0 Challenges in designing variance-reduced algorithm with tight rates and provable Byzantinerobustness. In the introduction, we explain why variance reduction is a natural way to handle Byzantine attacks (see the discussion before Q1). At first glance, it seems that one can take any variance-reduced method and combine it with some robust aggregation rule to get the result. However, this is not as straightforward as it may appear. As one can see from Table 2 , combination of SAGA with geometric median estimator (Byrd-SAGA) gives the rate Õ( m 2 b 2 (1-2δ)µ 2 ) (smoothness constant and logarithmic factors are omitted) in the smooth strongly convex case -this rate is in fact O( m 2 b 2 µ 2 ) times worse than the rate of SAGA even when δ = 0. Therefore, it becomes clear that the full potential of variance reduction in Byzantine-robust learning is not revealed via Byrd-SAGA. The key reason for that is the sensitivity of SAGA (and SAGA-based methods) to the unbiasedness of the stochastic estimator in the analysis. Since Byrd-SAGA uses the geometric median for the aggregation, which is necessarily biased, it is natural that it has a much worse convergence rate than SAGA even in the δ = 0 case. Moreover, one can't solve such an issue by simply changing one robust estimator for another since all known robust estimators are generally biased. To circumvent this issue, we consider Geom-SARAH/PAGE-based estimator (Horváth & Richtárik, 2019; Li et al., 2021) and study how it interacts with the robust aggregation. In particular, we observe that the averaged pair-wise variance for the stochastic gradients of good workers could be upper bounded by a constant multiplied by E x k+1 -x kfoot_2 plus some additional terms appearing due to heterogeneity (see Lemma E.2). Then, we notice that the robust aggregation only leads to the additional term proportional to E x k+1 -x k 2 (plus additional terms due to heterogeneity). We show that this term can be directly controlled using another term proportional to -E x k+1 -x k 2 , which appears in the original analysis of PAGE/VR-MARINA. These facts imply that although the difference between Byz-VR-MARINA and VR-MARINA is only in the choice of the aggregation rule, it is not straightforward beforehand that such a combination should be considered and that it will lead to better rates. Moreover, as we show next, we obtain vast improvements upon the previously best-known theoretical results for Byzantine-tolerant learning. General Non-Convex Functions. Our main convergence result for general non-convex functions follows. All proofs are deferred to Appendix E. Theorem 2.1. Let Assumptions 2.1, 2.2, 2.3, 2  .4 hold. Assume that 0 < γ ≤ 1 L+ √ A , where A = 6(1-p) p 4cδ p + 1 2G ωL 2 + (1+ω)L 2 ± b + 6(1-p) p 4cδ(1+ω) p + ω 2G L 2 ± . Then for all K ≥ 0 the point x K choosen uniformly at random from the iterates x 0 , x 1 , . . . , x K produced by Byz-VR-MARINA satisfies E ∇f ( x K ) 2 ≤ 2Φ 0 γ(K + 1) + 24cδζ 2 p , where Φ 0 = f (x 0 ) -f * + γ p g 0 -∇f (x 0 ) 2 and E[•] denotes the full expectation. 2 We highlight here several important properties of the derived result. First of all, this is the first theoretical result for the convergence of Byzantine-tolerant methods with compression in the nonconvex case with arbitrary adversaries. Next, when ζ > 0 the theorem above does not guarantee that E[ ∇f ( x K ) 2 ] can be made arbitrarily small. However, this is not a drawback of our analysis but rather an inevitable limitation of all algorithms in heterogeneous case. This is due to Karimireddy et al. ( 2022) who proved a lower bound showing that in the presence of Byzantine workers, all algorithms satisfy E[ ∇f ( x K ) 2 ] = Ω(δζ 2 ), i.e., the constant term from ( 7) is tight up to the factor of 1 /p. However, when ζ = 0, Byz-VR-MARINA can achieve any predefined accuracy of the solution, if δ is such that ARAgg is (δ, c)-robust (see Theorem D.1). Finally, as Table 2 showsfoot_3 , Byz-VR-MARINA achieves E[ ∇f ( x K ) 2 ] ≤ ε 2 faster than all previously known Byzantine-tolerant methods for small enough ε. Moreover, unlike virtually all other results in the non-convex case, Theorem 2.1 does not rely on the uniformly bounded variance assumption, which is known to be very restrictive (Nguyen et al., 2018) . For further discussion we refer to Appendix E.5. Functions Satisfying Polyak-Łojasiewicz (PŁ) Condition. We extend our theory to the functions satisfying Polyak-Łojasiewicz condition (Polyak, 1963; Łojasiewicz, 1963) . This assumption generalizes regular strong convexity and holds for several non-convex problems (Karimi et al., 2016) . Moreover, a very similar assumption appears in over-parameterized deep learning (Liu et al., 2022) . Assumption 2.5 (PŁ condition). We assume that function f satisfies Polyak-Łojasiewicz (PŁ) condition with parameter µ, i.e., for all x ∈ R d there exists x * ∈ argmin x∈R d f (x) such that ∇f (x) 2 ≥ 2µ (f (x) -f (x * )) . Under this and previously introduced assumptions, we derive the following result. Theorem 2.2. Let Assumptions 2.1, 2.2, 2.3, 2.4, 2.5 hold. Assume that 0 < γ ≤ min 1 L+ √ 2A , p 4µ , where A = 6(1-p) p 4cδ p + 1 2G ωL 2 + (1+ω)L 2 ± b + 6(1-p) p 4cδ(1+ω) p + ω 2G L 2 ± . Then for all K ≥ 0 the iterates produced by Byz-VR-MARINA satisfy E f (x K ) -f (x * ) ≤ (1 -γµ) K Φ 0 + 24cδζ 2 µ , where Similarly to the general non-convex case, in the PŁ-setting Byz-VR-MARINA is able to achieve Φ 0 = f (x 0 ) -f * + 2γ p g 0 -∇f (x 0 ) 2 . E[f (x K ) -f (x * )] = O( cδζ 2 /µ ) accuracy, which matches (up to the factor of 1 /p) the lower bound from Karimireddy et al. ( 2022) derived for µ-strongly convex objectives. Next, when ζ = 0, Byz-VR-MARINA converges linearly asymptotically to the exact solution. Moreover, as Table 2 shows, our convergence result in the PŁ-setting outperforms the known rates in more restrictive strongly-convex setting. In particular, when ε is small enough, Byz-VR-MARINA has better complexity than BTARD-SGD. When the conditioning of the problem is bad (i.e., L /µ 1) our rate dominates results of BR-CSGD, BR-CSAGA, and BROADCAST. Furthermore, both BR-CSGD and BR-CSAGA rely on the uniformly bounded second moment assumption (contradicting the strong convexity), and the rate of the BROADCAST algorithm is based on the assumption that ω = O( µ 2 /L 2 ) implying that Q(x) ≈ x (no compression) even for well-conditioned problems.

3. NUMERICAL EXPERIMENTS

In this section, we demonstrate the practical performance of the proposed method. The main goal of our experimental evaluation is to showcase the benefits of employing SOTA variance reduction to remedy the presence of Byzantine workers. For the task, we consider the standard logistic regression model with 2 -regularization f i,j (x) = -y i,j log(h(x, a i,j ))-(1-y i,j ) log(1-h(x, a i,j ))+λ x 2 , where y i,j ∈ {0, 1} is the label, a i,j ∈ R d represents the features vector, λ is the regularization parameter and h(x, a) = 1 /(1+e -a x ). One can show that this objective is smooth, and for λ > 0, it is also strongly convex, therefore, it satisfies PŁcondition. We consider a9a LIBSVM dataset (Chang & Lin, 2011) and set λ = 0.01. In the experiments, we focus on an important feature of Byz-VR-MARINA: it guarantees linear convergence for homogeneous datasets across clients even in the presence of Byzantine workers, as shown in Theorem 2.2. To demonstrate this experimentally, we consider the setup with four good workers and one Byzantine worker, each worker can access the entire dataset, and the server uses coordinate-wise median with bucketing as the aggregator (see the details in Appendix D). We consider five different attacks: • No Attack (NA): clean training; • Label Flipping (LF): labels are flipped, i.e., y i,j → 1 -y i,j ; • Bit Flipping (BF): a Byzantine worker sends an update with flipped sign; • A Little is enough (ALIE) (Baruch et al., 2019) : the Byzantine workers estimate the mean µ G and standard deviation σ G of the good updates, and send µ G -zσ G to the server where z is a small constant controlling the strength of the attack; • Inner Product Manipulation (IPM) (Xie et al., 2020) : the attackers send -G i∈G ∇f i (x) where controls the strength of the attack. For bucketing, we use s = 2, i.e., partitioning the updates into the groups of two, as recommended by Karimireddy et al. (2022) . We compare our Byz-VR-MARINA with the baselines without compression (SGD, BR-SGDm (Karimireddy et al., 2021)) and the baselines with random sparsification (compressed SGD and DIANA (BR-DIANA). We do not compare against Byrd-SAGA (and BR-CSAGA, BROADCAST from Zhu & Ling (2021)), which consumes large memory that scales linearly with the number of local data points and is not well suited for memoryefficient batched gradient computation (e.g., used in PyTorch). Our implementation is based on PyTorch (Paszke et al., 2019) . Figure 1 showcases that, indeed, we observe linear convergence of our method while no baseline achieves this fast rate. In the first row, we display methods with no compression, and in the second row, each algorithm uses random sparsification. We 

A DETAILED RELATED WORK

Byzantine-robustness. Classical approaches to Byzantine-tolerant optimization are based on applying special aggregation rules to Parallel-SGD (Blanchard et al., 2017; Chen et al., 2017; Yin et al., 2018; Damaskinos et al., 2019; Guerraoui et al., 2018; Pillutla et al., 2022) . It turns out that such defences are vulnerable to the special type of attacks (Baruch et al., 2019; Xie et al., 2020) 2022) propose an extension to the decentralized optimization over fixed networks. Gorbunov et al. (2021a) propose an alternative approach based on the usage of AllReduce (Patarasuk & Yuan, 2009) with additional verifications of correctness and show that their algorithm has complexity not worse than Parallel-SGD when the target accuracy is small enough. Wu et al. (2020) are the first who applied variance reduction mechanism to tolerate Byzantine attacks (see the discussion above Q1). We also refer reader to (Chen et al., 2018; Rajput et al., 2019; Rodríguez-Barroso et al., 2020; Xu & Lyu, 2020; Alistarh et al., 2018; Allen-Zhu et al., 2021; Regatti et al., 2020; Yang & Bajwa, 2019a; b; Gupta et al., 2021; Gupta & Vaidya, 2021; Peng et al., 2021) for other advances in Byzantine-robustness (see the detailed summaries in (Lyu et al., 2020; Gorbunov et al., 2021a) ). We further progress the field by obtaining new theoretical SOTA convergence results in our work. Compressed communications. Methods with compression are relatively well studied in the literature. The first theoretical results were derived in (Alistarh et al., 2017; Wen et al., 2017; Stich et al., 2018; Mishchenko et al., 2019) . During the last several years the field has been significantly developed. In particular, compressed methods are analyzed in the conjuction with variance reduction (Horváth et al., 2019b; Gorbunov et al., 2020b; Danilova & Gorbunov, 2022) , acceleration (Li et al., 2020; Li & Richtárik, 2021; Qian et al., 2021b) , decentralized communications (Koloskova et al., 2019; Kovalev et al., 2021) , local steps (Basu et al., 2019; Haddadpour et al., 2021) , adaptive compression (Faghri et al., 2020) , second-order methods (Islamov et al., 2021; Safaryan et al., 2021) , and min-max optimization (Beznosikov et al., 2021; 2022) . However, to our knowledge, only one work studies communication compression in the context of Byzantine-robustness (Zhu & Ling, 2021) (see the discussion above Q2). Our work makes a further step towards closing this significant gap in the literature. Variance reduction is a powerful tool allowing to speed up the convergence of stochastic methods (especially when one needs to achieve a good approximation of the solution). The first variancereduced methods were proposed by Schmidt et al. (2017) ; Johnson & Zhang (2013) ; Defazio et al. (2014) . Optimal variance-reduced methods for (strongly) convex problems are proposed in (Lan & Zhou, 2018; Allen-Zhu, 2017; Lan et al., 2019) and for non-convex optimization in (Nguyen et al., 2017; Fang et al., 2018; Li et al., 2021) . Despite the noticeable attention to these kinds of methods (Gower et al., 2020) , only a few papers study Byzantine-robustness in conjunction with variance reduction (Wu et al., 2020; Zhu & Ling, 2021; Karimireddy et al., 2021) . Moreover, as we mentioned before, the results from Wu et al. (2020) ; Zhu & Ling (2021) are not better than the known ones for non-parallel variance-reduced methods, and Karimireddy et al. ( 2021) rely on the uniformly bounded variance assumption, which is hard to achieve in practice. In our work, we circumvent these limitations. Non-uniform sampling. Originally proposed for randomized coordinate methods (Nesterov, 2012; Richtárik & Takáč, 2016; Qu & Richtárik, 2016) , non-uniform sampling is extended in multiple ways to stochastic optimization, e.g., see (Horváth & Richtárik, 2019; Gower et al., 2019; Qian et al., 2019; Gorbunov et al., 2020b; a; Qian et al., 2021a) . Typically, non-uniform sampling of stochastic gradients allows better dependence on smoothness constants in the theoretical results. Inspired by these advances, we propose the first Byzantine-robust optimization method supporting non-uniform sampling of stochastic gradients. • 24 CPUs: Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz , • GPU: NVIDIA TITAN Xp with CUDA version 11.3, • PyTorch version: 1.11.0. 

B.2 EXPERIMENTAL SETUP

For each experiment, we tune the step size using the following set of candidates {0.5, 0.05, 0.005}. The step size is fixed. We do not use learning rate warmup or decay. We use batches of size 32 for all methods. Each experiment is run with three varying random seeds, and we report the mean optimality gap with one standard error. The optimal value is obtained by running gradient descent (GD) on the complete dataset for 1000 epochs. Our implementation of attacks and robust aggregation schemes is based on the public implementation from (Karimireddy et al., 2022) available at https:// github.com/epfml/byzantine-robust-noniid-optimizer. Our codes are available via an anonymized repository at https://github.com/SamuelHorvath/VR_Byzantine. We select the same set of hyperparameters as (Karimireddy et al., 2022) , i.e., • RFA: the number of steps of smoothed Weisfield algorithm T = 8; see Section D for details, • ALIE: a small constant that controls the strength of the attack z is chosen according to (Baruch et al., 2019) , • IPM: a small constant that controls the strength of the attack = 0.1.

B.3.1 HETEROGENEOUS DATA

In this case, we randomly shuffle dataset and we sequentially distribute it among 15 good workers, where each worker has approximately the same amount of data and there is no overlap. We include five Byzantine workers who have access to an entire dataset and the exact updates computed at each client. For the aggregation, we consider three rules: standard averaging (AVG), coordinate-wise median (CM) with bucketing, and robust federated averaging (RFA) with bucketing (see the details in Appendix D). Discussion. In Figure 2 , we can see that momentum (BR-SGDm) the variance reduction (Byz-VR-MARINA) techniques consistently outperform the SGD baseline while none of them dominates for all the attacks. Byz-VR-MARINA is particularly useful in the clean data regime and against the ALIE and IPM attacks, and the BR-SGDm algorithm provides the best performance for label and bit flipping attacks. It would be interesting to automatically select the best technique, e.g., momentum or VR-MARINA, that provides the best defense against any given attack. We leave this for future work.

B.3.2 COMPRESSION

In this section, we consider the same setup as for the previous experiment with a difference that we employ communication compression. We choose random unbiased sparsification for with sparsity level 10%. We compare our Byz-VR-MARINA algorithm to compressed SGD and DIANA (BR-DIANA). Discussion. In Figure 3 , we can see that Byz-VR-MARINA consistently outperforms both baselines except for the bit flipping attack. However, even in this case, it seems that Byz-VR-MARINA only needs more epochs to provide the better solution while SGD cannot further improve regardless of the number of epochs.

B.3.3 EXTRA DATASET: W8A

In Figures 4 5 6 , we perform the same experiments, but for the different LIBSVM dataset:w8a. We note that the obtained results are consistent with our observations for the a9a dataset.

B.4 COMPARISON WITH Byrd-SVRG

As we note in the main part of the paper, Byrd-SAGA is not well suited for PyTorch due to the large memory consumption of SAGA-based methods. Nevertheless, one can use SVRG-estimator (Johnson & Zhang, 2013) as a proxy of SAGA-estimator due to similarities between SAGA and SVRG. We call the resulting method as Byrd-SVRG and compare its performance with Byz-VR-MARINA on the logistic regression task with non-convex regularization: an instance of ( 1) with f i,j (x) = -y i,j log(h(x, a i,j )) -(1 -y i,j ) log(1 -h(x, a i,j )) + λ d i=1 x 2 i 1+x 2 i . The results are presented in Figure 7 . One can see that Byz-VR-MARINA converges to the exact solution asymptotically, while other methods are able to converge only to some neighborhood of the solution.

B.5 EFFECT OF COMPRESSION

In this experiment, we illustrate the effect of compression in Byz-VR-MARINA on its communication efficiency. We compare the performance of Byz-VR-MARINA with and without compression in terms of the number of communicated bits between workers and server. The results are shown in Figure 8 . One can see that communication compression does speed up the training (in terms of the number of transmitted bits).  f(x) -f * CM | ALIE Byz-VR-MARINA Byz-VR-MARINA -Rand10% Figure 8 : The optimality gap f (x k ) -f (x * ) under ALIE attacks on a9a dataset, where each worker has access full dataset with 4 good and 1 Byzantine worker. The figure illustrates the effect of compression in Byz-VR-MARINA. RandK sparsification with K = 0.1d is used as a compression operator. "Relative comm. compression" = number of transmitted bits divided by the number of transmitted bits per iteration for full-precision algorithm and then divided by the size of the dataset (when no compression is applied this quantity equals the number of epochs).

C USEFUL FACTS

For all a, b ∈ R d and α > 0, p ∈ (0, 1] the following relations hold: 2 a, b = a 2 + b 2 -a -b 2 , (10) a + b 2 ≤ (1 + α) a 2 + (1 + α -1 ) b 2 , ( ) -a -b 2 ≤ - 1 1 + α a 2 + 1 α b 2 , ( ) (1 -p) 1 + p 2 ≤ 1 - p 2 . ( ) Lemma C.1 (Lemma 5 from Richtárik et al. ( 2021) ). Let a, b > 0. If 0 ≤ γ ≤ 1 √ a+b , then aγ 2 + bγ ≤ 1. The bound is tight up to the factor of 2 since 1 √ a+b ≤ min 1 √ a , 1 b ≤ 2 √ a+b .

D FURTHER DETAILS ON ROBUST AGGREGATION

In Section 2, we consider robust aggregation rules satisfying Definition 2.1. As we notice, this definition slightly differs from the original one introduced by Karimireddy et al. (2022) . In particular, we assume 1 G(G -1) i,l∈G E x i -x l 2 ≤ σ 2 , ( ) while Karimireddy et al. (2022) uses E[ x i -x l 2 ] ≤ σ 2 for any fixed i, l ∈ G. As we show next, this difference is very subtle and condtion ( 14) also allows to achieve robustness. We consider robust aggregation via bucketing proposed by Karimireddy et al. ( 2022) (see Algorithm 2). This algorithm can robustify some non-robust aggregation rules Aggr. In particular, Algorithm 2 Bucketing: Robust Aggregation using bucketing (Karimireddy et al., 2022) 1: Input: {x 1 , . . . , x n }, s ∈ N -bucket size, Aggr -aggregation rule 2: Sample random permutation π = (π(1), . . . , π(n)) of [n] 3: Compute y i = 1 s min{si,n} k=s(i-1)+1 x π(k) for i = 1, . . . , n /s 4: Return: x = Aggr(y 1 , . . . , y n /s ) Karimireddy et al. (2022) show that Algorithm 2 makes Krum (Blanchard et al., 2017) , Robust Federated Averaging (RFA) (Pillutla et al., 2022) (also known as geometric median), and Coordinate-wise Median (CM) (Chen et al., 2017) robust, in view of definition from Karimireddy et al. (2022) . Our main goal in this section is to show that Krum • Bucketing, RFA • Bucketing, and CM • Bucketing satisfy Definition 2.1. Before we prove this fact, we need to introduce Krum, RFA, CM. Krum. Let S i ⊆ {x 1 , . . . , x n } be the subset of n -|B| -2 closest vectors to x i . Then, Krumestimator is defined as Krum(x 1 , . . . , x n ) def = argmin xi∈{x1,...,xn} j∈Si x j -x i 2 . ( ) Krum requires computing all pair-wise distances of vectors from {x 1 , . . . , x n } resulting in O(n 2 ) computation cost for the server. Therefore, Krum is computationally expensive, when number of workers n is large. Robust Federated Averaging. RFA-estimator finds a geometric median: RFA(x 1 , . . . , x n ) def = argmin x∈R d n i=1 x -x i . ( ) The above problem has no closed form solution. However, one can compute approximate RFA using several steps of smoothed Weiszfeld algorithm having O(n) computation cost of each iteration (Weiszfeld, 1937; Pillutla et al., 2022) . Coordinate-wise Median. CM-estimator computes a median of each component separately. That is, for t-th coordinate it is defined as [CM(x 1 , . . . , x n )] t def = Median([x 1 ] t , . . . , [x n ] t ) = argmin u∈R n i=1 |u -[x i ] t |, where (Chen et al., 2017; Yin et al., 2018) . [x] t is t-th coordinate of vector x ∈ R d . CM has O(n) computation cost Robustness via bucketing. The following lemma is the key to show robustness of Krum  G = {i ∈ [N ] | B i ⊆ G} , where y i = 1 |Bi| j∈Bi x j and B i denotes the i-th bucket, i.e., B i = {π((i -1) • s + 1), . . . , π(min{i • s, n})}. Then, | G| = G ≥ (1 -δs)N and for any fixed i, l ∈ G we have E[y i ] = E[x] and E y i -y l 2 ≤ σ 2 s , where x = 1 G i∈G x i . Proof. The proof is almost identical to the proof of Lemma 1 from (Karimireddy et al., 2022) . Nevertheless, for the sake of mathematical rigor, we provide a complete proof. Since each Byzantine peer is contained in no more than 1 bucket and |B| ≤ δn (here B = [n] \ G), we have that number of "bad" buckets is not greater than δn ≤ δsN , i.e., G ≥ (1 -δs)N . Next, for any fixed i ∈ G we have E π [y i | i ∈ G] = 1 |B i | j∈Bi E π [x j | j ∈ G] = 1 G j∈G x j = x, where E π [•] denotes the expectation w.r.t. the randomness comming from the permutation. Taking the full expectation, we obtain the first part of (18). To derive the second part we introduce the notation for workers from B i and B l : B i = {k i,1 , k i,2 , . . . , k i,s } and B l = {k l,1 , k l,2 , . . . , k l,s }. Then, for any fixed i, l ∈ G the (ordered) pairs {(x ki,t , x k l,t )} s t=1 are identically distributed random variables as well as for any fixed t = 1, . . . , s vectors x ki,t , x k l,t are identically distributed. Therefore, we have E y i -y l 2 = E   1 s s t=1 (x ki,t -x k l,t ) 2   = 1 s 2 s t=1 E x ki,t -x k l,t 2 + 2 s 2 1≤t1<t2≤s E x ki,t 1 -x k l,t 1 , x ki,t 2 -x k l,t 2 = 1 s E x ki,1 -x k l,1 2 + s -1 s E x ki,1 -x k l,1 , x ki,2 -x k l,2 = 1 sG(G -1) i 1 ,i 2 ∈G i1 =i2 E x i1 -x i2 2 + s -1 sG(G -1)(G -2)(G -3) i 1 ,i 2 ,i 3 ,i 4 ∈G i 1 =i 2 ,i 3 ,i 4 i 2 =i 3 ,i 4 i3 =i4 E [ x i1 -x i2 , x i3 -x i4 ] S = 1 sG(G -1) i1,i2∈G E x i1 -x i2 2 + S (14) ≤ σ 2 s + S. Finally, we notice that S = 0: for any fixed i 1 , i 2 , i 3 , i 4 the sum contains one term proportional to E [ x i1 -x i2 , x i3 -x i4 ] and one term proportional to E [ x i2 -x i1 , x i3 -x i4 ] that cancel out. This concludes the proof. Using the above lemma, we get the following result. 

E MISSING PROOFS AND DETAILS FROM SECTION 2 E.1 EXAMPLES OF SAMPLINGS

Below we provide two examples of situations when Assumption 2.4 holds. In both cases, we assume that f i,j is L i,j -smooth for all i ∈ G, j ∈ [m]. Example E.1 (Uniform sampling with replacement). Consider ∆ i (x, y) = 1 b j∈I i,k ∆ i,j (x, y), where ∆ i,j (x, y) = ∇f i,j (x) -∇f i,j (y) and I i,k is the set of b i.i.d. samples from the uniform distribution on [m]. Then, Assumption 2.4 holds with L 2 ± = L 2 ±,US , where L 2 ±,US ≤ 1 G i∈G L 2 i,±,US and L 2 i,±,US is such that 1 m m j=1 ∆ i,j (x, y) -∆ i (x, y) 2 ≤ L 2 i,±,US x -y 2 . Lemma 2 from Szlendak et al. (2021) implies that L 2 i,US -L 2 i ≤ L 2 i,±,US ≤ L 2 i,US , where L 2 i,US is such that 1 m m j=1 ∆ i,j (x, y) 2 ≤ L 2 i,US x -y 2 . We point out that in the worst case L 2 i,US = 1 m m j=1 L 2 i,j . Example E.2 (Importance sampling with replacement). Consider ∆ k i = 1 b j∈I i,k Li Li,j ∆ i,j (x, y), where ∆ i,j (x, y) = ∇f i,j (x) -∇f i,j (y), L i = 1 m m j=1 L i,j , and I i,k is the set of b i.i.d. samples from the distribution D i,IS on [m] such that for j ∼ D i,IS we have P{j = t} = Li,t mLi . Then, Assumption 2.4 holds with L 2 ± = L 2 ±,IS such that 1 mG i∈G m j=1 L i L i,j ∆ i,j (x, y) 2 - 1 mG i∈G ∆ i (x, y) 2 ≤ L 2 ±,IS x -y 2 . Lemma 2 from Szlendak et al. (2021) implies that 1 G i∈G (L 2 i -L 2 i ) ≤ L 2 ±,IS ≤ 1 G i∈G L 2 i . We point out that L 2 i ≤ L 2 i,US and in the worst case mL 2 i = L 2 i,US . Therefore, typically L 2 ±,IS < L 2 ±,US . Next, we show that Assumption 2.4 holds whenever f i,j is L i,j -smooth for all i ∈ G, j ∈ [m] and ∆ i (x, y) can be written as ∆ i (x, y) = 1 m m j=1 ξ i,j (∇f i,j (x) -∇f i,j (y)) , for some random variables ξ i,j such that E[ξ i,j ] = 1, i ∈ G, j ∈ [m] and max i∈G,j∈[m] E[ξ 2 i,j ] = E 2 < ∞ 5 . Indeed, in this case, we have 1 G i∈G E ∆ i (x, y) -∆ i (x, y) 2 ≤ 1 G i∈G E ∆ i (x, y) 2 = 1 G i∈G E 1 m m j=1 ξ i,j (∇f i,j (x) -∇f i,j (y)) 2 ≤ 1 Gm i∈G m j=1 ∇f i,j (x) -∇f i,j (y) 2 Eξ 2 i,j ≤ E 2 GM i∈G m j=1 L 2 i x -y 2 ≤ E 2 max i∈G,j∈[m] L 2 i,j x -y 2 , meaning that Assumption 2.4 holds with L ± ≤ √ bE max i∈G,j∈[m] L i,j . However, as we show in Examples E.1 and E.2 constant L ± can be much smaller than the derived upper bound. Published as a conference paper at ICLR 2023 E.2 KEY LEMMAS Our theory works for a slightly more general setting than the one we discussed in the main part of the paper. In particular, instead of Assumption 2.2 we consider a more general assumption on the heterogeneity. Assumption E.1 ((B, ζ 2 )-heterogeneity). We assume that good clients have (B, ζ 2 )-heterogeneous local loss functions for some B ≥ 0, ζ ≥ 0, i.e., 1 G i∈G ∇f i (x) -∇f (x) 2 ≤ B ∇f (x) 2 + ζ 2 ∀x ∈ R d . When B = 0 the above assumption recovers Assumption 2.2. Since we allow B to be positive, it may reduce the value of ζ in some applications (for example, for over-parameterized models). As we show further, the best possible optimization error that Byz-VR-MARINA can achieve in this case is proportional to ζ 2 . We refer to Karimireddy et al. ( 2022) for the study of typical values of parameter B for some over-parameterized models. In proofs, we need the following lemma, wich is often used for analyzing SGD-like methods in the non-convex case. Lemma E.1 (Lemma 2 from Li et al. ( 2021)). Assume that function f is L-smooth and x k+1 = x k -γg k . Then f (x k+1 ) ≤ f (x k ) - γ 2 ∇f (x k ) 2 - 1 2γ - L 2 x k+1 -x k 2 + γ 2 g k -∇f (x k ) 2 . ( ) To estimate the "quality" of robust aggregation at iteration k + 1 we derive an upper bound for the averaged pairwise variance of estimators obtained by good peers (see also Definition 2.1). Lemma E.2 (Bound on the variance). Let Assumptions 2.1, E.1, 2.3, 2.4 hold. Then for all k ≥ 0 the iterates produced by Byz-VR-MARINA satisfy 1 G(G -1) i,l∈G E g k+1 i -g k+1 l 2 ≤ A E x k+1 -x k 2 + 8BpE ∇f (x k ) 2 + 4pζ 2 , ( ) where A = 8BpL 2 + 4(1 -p) ωL 2 + (1 + ω)L 2 ± + (1+ω)L 2 ± b . Proof. For the compactness, we introduce new notation: ∆ k = ∇f (x k+1 ) -∇f (x k ). Let E c k [•] denote the expectation w.r.t. c k . Then, by definition of g k+1 i for i ∈ G we have 1 G(G -1) i,l∈G E c k g k+1 i -g k+1 l 2 = p G(G -1) i,l∈G ∇f i (x k+1 ) -∇f l (x k+1 ) 2 T1 + 1 -p G(G -1) i,l∈G Q( ∆ k i ) -Q( ∆ k l ) 2 T2 . Taking the full expectation and using the tower property E[E c k [•]] = E[•], we derive 1 G(G -1) i,l∈G E g k+1 i -g k+1 l 2 = E[T 1 ] + E[T 2 ]. Term E[T 1 ] can be bounded via Assumption 19: E[T 1 ] = p G(G -1) i,l∈G i =l E ∇f i (x k+1 ) -∇f l (x k+1 ) 2 (11) ≤ p G(G -1) i,l∈G i =l E 2 ∇f i (x k+1 ) -∇f (x k+1 ) 2 + 2 ∇f l (x k+1 ) -∇f (x k+1 ) 2 = 4p G i∈G E ∇f i (x k+1 ) -∇f (x k+1 ) 2 (19) ≤ 4BpE ∇f (x k+1 ) 2 + 4pζ 2 (11) ≤ 8BpE ∇f (x k ) 2 + 8BpE ∇f (x k+1 ) -∇f (x k ) 2 + 4pζ 2 As. 2.1 ≤ 8BpE ∇f (x k ) 2 + 8BpL 2 E x k+1 -x k 2 + 4pζ 2 . To estimate E[T 2 ] we first derive an upper bound for E k [T 2 ], where E k [•] denotes expectation w.r.t. the all randomness (compression and stochasticity of the the gradients) coming from the step k + 1 of the algorithm: E k [T 2 ] = 1 -p G(G -1) i,l∈G i =l E k Q( ∆ k i ) -Q( ∆ k l ) 2 = 1 -p G(G -1) i,l∈G i =l E k Q( ∆ k i ) -∆ k i -(Q( ∆ k l ) -∆ k l ) 2 + 1 -p G(G -1) i,l∈G i =l ∆ k i -∆ k l 2 (11) ≤ 1 -p G(G -1) i,l∈G i =l E k 2 Q( ∆ k i ) -∆ k i 2 + 2 Q( ∆ k l ) -∆ k l 2 + 1 -p G(G -1) i,l∈G i =l 2 ∆ k i -∆ k 2 + 2 ∆ k l -∆ k 2 = 1 -p G(G -1) i,l∈G i =l E k E Q k 2 Q( ∆ k i ) -∆ k i 2 + 2 Q( ∆ k l ) -∆ k l 2 + 1 -p G(G -1) i,l∈G i =l 2 ∆ k i -∆ k 2 + 2 ∆ k l -∆ k 2 = 4(1 -p) G i∈G E k E Q k Q( ∆ k i ) 2 -∆ k i 2 + 4(1 -p) G i∈G ∆ k i 2 -∆ k 2 (3) ≤ 4(1 -p) G i∈G E k (1 + ω) ∆ k i 2 -∆ k i 2 + 4(1 -p) G i∈G ∆ k i 2 -∆ k 2 = 4(1 -p)(1 + ω) G i∈G E k ∆ k i -∆ k i 2 + 4(1 -p)(1 + ω) G i∈G ∆ k i -∆ k 2 +4(1 -p)ω ∆ k 2 , where E Q k [•] denotes the expectation w.r.t. the randomness coming from the compression at step k + 1. Applying Assumptions 2.4, 2.3, and 2.1 and taking the full expectation, we get E[T 2 ] ≤ 4(1 -p) ωL 2 + (1 + ω) L 2 ± + L 2 ± b E x k+1 -x k 2 . Plugging the upper bounds for E[T 1 ] and E[T 2 ] in ( 22), we obtain the result. Using the above lemma, we derive the following technical result, which we rely on in the proofs of the main results. Lemma E.3 (Bound on the distortion). Let Assumptions 2.1, E.1, 2.3, 2.4 hold. Then for all k ≥ 0 the iterates produced by Byz-VR-MARINA satisfy E g k+1 -∇f (x k+1 ) 2 ≤ 1 - p 2 E g k -∇f (x k ) 2 + 24BcδE ∇f (x k ) 2 + 12cδζ 2 + Ap 4 E x k+1 -x k 2 , ( ) where A = 48BL 2 cδ p + 6(1-p) p 4cδ p + 1 2G ωL 2 + (1+ω)L 2 ± b + 6(1-p) p 4cδ(1+ω) p + ω 2G L 2 ± . Proof. For convenience, we intoduce the following notation: g k+1 = 1 G i∈G g k+1 i = ∇f (x k+1 ), if c k = 1, g k + 1 G i∈G Q( ∆ k i ), otherwise. Using the introduced notation, we derive E g k+1 -∇f (x k+1 ) 2 (11) ≤ 1 + p 2 E g k+1 -∇f (x k+1 ) 2 + 1 + 2 p E g k+1 -g k+1 2 . ( ) Next, we need to upper-bound the terms from the right-hand side of (25). Let E c k [•] denote the expectation w.r.t. c k . Then, in view of (24), we have E c k g k+1 -∇f (x k+1 ) 2 = (1 -p) g k + 1 G i∈G Q( ∆ k i ) -∇f (x k+1 ) 2 . Taking expectation E k [•] w.r.t. the all randomness (compression and stochasticity of the the gradients) coming from the step k + 1 of the algorithm and applying the variance decomposition and independence of mini-batch and compression computations on different workers, we get E k g k+1 -∇f (x k+1 ) 2 = (1 -p)E k   g k + 1 G i∈G Q( ∆ k i ) -∇f (x k+1 ) 2   = (1 -p) g k -∇f (x k ) 2 +(1 -p)E k   1 G i∈G (Q( ∆ k i ) -∆ k i ) 2   = (1 -p) g k -∇f (x k ) 2 + 1 -p G 2 i∈G E k Q( ∆ k i ) -∆ k i 2 . Let E Q k [•] denote the expectation w.r.t. the randomness coming from the compression at step k + 1. The definition of the unbiased compression operator (Definition 2.2) implies E k g k+1 -∇f (x k+1 ) 2 = (1 -p) g k -∇f (x k ) 2 + 1 -p G 2 i∈G E k Q( ∆ k i ) 2 - 1 -p G 2 i∈G ∆ k i 2 = (1 -p) g k -∇f (x k ) 2 + 1 -p G 2 i∈G E k E Q k Q( ∆ k i ) 2 - 1 -p G 2 i∈G ∆ k i 2 (3) ≤ (1 -p) g k -∇f (x k ) 2 + (1 -p)(1 + ω) G 2 i∈G E k ∆ k i 2 - 1 -p G 2 i∈G ∆ k i 2 = (1 -p) g k -∇f (x k ) 2 + (1 -p)(1 + ω) G 2 i∈G E k ∆ k i -∆ k i 2 + (1 -p)ω G 2 i∈G ∆ k i 2 = (1 -p) g k -∇f (x k ) 2 + (1 -p)(1 + ω) G 2 i∈G E k ∆ k i -∆ k i 2 + (1 -p)ω G 2 i∈G ∆ k i -∆ k 2 + (1 -p)ω G ∆ k 2 . Using Assumptions 2.4, 2.3, and 2.1 and taking the full expectation, we arrive at E g k+1 -∇f (x k+1 ) 2 ≤ (1 -p)E g k -∇f (x k ) 2 (26) + 1 -p G ωL 2 + ωL 2 ± + (1 + ω)L 2 ± b E x k+1 -x k 2 . That is, we obtained an upper bound for the first term in the right-hand side of (25). To bound the second term, we use the definition of (δ, c)-ARAgg (Definition 2.1) and Lemma E.2: E g k+1 -g k+1 2 = E E k g k+1 -g k+1 2 (2) ≤ E   cδ G(G -1) i,l∈G E k g k+1 i -g k+1 l 2   = cδ G(G -1) i,l∈G E g k+1 i -g k+1 l 2 (21) ≤ 8Bpcδ + 4pζ 2 cδ + A cδE x k+1 -x k 2 , ( ) where A = 8BpL 2 + 4(1 -p) ωL 2 + (1 + ω)L 2 ± + (1+ω)L 2 ± b . Plugging ( 26) and ( 27) in (25) and using p ≤ 1, we obtain E g k+1 -∇f (x k+1 ) 2 ≤ (1 -p) 1 + p 2 E g k -∇f (x k ) 2 + 3(1 -p) 2G ωL 2 + ωL 2 ± + (1 + ω)L 2 ± b E x k+1 -x k 2 + 3 p 8Bpcδ + 4pζ 2 cδ + A cδE x k+1 -x k 2 (13) ≤ 1 - p 2 E g k -∇f (x k ) 2 + 24Bcδ + 12ζ 2 cδ + Ap 2 E x k+1 -x k 2 , where A = 2 p 3(1 -p) 2G ωL 2 + ωL 2 ± + (1 + ω)L 2 ± b + 3 p A cδ = 2 p 24BL 2 cδ + 3(1 -p) 4cδ p + 1 2G ωL 2 + (1 + ω)L 2 ± b + 2 p • 3(1 -p) 4cδ(1 + ω) p + ω 2G L 2 ± = 48BL 2 cδ p + 6(1 -p) p 4cδ p + 1 2G ωL 2 + (1 + ω)L 2 ± b + 6(1 -p) p 4cδ(1 + ω) p + ω 2G L 2 ± . This concludes the proof.

E.3 GENERAL NON-CONVEX FUNCTIONS

Theorem E.1 (Generalized version of Theorem 2.1). Let Assumptions 2.1, E.1, 2.3, 2.4 hold. Assume that 0 < γ ≤ 1 L + √ A , δ < p 48cB , where A = 48BL 2 cδ p + 6(1-p) p 4cδ p + 1 2G ωL 2 + (1+ω)L 2 ± b + 6(1-p) p 4cδ(1+ω) p + ω 2G L 2 ± . Then for all K ≥ 0 the iterates produced by Byz-VR-MARINA satisfy E ∇f ( x K ) 2 ≤ 2Φ 0 γ 1 -48Bcδ p (K + 1) + 24cδζ 2 p -48Bcδ , where x K is choosen uniformly at random from x 0 , x 1 , . . . , x K , and Φ 0 = f (x 0 ) -f * + γ p g 0 -∇f (x 0 ) 2 . The result of Theorem 2.1 is a special case of the statement above with B = 0, since for B = 0 we have A = 48BL 2 cδ p + 6(1-p) p 4cδ p + 1 2G ωL 2 + (1+ω)L 2 ± b + 6(1-p) p 4cδ(1+ω) p + ω 2G L 2 ± = 6(1-p) p 4cδ p + 1 2G ωL 2 + (1+ω)L 2 ± b + 6(1-p) p 4cδ(1+ω) p + ω 2G L 2 ± , 1 -48Bcδ p = 1, p -48Bcδ = p, and the second condition from (28) always holds. Proof. For all k ≥ 0 we introduce Φ k = f (x k ) -f * + γ p g k -∇f (x k ) 2 . Using the results of Lemmas E.1 and E.3, we derive E[Φ k+1 ] (20),(23) ≤ E f (x k ) -f * - 1 2γ - L 2 x k+1 -x k 2 + γ 2 g k -∇f (x k ) 2 - γ 2 E ∇f (x k ) 2 + γ p 1 - p 2 E g k -∇f (x k ) 2 + 24Bcδγ p E ∇f (x k ) 2 + 12cδζ 2 γ p + γA 2 E x k+1 -x k 2 = E [Φ k ] - γ 2 1 - 48Bcδ p E ∇f (x k ) 2 + 12cδζ 2 γ p - 1 2γ 1 -Lγ -Aγ 2 E x k+1 -x k 2 ≤ E [Φ k ] - γ 2 1 - 48Bcδ p E ∇f (x k ) 2 + 12cδζ 2 γ p , where in the last step we use Lemma C.1 and our choice of γ from (28). Next, in view of (28), we have γ 2 1 -48Bcδ p > 0. Therefore, summing up the above inequality for k = 0, 1, . . . , K and rearranging the terms, we get 1 K + 1 K k=0 E ∇f (x k ) 2 ≤ 2 γ 1 -48Bcδ p (K + 1) K k=0 (E[Φ k ] -E[Φ k+1 ]) + 24cδζ 2 p -48Bcδ = 2 (E[Φ 0 ] -E[Φ K+1 ]) γ 1 -48Bcδ p (K + 1) + 24cδζ 2 p -48Bcδ Φ K+1 ≥0 ≤ 2E[Φ 0 ] γ 1 -48Bcδ p (K + 1) + 24cδζ 2 p -48Bcδ . It remains to notice, that the lef-hand side equals E[ ∇f ( x K ) 2 ], where x K is choosen uniformly at random from x 0 , x 1 , . . . , x K . 2021) all permuation-invariant algorithms cannot converge to any predefined accuracy even in the homogeneous case. In our paper, we provide a different perspective on this problem: it turns out that the method can be variance-reduced, Byzantine-robust, and permutation-invariant at the same time.

On the differences between

Let us first refine what we mean by permutation-invariance since our definition slightly differs from the one used by Karimireddy et al. (2021) . That is, consider the homogeneous setup and assume that there are no Byzantine workers. We say that the algorithm is permutation-invariant if one can arbitrarily permute the results of stochastic gradients computations (not necessarily one stochastic gradient computation) between workers at any aggregation step without changing the output of the method. Then, Byz-VR-MARINA is permutation-invariant, since the output in line 10 depends only on g k and the set {∆ k i } i∈[n] (note that Q(x) = x, since we assume ω = 0), not on their order. Our results do not contradict the ones from Karimireddy et al. (2021) , since Karimireddy et al. (2021) assume that the variance of the stochastic gradient is bounded, while we apply variance reduction implying that the variance goes to zero and Byzantine workers cannot successfuly "hide in the noise" via time-coupled attacks anymore. Before we move on to the corollaries, we ellaborate on the derived upper bound. In particular, it is important to estimate E[Φ 0 ]. By definition, Φ 0 = f (x 0 ) -f * + γ p g 0 -∇f (x 0 ) 2 , i.e., Φ 0 depends on the choice of g 0 . For example, one can ask good workers to compute h i = ∇f i (x 0 ), i ∈ G and send it to the server. Then, the server can set g 0 as g 0 = ARAgg(h 1 , . . . , h n ). This gives us E[Φ 0 ] = f (x 0 ) -f * + γ p E g 0 -∇f (x 0 ) 2 (2) ≤ f (x 0 ) -f * + γcδ pG(G -1) i,l∈G i =l ∇f i (x 0 ) -∇f l (x 0 ) 2 (11) ≤ f (x 0 ) -f * + 2γcδ pG(G -1) i,l∈G i =l ∇f i (x 0 ) -∇f (x 0 ) 2 + ∇f l (x 0 ) -∇f (x 0 ) 2 = f (x 0 ) -f * + 4γcδ pG i∈G ∇f i (x 0 ) -∇f (x 0 ) 2 (19) ≤ f (x 0 ) -f * + 4γcδB p ∇f (x 0 ) 2 + 4γcδζ 2 p . Function f is L-smooth that implies ∇f (x 0 ) 2 ≤ 2L f (x 0 ) -f * . Using this and δ < p /(48cB) and γ ≤ 1 /L, we derive E[Φ 0 ] ≤ 1 + 8γcδBL p f (x 0 ) -f * + 4γcδζ 2 p ≤ 1 + γL 6 f (x 0 ) -f * + 4γcδζ 2 p ≤ 2 f (x 0 ) -f * + 4γcδζ 2 p . Plugging this upper bound in (29), we get E ∇f ( x K ) 2 ≤ 4 f (x 0 ) -f * γ 1 -48Bcδ p (K + 1) + 32cδζ 2 p -48Bcδ . Based on this inequality we derive following corollaries. Corollary E.1 (Homogeneous data, no compression (ω = 0)). Let the assumptions of Theorem E.1 hold, Q(x) ≡ x for all x ∈ R d (no compression, ω = 0), p = b /m, B = 0, ζ = 0, and γ = 1 L + L ± 6 4cδm 2 b 3 + m b 2 G Then for all K ≥ 0 we have E ∇f ( x K ) 2 of the order O     L + L ± cδm 2 b 3 + m b 2 G ∆ 0 K     , where x K is choosen uniformly at random from the iterates x 0 , x 1 , . . . , x K produced by Byz-VR-MARINA and ∆ 0 = f (x 0 ) -f * . That is, to guarantee E ∇f ( x K ) 2 ≤ ε 2 for ε 2 > 0 Byz-VR-MARINA requires O     L + L ± cδm 2 b 3 + m b 2 G ∆ 0 ε 2     , (32) communication rounds and O     bL + L ± cδm 2 b + m G ∆ 0 ε 2     , oracle calls per worker. Corollary E.2 (No compression (ω = 0)). Let the assumptions of Theorem E.1 hold, Q(x) ≡ x for all x ∈ R d (no compression, ω = 0), p = b /m and γ = 1 L + 48L 2 Bcδm b + 24cδm 2 b 2 L 2 ± + 6 4cδm 2 b 2 + m bG L 2 ± b Then for all K ≥ 0 we have E ∇f ( x K ) 2 of the order O     L + L 2 Bcδm b + cδm 2 b 2 L 2 ± + cδm 2 b 2 + m bG L 2 ± b ∆ 0 1 -48Bcδm b K + cδζ 2 b m -48Bcδ     , where x K is choosen uniformly at random from the iterates x 0 , x 1 , . . . , x K produced by Byz-VR-MARINA and ∆ 0 = f (x 0 ) -f * . That is, to guarantee E ∇f ( x K ) 2 ≤ ε 2 for ε 2 ≥ 12cδζ 2 p-48Bcδ Byz-VR-MARINA requires O     L + L 2 Bcδm b + cδm 2 b 2 L 2 ± + cδm 2 b 2 + m bG L 2 ± b ∆ 0 1 -48Bcδm b ε 2     , communication rounds and O     bL + L 2 Bcδmb + cδm 2 L 2 ± + cδm 2 + mb G L 2 ± b ∆ 0 1 -48Bcδm b ε 2     , oracle calls per worker. Corollary E.3. Let the assumptions of Theorem E.1 hold, p = min{ b /m, 1 /1+ω} and γ = 1 L + √ A , A = 48L 2 Bcδ max m b , 1 + ω +6 4cδ max m 2 b 2 , (1 + ω) 2 + max m b , 1 + ω 2G ωL 2 + (1 + ω)L 2 ± b +6 4cδ(1 + ω) max m 2 b 2 , (1 + ω) 2 + ω max m b , 1 + ω 2G L 2 ± Then for all K ≥ 0 we have E ∇f ( x K ) 2 of the order O   L + √ A ∆ 0 1 -48Bcδ max m b , 1 + ω K + cδζ 2 min b m , 1 1+ω -48Bcδ   , where x K is choosen uniformly at random from the iterates x 0 , x 1 , . . . , x K produced by Byz-VR-MARINA and ∆ 0 = f (x 0 ) -f * . That is, to guarantee E ∇f ( x K ) 2 ≤ ε 2 for ε 2 ≥ 32cδζ 2 p-48Bcδ Byz-VR-MARINA requires O   L + √ A ∆ 0 1 -48Bcδ max m b , 1 + ω ε 2   , communication rounds and O   bL + b √ A ∆ 0 1 -48Bcδ max m b , 1 + ω ε 2   , oracle calls per worker. Corollary E.4 (Homogeneous data). Let the assumptions of Theorem E.1 hold, p = min{ b /m, 1 /1+ω}, B = 0, ζ = 0, and γ = 1 L + √ A , A = 6 3cδ max m 2 b 2 , (1 + ω) 2 + max m b , 1 + ω 2G ωL 2 + (1 + ω)L 2 ± b Then for all K ≥ 0 we have E ∇f ( x K ) 2 of the order O   L + √ A ∆ 0 K   , where x K is choosen uniformly at random from the iterates x 0 , x 1 , . . . , x K produced by Byz-VR-MARINA and ∆ 0 = f (x 0 ) -f * . That is, to guarantee E ∇f ( x K ) 2 ≤ ε 2 for ε 2 > 0 Byz-VR-MARINA requires O   L + √ A ∆ 0 ε 2   , communication rounds and O   bL + b √ A ∆ 0 ε 2   , oracle calls per worker.

E.4 FUNCTIONS SATISFYING POLYAK-ŁOJASIEWICZ CONDITION

Theorem E.2 (Generalized version of Theorem 2.2). Let Assumptions 2.1, E.1, 2.3, 2.4, 2.5 hold. Assume that 0 < γ ≤ min    1 L + √ 2A , p 4µ 1 -96Bcδ p    , δ < p 96cB , where A = 48BL 2 cδ p + 6(1-p) p 4cδ p + 1 2G ωL 2 + (1+ω)L 2 ± b + 6(1-p) p 4cδ(1+ω) p + ω 2G L 2 ± . Then for all K ≥ 0 the iterates produced by Byz-VR-MARINA satisfy E f (x K ) -f (x * ) ≤ 1 -γµ 1 - 96Bcδ p K Φ 0 + 24cδζ 2 µ(p -96Bcδ) , where Φ 0 = f (x 0 ) -f (x * ) + 2γ p g 0 -∇f (x 0 ) 2 . The result of Theorem 2.1 is a special case of the statement above with B = 0, since for B = 0 we have A = 48BL 2 cδ p + 6(1-p) p 4cδ p + 1 2G ωL 2 + (1+ω)L 2 ± b + 6(1-p) p 4cδ(1+ω) p + ω 2G L 2 ± = 6(1-p) p 4cδ p + 1 2G ωL 2 + (1+ω)L 2 ± b + 6(1-p) p 4cδ(1+ω) p + ω 2G L 2 ± , 1 -96Bcδ p = 1, p - 96Bcδ = p, and the second condition from (43) always holds. Proof. For all k ≥ 0 we introduce Φ k = f (x k ) -f * + 2γ p g k -∇f (x k ) 2 . Using the results of Lemmas E.1 and E.3, we derive

E[Φ k+1 ]

(20),( 23) ≤ E f (x k ) -f (x * ) - 1 2γ - L 2 x k+1 -x k 2 + γ 2 g k -∇f (x k ) 2 - γ 2 E ∇f (x k ) 2 + 2γ p 1 - p 2 E g k -∇f (x k ) 2 + 48Bcδγ p E ∇f (x k ) 2 + 24cδζ 2 γ p + γAE x k+1 -x k 2 = E f (x k ) -f (x * ) + 2γ p 1 - p 4 E g k -∇f (x k ) 2 - γ 2 1 - 96Bcδ p E ∇f (x k ) 2 + 24cδζ 2 γ p - 1 2γ 1 -Lγ -2Aγ 2 E x k+1 -x k 2 (8) ≤ 1 -γµ 1 - 96Bcδ p E f (x k ) -f (x * ) + 2γ p 1 - p 4 E g k -∇f (x k ) 2 + 24cδζ 2 γ p (43) ≤ 1 -γµ 1 - 96Bcδ p E [Φ k ] + 24cδζ 2 γ p where in the last step we use Lemma C.1 and our choice of γ from (43). Unrolling the recurrence, we obtain E[Φ K ] ≤ 1 -γµ 1 - 96Bcδ p K E [Φ 0 ] + 24cδζ 2 γ p K-1 k=0 1 -γµ 1 - 96Bcδ p k ≤ 1 -γµ 1 - 96Bcδ p K E [Φ 0 ] + 24cδζ 2 γ p ∞ k=0 1 -γµ 1 - 96Bcδ p k = 1 -γµ 1 - 96Bcδ p K E [Φ 0 ] + 24cδζ 2 µ(p -96Bcδ) . Taking into account Φ k ≥ f (x k ) -f (x * ), we get the result. As in the case of general non-convex smooth functions, we need to estimate Φ 0 to derive complexity results. Following exactly the same reasoning as in the derivation of (30), we get E[Φ 0 ] ≤ 2 f (x 0 ) -f (x * ) + 8γcδζ 2 p . Plugging this upper bound in (44), we get E f (x K ) -f (x * ) ≤ 2 1 -γµ 1 - 96Bcδ p K f (x 0 ) -f (x * ) + 1 -γµ 1 - 96Bcδ p K • 8γcδζ 2 p + 24cδζ 2 µ(p -96Bcδ) ≤ 2 1 -γµ 1 - 96Bcδ p K f (x 0 ) -f (x * ) + ∞ k=0 1 -γµ 1 - 96Bcδ p k • 8γcδζ 2 p + 24cδζ 2 µ(p -96Bcδ) ≤ 2 1 -γµ 1 - 96Bcδ p K f (x 0 ) -f (x * ) + 32cδζ 2 µ(p -96Bcδ) . Based on this inequality we derive following corollaries.  + 3m 2b 2 G , b m    K   ∆ 0   , ( ) where ∆ 0 = f (x 0 ) -f (x * ). That is, to guarantee E f (x K ) -f (x * ) ≤ ε for ε > 0 Byz-VR- MARINA requires O   max    L + L ± cδm 2 b 3 + m b 2 G µ , m b    log ∆ 0 ε   , communication rounds and O   max    bL + L ± cδm 2 b + m G µ , m    log ∆ 0 ε   , oracle calls per worker.  oracle calls per worker.

E.5 FURTHER DETAILS ON THE OBTAINED RESULTS AND THE COMPARISON FROM TABLE 2

In this part, we discuss additional details about the obtained results and on the comparison of methods complexities given in Table 2 . We mostly focus on the results for general smooth non-convex functions. Similar observations are valid for smooth PŁfunctions as well. Comparison of the assumptions on the stochastic gradient noise. Many existing works rely on the uniformly bounded variance assumption (UBV): it is assumed that for all x ∈ R d the good workers have an access to the unbiased estimators g i (x) of ∇f i (x) such that E g i (x) -∇f i (x) 2 ≤ σ 2 for all i ∈ G and σ ≥ 0. This assumption does not hold in many practical situations and even for simple convex finite-sum problems like sums of quadratic functions with non-identical Hessians. Moreover, in the situations when this assumption holds, the value of σ 2 can be huge. However, UBV assumption does not require individual stochastic realizations, i.e., summands f i,j , to be smooth. In contrast, we use Assumption 2.4 that holds for many situations when UBV assumption does not. For example, Assumption 2.4 holds whenever all functions f i,j , i ∈ G, j ∈ m are L i,j -smooth (see Appendix E.1). These facts allow us to cover a large class of problems that does not fit the setup considered in (Karimireddy et al., 2021; 2022; Gorbunov et al., 2021a) . Moreover, since Assumption 2.4 is more general than smoothness of all f i,j , our analysis covers the setup considered in (Wu et al., 2020; Zhu & Ling, 2021) . However, it is worth mentioning that there exist problems such that UBV assumption holds and Assumption 2.4 does not, e.g., when the gradient noise is additive: ∆ i (x, y) = ∇f i (x) -∇f i (y) + ξ i , where Eξ i = 0 and E ξ i 2 = σ 2 . On the choice of p. Our analysis is valid for any choice of p ∈ (0, 1]. As we explain in footnote 2, the choice of p = min{ b /m, 1 /(1+ω)} leads to the fair comparison with other results, since this choice implies that the total expected (communication and oracle) cost of steps with full gradients computations/uncompressed communications coincides with the total cost of the rest of iterations. Indeed, to measure the communication efficiency one can use expected density ζ Q (see Definition 2.2). Then, the expected number of components that each worker sends to the server at each step is upperbounded by ζ Q (1 -p) + pd meaning that p = ζ Q /d makes the expected number of components that each worker sends to the server at each step equal to O(ζ Q ). In the case when 1 + ω = Θ( d /ζ Q ) (which is the case for RandK sparsification and 2 -quantization, see (Beznosikov et al., 2020) ), one can choose p = 1 /(1+ω). On the other hand, the expected number of oracle calls per iteration is 2b(1 -p) + mp meaning that p = b /m makes the expected oracle cost of each iteration equal to O(b) like in the case of SGD. This means that the best p for oracle complexity and the best p for communication efficiency are different in general. When b /m < 1 /(1+ω), the choice p = min{ b /m, 1 /(1+ω)} implies that the algorithm could use uncompressed vectors more often without sacrificing the communication cost, and when b /m > 1 /(1+ω), the choice p = min{b/m, 1/(1 + ω)} implies that the algorithm could use full gradients more often without sacrificing the oracle cost. That is, the choice of p ≈ b /m implies better oracle complexity and the choice p ≈ 1 /(1+ω) leads to better communication efficiency. Depending on how much these two aspects are important for the particular application, one can choose p in between these two values. The effect of compression. The above result shows that the communication complexity becomes worse with the growth of ω. Larger ω means that the compression Q is more loose, i.e., less information is communicated. This is



This term is standard for distributed learning literature(Lamport et al., 1982;Su & Vaidya, 2016;Lyu et al., 2020). Using this term, we follow standard terminology and do not want to offend any group. It would be great if the community found and agreed on a more neutral term to denote such workers. In all the results and the proofs, E[•] denotes the full expectation if the opposite is not specified. To have a fair comparison, we take p = min{ b /m, 1 /(1+ω)} since in this case, at each iteration each worker sends O(ζQ) components, when ω + 1 = Θ( d /ζ Q ) (which is the case for RandK sparsification and 2-quantization, see(Beznosikov et al., 2020)), and makes O(b) oracle calls in expectation (computations of ∇fi,j(x)). With such choice of p, the total expected (communication and oracle) cost of steps with full gradients computations/uncompressed communications coincides with the total cost of the rest of iterations. For simplicity, we assume that n is divisible by s. We note that this assumption on the form of ∆i(x, y) is very mild and holds for standard sampling strategies including uniform and importance samplings. We refer to(Gower et al., 2019) for more examples. We remind here that by communication complexity we mean the total number of communication rounds needed for the algorithm to find point x such that E ∇f (x) 2 ≤ ε 2 .



Notation: ε = desired accuracy; δ = ratio of Byzantine workers; c = parameter of the robust aggregator; n = total number of workers; b = batchsize; σ 2 = uniform bound on the variance of stochastic gradients; D 2 = uniform bound on the second moment of stochastic gradients; C = the number of workers used by BTARD-SGD for the checks of computations after each step; µ = parameter from As. 2.5 (strong convexity parameter in the case of BTARD-SGD, Byrd-SAGA, BR-CSGD, BR-CSAGA, BROADCAST); m = size of the local dataset on workers; p = min { b /m, 1 /(1+ω)} = probability of communication in Byz-VR-MARINA.

from Bernoulli distribution with parameter p: c k ∼ Be(p) 5: Broadcast g k , c k to all workers 6:

Figure1: The optimality gap f (x k ) -f (x * ) of 3 aggregation rules (AVG, CM, RFA) under 5 attacks (NA, LF, BF, ALIE, IPM) on a9a dataset, where each worker has access full dataset with 4 good and 1 Byzantine worker. In the first row, we do not use any compression, in the second row each method uses RandK sparsification with K = 0.1d.

Figure 2: The optimality gap f (x k ) -f (x * ) of 3 aggregation rules (AVG, CM, RFA) under 5 attacks (NA, LF, BF, ALIE, IPM) on a9a dataset with uniform split over 15 workers with 5 Byzantine workers. The top row displays the best performance in hindsight for a given attack.

Figure6: The optimality gap f (x k ) -f (x * ) of 3 aggregation rules (AVG, CM, RFA) under 5 attacks (NA, LF, BF, ALIE, IPM) on w8a dataset, where each worker has access full dataset with 4 good and 1 Byzantine worker. In the first row, we do not use any compression, in the second row each method uses RandK sparsification with K = 0.1d.

Figure 7: The optimality gap f (x k ) -f (x * ) under 2 attacks (NA, ALIE) on a9a dataset, where each worker has access full dataset with 4 good and 1 Byzantine worker. No compression is applied.

Modification of Theorem I from (Karimireddy et al., 2022)). Let {x 1 , x 2 , . . . , x n } satisfy the conditions of Lemma D.1 for some δ ≤ δ max . If Algorithm 2 is run with s = δmax /δ , then•Krum • Bucketing satisfies Definition 2.1 with c = O(1) and δ max < 1 /4, • RFA • Bucketing satisfies Definition 2.1 with c = O(1) and δ max < 1 /2, • CM • Bucketing satisfies Definition 2.1 with c = O(d) and δ max < 1 /2. Proof. The proof is identical to the proof of Theorem I from (Karimireddy et al., 2022), since Karimireddy et al. (2022) rely only on the general properties of Krum/RFA/CM and (18) to get the result.

Byz-VR-MARINA and momentum-based methods. Karimireddy et al. (2021) use momentum and momentum-based variance reduction in order to prevent the algorithm from being permutation-invariant, since in the setup considered by Karimireddy et al. (

Homogeneous data, no compression (ω = 0)). Let the assumptions of Theorem E.2 hold, Q(x) ≡ x for all x ∈ R d (no compression, ω = 0), p = b /m, B = 0, ζ = 0, and Then for all K ≥ 0 we have E f (x K ) -f (x * ) of the order

No compression (ω = 0)). Let the assumptions of Theorem E.2 hold, Q(x) ≡ x for all x ∈ R d (no compression, ω = 0), p = b /m and Then for all K ≥ 0 we have E f (x K ) -f (x * ) of the order ) , (48)where ∆ 0 = f (x 0 ) -f (x * ). That is, to guarantee E f (x K ) -f (x * ) ≤ ε for ε ≥ 32cδζ 2 bL + L 2 Bcδmb + cδm 2 L 2 ± + cδm 2 + mb G

For simplicity, consider a homogeneous case (B = 0, ζ = 0) and let b = 1; similar arguments are valid for the general case. As Corollary E.4 states, the communication complexity 6 of Byz-VR-MARINA in this case equals = 6 3cδ max m 2 , (1 + ω) 2 + max {m, 1 + ω} 2G ωL 2 + (1 + ω)L 2 ± .

Comparison of the state-of-the-art complexity results for Byzantine-tolerant distributed methods. Columns: "Assumptions" = additional assumptions to smoothness of all fi(x), i ∈ G (although our results require more refined As. 2.3); "Complexity (NC)" and "Complexity (PŁ)" = number of communication rounds required to find such x that E ∇f (x) 2 ≤ ε 2 in the general non-convex case and such x that E[f (x) -f (x * )] ≤ ε in PŁ case respectively. Dependencies on numerical constants (and logarithms in PŁ setting), smoothness constants, and initial suboptimality are omitted in the complexity bounds. Although BR-SGDm, BR-MVR, BTARD-SGD, Byrd-SAGA, BR-CSGD, BR-CSAGA, BROADCAST are analyzed for unit batchsize only (b = 1), one can easily generalize them to the case of b > 1 and we show these generalizations in the table.

defer further details and additional experiments with heterogeneous data to Appendix B. General setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. Moreover,Karimireddy et al. (2021) propose a reasonable formalism for describing robust aggregation rules (see Def. 2.1) and show that almost all previously known defences are not robust according to this formalism. In addition, they propose and analyze new Byzantine-tolerant methods based on the usage of Polyak's momentum(Polyak, 1964) (BR-SDGm) and momentum variance reduction(Cutkosky & Orabona, 2019) (BR-MVR). This approach is extended to the case of heterogeneous data and aggregators agnostic to the noise level byKarimireddy et al. (2022), and He et al. (

• Bucketing, RFA • Bucketing, and CM • Bucketing in terms of Definition 2.1. Lemma D.1 (Modification of Lemma 1 from (Karimireddy et al., 2022)). Assume that {x 1 , x 2 , . . . , x n } is such that there exists a subset G ⊆ [n], |G| = G ≥ (1 -δ)n and σ ≥ 0 such that (14) holds. Let vectors {y 1 , . . . , y N }, N = n /s 4 be generated by Algorithm 2 and

ACKNOWLEDGEMENTS

We thank anonymous reviewers for useful suggestions regarding additional experiments and discussion of the derived results. The work of E. Gorbunov was partially supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 No. 70-2021-00138. The work of P. Richtárik was partially supported by the KAUST Baseline Research Fund Scheme and by the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence.

annex

Corollary E.7. Let the assumptions of Theorem E.2 hold, p = min{ b /m, 1 /1+ω} and γ = min, wherewherecommunication rounds andoracle calls per worker. Corollary E.8 (Homogeneous data). Let the assumptions of Theorem E.2 hold, p = min{ b /m, 1 /1+ω}, B = 0, ζ = 0, and, whereThen for all K ≥ 0 we havewherecommunication rounds andoracle calls per worker.a common phenomenon for the methods with communication compression (Horváth et al., 2019b; Gorbunov et al., 2021b) . However, when the compression is not severe, e.g., 1 + ω ≤ m, then the complexity bound becomeswhich is worse only O( √ ω) times than the complexity of Byz-VR-MARINA without compression, while the number of communicated bits/components becomes O( d /ζ Q ) times smaller. For example, in the case of RandK sparsification we have 1 + ω = d /K = d /ζ Q , and 1 + ω ≤ m allows to have quite strong compression, e.g., for m ≥ 1000, meaning that the dataset has at least 1000 samples, inequality 1 + ω ≤ m implies that workers can send just 0.1% of information. In this case, the communication cost of each iteration becomes ∼ 1000 times cheaper, while the number of communication rounds increases only √ 1 + ω ∼ 30 times. If the communication is the bottleneck, then the algorithm will converge much faster with compression than without it in this setup.On the batchsizes. First, we note that our analysis is valid for any choice of b ≥ 1. For simplicity of the further discussion of the batchsizes role in the complexities, consider the homogeneous case (B = 0, ζ = 0) without compression (ω = 0). As Corollary E.1 states, the communication complexity of Byz-VR-MARINA in this case equalsNote that the term depending on the ratio of Byzantine workers δ scales as b -3 /2 with the batchsize and the term depending on 1 /G scales as b -1 . Table 2 illustrates that previous SOTA results in this case scale as b -1 or b -1 /2 , so, the complexity bound for Byz-VR-MARINA scales with b no worse than the concurrent bounds.Next, typically, there is no need to take b larger than √ m for SARAH-based variance reduced methods (Horváth et al., 2022; Li et al., 2021) : oracle complexity is always the same (neglecting the differences in the smoothness constants), while the iteration complexity stops improving once b becomes larger than √ m. However, the complexity bound for Byz-VR-MARINA contains the non-standard term O L±m √ cδ / √ b 3 ε 2 appearing due to the presence of Byzantine workers. For simplicity, we assume that L = Θ(L ± ) (though L ± can be both smaller and larger than L). Then, when we increase batchisze b, the communication complexity stops improving once b becomes larger than max{ 3 √ cδm 2 , m}. Interestingly, 3 √ cδm 2 can be larger than the standard value √ m: this is the case when m > 1 /c 2 δ 2 . In this case, the communication complexity of Byz-VR-MARINA benefits from the slightly larger batchsizes than in the classical case. This phenomenon has a natural explanation: when we increase the batchsize, the variance of the gradient noise decreases and it becomes even harder for Byzantine workers to shift the updates of the method significantly.

