DASHA: DISTRIBUTED NONCONVEX OPTIMIZATION WITH COMMUNICATION COMPRESSION AND OPTIMAL ORACLE COMPLEXITY

Abstract

We develop and analyze DASHA: a new family of methods for nonconvex distributed optimization problems. When the local functions at the nodes have a finite-sum or an expectation form, our new methods, DASHA-PAGE, DASHA-MVR and DASHA-SYNC-MVR, improve the theoretical oracle and communication complexity of the previous state-of-the-art method MARINA by Gorbunov et al. (2020) . In particular, to achieve an ε-stationary point, and considering the random sparsifier RandK as an example, our methods compute the optimal number of gradients finite-sum and expectation form cases, respectively, while maintaining the SOTA communication complexity O ( d /ε √ n). Furthermore, unlike MARINA, the new methods DASHA, DASHA-PAGE and DASHA-MVR send compressed vectors only, which makes them more practical for federated learning. We extend our results to the case when the functions satisfy the Polyak-Łojasiewicz condition. Finally, our theory is corroborated in practice: we see a significant improvement in experiments with nonconvex classification and training of deep learning models. Setting Method T := # Communication Rounds (a) Oracle Complexity Full? (b) Gradient MARINA ω + L(1+ω/ √ n) µ T Yes DASHA (Cor. I.10) ω + L(1+ω/ √ n) µ T No Finite-Sum (2) VR-MARINA ω + m B + L(1+ω/ √ n) µ + L √ (1+ω)m µ

1. INTRODUCTION

Nonconvex optimization problems are widespread in modern machine learning tasks, especially with the rise of the popularity of deep neural networks (Goodfellow et al., 2016) . In the past years, the dimensionality of such problems has increased because this leads to better quality (Brown et al., 2020) and robustness (Bubeck & Sellke, 2021) of the deep neural networks trained this way. Such huge-dimensional nonconvex problems need special treatment and efficient optimization methods (Danilova et al., 2020) . Because of their high dimensionality, training such models is a computationally intensive undertaking that requires massive training datasets (Hestness et al., 2017) , and parallelization among several compute nodes 1 (Ramesh et al., 2021). Also, the distributed learning paradigm is a necessity in federated learning (Konečný et al., 2016) , where, among other things, there is an explicit desire to secure the private data of each client. Unlike in the case of classical optimization problems, where the performance of algorithms is defined by their computational complexity (Nesterov, 2018) , distributed optimization algorithms are typically measured in terms of the communication overhead between the nodes since such communication is often the bottleneck in practice (Konečný et al., 2016; Wang et al., 2021) . Many approaches tackle the problem, including managing communication delays (Vogels et al., 2021) , fighting with stragglers (Li et al., 2020a) , and optimization over time-varying directed graphs (Nedić & Olshevsky, 2014) . Another popular way to alleviate the communication bottleneck is to use lossy compression of communicated messages (Alistarh et al., 2017; Mishchenko et al., 2019; Gorbunov et al., 2021; Szlendak et al., 2021) . In this paper, we focus on this last approach.

1.1. PROBLEM FORMULATION

In this work, we consider the optimization problem min x∈R d f (x) := 1 n n i=1 f i (x) , where f i : R d → R is a smooth nonconvex function for all i ∈ [n] := {1, . . . , n}. Moreover, we assume that the problem is solved by n compute nodes, with the i th node having access to function f i only, via an oracle. Communication is facilitated by an orchestrating server able to communicate with all nodes. Our goal is to find an ε-solution (ε-stationary point) of (1): a (possibly random) point x ∈ R d , such that E ∇f ( x) 2 ≤ ε.

1.2. GRADIENT ORACLES

We consider all of the following structural assumptions about the functions {f i } n i=1 , each with its own natural gradient oracle: 1. Gradient Setting. The i th node has access to the gradient ∇f i : R d → R d of function f i . 2. Finite-Sum Setting. The functions {f i } n i=1 have the finite-sum form f i (x) = 1 m m j=1 f ij (x), ∀i ∈ [n], where f ij : R d → R is a smooth nonconvex function for all j ∈ [m]. For all i ∈ [n], the i th node has access to a mini-batch of B gradients, 1 B j∈Ii ∇f ij (•) , where I i is a multi-set of i.i.d. samples of the set [m], and |I i | = B. 3. Stochastic Setting. The function f i is an expectation of a stochastic function, f i (x) = E ξ [f i (x; ξ)] , ∀i ∈ [n], where f i : R d × Ω ξ → R. For a fixed x ∈ R, f i (x; ξ) is a random variable over some distribution D i , and, for a fixed ξ ∈ Ω ξ , f i (x; ξ) is a smooth nonconvex function. The i th node has access to a mini-batch of B stochastic gradients 1 B B j=1 ∇f i (•; ξ ij ) of the function f i through the distribution D i , where {ξ ij } B j=1 is a collection of i.i.d. samples from D i .

1.3. ORACLE COMPLEXITY

In this paper, the oracle complexity of a method is the number of (stochastic) gradient calculations per node to achieve an ε-solution. Every considered method performs some number T of communications rounds to get an ε-solution; thus, if every node (on average) calculates B gradients in each communication round, then the oracle complexity equals O (B init + BT ) , where B init is the number of gradient calculations in the initialization phase of a method.

1.4. UNBIASED COMPRESSORS

The method proposed in this paper is based on unbiased compressors -a family of stochastic mappings with special properties that we define now. Definition 1.1. A stochastic mapping C : R d → R d is an unbiased compressor if there exists ω ∈ R such that E [C(x)] = x, E C(x) -x 2 ≤ ω x 2 , ∀x ∈ R d . We denote this class of unbiased compressors as U(ω). One can find more information about unbiased compressors in (Beznosikov et al., 2020; Horváth et al., 2019) . The purpose of such compressors is to quantize or sparsify the communicated vectors in order to increase the communication speed between the nodes and the server. Our methods will work collection of stochastic mappings {C i } n i=1 satisfying the following assumption. Assumption 1.2. C i ∈ U(ω) for all i ∈ [n], and the compressors are independent. Table 1 : General Nonconvex Case. The number of communication rounds (iterations) and the oracle complexity of algorithms to get an ε-solution (E ∇f ( x) 2 ≤ ε), and the necessity (or not) of algorithms to send non-compressed vectors periodically (see Section 3). Setting Method T := # Communication Rounds (a) Oracle Complexity Full? (b) Gradient 

+ BT No

DASHA-SYNC-MVR (Cor. 6.10) 1+ω/ √ n ε + σ 2 εnB + σ ε 3/2 nB Bω + BT Yes (a) Only dependencies w.r.t. the following variables are shown: ω = quantization parameter, n = # of nodes, m = # of local functions (only in finite-sum case (2)), σ 2 = variance of stochastic gradients (only in stochastic case (3)), B = batch size (only in finite-sum and stochastic case). To simplify bounds, we assume that ω + 1 = Θ ( d /ζ C ) , where d is dimension of x in (1) and ζC is the expected number of nonzero coordinates that each compressor Ci returns (see Definition 1.3). (b) Does the algorithm periodically send full (non-compressed) vectors? (see Section 3) (c) One can always choose the parameter of RandK such that this term does not dominate (see Section 6.5).

1.5. COMMUNICATION COMPLEXITY

The quantity below characterizes the number of nonzero coordinates that a compressor C returns. This notion is useful in case of sparsification compressors. In this paper, the communication complexity of a method is the number of coordinates sent to the server per node to achieve an ε-solution. If every node (on average) sends ζ coordinates in each communication round, then the communication complexity equals O (ζ init + ζT ) , where T is the number of communication rounds, and ζ init is the number of coordinates sent in the initialization phase. We would like to notice that the established communication complexities are compared to previous upper bounds from (Gorbunov et al., 2021; Szlendak et al., 2021; Mishchenko et al., 2019; Alistarh et al., 2017) , and in this line of work, the comparisons of the communication complexities are made with respect to the number of send coordinates. As far as we know, in this sense, no lower bounds are proved, and it deserves a separate piece of work. However, Korhonen & Alistarh (2021) proved the lower bounds of the communication complexity with respect to the number of send bits in the constraint optimization setting that x ∈ [0, 1] d , so our upper bounds can not be directly compared to their result because we operate on a different level of abstraction.

2. RELATED WORK

• Uncompressed communication. This line of work is characterized by methods in which the nodes send messages (vectors) to the server without any compression. In the finite-sum setting, the current state-of-the-art methods were proposed by Sharma et al. (2019) ; Li et al. (2021b) , showing that after O ( 1 /ε) communication rounds and O m + √ m ε √ n Table 2 : Polyak-Łojasiewicz Case. The number of communications rounds (iterations) and oracle complexity of algorithms to get an ε-solution (E [f ( x)] -f * ≤ ε), and the necessity (or not) of algorithms to send non-compressed vectors periodically. calculations of ∇f ij per node, these methods can return an ε-solution. Moreover, Sharma et al. (2019) show that the same can be done in the stochastic setting after O σ 2 εn + σ ε 3 /2 n (6) stochastic gradient calculations per node. Note that complexities ( 5) and ( 6) are optimal (Arjevani et al., 2019; Fang et al., 2018; Li et al., 2021a ). An adaptive variant was proposed by Khanduri et al. (2020) based on the work of Cutkosky & Orabona (2019) . See also (Khanduri et al., 2021; Murata & Suzuki, 2021) . • Compressed communication. In practice, it is rarely affordable to send uncompressed messages (vectors) from the nodes to the server due to limited communication bandwidth. Because of this, researchers started to develop methods keeping in mind the communication complexity: the total number of coordinates/floats/bits that the nodes send to the server to find an ε-solution. Two important families of compressors are investigated in the literature to reduce communication bottleneck: biased and unbiased compressors. While unbiased compressors are superior in theory (Mishchenko et al., 2019; Li et al., 2020b; Gorbunov et al., 2021) , biased compressors often enjoy better performance in practice (Beznosikov et al., 2020; Xu et al., 2020) . Recently, Richtárik et al. (2021) developed EF21, which is the first method capable of working with biased compressors an having the theoretical iteration complexity of gradient descent (GD), up to constant factors. • Unbiased compressors. The theory around unbiased compressors is much more optimistic. Alistarh et al. (2017) developed the QSGD method providing convergence rates of stochastic gradient method with quantized vectors. However, the nonstrongly convex case was analyzed under the strong assumption that all nodes have identical functions, and the stochastic gradients have bounded second moment. Next, Mishchenko et al. (2019) ; Horváth et al. (2019) proposed the DIANA method and proved convergence rates without these restrictive assumptions. Also, distributed nonconvex optimization methods with compression were developed by Haddadpour et al. (2021) ; Das et al. (2020) . Finally, Gorbunov et al. (2021) proposed MARINA -the current state-of-the-art distributed method in terms of theoretical communication complexity, inspired by the PAGE method of Li et al. (2021a).

3. CONTRIBUTIONS

We develop a new family of distributed optimization methods DASHA for nonconvex optimization problems with unbiased compressors. Compared to MARINA, our methods make more practical and simpler optimization steps. In particular, in MARINA, all nodes simultaneously send either compressed vectors, with some probability p, or the gradients of functions {f i } n i=1 (uncompressed vectors), with probability 1 -p. In other words, the server periodically synchronizes all nodes. In federated learning, where some nodes can be inaccessible for a long time, such periodic synchronization is intractable. Our method DASHA solves both problems: i) the nodes always send compressed vectors, and ii) the server never synchronizes all nodes in the gradient setting. Further, a simple tweak in the compressors (see Appendix D) results in support for partial participation in the gradient setting , which makes DASHA more practical for federated learning tasks. Let us summarize our most important theoretical and practical contributions: • New theoretical SOTA complexity in the finite-sum setting. Using our novel approach to compress gradients, we improve the theoretical complexities of VR-MARINA (see Tables 1 and 2 ) in the finite-sum setting. Indeed, if the number of functions m is large, our algorithm DASHA-PAGE needs √ ω + 1 times fewer communications rounds, while communicating compressed vectors only. • New theoretical SOTA complexity in the stochastic setting. We develop a new method, DASHA-SYNC-MVR, improving upon the previous state of the art (see Table 1 ). When ε is small, the number of communication rounds is reduced by a factor of √ ω + 1. Indeed, we improve the dominant term which depends on ε 3 /2 (the other terms depend on ε only). However, DASHA-SYNC-MVR needs to periodically send uncompressed vectors with the same rate as VR-MARINA (online). Nevertheless, we show that DASHA-MVR also improves the dominant term when ε is small, and this method sends compressed vectors only. Moreover, we provide detailed experiments on practical machine learning tasks: training nonconvex generalized linear models and deep neural networks, showing improvements predicted by our theory. See Appendix A. • Closing the gap between uncompressed and compressed methods. In Section 2, we mentioned that the optimal oracle complexities of methods without compression in the finite-sum and stochastic settings are ( 5) and ( 6), respectively. Considering the RandK compressor (see Definition F.1), we show that DASHA-PAGE, DASHA-MVR and DASHA-SYNC-MVR attain these optimal oracle complexities while attainting the state-of-the-art communication complexity as MARINA, which needs to use the stronger gradient oracle! Therefore, our new methods close the gap between results from (Gorbunov et al., 2021) and (Sharma et al., 2019; Li et al., 2021b) .

4. ALGORITHM DESCRIPTION

We now describe our proposed family of optimization methods, DASHA (see Algorithm 1). DASHA is inspired by MARINA and momentum variance reduction methods (MVR) (Cutkosky & Orabona, 2019; Tran-Dinh et al., 2021; Liu et al., 2020) : the general structure repeats MARINA except for the variance reduction strategy, which we borrow from MVR. Unlike MARINA, our algorithm never sends uncompressed vectors, and the number of bits that every node sends is always the same. Moreover, we reduce the variance from the oracle and the compressor separately, which helps us to improve the theoretical convergence rates in the stochastic and finite-sum cases. First, using the gradient estimator g t , the server in each communication round calculates the next point x t+1 and broadcasts it to the nodes. Subsequently, all nodes in parallel calculate vectors h t+1 i in one of three ways, depending on the available oracle. For the the gradient, finite-sum, and the stochastic settings, we use GD-like, PAGE-like, and MVR-like strategies, respectively. Next, each node compresses their message and uploads it to the server. Finally, the server aggregates all received messages and calculates the next vector g t+1 . We refer to Section H to get a better intuition about DASHA. We note that in the stochastic setting, our analysis of DASHA-MVR (Algorithm 1) provides a suboptimal oracle complexity w.r.t. ω (see Tables 1 and 2 ). In Appendix J we provide experimental evidence that our analysis is tight. For this reason, we developed DASHA-SYNC-MVR (see Algorithm 2 in Appendix C) that improves the previous state-of-the-art results and sends non-compressed vectors with the same rate as VR-MARINA (online). Note that DASHA-MVR still enjoys the optimal oracle and SOTA communication complexity (see Section 6.5); and this can be seen it in experiments. Algorithm 1 DASHA 1: Input: starting point x 0 ∈ R d , stepsize γ > 0, momentum a ∈ (0, 1], momentum b ∈ (0, 1] (only in DASHA-MVR), probability p ∈ (0, 1] (only in DASHA-PAGE), batch size B (only in DASHA-PAGE and DASHA-MVR), number of iterations T ≥ 1 2: Initialize g 0 i ∈ R d , h 0 i ∈ R d on the nodes and g 0 = 1 n n i=1 g 0 i on the server 3: for t = 0, 1, . . . , T -1 do 4: x t+1 = x t -γg t

5:

Flip a coin c t+1 = 1, with probability p 0, with probability 1 -p (only in DASHA-PAGE)

6:

Broadcast x t+1 to all nodes 7: for i = 1, . . . , n in parallel do 8: h t+1 i =              ∇f i (x t+1 ) (DASHA) ∇f i (x t+1 ) if c t+1 = 1 h t i + 1 B j∈I t i ∇f ij (x t+1 ) -∇f ij (x t ) if c t+1 = 0 (DASHA-PAGE) 1 B B j=1 ∇f i (x t+1 ; ξ t+1 ij ) + (1 -b) h t i -1 B B j=1 ∇f i (x t ; ξ t+1 ij ) (DASHA-MVR) 9: m t+1 i = C i h t+1 i -h t i -a (g t i -h t i ) 10: g t+1 i = g t i + m t+1 i 11: Send m t+1 i to the server 12: end for 13: g t+1 = g t + 1 n n i=1 m t+1 i 14: end for 15: Output: xT chosen uniformly at random from {x t } T -1 k=0 (or x T under the PŁ-condition)

5. ASSUMPTIONS

We now provide the assumptions used throughout our paper. Assumption 5.1. There exists f * ∈ R such that f (x) ≥ f * for all x ∈ R. Assumption 5.2. The function f is L-smooth, i.e., ∇f (x) -∇f (y) ≤ L x -y for all x, y ∈ R d . Assumption 5.3. For all i ∈ [n], the function f i is L i -smooth. 2 We define L 2 := 1 n n i=1 L 2 i . The next assumption is used in the finite-sum setting (2). Assumption 5.4. For all i ∈ [n], j ∈ [m], the function f ij is L ij -smooth. Let L max := max i∈[n],j∈[m] L ij . The two assumptions below are provided for the stochastic setting (3). Assumption 5.5. For all i ∈ [n] and for all x ∈ R d , the stochastic gradient ∇f i (x; ξ) is unbiased and has bounded variance, i.e., E ξ [∇f i (x; ξ)] = ∇f i (x), E ξ ∇f i (x; ξ) -∇f i (x) 2 ≤ σ 2 , where σ 2 ≥ 0. Assumption 5.6. For all i ∈ [n] and for all x, y ∈ R, the stochastic gradient ∇f i (x; ξ) satisfies the mean-squared smoothness property, i.e., E ξ ∇f i (x; ξ) -∇f i (y; ξ) -(∇f i (x) -∇f i (y)) 2 ≤ L 2 σ x -y 2 . 2 Note that one can always take L 2 = L 2 := 1 n n i=1 L 2 i . However, the optimal constant L can be much better because L 2 ≤ 1 n n i=1 Li 2 ≤ 1 n n i=1 L 2 i .

6. THEORETICAL CONVERGENCE RATES

Now, we provide convergence rate theorems for DASHA, DASHA-PAGE and DASHA-MVR. All three methods are listed in Algorithm 1 and differ in Line 8 only. At the end of the section, we provide a theorem for DASHA-SYNC-MVR. 6.1 GRADIENT SETTING (DASHA) Theorem 6.1. Suppose that Assumptions 5.1, 5.2, 5.3 and 1.2 hold. Let us take a = 1/ (2ω + 1) , γ ≤ L + 16ω(2ω+1) n L -1 , and g 0 i = h 0 i = ∇f i (x 0 ) for all i ∈ [n] in Algorithm 1 (DASHA), then E ∇f ( x T ) 2 ≤ 2(f (x 0 )-f * ) γT . The corollary below simplifies the previous theorem and reveals the communication complexity of DASHA. Corollary 6.2. Suppose that assumptions from Theorem 6.1 hold, and  g 0 i = h 0 i = ∇f i (x 0 ) for all i ∈ [n], then DASHA needs T := O 1 ε f (x 0 ) -f * L + ω √ n L = ζ C ≤ d/ √ n, then the communication complexity equals O d + L(f (x 0 )-f * )d ε √ n . 6.2 FINITE-SUM SETTING (DASHA-PAGE) Next, we provide the complexity bounds for DASHA-PAGE. Theorem 6.4. Suppose that Assumptions 5.1, 5.2, 5.3, 5.4, and 1.2 hold. Let us take a = 1/ (2ω + 1), probability p ∈ (0, 1], γ ≤ L + 48ω (2ω + 1) n (1 -p)L 2 max B + L 2 + 2 (1 -p) L 2 max pnB -1 and g 0 i = h 0 i = ∇f i (x 0 ) for all i ∈ [n] in Algorithm 1 (DASHA-PAGE) then E ∇f ( x T ) 2 ≤ 2(f (x 0 )-f * ) γT . Let us simplify the statement of Theorem 6.4 by choosing particular parameters. Corollary 6.5. Let the assumptions from Theorem 6.4 hold, p = B /(m+B), and  g 0 i = h 0 i = ∇f i (x 0 ) for all i ∈ [n]. Then DASHA-PAGE needs T := O     1 ε     f (x 0 ) -f * L + ω √ n L + ω √ n + m nB L max √ B         O d + L max f (x 0 ) -f * d ε √ n , and the expected # of gradient calculations per node equals O m + L max f (x 0 ) -f * √ m ε √ n . ( ) Up to Lipschitz constants factors, bound (8) is optimal (Fang et al., 2018; Li et al., 2021a) , and unlike VR-MARINA, we recover the optimal bound with compression! At the same time, the communication complexity ( 7) is the same as in DASHA (see Corollary 6.3) or MARINA.

6.3. STOCHASTIC SETTING (DASHA-MVR)

Let h t := 1 n n i=1 h t i . This vector is not used in Algorithm 1, but appears in the theoretical results. Theorem 6.7. Suppose that Assumptions 5.1, 5.2, 5.3, 5.5, 5.6 and 1.2 hold. Let us take a = 1 2ω+1 , b ∈ (0, 1], γ ≤ L + 96ω(2ω+1) n (1-b) 2 L 2 σ B + L 2 + 4(1-b) 2 L 2 σ bnB -1 , and g 0 i = h 0 i for all i ∈ [n] in Algorithm 1 (DASHA-MVR). Then E ∇f ( x T ) 2 ≤ 1 T     2 f (x 0 ) -f * γ + 2 b h 0 -∇f (x 0 ) 2 + 32bω (2ω + 1) n 1 n n i=1 h 0 i -∇f i (x 0 ) 2     + 96ω (2ω + 1) nB + 4 bnB b 2 σ 2 . Corollary 6.8. Suppose that assumptions from Theorem 6.7 hold, momentum b = Θ min 1 ω nεB σ 2 , nεB σ 2 , and g 0 i = h 0 i = 1 Binit Binit k=1 ∇f i (x 0 ; ξ 0 ik ) for all i ∈ [n], and batch size B init = Θ ( B /b) , then Algorithm 1 (DASHA-MVR) needs  T := O     1 ε     f (x 0 ) -f * L + ω √ n L + ω √ n + σ 2 εn 2 B L σ √ B     + σ 2 nεB     communication rounds = ζ C = Θ Bd √ εn σ , and L := max{L, L σ , L}. Then the communication complexity equals O dσ √ nε + L f (x 0 ) -f * d √ nε , and the expected # of stochastic gradient calculations per node equals O σ 2 nε + L f (x 0 ) -f * σ ε 3 /2 n . ( ) Up to Lipschitz constant factors, the bound (10) is optimal (Arjevani et al., 2019; Sharma et al., 2019) , and unlike VR-MARINA (online), we recover the optimal bound with compression! At the same time, the communication complexity (9) is the same as in DASHA (see Corollary 6.3) or MARINA for small enough ε. 6.4 STOCHASTIC SETTING (DASHA-SYNC-MVR) We now provide the complexities of Algorithm , batch size B = Θ σ 2 nε and h 0 i = g 0 i = 1 Binit Binit k=1 ∇f i (x 0 ; ξ 0 ik ) for all i ∈ [n], initial batch size B init = Θ max σ 2 nε , B d ζ C , then DASHA-SYNC-MVR needs  T := O     1 ε     f (x 0 ) -f * L + ω √ n L + ω √ n + d ζ C n + σ 2 εn 2 B L σ √ B     + σ 2 nεB     = ζ C = Θ Bd √ εn σ , and L := max{L, L σ , L}. Then the communication complexity equals O dσ √ nε + L f (x 0 ) -f * d √ nε , and the expected # of stochastic gradient calculations per node equals O σ 2 nε + L f (x 0 ) -f * σ ε 3 /2 n . ( ) Up to Lipschitz constant factors, the bound (12) is optimal (Arjevani et al., 2019; Sharma et al., 2019) , and unlike VR-MARINA (online), we recover the optimal bound with compression! At the same time, the communication complexity (11) is the same as in DASHA (see Corollary 6.3) or MARINA for small enough ε. 

A EXPERIMENTS

We have tested all developed algorithms on practical machine learnings problemsfoot_1 . Note that the goal of our experiments is to justify the theoretical convergence rates from our paper. We compare the new methods with MARINA on LIBSVM datasets (Chang & Lin, 2011 ) (under the 3-clause BSD license) because MARINA is the only previous state-of-the-art method for the problem (1). Moreover, we show the advantage of our method on an image recognition task with CIFAR10 (Krizhevsky et al., 2009) and a deep neural network. In all experiments, we take parameters of algorithms predicted by the theory (stated in the convergence rate theorems our paper and in (Gorbunov et al., 2021) ), except for the step sizes -we fine-tune them using a set of powers of two {2 i | i ∈ [-10, 10]} -and use the RandK compressor. We evaluate communication complexity; thus, each plot represents the relation between the norm of a gradient or function value (vertical axis), and the total number of transmitted bits per node (horizontal axis). A.1 GRADIENT SETTING We consider nonconvex functions f i (x) := 1 m m j=1 1 - 1 1 + exp(y ij a ij x) 2 to solve a classification problem. Here, a ij ∈ R d is the feature vector of a sample on the i th node, y ij ∈ {-1, 1} is the corresponding label, and m is the number of samples on the i th node. All nodes calculate full gradients. We take the mushrooms dataset (dimension d = 112, number of samples equals 8124) from LIBSVM, randomly split the dataset between 5 nodes and take K = 10 in RandK. One can see in Figure 1 When K = 100, the improvement is not significant because Table 1) In this experiment, we consider the following logistic regression functions with nonconvex regularizer {f i } n i=1 to solve a classification problem: 1+ ω / √ n ε dominates √ m ε √ nB (see f i (x 1 , x 2 ) := E j∼[m]     -log exp a ij x yij y∈{1,2} exp a ij x y + λ y∈{1,2} d k=1 {x y } 2 k 1 + {x y } 2 k     , where x 1 , x 2 ∈ R d , {•} k is an indexing operation, a ij ∈ R d is a feature of a sample on the i th node, y ij ∈ {1, 2} is a corresponding label, m is the number of samples located on the i th node, constant λ = 0.001. We take batch size B = 1 and compare VR-MARINA (online), DASHA-MVR, and DASHA-SYNC-MVR that depend on a common ratio σ 2 /nεBfoot_2 . We fix σ 2 /nεB ∈ {10 4 , 10 5 } and K ∈ {200, 2000} in RandK compressors. We consider real-sim dataset from LIBSVM splitted between 5 nodes. When we increase σ 2 /nεB from 10 4 to 10 5 , we implicitly decrease ε because other parameters are fixed. In Figure 3 , when ε is small, DASHA-MVR and DASHA-SYNC-MVR converge faster than VR-MARINA (online). /nεB ∈ {10 4 , 10 5 }, and K ∈ {200, 2000} in RandK in the stochastic setting.

A.4 DEEP NEURAL NETWORK TRAINING

Finally, we test our algorithms on an image recognition task, CIFAR10 (Krizhevsky et al., 2009) , with the ResNet-18 (He et al., 2016) deep neural network (the number of parameters d ≈ 10 7 ). We split CIFAR10 among 5 nodes, and take K ≈ 2 • 10 6 in RandK. In all methods we finetune two parameters: step size γ ∈ {0.05, 0.01, 0.005, 0.001} and ratio σ 2 /nεB ∈ {2, 10, 20, 100}. Moreover, we trained the neural network with SGD without compression as a baseline, with step size γ ∈ {1.0, 0.5, 0.1, 0.05, 0.01, 0.001}. All nodes have batch size B = 25. Results are provided in Figure 4 . We see that DASHA-MVR converges significantly faster than other algorithms in the terms of communication complexity. Moreover, DASHA-SYNC-MVR works better than VR-MARINA (online) and SGD. 

B EXPERIMENTS DETAILS

The code was written in Python 3.6.8 using PyTorch 1.9 (Paszke et al., 2019) . A distributed environment was emulated on a machine with Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz and 64 cores. Deep learning experiments were conducted with NVIDIA A100 GPU with 40GB memory (each deep learning experiment uses at most 5GB of this memory). When the number of nodes n does not divide the number of samples N in a dataset, we randomly ignore N mod n samples from a dataset (up to 4 when n = 5).

C DESCRIPTION OF DASHA-SYNC-MVR

In this section, we provide a description of DASHA-SYNC-MVR (see Algorithm 2). This algorithm is closely related to DASHA-MVR (Algorithm 1), but DASHA-SYNC-MVR synchronizes all nodes with some probability p. This synchronization procedure enabled us to fix the convergence rate suboptimality of DASHA-MVR w.r.t. ω. Algorithm 2 DASHA-SYNC-MVR 1: Input: starting point x 0 ∈ R d , stepsize γ > 0, momentum a ∈ (0, 1], probability p ∈ (0, 1], batch size B , number of iterations T ≥ 1. 2: Initialize g 0 i , h 0 i on the nodes and g 0 = 1 n n i=1 g 0 i on the server 3: for t = 0, 1, . . . , T -1 do 4: x t+1 = x t -γg t 5: c t+1 = 1, with probability p, 0, with probability 1 -p 6: Broadcast x t+1 to all nodes 7: for i = 1, . . . , n in parallel do 8: if c t+1 = 1 then 9: h t+1 i = 1 B B k=1 ∇f i (x t+1 ; ξ t+1 ik ) 10: m t+1 i = g t+1 i = h t+1 i 11: else 12: h t+1 i = 1 B B j=1 ∇f i (x t+1 ; ξ t+1 ij ) + h t i -1 B B j=1 ∇f i (x t ; ξ t+1 ij ) 13: m t+1 i = C i h t+1 i -h t i -a (g t i -h t i ) 14: g t+1 i = g t i + m t+1 i 15: end if 16: Send m t+1 i to the server 17: end for 18: if c t+1 = 1 then 19: g t+1 = 1 n n i=1 m t+1 i 20: else 21: g t+1 = g t + 1 n n i=1 m t+1 i 22: end if 23: end for 24: Output: xT chosen uniformly at random from {x t } T -1 k=0 D PARTIAL PARTICIPATION A partial participation mechanism, important for federated learning applications, can be easily implemented in DASHA. Let us assume that the i th node either participates in a communication round with probability p , or sends nothing. From the view of unbiased compressors, it can mean that instead of using a compressor C, we have use the following new stochastic mapping C p : C p (x) = 1 p C(x), with probability p , 0, with probability 1 -p . ( ) The following simple result states that the new mapping C p is also an unbiased compressor, which means that our theory applies to this choice as well. Theorem D.1. If C ∈ U(ω), then C p ∈ U ω+1 p -1 . In the case of partial participation, all theorems from Section 6 will hold with ω replaced by (ω+1) /p -1.

E AUXILIARY FACTS

In this section, we recall well-known auxiliary facts that we use in the proofs. 1. For all x, y ∈ R d , we have x + y 2 ≤ 2 x 2 + 2 y 2 (14) 2. Let us take a random vector ξ ∈ R d , then E ξ 2 = E ξ -E [ξ] 2 + E [ξ] 2 . ( ) F COMPRESSORS FACTS Definition F.1. Let us take a random subset S from [d], |S| = K, K ∈ [d]. We say that a stochastic mapping C : R d → R d is RandK if C(x) = d K j∈S x j e j , where {e i } d i=1 is the standard unit basis. Informally, RandK randomly keeps K coordinates and zeroes out the other. Theorem F.2. If C is RandK, then C ∈ U d k -1 . See the proof in (Beznosikov et al., 2020) . In the next theorem, we show that C p (x) from ( 13) is an unbiased compressor. Theorem D.1. If C ∈ U(ω), then C p ∈ U ω+1 p -1 . Proof. First, we proof the unbiasedness: E [C p (x)] = p 1 p C(x) + (1 -p )0 = C(x), ∀x ∈ R d . Next, we get a bound for the variance: E C p (x) -x 2 = p E 1 p C(x) -x 2 + (1 -p ) x 2 = p E 1 p 2 C(x) 2 -2 1 p C(x), x + x 2 + (1 -p ) x 2 = 1 p E C(x) 2 -(2 -p ) x 2 + (1 -p ) x 2 = 1 p E C(x) 2 -x 2 . From C ∈ U(ω), we have E C p (x) -x 2 ≤ ω + 1 p x 2 -x 2 = ω + 1 p -1 x 2 .

G POLYAK-ŁOJASIEWICZ CONDITION

In this section, we discuss our convergence rates under the (Polyak-Łojasiewicz) PŁ-condition: Assumption G.1. A functions f satisfy (Polyak-Łojasiewicz) PŁ-condition: ∇f (x) 2 ≥ 2µ(f (x) -f * ), ∀x ∈ R, where f * = inf x∈R d f (x) > -∞. Here we use a different notion of an ε-solution: it is a (random) point x, such that E [f ( x)] -f * ≤ ε. Under this assumption, Algorithm 1 achieves a linear convergence rate O (ln ( 1 /ε)) instead of a sublinear convergence rate O ( 1 /ε) in the gradient and finite-sum settings. Moreover, in the stochastic setting, Algorithms 1 and 2 also improve dependence on ε. Related Theorems I.9, I.12, I.15 and I.20 are stated in Appendix I. Note that in the finite-sum and stochastic settings, Theorems I.12 and I.20 provide new SOTA theoretical convergence rates (see Table 2 ).

H INTUITION BEHIND DASHA

In this section, we want to outline an intuition of differences between the proofs of DASHA and MARINA that helps us to improve the convergence rates.

H.1 DIFFERENT SOURCES OF CONTRACTIONS

In both algorithms the proofs analyze E C g t+1 -∇f (x k+1 ) 2 , a norm of a difference between a gradient ∇f (x k+1 ) and a gradient estimator g t+1 . For simplicity, we assume that n = 1, then for MARINA, we have E C g t+1 -∇f (x k+1 ) 2 = p ∇f (x k+1 ) -∇f (x k+1 ) 2 + (1 -p)E C g t + C ∇f (x k+1 ) -∇f (x k ) -∇f (x k+1 ) 2 = (1 -p)E C g t + C ∇f (x k+1 ) -∇f (x k ) -∇f (x k+1 ) 2 (4),(15) = (1 -p) g t -∇f (x k ) 2 + (1 -p)E C C ∇f (x k+1 ) -∇f (x k ) -∇f (x k+1 ) -∇f (x k ) 2 (4) ≤ (1 -p) g t -∇f (x k ) 2 + (1 -p)ω ∇f (x k+1 ) -∇f (x k ) 2 . In order to get a contraction, i.e., E C g t+1 -∇f (x k+1 ) 2 ≤ (1 -p) g t -∇f (x k ) 2 + • • • , MARINA has to send a full gradient ∇f (x k+1 ) with the probability p > 0. Now, let us look how we get a contraction in DASHA: E C g t+1 -∇f (x k+1 ) 2 = E C g t + C ∇f (x k+1 ) -∇f (x k ) -a g t -∇f (x k ) -∇f (x k+1 ) 2 = E C g t + C ∇f (x k+1 ) -∇f (x k ) -a g t -∇f (x k ) -∇f (x k+1 ) 2 (4),(15) = (1 -a) 2 g t -∇f (x k ) 2 + E C C ∇f (x k+1 ) -∇f (x k ) -a g t -∇f (x k ) -∇f (x k+1 ) -∇f (x k ) -a g t -∇f (x k ) 2 (4) ≤ (1 -a) 2 g t -∇f (x k ) 2 + ω ∇f (x k+1 ) -∇f (x k ) -a g t -∇f (x k ) 2 (14) ≤ (1 -a) 2 + 2ωa 2 g t -∇f (x k ) 2 + 2ω ∇f (x k+1 ) -∇f (x k ) 2 ≤ (1 -a) g t -∇f (x k ) 2 + 2ω ∇f (x k+1 ) -∇f (x k ) 2 . In the last inequality we use that a ≤ 1 /2ω+1. On can see that we get exactly the same recursion and contraction. The source of contraction is a correction -a(g t -∇f (x k )) inside the compressor C.

H.2 THE SOURCE OF IMPROVEMENTS IN THE CONVERGENCE RATES

Let us briefly explain why we get the improvements in the convergence rates of DASHA in the finite-sum setting. The same intuitions implies to the stochastic setting. In DASHA, we reduce variances from the compressors C and the random sampling I t j separately: we have two different control variables h t i and g t i , two different parameters the probability p and the momentum a. For simplicity, let us assume that the number of nodes n = 1. Let us consider a Lyapunov function from our proofs: E f (x t ) -f * + γ (4ω + 1) E g t -h t 2 + γ 1 p + 16ω (2ω + 1) E h t -∇f (x t ) 2 . In contrast, MARINA (VR-MARINA) has only one control variable g t i and on parameter p. A Lyapunov function of MARINA is E f (x t ) -f * + γ 2p E g t -∇f (x t ) 2 . MARINA has a simpler Lyapunov function that leads to a suboptimal convergence rate. Intuitively, having one control variable and one parameter is not enough to reduce variances from two different sources of randomness. So in DASHA, the parameter p = B m+B , while in MARINA p = min 

I THEOREMS WITH PROOFS

Lemma I.1. Suppose that Assumption 5.2 holds and let x t+1 = x t -γg t . Then for any g t ∈ R d and γ > 0, we have f (x t+1 ) ≤ f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ 2 g t -∇f (x t ) 2 . ( ) The proof of Lemma I.1 is provided in (Li et al., 2021a) . There  E C g t+1 -h t+1 2 ≤ 2ω n 2 n i=1 h t+1 i -h t i 2 + 2a 2 ω n 2 n i=1 g t i -h t i 2 + (1 -a) 2 g t -h t 2 , and E C g t+1 i -h t+1 i 2 ≤ 2ω h t+1 i -h t i 2 + 2a 2 ω + (1 -a) 2 g t i -h t i 2 , ∀i ∈ [n]. (19) Proof. First, we estimate E C g t+1 -h t+1 2 : E C g t+1 -h t+1 2 = E C   g t + 1 n n i=1 C i h t+1 i -h t i -a g t i -h t i -h t+1 2   (4),(15) = E C   1 n n i=1 C i h t+1 i -h t i -a g t i -h t i - 1 n n i=1 h t+1 i -h t i -a g t i -h t i 2   + (1 -a) 2 g t -h t 2 . Using the independence of compressors and (4), we get E C g t+1 -h t+1 2 = 1 n 2 n i=1 E C C i h t+1 i -h t i -a g t i -h t i -h t+1 i -h t i -a g t i -h t i 2 + (1 -a) 2 g t -h t 2 ≤ ω n 2 n i=1 h t+1 i -h t i -a g t i -h t i 2 + (1 -a) 2 g t -h t 2 ≤ 2ω n 2 n i=1 h t+1 i -h t i 2 + 2a 2 ω n 2 n i=1 g t i -h t i 2 + (1 -a) 2 g t -h t 2 . Analogously, we can get the bound for E C g t+1 i -h t+1 i 2 : E C g t+1 i -h t+1 i 2 = E C g t i + C i h t+1 i -h t i -a g t i -h t i -h t+1 i 2 = E C C i h t+1 i -h t i -a g t i -h t i -h t+1 i -h t i -a g t i -h t i 2 + (1 -a) 2 g t i -h t i 2 ≤ ω h t+1 i -h t i -a g t i -h t i 2 + (1 -a) 2 g t i -h t i 2 ≤ 2ω h t+1 i -h t i 2 + 2a 2 ω g t i -h t i 2 + (1 -a) 2 g t i -h t i 2 = 2ω h t+1 i -h t i 2 + 2a 2 ω + (1 -a) 2 g t i -h t i 2 . Lemma I.3. Suppose that Assumptions 5.2 and 1.2 hold and let us take a = 1/ (2ω + 1) , then E f (x t+1 ) + γ (2ω + 1) E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ h t -∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + 8γω (2ω + 1) n E 1 n n i=1 h t+1 i -h t i 2 . Proof. Due to Lemma I.1 and the update step from Line 4 in Algorithm 1, we have E f (x t+1 ) ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ 2 g t -∇f (x t ) 2 = E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ 2 g t -h t + h t -∇f (x t ) 2 (20) ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ g t -h t 2 + h t -∇f (x t ) 2 . In the last inequality we use Jensen's inequality ( 14). Let us fix some constants κ, η ∈ [0, ∞) that we will define later. Combining bounds (20), ( 18), ( 19) and using the law of total expectation, we get E f (x t+1 ) + κE g t+1 -h t+1 2 + ηE 1 n n i=1 g t+1 i -h t+1 i 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ g t -h t 2 + h t -∇f (x t ) + κE 2ω n 2 n i=1 h t+1 i -h t i 2 + 2a 2 ω n 2 n i=1 g t i -h t i 2 + (1 -a) 2 g t -h t 2 + ηE 2ω n n i=1 h t+1 i -h t i 2 + 2a 2 ω + (1 -a) 2 1 n n i=1 g t i -h t i 2 = E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ h t -∇f (x t ) 2 + γ + κ (1 -a) 2 E g t -h t 2 + 2κa 2 ω n + η 2a 2 ω + (1 -a) 2 E 1 n n i=1 g t i -h t i 2 + 2κω n + 2ηω E 1 n n i=1 h t+1 i -h t i 2 . ( ) Now, by taking κ = γ a , we can see that γ + κ (1 -a) 2 ≤ κ, and thus E f (x t+1 ) + γ a E g t+1 -h t+1 2 + ηE 1 n n i=1 g t+1 i -h t+1 i 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ h t -∇f (x t ) 2 + γ a E g t -h t 2 + 2γaω n + η 2a 2 ω + (1 -a) 2 E 1 n n i=1 g t i -h t i 2 + 2γω an + 2ηω E 1 n n i=1 h t+1 i -h t i 2 . Next, by taking η = 2γω n and considering the choice of a, one can show that 2γaω n + η 2a 2 ω + (1 -a) 2 ≤ η. Thus E f (x t+1 ) + γ (2ω + 1) E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ h t -∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + 2γω (2ω + 1) n + 4γω 2 n E 1 n n i=1 h t+1 i -h t i 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ h t -∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + 8γω (2ω + 1) n E 1 n n i=1 h t+1 i -h t i 2 . The following lemma almost repeats the previous one. We will use it in the theorems with Assumption G.1. Lemma I.4. Suppose that Assumptions 5.2, 1.2 and G.1 hold and let us take a = 1/ (2ω + 1) and γ ≤ a 2µ , then E f (x t+1 ) + 2γ(2ω + 1)E g t+1 -h t+1 2 + 8γω n E 1 n n i=1 g t+1 i -h t+1 i 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ h t -∇f (x t ) 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 + 20γω(2ω + 1) n E 1 n n i=1 h t+1 i -h t i 2 . Proof. Up to (21) we can follow the proof of Lemma I.3 to get E f (x t+1 ) + κE g t+1 -h t+1 2 + ηE 1 n n i=1 g t+1 i -h t+1 i 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ h t -∇f (x t ) 2 + γ + κ (1 -a) 2 E g t -h t 2 + 2κa 2 ω n + η 2a 2 ω + (1 -a) 2 E 1 n n i=1 g t i -h t i 2 + 2κω n + 2ηω E 1 n n i=1 h t+1 i -h t i 2 . Now, by taking κ = 2γ a , we can see that γ + κ (1 -a) 2 ≤ 1 -a 2 κ, and thus E f (x t+1 ) + 2γ a E g t+1 -h t+1 2 + ηE 1 n n i=1 g t+1 i -h t+1 i 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ h t -∇f (x t ) 2 + 1 - a 2 2γ a E g t -h t 2 + 4γaω n + η 2a 2 ω + (1 -a) 2 E 1 n n i=1 g t i -h t i 2 + 4γω an + 2ηω E 1 n n i=1 h t+1 i -h t i 2 . Next, by taking η = 8γω n and considering the choice of a, one can show that 4γaω n + η 2a 2 ω + (1 -a) 2 ≤ 1 -a 2 η. Thus E f (x t+1 ) + 2γ(2ω + 1)E g t+1 -h t+1 2 + 8γω n E 1 n n i=1 g t+1 i -h t+1 i 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ h t -∇f (x t ) 2 + 1 - a 2 2γ(2ω + 1)E g t -h t 2 + 1 - a 2 8γω n E 1 n n i=1 g t i -h t i 2 + 4γω(2ω + 1) n + 16γω 2 n E 1 n n i=1 h t+1 i -h t i 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ h t -∇f (x t ) 2 + 1 - a 2 2γ(2ω + 1)E g t -h t 2 + 1 - a 2 8γω n E 1 n n i=1 g t i -h t i 2 + 20γω(2ω + 1) n E 1 n n i=1 h t+1 i -h t i 2 . Finally, the assumption γ ≤ a 2µ implies an inequality 1 -a 2 ≤ 1 -γµ. Lemma I.5. Suppose that Assumption 5.1 holds and E f (x t+1 ) + γΨ t+1 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + γΨ t + γC, where Ψ t is a sequence of numbers, Ψ t ≥ 0 for all t ∈ [T ], constant C ≥ 0, and constant γ > 0. Then E ∇f ( x T ) 2 ≤ 2 f (x 0 ) -f * γT + 2Ψ 0 T + 2C, ( ) where a point x T is chosen uniformly from a set of points {x t } T -1 t=0 . Proof. By unrolling (22) for t from 0 to T -1, we obtain γ 2 T -1 t=0 E ∇f (x t ) 2 + E f (x T ) + γΨ T ≤ f (x 0 ) + γΨ 0 + γT C. We subtract f * , divide inequality by γT 2 , and take into account that f (x) ≥ f * for all x ∈ R, and Ψ t ≥ 0 for all t ∈ [T ], to get the following inequality: 1 T T -1 t=0 E ∇f (x t ) 2 ≤ 2 f (x 0 ) -f * γT + 2Ψ 0 T + 2C. It is left to consider the choice of a point x T to complete the proof of the lemma. Lemma I.6. Suppose that Assumptions 5.1 and G.1 hold and E f (x t+1 ) + γΨ t+1 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + (1 -γµ)γΨ t + γC, where Ψ t is a sequence of numbers, Ψ t ≥ 0 for all t ∈ [T ], constant C ≥ 0, constant µ > 0, and constant γ ∈ (0, 1/µ). Then E f (x T ) -f * ≤ (1 -γµ) T f (x 0 ) -f * + γΨ 0 + C µ . ( ) Proof. We subtract f * and use PŁ-condition ( 16) to get E f (x t+1 ) -f * + γΨ t+1 ≤ E f (x t ) -f * - γ 2 E ∇f (x t ) 2 + γΨ t + γC ≤ (1 -γµ)E f (x t ) -f * + (1 -γµ)γΨ t + γC = (1 -γµ) E f (x t ) -f * + γΨ t + γC. Unrolling the inequality, we have E f (x t+1 ) -f * + γΨ t+1 ≤ (1 -γµ) t+1 f (x 0 ) -f * + γΨ 0 + γC t i=0 (1 -γµ) i ≤ (1 -γµ) t+1 f (x 0 ) -f * + γΨ 0 + C µ . It is left to note that Ψ t ≥ 0 for all t ∈ [T ]. Lemma I.7. If 0 < γ ≤ (L + √ A) -1 , L > 0, and A ≥ 0, then 1 2γ - L 2 - γA 2 ≥ 0. It is easy to verify with a direct calculation.

I.1 CASE OF DASHA

Despite the triviality of the following lemma, we provide it for consistency with Lemma I.14 and Lemma I.11. Lemma I.8. Suppose that Assumption 5.3 holds. Assuming that h 0 i = ∇f i (x 0 ) for all i ∈ [n], for h t+1 i from Algorithm 1 (DASHA) we have 1. E h h t+1 -∇f (x t+1 ) 2 = 0. 2. E h h t+1 i -∇f i (x t+1 ) 2 = 0, ∀i ∈ [n]. 3. E h h t+1 i -h t i 2 ≤ L 2 i x t+1 -x t 2 , ∀i ∈ [n]. Theorem 6.1. Suppose that Assumptions 5.1, 5.2, 5.3 and 1.2 hold. Let us take a = 1/ (2ω + 1) and γ ≤ L + 16ω(2ω+1) n L -1 , and h 0 i = ∇f i (x 0 ) for all i ∈ [n] in Algorithm 1 (DASHA), then E ∇f ( x T ) 2 ≤ 1 T 2 f (x 0 ) -f * L + 16ω (2ω + 1) n L + 2 (2ω + 1) g 0 -∇f (x 0 ) 2 + 4ω n 1 n n i=1 g 0 i -∇f i (x 0 ) 2 . Proof. Considering Lemma I.3, Lemma I.8, and the law of total expectation, we obtain E f (x t+1 ) + γ (2ω + 1) E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + 8γω (2ω + 1) n L 2 x t+1 -x t 2 = E f (x t ) - γ 2 E ∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 - 1 2γ - L 2 - 8γω (2ω + 1) n L 2 E x t+1 -x t 2 . Using assumption about γ, we can show that 1 2γ -L 2 -8γω(2ω+1) n L 2 ≥ 0 (see Lemma I.7), thus E f (x t+1 ) + γ (2ω + 1) E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 . In the view of Lemma I.5 with Ψ t = (2ω + 1) E g t -h t 2 + 2ω n E 1 n n i=1 g t i -h t i 2 we can conclude the proof. Corollary 6.2. Suppose that assumptions from Theorem 6.1 hold, and Corollary 6.3. Suppose that assumptions of Corollary 6.2 hold. We take the unbiased compressor g 0 i = h 0 i = ∇f i (x 0 ) for all i ∈ [n], then DASHA needs T := O 1 ε f (x 0 ) -f * L + ω RandK with K = ζ C ≤ d/ √ n, then the communication complexity equals O d + L(f (x 0 )-f * )d ε √ n . Proof. In the view of Theorem F.2, we have ω + 1 = d/K. Combining this and an inequality L ≤ L, the communication complexity equals O (d + ζ C T ) = O d + 1 ε f (x 0 ) -f * KL + K ω √ n L = O d + 1 ε f (x 0 ) -f * d √ n L + d √ n L = O d + 1 ε f (x 0 ) -f * d √ n L . I.2 CASE OF DASHA UNDER PŁ-CONDITION Theorem I.9. Suppose that Assumption 5.1, 5.2, 5.3, 1.2 and G.1 hold. Let us take a = 1/ (2ω + 1) , γ ≤ min L + 40ω(2ω+1) n L -1 , a , and h 0 i = ∇f i (x 0 ) for all i ∈ [n] in Algorithm 1 (DASHA), then E f (x T ) -f * ≤ (1 -γµ) T f (x 0 ) -f * + 2γ(2ω + 1) g 0 -∇f (x 0 ) 2 + 8γω n 1 n n i=1 g 0 i -∇f i (x 0 ) 2 . Proof. Considering Lemma I.4, Lemma I.8, and the law of total expectation, we obtain E f (x t+1 ) + 2γ(2ω + 1)E g t+1 -h t+1 2 + 8γω n E 1 n n i=1 g t+1 i -h t+1 i 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 + 20γω(2ω + 1) n L 2 x t+1 -x t 2 = E f (x t ) - γ 2 E ∇f (x t ) 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 - 1 2γ - L 2 - 20γω(2ω + 1) n L 2 x t+1 -x t 2 . Using the assumption about γ, we can show that 1 2γ -L 2 -20γω(2ω+1) n L 2 ≥ 0 (see Lemma I.7), thus E f (x t+1 ) + 2γ(2ω + 1)E g t+1 -h t+1 2 + 8γω n E 1 n n i=1 g t+1 i -h t+1 i 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 . In the view of Lemma I.6 with Ψ t = 2(2ω + 1)E g t -h t 2 + 8ω n E 1 n n i=1 g t i -h t i 2 we can conclude the proof. We use O (•), when we provide a bound up to logarithmic factors. Corollary I.10. Suppose that assumptions from Theorem I.9 hold, and g 0 i = 0 for all i ∈ [n], then DASHA needs Proof. Clearly, using Theorem I.9, one can show that Algorithm 1 returns an ε-solution after (25) communication rounds. At each communication round of Algorithm 1, each node sends ζ C coordinates, thus the total communication complexity would be O (ζ C T ) per node. Unlike Corollary 6.2, in this corollary, we can initialize g 0 i , for instance, with zeros because the corresponding initialization error Ψ 0 from the proof of Theorem I.9 would be under the logarithm. T := O ω + L µ + ω L µ √ n . (

I.3 CASE OF DASHA-PAGE

Lemma I.11. Suppose that Assumptions 5.3 and 5.4 hold. For h t+1 i from Algorithm 1 (DASHA-PAGE) we have 1. E h h t+1 -∇f (x t+1 ) 2 ≤ (1 -p) L 2 max nB x t+1 -x t 2 + (1 -p) h t -∇f (x t ) 2 . 2. E h h t+1 i -∇f i (x t+1 ) 2 ≤ (1 -p) L 2 max B x t+1 -x t 2 + (1 -p) h t i -∇f i (x t ) 2 , ∀i ∈ [n]. 3. E h h t+1 i -h t i 2 ≤ (1 -p)L 2 max B + 2L 2 i x t+1 -x t 2 + 2p h t i -∇f i (x t ) 2 , ∀i ∈ [n]. Proof. Using the definition of h t+1 , we obtain E h h t+1 -∇f (x t+1 ) 2 = (1 -p) E h    h t + 1 n n i=1 1 B j∈I t i ∇f ij (x t+1 ) -∇f ij (x t ) -∇f (x t+1 ) 2    (15) = (1 -p) E h    1 n n i=1 1 B j∈I t i ∇f ij (x t+1 ) -∇f ij (x t ) -∇f (x t+1 ) -∇f (x t ) 2    + (1 -p) h t -∇f (x t ) 2 . From the unbiasedness and independence of mini-batch samples, we get E h h t+1 -∇f (x t+1 ) 2 ≤ (1 -p) n 2 B 2 n i=1 E h   j∈I t i ∇f ij (x t+1 ) -∇f ij (x t ) -∇f i (x t+1 ) -∇f i (x t ) 2   + (1 -p) h t -∇f (x t ) 2 = (1 -p) n 2 B n i=1   1 m m j=1 ∇f ij (x t+1 ) -∇f ij (x t ) -∇f i (x t+1 ) -∇f i (x t ) 2   + (1 -p) h t -∇f (x t ) 2 ≤ (1 -p) n 2 B n i=1   1 m m j=1 ∇f ij (x t+1 ) -∇f ij (x t ) 2   + (1 -p) h t -∇f (x t ) 2 ≤ (1 -p) L 2 max nB x t+1 -x t 2 + (1 -p) h t -∇f (x t ) 2 . In the last inequality, we use Assumption 5.4. Using the same reasoning, we have E h h t+1 i -∇f i (x t+1 ) = (1 -p) E h    h t i + 1 B j∈I t i ∇f ij (x t+1 ) -∇f ij (x t ) -∇f i (x t+1 ) 2    = (1 -p) E h    1 B j∈I t i ∇f ij (x t+1 ) -∇f ij (x t ) -∇f (x t+1 ) -∇f (x t ) 2    + (1 -p) h t i -∇f i (x t ) 2 ≤ (1 -p) L 2 max B x t+1 -x t 2 + (1 -p) h t i -∇f i (x t ) 2 . Finally, we consider the last ineqaulity of the lemma: E h h t+1 i -h t i 2 = p ∇f i (x t+1 ) -h t i 2 + (1 -p)E h    h t i + 1 B j∈I t i ∇f ij (x t+1 ) -∇f ij (x t ) -h t i 2    (15) = p ∇f i (x t+1 ) -h t i 2 + (1 -p)E h    1 B j∈I t i ∇f ij (x t+1 ) -∇f ij (x t ) -∇f i (x t+1 ) -∇f i (x t ) 2    + (1 -p) ∇f i (x t+1 ) -∇f i (x t ) 2 . Using the unbiasedness and independence of the gradients, we obtain E h h t+1 i -h t i 2 ≤ p ∇f i (x t+1 ) -h t i 2 + (1 -p) B 2 E h   j∈I t i ∇f ij (x t+1 ) -∇f ij (x t ) -∇f i (x t+1 ) -∇f i (x t ) 2   + (1 -p) ∇f i (x t+1 ) -∇f i (x t ) 2 = p ∇f i (x t+1 ) -h t i 2 + (1 -p) B   1 m m j=1 ∇f ij (x t+1 ) -∇f ij (x t ) -∇f i (x t+1 ) -∇f i (x t ) 2   + (1 -p) ∇f i (x t+1 ) -∇f i (x t ) 2 ≤ p ∇f i (x t+1 ) -h t i 2 + (1 -p) B   1 m m j=1 ∇f ij (x t+1 ) -∇f ij (x t ) 2   + (1 -p) ∇f i (x t+1 ) -∇f i (x t ) 2 . From Assumptions 5.3 and 5.4, we can conclude that E h h t+1 i -h t i 2 ≤ p ∇f i (x t+1 ) -h t i 2 + (1 -p) L 2 max B + L 2 i x t+1 -x t 2 = p ∇f i (x t+1 ) -∇f i (x t ) + ∇f i (x t ) -h t i 2 + (1 -p) L 2 max B + L 2 i x t+1 -x t 2 (14) ≤ 2p ∇f i (x t+1 ) -∇f i (x t ) 2 + 2p h t i -∇f i (x t ) 2 + (1 -p) L 2 max B + L 2 i x t+1 -x t 2 ≤ 2pL 2 i x t+1 -x t 2 + 2p h t i -∇f i (x t ) 2 + (1 -p) L 2 max B + L 2 i x t+1 -x t 2 ≤ (1 -p)L 2 max B + 2L 2 i x t+1 -x t 2 + 2p h t i -∇f i (x t ) 2 . Theorem 6.4. Suppose that Assumptions 5.1, 5.2, 5.3, 5.4, and 1.2 hold. Let us take a = 1/ (2ω + 1), probability p ∈ (0, 1], and γ ≤ L + 48ω (2ω + 1) n (1 -p)L 2 max B + L 2 + 2 (1 -p) L 2 max pnB -1 in Algorithm 1 (DASHA-PAGE) then E ∇f ( x T ) 2 ≤ 1 T     2 f (x 0 ) -f * × L + 48ω (2ω + 1) n (1 -p)L 2 max B + L 2 + 2 (1 -p) L 2 max pnB + 2 (2ω + 1) g 0 -h 0 2 + 4ω n 1 n n i=1 g 0 i -h 0 i 2 + 2 p h 0 -∇f (x 0 ) 2 + 32ω (2ω + 1) n 1 n n i=1 h 0 i -∇f i (x 0 ) 2     . Proof. Let us fix constants ν, ρ ∈ [0, ∞) that we will define later. Considering Lemma I.3, Lemma I.11, and the law of total expectation, we obtain E f (x t+1 ) + γ (2ω + 1) E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + νE h t+1 -∇f (x t+1 ) 2 + ρE 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ h t -∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + 8γω (2ω + 1) n E (1 -p)L 2 max B + 2 L 2 x t+1 -x t 2 + 2p 1 n n i=1 h t i -∇f i (x t ) 2 + νE (1 -p) L 2 max nB x t+1 -x t 2 + (1 -p) h t -∇f (x t ) 2 + ρE (1 -p) L 2 max B x t+1 -x t 2 + (1 -p) 1 n n i=1 h t i -∇f i (x t ) 2 . After rearranging the terms, we get E f (x t+1 ) + γ (2ω + 1) E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + νE h t+1 -∇f (x t+1 ) 2 + ρE 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 -   1 2γ - L 2 - 8γω (2ω + 1) (1-p)L 2 max B + 2 L 2 n -ν (1 -p) L 2 max nB -ρ (1 -p) L 2 max B   E x t+1 -x t 2 + (γ + ν(1 -p)) E h t -∇f (x t ) 2 + 16γpω (2ω + 1) n + ρ(1 -p) E 1 n n i=1 h t i -∇f i (x t ) 2 . Next, let us fix ν = γ p , to get E f (x t+1 ) + γ (2ω + 1) E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + γ p E h t+1 -∇f (x t+1 ) 2 + ρE 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + γ p E h t -∇f (x t ) 2 -   1 2γ - L 2 - 8γω (2ω + 1) (1-p)L 2 max B + 2 L 2 n - γ (1 -p) L 2 max pnB -ρ (1 -p) L 2 max B   E x t+1 -x t 2 + 16γpω (2ω + 1) n + ρ(1 -p) E 1 n n i=1 h t i -∇f i (x t ) 2 . By taking ρ = 16γω(2ω+1) n , we obtain E f (x t+1 ) + γ (2ω + 1) E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + γ p E h t+1 -∇f (x t+1 ) 2 + 16γω (2ω + 1) n E 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + γ p E h t -∇f (x t ) 2 + 16γω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) -   1 2γ - L 2 - 8γω (2ω + 1) (1-p)L 2 max B + 2 L 2 n - γ (1 -p) L 2 max pnB - 16γω (2ω + 1) (1 -p) L 2 max nB E x t+1 -x t 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + γ p E h t -∇f (x t ) 2 + 16γω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) 2 -   1 2γ - L 2 - 24γω (2ω + 1) (1-p)L 2 max B + L 2 n - γ (1 -p) L 2 max pnB   E x t+1 -x t 2 . Next, considering the choice of γ and Lemma I.7, we get E f (x t+1 ) + γ (2ω + 1) E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + γ p E h t+1 -∇f (x t+1 ) 2 + 16γω (2ω + 1) n E 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + γ p E h t -∇f (x t ) 2 + 16γω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) 2 . Finally, in the view of Lemma I.5 with Ψ t = (2ω + 1) E g t -h t 2 + 2ω n E 1 n n i=1 g t i -h t i 2 + 1 p E h t -∇f (x t ) 2 + 16ω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) 2 , we can conclude the proof. Corollary 6.5. Let the assumptions from Theorem 6.4 hold, p = B /(m+B), and  g 0 i = h 0 i = ∇f i (x 0 ) for all i ∈ [n]. Then DASHA-PAGE needs T := O     1 ε     f (x 0 ) -f * L + ω √ n L + ω √ n + m nB L max √ B         O d + L max f (x 0 ) -f * d ε √ n , and the expected # of gradient calculations per node equals O m + L max f (x 0 ) -f * √ m ε √ n . ( ) Proof. In the view of Theorem F.2, we have ω + 1 = d/K. Combining this, inequalities L ≤ L ≤ L max , and K = Θ Bd √ m = O d √ n , we can show that the communication complexity equals O (d + ζ C T ) = O     d + 1 ε     f (x 0 ) -f * KL + K ω √ n L + K ω √ n + m nB L max √ B         = O     d + 1 ε     f (x 0 ) -f * d √ n L + d √ n L + d √ n L max         = O     d + 1 ε     f (x 0 ) -f * d √ n L max         . And the expected number of gradient calculations per node equals O (m + BT ) = O     m + 1 ε     f (x 0 ) -f * BL + B ω √ n L + B ω √ n + m nB L max √ B         = O     m + 1 ε     f (x 0 ) -f * m n L + m n L + m n L max         = O     m + 1 ε     f (x 0 ) -f * m n L max         .

I.4 CASE OF DASHA-PAGE UNDER PŁ-CONDITION

Theorem I.12. Suppose that Assumption 5.1, 5.2, 5.3, 1.2, 5.4, and G.1 hold. Let us take a = 1/ (2ω + 1) , probability p ∈ (0, 1], batch size B ∈ [m], and γ ≤ min L + 200ω(2ω+1) n (1-p)L 2 max B + 2 L 2 + 4(1-p)L 2 max pnB -1 , a 2µ , p 2µ in Algorithm 1 (DASHA-PAGE), then E f (x T ) -f * ≤ (1 -γµ) T     (f (x 0 ) -f * ) + 2γ(2ω + 1) g 0 -h 0 2 + 8γω n 1 n n i=1 g 0 i -h 0 i 2 + 2γ p h 0 -∇f (x 0 ) 2 + 80γω (2ω + 1) n 1 n n i=1 h 0 i -∇f i (x 0 ) 2     . Proof. Let us fix constants ν, ρ ∈ [0, ∞) that we will define later. Considering Lemma I.4, Lemma I.11, and the law of total expectation, we obtain E f (x t+1 ) + 2γ(2ω + 1)E g t+1 -h t+1 2 + 8γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + νE h t+1 -∇f (x t+1 ) 2 + ρE 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ h t -∇f (x t ) 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 + 20γω(2ω + 1) n E (1 -p)L 2 max B + 2 L 2 x t+1 -x t 2 + 2p 1 n n i=1 h t i -∇f i (x t ) 2 + νE (1 -p) L 2 max nB x t+1 -x t 2 + (1 -p) h t -∇f (x t ) 2 + ρE (1 -p) L 2 max B x t+1 -x t 2 + (1 -p) 1 n n i=1 h t i -∇f i (x t ) 2 . After rearranging the terms, we get E f (x t+1 ) + 2γ(2ω + 1)E g t+1 -h t+1 2 + 8γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + νE h t+1 -∇f (x t+1 ) 2 + ρE 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 - 1 2γ - L 2 - 20γω(2ω + 1) n (1 -p)L 2 max B + 2 L 2 -ν (1 -p) L 2 max nB -ρ (1 -p) L 2 max B E x t+1 -x t 2 + (γ + ν(1 -p)) E h t -∇f (x t ) 2 + 40pγω (2ω + 1) n + ρ(1 -p) E 1 n n i=1 h t i -∇f i (x t ) 2 . By taking ν = 2γ p and ρ = 80γω(2ω+1) n , one can see that γ +ν(1-p) ≤ 1 -p 2 ν and 40pγω(2ω+1) n + ρ(1 -p) ≤ 1 -p 2 ρ, thus E f (x t+1 ) + 2γ(2ω + 1)E g t+1 -h t+1 2 + 8γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + νE h t+1 -∇f (x t+1 ) 2 + ρE 1 n n i=1 h t+1 i -∇f i (x t+1 ) ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 + 1 - p 2 2γ p E h t -∇f (x t ) 2 + 1 - p 2 80γω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) 2 - 1 2γ - L 2 - 20γω(2ω + 1) n (1 -p)L 2 max B + 2 L 2 - 2γ (1 -p) L 2 max pnB - 80γω (2ω + 1) (1 -p) L 2 max nB E x t+1 -x t 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 + 1 - p 2 2γ p E h t -∇f (x t ) 2 + 1 - p 2 80γω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) 2 - 1 2γ - L 2 - 100γω(2ω + 1) n (1 -p)L 2 max B + 2 L 2 - 2γ (1 -p) L 2 max pnB E x t+1 -x t 2 . Next, considering the choice of γ and Lemma I.7, we get E f (x t+1 ) + 2γ(2ω + 1)E g t+1 -h t+1 2 + 8γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + νE h t+1 -∇f (x t+1 ) 2 + ρE 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 + (1 -γµ) 2γ p E h t -∇f (x t ) 2 + (1 -γµ) 80γω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) 2 . In the view of Lemma I.6 with Ψ t = 2(2ω + 1)E g t -h t 2 + 8ω n E 1 n n i=1 g t i -h t i 2 + 2 p E h t -∇f (x t ) 2 + 80ω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) 2 , we can conclude the proof of the theorem. Corollary I.13. Suppose that assumptions from Theorem I.12 hold, probability p = B/(m + B), and h 0 i = g 0 i = 0 for all i ∈ [n], then DASHA-PAGE needs Proof. Clearly, using Theorem I.12, one can show that Algorithm 1 returns an ε-solution after (26) communication rounds. At each communication round of Algorithm 1, each node sends ζ C coordinates, thus the total communication complexity would be O (ζ C T ) . Moreover, the expected number of gradients calculations at each communication round equals pm+(1-p)B = 2mB m+B ≤ 2B, thus the total expected number of gradients that each node calculates is O (BT ) . Unlike Corollary 6.5, in this corollary, we can initialize h 0 i and g 0 i , for instance, with zeros because the corresponding initialization error Ψ 0 from the proof of Theorem I.12 would be under the logarithm. T := O ω + m B + L µ + ω L µ √ n + ω √ n + √ m √ nB L max µ √ B

I.5 CASE OF DASHA-MVR

We introduce new notations: ∇f i (x t+1 ; ξ t+1 i ) = 1 B B j=1 ∇f i (x t+1 ; ξ t+1 ij ) and ∇f (x t+1 ; ξ t+1 ) = 1 n n i=1 ∇f i (x t+1 ; ξ t+1 i ). Lemma I.14. Suppose that Assumptions 5.3, 5.5 and 5.6 hold. For h t+1 i from Algorithm 1 (DASHA-MVR) we have 1. E h h t+1 -∇f (x t+1 ) 2 ≤ 2b 2 σ 2 nB + 2 (1 -b) 2 L 2 σ nB x t+1 -x t 2 + (1 -b) 2 h t -∇f (x t ) 2 . 2. E h h t+1 i -∇f i (x t+1 ) 2 ≤ 2b 2 σ 2 B + 2 (1 -b) 2 L 2 σ B x t+1 -x t 2 + (1 -b) 2 h t i -∇f i (x t ) 2 , ∀i ∈ [n]. 3. E h h t+1 i -h t i 2 ≤ 2b 2 σ 2 B + 2 (1 -b) 2 L 2 σ B + L 2 i x t+1 -x t 2 + 2b 2 h t i -∇f i (x t ) 2 , ∀i ∈ [n]. Proof. First, let us proof the bound for E h h t+1 -∇f (x t+1 ) 2 : E h h t+1 -∇f (x t+1 ) 2 = E h ∇f (x t+1 ; ξ t+1 ) + (1 -b) h t -∇f (x t ; ξ t+1 ) -∇f (x t+1 ) 2 (15) = E h b ∇f (x t+1 ; ξ t+1 ) -∇f (x t+1 ) + (1 -b) ∇f (x t+1 ; ξ t+1 ) -∇f (x t+1 ) + ∇f (x t ) -∇f (x t ; ξ t+1 ) 2 + (1 -b) 2 h t -∇f (x t ) 2 (14) ≤ 2b 2 E h ∇f (x t+1 ; ξ t+1 ) -∇f (x t+1 ) 2 + 2 (1 -b) 2 E h ∇f (x t+1 ; ξ t+1 ) -∇f (x t+1 ) + ∇f (x t ) -∇f (x t ; ξ t+1 ) 2 + (1 -b) 2 h t -∇f (x t ) 2 = 2b 2 n 2 n i=1 E h ∇f i (x t+1 ; ξ t+1 i ) -∇f i (x t+1 ) 2 + 2 (1 -b) 2 n 2 n i=1 E h ∇f i (x t+1 ; ξ t+1 i ) -∇f i (x t ; ξ t+1 i ) -∇f i (x t+1 ) -∇f i (x t ) 2 + (1 -b) 2 h t -∇f (x t ) 2 = 2b 2 n 2 B 2 n i=1 B j=1 E h ∇f i (x t+1 ; ξ t+1 ij ) -∇f i (x t+1 ) 2 + 2 (1 -b) 2 n 2 B 2 n i=1 B j=1 E h ∇f i (x t+1 ; ξ t+1 ij ) -∇f i (x t ; ξ t+1 ij ) -∇f i (x t+1 ) -∇f i (x t ) + (1 -b) 2 h t -∇f (x t ) 2 . Using Assumptions 5.5 and 5.6, we obtain E h h t+1 -∇f (x t+1 ) 2 ≤ 2b 2 σ 2 nB + 2 (1 -b) 2 L 2 σ nB x t+1 -x t 2 + (1 -b) 2 h t -∇f (x t ) 2 . Similarly, we can get the bound for E h h t+1 i -∇f i (x t+1 ) 2 : E h h t+1 i -∇f i (x t+1 ) 2 = E h ∇f i (x t+1 ; ξ t+1 i ) + (1 -b) h t i -∇f i (x t ; ξ t+1 i ) -∇f i (x t+1 ) 2 = E b ∇f i (x t+1 ; ξ t+1 i ) -∇f i (x t+1 ) + (1 -b) ∇f i (x t+1 ; ξ t+1 i ) -∇f i (x t+1 ) + ∇f (x t ) -∇f i (x t ; ξ t+1 i ) 2 + (1 -b) 2 h t i -∇f i (x t ) 2 ≤ 2b 2 σ 2 B + 2 (1 -b) 2 L 2 σ B x t+1 -x t 2 + (1 -b) 2 h t i -∇f i (x t ) 2 . Now, we proof the last inequality of the lemma: E h h t+1 i -h t i 2 = E h ∇f i (x t+1 ; ξ t+1 i ) + (1 -b) h t i -∇f i (x t ; ξ t+1 i ) -h t i 2 (15) = E h ∇f i (x t+1 ; ξ t+1 i ) -∇f i (x t+1 ) + (1 -b) ∇f i (x t ) -∇f i (x t ; ξ t+1 i ) 2 + ∇f i (x t+1 ) -∇f i (x t ) -b h t i -∇f i (x t ) 2 = E h b ∇f i (x t+1 ; ξ t+1 i ) -∇f i (x t+1 ) + (1 -b) ∇f i (x t+1 ; ξ t+1 i ) -∇f i (x t ; ξ t+1 i ) -(∇f i (x t+1 ) -∇f i (x t )) 2 + ∇f i (x t+1 ) -∇f i (x t ) -b h t i -∇f i (x t ) 2 (14) ≤ 2b 2 E h ∇f i (x t+1 ; ξ t+1 i ) -∇f i (x t+1 ) 2 + 2 (1 -b) 2 E h ∇f i (x t+1 ; ξ t+1 i ) -∇f i (x t ; ξ t+1 i ) -(∇f i (x t+1 ) -∇f i (x t )) 2 + 2 ∇f i (x t+1 ) -∇f i (x t ) 2 + 2b 2 h t i -∇f i (x t ) 2 = 2b 2 B 2 B j=1 E h ∇f i (x t+1 ; ξ t+1 ij ) -∇f i (x t+1 ) 2 + 2 (1 -b) 2 B 2 B j=1 E h ∇f i (x t+1 ; ξ t+1 ij ) -∇f i (x t ; ξ t+1 ij ) -(∇f i (x t+1 ) -∇f i (x t )) 2 + 2 ∇f i (x t+1 ) -∇f i (x t ) 2 + 2b 2 h t i -∇f i (x t ) 2 In the view of Assumptions 5.3, 5.5 and 5.6, we obtain E h h t+1 i -h t i 2 ≤ 2b 2 σ 2 B + 2 (1 -b) 2 L 2 σ B x t+1 -x t 2 + 2L 2 i x t+1 -x t 2 + 2b 2 h t i -∇f i (x t ) 2 . Theorem 6.7. Suppose that Assumptions 5.1, 5.2, 5.3, 5.5, 5.6 and 1.2 hold. Let us take a = 1 2ω+1 , b ∈ (0, 1], and γ ≤ L + 96ω(2ω+1) n (1-b) 2 L 2 σ B + L 2 + 4(1-b) 2 L 2 σ bnB -1 , in Algorithm 1 (DASHA-MVR). Then E ∇f ( x T ) 2 ≤ 1 T     2 f (x 0 ) -f * ×   L + 96ω (2ω + 1) n (1 -b) 2 L 2 σ B + L 2 + 4 (1 -b) 2 L 2 σ bnB   + 2 (2ω + 1) g 0 -h 0 2 + 4ω n 1 n n i=1 g 0 i -h 0 i 2 + 2 b h 0 -∇f (x 0 ) 2 + 32bω (2ω + 1) n 1 n n i=1 h 0 i -∇f i (x 0 ) 2     + 96ω (2ω + 1) nB + 4 bnB b 2 σ 2 . Proof. Let us fix constants ν, ρ ∈ [0, ∞) that we will define later. Considering Lemma I.3, Lemma I.14, and the law of total expectation, we obtain E f (x t+1 ) + γ (2ω + 1) E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + νE h t+1 -∇f (x t+1 ) 2 + ρE 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ h t -∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + 8γω (2ω + 1) n E 2b 2 σ 2 B + 2 (1 -b) 2 L 2 σ B + L 2 x t+1 -x t 2 + 2b 2 1 n n i=1 h t i -∇f i (x t ) 2 + νE 2b 2 σ 2 nB + 2 (1 -b) 2 L 2 σ nB x t+1 -x t 2 + (1 -b) 2 h t -∇f (x t ) 2 + ρE 2b 2 σ 2 B + 2 (1 -b) 2 L 2 σ B x t+1 -x t 2 + (1 -b) 2 1 n n i=1 h t i -∇f i (x t ) 2 After rearranging the terms, we get E f (x t+1 ) + γ (2ω + 1) E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + νE h t+1 -∇f (x t+1 ) 2 + ρE 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 -   1 2γ - L 2 - 16γω (2ω + 1) (1-b) 2 L 2 σ B + L 2 n - 2ν (1 -b) 2 L 2 σ nB - 2ρ (1 -b) 2 L 2 σ B   E x t+1 -x t 2 + γ + ν(1 -b) 2 E h t -∇f (x t ) 2 + 16b 2 γω (2ω + 1) n + ρ(1 -b) 2 E 1 n n i=1 h t i -∇f i (x t ) + 2 8γω (2ω + 1) nB + ν nB + ρ B b 2 σ 2 . By taking ν = γ b , one can see that γ + ν(1 -b) 2 ≤ ν, and E f (x t+1 ) + γ (2ω + 1) E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + γ b E h t+1 -∇f (x t+1 ) 2 + ρE 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + γ b E h t -∇f (x t ) 2 -   1 2γ - L 2 - 16γω (2ω + 1) (1-b) 2 L 2 σ B + L 2 n - 2γ (1 -b) 2 L 2 σ bnB - 2ρ (1 -b) 2 L 2 σ B   E x t+1 -x t 2 + 16b 2 γω (2ω + 1) n + ρ(1 -b) 2 E 1 n n i=1 h t i -∇f i (x t ) 2 + 2 8γω (2ω + 1) nB + γ bnB + ρ B b 2 σ 2 . Next, we fix ρ = 16bγω(2ω+1)

n

. With this choice of ρ and for all b ∈ [0, 1], we can show that 16b 2 γω(2ω+1) n + ρ(1 -b) 2 ≤ ρ, thus E f (x t+1 ) + γ (2ω + 1) E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + γ b E h t+1 -∇f (x t+1 ) 2 + 16bγω (2ω + 1) n E 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + γ b E h t -∇f (x t ) 2 + 16bγω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) 2 -   1 2γ - L 2 - 16γω (2ω + 1) (1-b) 2 L 2 σ B + L 2 n - 2γ (1 -b) 2 L 2 σ bnB - 32bγω (2ω + 1) (1 -b) 2 L 2 σ nB   E x t+1 -x t 2 + 2 8γω (2ω + 1) nB + γ bnB + 16bγω (2ω + 1) nB b 2 σ 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + γ b E h t -∇f (x t ) 2 + 16bγω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) -   1 2γ - L 2 - 48γω (2ω + 1) (1-b) 2 L 2 σ B + L 2 n - 2γ (1 -b) 2 L 2 σ bnB   E x t+1 -x t 2 + 48γω (2ω + 1) nB + 2γ bnB b 2 σ 2 . In the last inequality we use b ∈ (0, 1]. Next, considering the choice of γ and Lemma I.7, we get E f (x t+1 ) + γ (2ω + 1) E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + γ b E h t+1 -∇f (x t+1 ) 2 + 16bγω (2ω + 1) n E 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + γ (2ω + 1) E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + γ b E h t -∇f (x t ) 2 + 16bγω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) 2 + 48γω (2ω + 1) nB + 2γ bnB b 2 σ 2 . In the view of Lemma I.5 with Ψ t = (2ω + 1) E g t -h t 2 + 2ω n E 1 n n i=1 g t i -h t i 2 + 1 b E h t -∇f (x t ) 2 + 16bω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) 2 and C = 48ω(2ω+1) nB + 2 bnB b 2 σ 2 , we can conclude the proof. Corollary 6.8. Suppose that assumptions from Theorem 6.7 hold, momentum b = Θ min 1 ω nεB σ 2 , nεB σ 2 , and Proof. In the view of Theorem 6.7, we have g 0 i = h 0 i = 1 Binit Binit k=1 ∇f i (x 0 ; ξ 0 ik ) for all i ∈ [n], and batch size B init = Θ ( B /b) , then Algorithm 1 (DASHA-MVR) needs T := O     1 ε     f (x 0 ) -f * L + ω √ n L + ω √ n + σ 2 εn 2 B L σ √ B     + σ 2 nεB     E ∇f ( x T ) 2 = O     1 T     f (x 0 ) -f *   L + ω √ n (1 -b) 2 L 2 σ B + L 2 + (1 -b) 2 bn L σ √ B   + 1 b h 0 -∇f (x 0 ) 2 + bω 2 n 1 n n i=1 h 0 i -∇f i (x 0 ) 2     + ω 2 n + 1 bn b 2 σ 2 B     . Note, that 1 b = Θ max ω σ 2 nεB , σ 2 nεB ≤ Θ max ω 2 , σ 2 nεB , thus E ∇f ( x T ) 2 = O     1 T     f (x 0 ) -f * L + ω √ n L + L σ √ B + σ 2 εn 2 B L σ √ B + 1 b h 0 -∇f (x 0 ) 2 + bω 2 n 1 n n i=1 h 0 i -∇f i (x 0 ) 2     + ε     . Thus we can take T = O     1 ε     f (x 0 ) -f * L + ω √ n L + L σ √ B + σ 2 εn 2 B L σ √ B + 1 b h 0 -∇f (x 0 ) 2 + bω 2 n 1 n n i=1 h 0 i -∇f i (x 0 ) 2         . Note, that h 0 i = g 0 i = 1 Binit Binit k=1 ∇f i (x 0 ; ξ 0 ik ) for all i ∈ [n]. Let us bound E h 0 -∇f (x 0 ) 2 : E h 0 -∇f (x 0 ) 2 = E   1 n n i=1 1 B init Binit k=1 ∇f i (x 0 ; ξ 0 ik ) -∇f (x 0 ) 2   = 1 n 2 B 2 init n i=1 Binit k=1 E ∇f i (x 0 ; ξ 0 ik ) -∇f i (x 0 ) 2 ≤ σ 2 nB init . Likewise, 1 n n i=1 E h 0 i -∇f i (x 0 ) 2 ≤ σ 2 Binit . All in all, we have T = O     1 ε     f (x 0 ) -f * L + ω √ n L + L σ √ B + σ 2 εn 2 B L σ √ B + σ 2 bnB init + bω 2 σ 2 nB init         = O     1 ε     f (x 0 ) -f * L + ω √ n L + L σ √ B + σ 2 εn 2 B L σ √ B + σ 2 nB + b 2 ω 2 σ 2 nB         = O     1 ε     f (x 0 ) -f * L + ω √ n L + L σ √ B + σ 2 εn 2 B L σ √ B     + σ 2 nεB     . In the view of Algorithm 1 and the fact that we use a mini-batch of stochastic gradients, the number of stochastic gradients that each node calculates equals B init + 2BT = O(B init + BT ). Corollary 6.9. Suppose that assumptions of Corollary 6.8 hold, batch size B ≤ σ √ εn , we take RandK with K = ζ C = Θ Bd √ εn σ , and L := max{L, L σ , L}. Then the communication complexity equals O dσ √ nε + L f (x 0 ) -f * d √ nε , and the expected # of stochastic gradient calculations per node equals  O σ 2 nε + L f (x 0 ) -f * σ ε 3 /2 n . ( O (d + ζ C T ) = O     d + 1 ε     f (x 0 ) -f * KL + K ω √ n L + L σ √ B + K σ 2 εn 2 B L σ √ B     + K σ 2 nεB     = O     d + 1 ε     f (x 0 ) -f * d √ n L + d √ n L + L σ √ B + d √ n L σ     + dσ √ nε     = O     d + dσ √ nε + 1 ε     f (x 0 ) -f * d √ n L         = O     dσ √ nε + 1 ε     f (x 0 ) -f * d √ n L         . And the expected number of stochastic gradient calculations per node equals  O (B init + BT ) = O     B σ 2 Bnε + Bω σ 2 nεB + 1 ε     f (x 0 ) -f * BL + B ω √ n L + L σ √ B + B σ 2 εn 2 B L σ √ B         = O     σ 2 nε + σ 2 nε √ B + 1 ε     f (x 0 ) -f * σ √ εn L + σ √ εn L + L σ √ B + σ √ εn L σ         = O     σ 2 nε + 1 ε     f (x 0 ) -f * σ √ εn L         . I.          L + 400ω(2ω+1) (1-b) 2 L 2 σ B + L 2 n + 8(1-b) 2 L 2 σ bnB    -1 , a 2µ , b 2µ        in Algorithm 1 (DASHA- MVR), then E f (x T ) -f * ≤ (1 -γµ) T     f (x 0 ) -f * + 2γ(2ω + 1) g 0 -h 0 2 + 8γω n 1 n n i=1 g 0 i -h 0 i 2 + 2γ b h 0 -∇f (x 0 ) 2 + 80bγω (2ω + 1) n 1 n n i=1 h 0 i -∇f i (x 0 ) 2     + 1 µ 200ω (2ω + 1) nB + 4 bnB b 2 σ 2 . Proof. Let us fix constants ν, ρ ∈ [0, ∞) that we will define later. Considering Lemma I.4, Lemma I.14, and the law of total expectation, we obtain E f (x t+1 ) + 2γ(2ω + 1)E g t+1 -h t+1 2 + 8γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + νE h t+1 -∇f (x t+1 ) 2 + ρE 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ h t -∇f (x t ) 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 + 20γω(2ω + 1) n E 2b 2 σ 2 B + 2 (1 -b) 2 L 2 σ B + L 2 x t+1 -x t 2 + 2b 2 1 n n i=1 h t i -∇f i (x t ) 2 + νE 2b 2 σ 2 nB + 2 (1 -b) 2 L 2 σ nB x t+1 -x t 2 + (1 -b) 2 h t -∇f (x t ) 2 + ρE 2b 2 σ 2 B + 2 (1 -b) 2 L 2 σ B x t+1 -x t 2 + (1 -b) 2 1 n n i=1 h t i -∇f i (x t ) 2 . E f (x t+1 ) + 2γ(2ω + 1)E g t+1 -h t+1 2 + 8γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + νE h t+1 -∇f (x t+1 ) 2 + ρE 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 -   1 2γ - L 2 - 40γω (2ω + 1) (1-b) 2 L 2 σ B + L 2 n - 2ν (1 -b) 2 L 2 σ nB - 2ρ (1 -b) 2 L 2 σ B   E x t+1 -x t 2 + γ + ν(1 -b) 2 E h t -∇f (x t ) 2 + 40b 2 γω (2ω + 1) n + ρ(1 -b) 2 E 1 n n i=1 h t i -∇f i (x t ) 2 + 2 20γω (2ω + 1) nB + ν nB + ρ B b 2 σ 2 . By taking ν = 2γ b , one can see that γ + ν(1 -b) 2 ≤ 1 -b 2 ν , and E f (x t+1 ) + 2γ(2ω + 1)E g t+1 -h t+1 2 + 8γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + 2γ b E h t+1 -∇f (x t+1 ) 2 + ρE 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 + 1 - b 2 2γ b E h t -∇f (x t ) 2 -   1 2γ - L 2 - 40γω (2ω + 1) (1-b) 2 L 2 σ B + L 2 n - 4γ (1 -b) 2 L 2 σ bnB - 2ρ (1 -b) 2 L 2 σ B   E x t+1 -x t 2 + 40b 2 γω (2ω + 1) n + ρ(1 -b) 2 E 1 n n i=1 h t i -∇f i (x t ) 2 + 2 20γω (2ω + 1) nB + 2γ bnB + ρ B b 2 σ 2 . Next, we fix ρ = 80bγω(2ω+1) . With this choice of ρ and for all b ∈ (0, 1], we can show that 40b 2 γω(2ω+1) n + ρ(1 -b) 2 ≤ 1 -b 2 ρ, thus E f (x t+1 ) + 2γ(2ω + 1)E g t+1 -h t+1 2 + 8γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + 2γ b E h t+1 -∇f (x t+1 ) 2 + 80bγω (2ω + 1) n E 1 n n i=1 h t+1 i -∇f i (x t+1 ) ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 + 1 - b 2 2γ b E h t -∇f (x t ) 2 + 1 - b 2 80bγω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) -   1 2γ - L 2 - 40γω (2ω + 1) (1-b) 2 L 2 σ B + L 2 n - 4γ (1 -b) 2 L 2 σ bnB - 160bγω (2ω + 1) (1 -b) 2 L 2 σ nB E x t+1 -x t 2 + 2 20γω (2ω + 1) nB + 2γ bnB + 80bγω (2ω + 1) nB b 2 σ 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 + 1 - b 2 2γ b E h t -∇f (x t ) 2 + 1 - b 2 80bγω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) -   1 2γ - L 2 - 200γω (2ω + 1) (1-b) 2 L 2 σ B + L 2 n - 4γ (1 -b) 2 L 2 σ bnB   E x t+1 -x t 2 + 200γω (2ω + 1) nB + 4γ bnB b 2 σ 2 . In the last inequality we use b ∈ (0, 1]. Next, considering the choice of γ and Lemma I.7, we get E f (x t+1 ) + 2γ(2ω + 1)E g t+1 -h t+1 2 + 8γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + 2γ b E h t+1 -∇f (x t+1 ) 2 + 80bγω (2ω + 1) n E 1 n n i=1 h t+1 i -∇f i (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + (1 -γµ) 2γ(2ω + 1)E g t -h t 2 + (1 -γµ) 8γω n E 1 n n i=1 g t i -h t i 2 + (1 -γµ) 2γ b E h t -∇f (x t ) 2 + (1 -γµ) 80bγω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) + 200γω (2ω + 1) nB + 4γ bnB b 2 σ 2 . In the view of Lemma I.6 with , and h 0 i = g 0 i = 0 for all i ∈ [n], then Algorithm 1 needs  Ψ t = 2(2ω + 1)E g t -h t 2 + 8ω n E 1 n n i=1 g t i -h t i 2 + 2 b E h t -∇f (x t ) 2 + 80bω (2ω + 1) n E 1 n n i=1 h t i -∇f i (x t ) T := O ω + ω σ 2 µnεB + σ 2 µnεB + L µ + ω L µ √ n + ω √ n + σ n √ Bµε L σ µ √ B + 4 bnB b 2 σ 2 = O (ε) . Therefore, is it enough to take the number of communication rounds equals ( 27) to get an ε-solution. In the view of Algorithm 1 and the fact that we use a mini-batch of stochastic gradients, the communication complexity is equal to O (ζ C T ) and the number of stochastic gradients that each node calculates equals O(BT ). Unlike Corollary 6.8, in this corollary, we can initialize h 0 i and g 0 i , for instance, with zeros because the corresponding initialization error Ψ 0 from the proof of Theorem I.15 would be under the logarithm.

I.7 CASE OF DASHA-SYNC-MVR

Comparing Algorithm 1 and Algorithm 2, one can see that Algorithm 2 has the third source of randomness from c t+1 . In this section, we define E p [•] to be a conditional expectation w.r.t. c t+1 conditioned on all previous randomness. And we define E t+1 [•] to be a conditional expectation w .r.t. c t+1 , {C i } n i=1 , {h t+1 i } n i=1 conditioned on all previous randomness. Note, that E t+1 [•] = E h [E C [E p [•]]] . Lemma I.17. Suppose that Assumptions 5.3, 5.5 and 1.2 hold and let us consider sequences {g t+1 i } n i=1 and {h t+1 i } n i=1 from Algorithm 2, then E t+1 g t+1 -h t+1 2 ≤ 2ω(1 -p) L 2 σ B + L 2 n x t+1 -x t 2 + 2a 2 ω(1 -p) n 2 n i=1 g t i -h t i 2 + (1 -p) (1 -a) 2 g t -h t 2 , E t+1 g t+1 i -h t+1 i 2 ≤ 2ω(1 -p) L 2 σ B + L 2 i x t+1 -x t 2 + (1 -p) 2a 2 ω + (1 -a) 2 g t i -h t i 2 , ∀i ∈ [n]. Proof. First, we estimate E t+1 g t+1 -h t+1 2 . Let us denote h t+1 i,0 = 1 B B j=1 ∇f i (x t+1 ; ξ t+1 ij ) + h t i -1 B B j=1 ∇f i (x t ; ξ t+1 ij ). E t+1 g t+1 -h t+1 2 = E t+1 E p g t+1 -h t+1 2 = (1 -p)E t+1   g t + 1 n n i=1 C i h t+1 i,0 -h t i -a g t i -h t i - 1 n n i=1 h t+1 i,0 2   (4),(15) = (1 -p)E h   E C   1 n n i=1 C i h t+1 i,0 -h t i -a g t i -h t i - 1 n n i=1 h t+1 i,0 -h t i -a g t i -h t i 2     + (1 -p) (1 -a) 2 g t -h t 2 . Using the independence of compressors and (4), we get E t+1 g t+1 -h t+1 2 = (1 -p) n 2 n i=1 E h E C C i h t+1 i,0 -h t i -a g t i -h t i -h t+1 i,0 -h t i -a g t i -h t i 2 + (1 -p) (1 -a) 2 g t -h t 2 ≤ ω(1 -p) n 2 n i=1 E h h t+1 i,0 -h t i -a g t i -h t i 2 + (1 -p) (1 -a) 2 g t -h t 2 ≤ 2ω(1 -p) n 2 n i=1 E h h t+1 i,0 -h t i 2 + 2a 2 ω(1 -p) n 2 n i=1 g t i -h t i 2 + (1 -p) (1 -a) 2 g t -h t 2 = 2ω(1 -p) n 2 n i=1 E h    1 B B j=1 ∇f i (x t+1 ; ξ t+1 ij ) - 1 B B j=1 ∇f i (x t ; ξ t+1 ij ) 2    + 2a 2 ω(1 -p) n 2 n i=1 g t i -h t i 2 + (1 -p) (1 -a) 2 g t -h t 2 (15) = 2ω(1 -p) n 2 n i=1 E h    1 B B j=1 ∇f i (x t+1 ; ξ t+1 ij ) -∇f i (x t ; ξ t+1 ij ) -∇f i (x t+1 ) -∇f i (x t ) 2    + ∇f i (x t+1 ) -∇f i (x t ) 2 + 2a 2 ω(1 -p) n 2 n i=1 g t i -h t i 2 + (1 -p) (1 -a) 2 g t -h t 2 = 2ω(1 -p) n 2 n i=1 1 B 2 B j=1 E h ∇f i (x t+1 ; ξ t+1 ij ) -∇f i (x t ; ξ t+1 ij ) -∇f i (x t+1 ) -∇f i (x t ) 2 + ∇f i (x t+1 ) -∇f i (x t ) 2 + 2a 2 ω(1 -p) n 2 n i=1 g t i -h t i 2 + (1 -p) (1 -a) 2 g t -h t 2 ≤ 2ω(1 -p) L 2 σ B + L 2 n x t+1 -x t 2 + 2a 2 ω(1 -p) n 2 n i=1 g t i -h t i 2 + (1 -p) (1 -a) 2 g t -h t 2 , where in the inequalities we use Assumptions 1.2, 5.5 and 5.3, and ( 14). Analogously, we can get the bound for E t+1 g t+1 i -h t+1 i 2 for all i ∈ [n]: E t+1 g t+1 i -h t+1 i 2 = E t+1 E p g t+1 i -h t+1 i 2 = (1 -p)E t+1 g t i + C i h t+1 i,0 -h t i -a g t i -h t i -h t+1 i,0 2 ≤ 2ω(1 -p) L 2 σ B + L 2 i x t+1 -x t 2 + 2a 2 ω(1 -p) g t i -h t i 2 + (1 -p) (1 -a) 2 g t i -h t i 2 . We introduce new notations: ∇f i (x t+1 ; ξ t+1 i ) = 1 B B j=1 ∇f i (x t+1 ; ξ t+1 ij ) and ∇f (x t+1 ; ξ t+1 ) = 1 n n i=1 ∇f i (x t+1 ; ξ t+1 i ). Lemma I.18. Suppose that Assumptions 5.5 and 5.6 hold and let us consider sequence {h t+1 i } n i=1 from Algorithm 2, then E t+1 h t+1 -∇f (x t+1 ) 2 ≤ pσ 2 nB + (1 -p)L 2 σ nB x t+1 -x t 2 + (1 -p) h t -∇f (x t ) 2 . Proof. E t+1 h t+1 -∇f (x t+1 ) 2 = pE h    1 n n i=1 1 B B k=1 ∇f i (x t+1 ; ξ t+1 ik ) -∇f (x t+1 ) 2    + (1 -p)E h ∇f (x t+1 ; ξ t+1 ) + h t -∇f (x t ; ξ t+1 ) -∇f (x t+1 ) 2 ≤ pσ 2 nB + (1 -p)E h ∇f (x t+1 ; ξ t+1 ) + h t -∇f (x t ; ξ t+1 ) -∇f (x t+1 ) 2 , where we use Assumption 5.5. Next, using Assumption 5.6 and (15), we have E t+1 h t+1 -∇f (x t+1 ) 2 ≤ pσ 2 nB + (1 -p)E h ∇f (x t+1 ; ξ t+1 ) + h t -∇f (x t ; ξ t+1 ) -∇f (x t+1 ) 2 = pσ 2 nB + (1 -p)E h ∇f (x t+1 ; ξ t+1 ) -∇f (x t ; ξ t+1 ) -∇f (x t+1 ) -∇f (x t ) 2 + (1 -p) h t -∇f (x t ) 2 = pσ 2 nB + (1 -p) n 2 n i=1 E h ∇f i (x t+1 ; ξ t+1 i ) -∇f i (x t ; ξ t+1 i ) -∇f i (x t+1 ) -∇f i (x t ) 2 + (1 -p) h t -∇f (x t ) 2 = pσ 2 nB + (1 -p) n 2 B 2 n i=1 B j=1 E h ∇f i (x t+1 ; ξ t+1 ij ) -∇f i (x t ; ξ t+1 ij ) -∇f i (x t+1 ) -∇f i (x t ) 2 + (1 -p) h t -∇f (x t ) 2 ≤ pσ 2 nB + (1 -p)L 2 σ nB x t+1 -x t 2 + (1 -p) h t -∇f (x t ) 2 . Theorem I.19. Suppose that Assumptions 5.1, 5.2, 5.3, 5.5, 5.6 and 1.2 hold. Let us take a = 1 2ω+1 , probability p ∈ (0, 1], batch size B ≥ 1 and γ ≤ L + 12ω(2ω + 1)(1 -p) n L 2 σ B + L 2 + 2(1 -p)L 2 σ pnB -1 , in Algorithm 2. Then E ∇f ( x T ) 2 ≤ 1 T     2 f (x 0 ) -f * × L + 12ω(2ω + 1)(1 -p) n L 2 σ B + L 2 + 2(1 -p)L 2 σ pnB + 2 (2ω + 1) g 0 -h 0 2 + 4ω n 1 n n i=1 g 0 i -h 0 i 2 + 2 p h 0 -∇f (x 0 ) 2     + 2σ 2 nB . Proof. Let us fix constants κ, η, ν ∈ [0, ∞) that we will define later. Using Lemma I.1, we can get (20). Considering (20), Lemma I.17, Lemma I.18, and the law of total expectation, we obtain E f (x t+1 ) + κE g t+1 -h t+1 2 + ηE 1 n n i=1 g t+1 i -h t+1 i 2 + νE h t+1 -∇f (x t+1 ) 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t 2 + γ g t -h t 2 + γ h t -∇f (x t ) 2 + κE   2ω(1 -p) L 2 σ B + L 2 n x t+1 -x t 2 + 2a 2 ω(1 -p) n 2 n i=1 g t i -h t i 2 + (1 -p) (1 -a) 2 g t -h t 2   + ηE 2ω(1 -p) L 2 σ B + L 2 x t+1 -x t 2 + (1 -p) 2a 2 ω + (1 -a) 2 1 n n i=1 g t i -h t i 2 + νE pσ 2 nB + (1 -p)L 2 σ nB x t+1 -x t 2 + (1 -p) h t -∇f (x t ) 2 . After rearranging the terms, we get E f (x t+1 ) + κE g t+1 -h t+1 2 + ηE 1 n n i=1 g t+1 i -h t+1 i 2 + νE h t+1 -∇f (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 -   1 2γ - L 2 - 2κω(1 -p) L 2 σ B + L 2 n -2ηω(1 -p) L 2 σ B + L 2 - ν(1 -p)L 2 σ nB   E x t+1 -x t 2 + γ + κ(1 -p)(1 -a) 2 E g t -h t 2 + 2κa 2 ω(1 -p) n + η(1 -p) 2a 2 ω + (1 -a) 2 E 1 n n i=1 g t i -h t i 2 + (γ + ν(1 -p)) E h t -∇f (x t ) 2 + νpσ 2 nB . Let us take ν = γ p , κ = γ a , a = 1 2ω+1 , and η = 2γω n . Thus γ+κ(1-p)(1-a) 2 ≤ κ, γ+ν(1-p) = ν, 2κa 2 ω(1-p) n + η(1 -p) 2a 2 ω + (1 -a) 2 ≤ η, E f (x t+1 ) + γ(2ω + 1)E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + γ p E h t+1 -∇f (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 -   1 2γ - L 2 - 2γω(2ω + 1)(1 -p) L 2 σ B + L 2 n - 4γω 2 (1 -p) L 2 σ B + L 2 n - γ(1 -p)L 2 σ pnB   E x t+1 -x t 2 + γ(2ω + 1)E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + γ p E h t -∇f (x t ) 2 + γσ 2 nB ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 -   1 2γ - L 2 - 6γω(2ω + 1)(1 -p) L 2 σ B + L 2 n - γ(1 -p)L 2 σ pnB   E x t+1 -x t 2 + γ(2ω + 1)E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + γ p E h t -∇f (x t ) 2 + γσ 2 nB . In the view of the choice of γ, we obtain E f (x t+1 ) + γ(2ω + 1)E g t+1 -h t+1 2 + 2γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + γ p E h t+1 -∇f (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 + γ(2ω + 1)E g t -h t 2 + 2γω n E 1 n n i=1 g t i -h t i 2 + γ p E h t -∇f (x t ) 2 + γσ 2 nB . Finally, using Lemma I.5 with Ψ t = (2ω + 1)E g t -h t 2 + 2ω n E 1 n n i=1 g t i -h t i 2 + 1 p E h t -∇f (x t ) 2 and C = σ 2 nB , we can conclude the proof. Corollary 6.10. Suppose that assumptions from Theorem I.19 hold, probability p = min ζ C d , nεB σ 2 , batch size B = Θ σ 2 nε and h 0 i = g 0 i = 1 Binit Binit k=1 ∇f i (x 0 ; ξ 0 ik ) for all i ∈ [n], initial batch size B init = Θ max σ 2 nε , B d ζ C , then DASHA-SYNC-MVR needs  T := O     1 ε     f (x 0 ) -f * L + ω √ n L + ω √ n + d ζ C n + σ 2 εn 2 B L σ √ B     + σ 2 nεB     E ∇f ( x T ) 2 ≤ 1 T     2 f (x 0 ) -f *     L + 12ω(2ω + 1)(1 -p) L 2 σ B + L 2 n + 2(1 -p)L 2 σ pnB     + 2 p h 0 -∇f (x 0 ) 2     + 2σ 2 nB ≤ 1 T     2 f (x 0 ) -f *     L + 12ω(2ω + 1)(1 -p) L 2 σ B + L 2 n + 2(1 -p)L 2 σ pnB     + 2 p h 0 -∇f (x 0 ) 2     + 2 3 ε. Due to p = min ζ C d , nεB σ 2 , we have E ∇f ( x T ) 2 ≤ O     1 T     2 f (x 0 ) -f *     L + 12ω(2ω + 1)(1 -p) L 2 σ B + L 2 n + 2d(1 -p)L 2 σ ζ C nB + 2σ 2 (1 -p)L 2 σ εn 2 B 2     + 2 d ζ C + σ 2 nεB h 0 -∇f (x 0 ) 2         + 2 3 ε ≤ O     1 T     2 f (x 0 ) -f *     L + ω 2 (1 -p) L 2 σ B + L 2 n + d(1 -p)L 2 σ ζ C nB + 2σ 2 (1 -p)L 2 σ εn 2 B 2     + 2 d ζ C + σ 2 nεB h 0 -∇f (x 0 ) 2         + 2 3 ε. Therefore, we can take T = O     1 ε     f (x 0 ) -f * L + ω √ n L + L σ √ B + d ζ C n L σ √ B + σ 2 εn 2 B L σ √ B + d ζ C + σ 2 nεB h 0 -∇f (x 0 ) 2         . Note, that E h 0 -∇f (x 0 ) 2 = E   1 n n i=1 1 B init Binit k=1 ∇f i (x 0 ; ξ 0 ik ) -∇f (x 0 ) 2   = 1 n 2 B 2 init n i=1 Binit k=1 E ∇f i (x 0 ; ξ 0 ik ) -∇f i (x 0 ) 2 ≤ σ 2 nB init . Next, by taking B init = max σ 2 nε , B d ζ C and using the last ineqaulity, we have  T = O     1 ε     f (x 0 ) -f * L + ω √ n L + L σ √ B + d ζ C n L σ √ B + σ 2 εn 2 B L σ √ B + d ζ C + σ 2 nεB min σ 2 ζ C ndB , ε         = O     1 ε     f (x 0 ) -f * L + ω √ n L + L σ √ B + d ζ C n L σ √ B + σ 2 εn 2 B L σ √ B     + σ 2 nεB     . O (d + ζ C T ) = O     d + 1 ε     f (x 0 ) -f * KL + K ω √ n L + L σ √ B + K ω n L σ √ B + K σ 2 εn 2 B L σ √ B     + K σ 2 nεB     = O     d + 1 ε     f (x 0 ) -f * d √ n L + d √ n L + L σ √ B + d √ n L σ     + dσ √ nε     = O     d + dσ √ nε + 1 ε     f (x 0 ) -f * d √ n L         = O     dσ √ nε + 1 ε     f (x 0 ) -f * d √ n L         . And the expected number of stochastic gradient calculations per node equals Proof. Let us fix constants κ, η, ν ∈ [0, ∞) that we will define later. Using Lemma I.1, we can get (20). Considering (20), Lemma I.17, Lemma I.18, and the law of total expectation, we obtain  O (B init + BT ) = O     σ 2 nε + B d ζ C + 1 ε     f (x 0 ) -f * BL + B ω √ n L + L σ √ B + B ω n L σ √ B + B σ 2 εn 2 B L σ √ B         = O     σ 2 nε + σ √ nε + 1 ε     f (x 0 ) -f * σ √ εn L + σ √ εn L + L σ √ B + σ √ εn L σ         = O     σ 2 nε + 1 ε     f (x 0 ) -f * E f (x t+1 ) + κE g t+1 -h t+1 2 + ηE 1 n n i=1 g t+1 i -h t+1 i 2 + νE h t+1 -∇f (x t+1 ) 2 ≤ E f (x t ) - γ 2 ∇f (x t ) 2 - 1 2γ - L 2 x t+1 -x t + νE h t+1 -∇f (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 -   1 2γ - L 2 - 2κω(1 -p) L 2 σ B + L 2 n -2ηω(1 -p) L 2 σ B + L 2 - ν(1 -p)L 2 σ nB   E x t+1 -x t 2 + γ + κ(1 -p)(1 -a) 2 E g t -h t 2 + 2κa 2 ω(1 -p) n + η(1 -p) 2a 2 ω + (1 -a) 2 E 1 n n i=1 g t i -h t i 2 + (γ + ν(1 -p)) E h t -∇f (x t ) 2 + νpσ 2 nB . Let us take ν = 2γ p , κ = 2γ a , a = 1 2ω+1 , and η = 8γω n . Thus γ + κ(1 -p)(1 -a) 2 ≤ 1 -a 2 κ, γ + ν(1 -p) = 1 -p 2 ν, 2κa 2 ω(1-p) n + η(1 -p) 2a 2 ω + (1 -a) 2 ≤ 1 -a 2 η, and , batch size B = Θ σ 2 µnε , and h 0 i = g 0 i = 0 for all i ∈ [n], then DASHA-SYNC-MVR needs Considering the fact that we use a mini-batch of stochastic gradients, on average, the number of stochastic gradients that each node calculates at each communication round equals pB +(1-p)2B = O µnεB σ 2 • σ 2 µnε + 2B = O (B) , thus the number of stochastic gradients that each node calculates equals O(BT ). Unlike Corollary 6.10, in this corollary, we can initialize h 0 i and g 0 i , for instance, with zeros because the corresponding initialization error Ψ 0 from the proof of Theorem I.20 would be under the logarithm. E f (x t+1 ) + 2γ(2ω + 1)E g t+1 -h t+1 2 + 8γω n E 1 n n i=1 g t+1 i -h t+1 i 2 + 2γ p E h t+1 -∇f (x t+1 ) 2 ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 -   1 2γ - L 2 - 4γω(2ω + 1)(1 -p) L 2 σ B + L 2 n - 16γω 2 (1 -p) L 2 σ B + L 2 n - 2γ(1 -p)L 2 σ pnB   E x t+1 -x t 2 + 1 - a 2 2γ(2ω + 1)E g t -h t 2 + 1 - a 2 8γω n E 1 n n i=1 g t i -h t i 2 + 1 - p 2 2γ p E h t -∇f (x t ) 2 + 2γσ 2 nB ≤ E f (x t ) - γ 2 E ∇f (x t ) 2 -   1 2γ - L 2 - 20γω(2ω + 1)(1 -p) L 2 σ B + L 2 n - 2γ(1 -p)L 2 σ pnB   E x t+1 -x t T := O ω + d ζ C + σ 2 µnεB + L µ + ω L µ √ n + ω √ n + d ζ C n + σ n √ Bµε L σ µ √ B



Alternatively, we sometimes use the terms: machines, workers and clients. Code: https://github.com/mysteryresearcher/dasha Indeed, in DASHA-SYNC-MVR and MARINA, the probability p = min{ K /d, nεB /σ 2 }. In DASHA-MVR, the momentum b = min{ K /d nεB /σ 2 , nεB /σ 2 }.



The expected density of the compressorC i is ζ Ci := sup x∈R d E [ C i (x) 0 ], where x 0 is the number of nonzero components of x ∈ R d . Let ζ C = max i∈[n] ζ Ci .

Different Sources of Contractions . . . . . . . . . . . . . . . . . . . . . . . . . . Published as a conference paper at ICLR 2023 H.2 The Source of Improvements in the Convergence Rates . . . . . . . . . . . . . . . I Theorems with Proofs I.1 Case of DASHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I.2 Case of DASHA under PŁ-condition . . . . . . . . . . . . . . . . . . . . . . . . . I.3 Case of DASHA-PAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I.4 Case of DASHA-PAGE under PŁ-condition . . . . . . . . . . . . . . . . . . . . . . I.5 Case of DASHA-MVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I.6 Case of DASHA-MVR under PŁ-condition . . . . . . . . . . . . . . . . . . . . . . I.7 Case of DASHA-SYNC-MVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I.8 Case of DASHA-SYNC-MVR under PŁ-condition . . . . . . . . . . . . . . . . . .

Figure 1: Classification task with the mushrooms dataset and gradient oracle.

Figure 2: Classification task with the real-sim dataset and K ∈ {100; 500; 2, 000} in RandK in the finite-sum setting.

Figure 3: Classification task with the real-sim dataset, σ 2 /nεB ∈ {10 4 , 10 5 }, and K ∈ {200, 2000} in RandK in the stochastic setting.

Figure 4: Classification task with CIFAR10 dataset and ResNet-18 deep neural network. Dimension d ≈ 10 7 and K ≈ 2 • 10 6 in RandK.

B m+B , ζ C d , because the parameter p of MARINA helps to reduce the variance from the compressors C.

ε-solution and the communication complexity is equal to O (d + ζ C T ) , where ζ C is the expected density from Definition 1.3. Proof. The communication complexities can be easily derived using Theorem 6.1. At each communication round of Algorithm 1, each node sends ζ C coordinates. In the view of g 0 i = ∇f i (x 0 ) for all i ∈ [n], we additionally have to send d coordinates from the nodes to the server, thus the total communication complexity would be O (d + ζ C T ) .

communication rounds to get an ε-solution and the communication complexity is equal to O (ζ C T ) , where ζ C is the expected density from Definition 1.3.

communication rounds to get an ε-solution, the communication complexity is equal to O (d + ζ C T ) , and the expected # of gradient calculations per node equals O (m + BT ) , where ζ C is the expected density from Definition 1.3. Proof. Corollary 6.5 can be proved in the same way as Corollary 6.2. One only should note that the expected number of gradients calculations at each communication round equals pm + (1 -p)B = 2mB m+B ≤ 2B. Corollary 6.6. Suppose that assumptions of Corollary 6.5 hold, B ≤ m /n, and we use the unbiased compressor RandK with K = ζ C = Θ ( Bd / √ m) . Then the communication complexity of Algorithm 1 is

26) communication rounds to get an ε-solution, the communication complexity is equal to O (ζ C T ) , and the expected number of gradient calculations per node equals O (BT ) , where ζ C is the expected density from Definition 1.3.

communication rounds to get an ε-solution, the communication complexity is equal to O (d + ζ C T ) , and the number of stochastic gradient calculations per node equals O(B init + BT ), where ζ C is the expected density from Definition 1.3.

Proof. In the view of Theorem F.2, we have ω+1 = d/K. Moreover, K = Θ Bd

b 2 σ 2 , we can conclude the proof.Corollary I.16. Suppose that assumptions from Theorem I.15 hold, momentum b =

27) communication rounds to get an ε-solution, the communication complexity is equal to O (ζ C T ) , and the number of stochastic gradient calculations per node equals O(BT ), where ζ C is the expected density from Definition 1.3. Proof. Considering the choice of b, we have 1 µ 200ω(2ω+1) nB

, it is left to estimate the communication and oracle complexity. On average, the number of coordinates that each node in Algorithm 2 sends at each communication round equals pd+ (1 -p)ζ C ≤ ζ C d d + 1 -ζ C d ζ C ≤ 2ζ C . Therefore, the communication complexity is equal to O (d + ζ C T ) .Considering the fact that we use a mini-batch of stochastic gradients, on average, the number of stochastic gradients that each node calculates at each communication round equals pB+ (1 -p)2B ≤ O nεB σ 2 • σ 2 nε + 2B = O (B) .Considering the initial batch size B init , the number of stochastic gradients that each node calculates equals O(B init + BT ). Corollary 6.11. Suppose that assumptions of Corollary 6.10 hold, batch size B ≤ σ √ εn , we take RandK with K = ζ C = Θ Bd √ εn σ , and L := max{L, L σ , L}. Then the communication complex-In the view of Theorem F.2, we have ω+1 = d/K. Moreover, K = Θ Bd

x 0 ) -f * + 2γ(2ω + 1) g 0 -h 0

28) communication rounds to get an ε-solution, the communication complexity is equal to O (ζ C T ) , and the number of stochastic gradient calculations per node equals O(BT ), where ζ C is the expected density from Definition 1.3.Proof. Considering the choice of B , we have 2σ 2 nµB = O (ε) . Therefore, is it enough to take the number of communication rounds equals (28) to get an ε-solution.It is left to estimate the communication and oracle complexity. On average, in Algorithm 2, at each communication round the number of coordinates that each node sends equals pd+ (1p)ζ C ≤ ζ C d d + 1 -ζ C d ζ C ≤ 2ζ C .Therefore, the communication complexity is equal to O (ζ C T ) .

communication rounds to get an ε-solution and the communication complexity is equal to O (d + ζ C T ) , where ζ C is the expected density from Definition 1.3.

communication rounds to get an ε-solution, the communication complexity is equal to O (d + ζ C T ) , and the expected # of gradient calculations per node equals O (m + BT ) , where ζ C is the expected density from Definition 1.3. Suppose that assumptions of Corollary 6.5 hold, B ≤ m /n, and we use the unbiased compressor RandK with K = ζ C = Θ ( Bd /

to get an ε-solution, the communication complexity is equal to O (d + ζ C T ) , and the number of stochastic gradient calculations per node equals O(B init + BT ), where ζ C is the expected density from Definition 1.3.

communication rounds to get an ε-solution, the communication complexity is equal to O (d + ζ C T ) , and the number of stochastic gradient calculations per node equals O(B init + BT ), where ζ C is the expected density from Definition 1.3.

OF DASHA-MVR AND DASHA-SYNC-MVR Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A.2 FINITE-SUM SETTING Now, we conduct the same experiments as in Section A.1 with real-sim dataset (dimension d = 20,958, number of samples equals 72,309) from LIBSVM in the finite-sum setting; moreover, we compare VR-MARINA versus DASHA-PAGE with batch size B = 1 in both algorithms. Results in Figure2coincide with Table1-our new method DASHA-PAGE converges faster than MARINA.

6 CASE OF DASHA-MVR UNDER PŁ-CONDITION

communication rounds to get an ε-solution, the communication complexity is equal to O (d + ζ C T ) , and the number of stochastic gradient calculations per node equals O(B init + BT ), where ζ C is the expected density from Definition 1.3. Proof. Considering Theorem I.19 and the choice of B , we have

2 + γ g t -h t 2 + γ h t -∇f (x t ) t+1 -x t 2 + (1 -p) h t -∇f (x t )After rearranging the terms, we getE f (x t+1 ) + κE g t+1 -h t+1 2 + ηE 1 n

In the view of the choice of γ and Lemma I.7, one can show that 1 2γ -L 2 -= 2σ 2 nB , we can conclude the proof.

ACKNOWLEDGEMENTS

The work of P. Richtárik was partially supported by the KAUST Baseline Research Fund Scheme and by the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence. The work of A. Tyurin was supported by the Extreme Computing Research Center (ECRC) at KAUST.

J EXTRA EXPERIMENTS

DASHA-MVR improves VR-MARINA (online) when ε is small (see Tables 1 and 2 and experiments in Section A). However, our analysis shows that DASHA-MVR gets a term Bω σ 2 εnB in the oracle complexity and a term ω σ 2 µεnB in the number of communication rounds in general nonconvex and PŁ settings accordingly. Both terms can be a bottleneck in some regimes; now, we verify this dependence in the PŁ setting.We take a synthetically generated stochastic quadratic optimization problem with one node (n = 1):We generate A in such way, that µ ≈ 1.0 ≤ L ≈ 2.0, take d = 10 4 , σ 2 = 1.0, RandK with K = 1(ω ≈ d), batch size B = 1, and σ 2 µεnB = 10 4 . With this particular choice of parameters, ωResults are provided in Figure 5 . We consider DASHA-MVR with a momentum b from Corollary I.16 and b = min 1 ω , µnεB σ 2. With the latter choice of momentum b, DASHA-MVR converges at the same rate as DASHA-SYNC-MVR or VR-MARINA (online) but to an ε-solution with a smaller ε. On the other hand, the former choice of momentum b guarantees the convergence to the correct ε-solution, but with a slower rate. Overall, the experiment provides the pieces of evidence that our choice of b is correct and that our analysis in Theorem I.15 is tight.If we decrease ω from 10 4 to 10 3 (see Figure 6 ), or σ 2 from 1.0 to 0.1 (see Figure 7 ), or µ from 1.0 to 0.1 (see Figure 8 ), then the gap between algorithms closes. 

