EF21-P AND FRIENDS: IMPROVED THEORETICAL COMMUNICATION COMPLEXITY FOR DISTRIBUTED OPTIMIZATION WITH BIDIRECTIONAL COMPRESSION

Abstract

The starting point of this paper is the discovery of a novel and simple errorfeedback mechanism, which we call EF21-P, for dealing with the error introduced by a contractive compressor. Unlike all prior works on error feedback, where compression and correction operate in the dual space of gradients, our mechanism operates in the primal space of models. While we believe that EF21-P may be of interest in many situations where it is often advantageous to perform model perturbation prior to the computation of the gradient (e.g., randomized smoothing and generalization), in this work we focus our attention on its use as a key building block in the design of communication-efficient distributed optimization methods supporting bidirectional compression. In particular, we employ EF21-P as the mechanism for compressing and subsequently error-correcting the model broadcast by the server to the workers. By combining EF21-P with suitable methods performing worker-to-server compression, we obtain novel methods supporting bidirectional compression and enjoying new state-of-the-art theoretical communication complexity for convex and nonconvex problems. For example, our bounds are the first that manage to decouple the variance/error coming from the workersto-server and server-to-workers compression, transforming a multiplicative dependence to an additive one. In the convex regime, we obtain the first bounds that match the theoretical communication complexity of gradient descent. Even in this convex regime, our algorithms work with biased gradient estimators, which is nonstandard and requires new proof techniques that may be of independent interest. Finally, our theoretical results are corroborated through suitable experiments.

1. INTRODUCTION: ERROR FEEDBACK IN THE PRIMAL SPACE

The key moment which ultimately enabled the main results of this paper was our discovery of a new and simple error-feedback technique, which we call EF21-P, that operates in the primal space of the iterates/models instead of the prevalent approach to error-feedback (Stich & Karimireddy, 2019; Karimireddy et al., 2019; Gorbunov et al., 2020b; Beznosikov et al., 2020; Richtárik et al., 2021) which operates in the dual space of gradientsfoot_0 . To describe EF21-P, consider solving the optimization problem min x∈R d f (x), where f : R d → R is a smooth but not necessarily convex function. Given a contractive compression operator C : R d → R d , i.e., a (possibly) randomized mapping satisfying the inequality E C(x) -x 2 ≤ (1 -α) x 2 , ∀x ∈ R d (2) for some constant α ∈ (0, 1], our EF21-P method aims to solve (1) via the iterative process x t+1 = x t -γ∇f (w t ), w t+1 = w t + C t (x t+1 -w t ), where γ > 0 is a stepsize, x 0 ∈ R d is the initial iterate, w 0 = x 0 ∈ R d is the initial iterate shift, and C t is an instantiation of a randomized contractive compressor satisfying (2) sampled at time t. Note that when C is the identity mapping (α = 1), then w t = x t for all t, and hence EF21-P reduces to vanilla gradient descent (GD). Otherwise, EF21-P is a new optimization method. Note that {x t } iteration of EF21-P can be equivalently written in the form of perturbed gradient descent x t+1 = x t -γ∇f (x t + ζ t ), ζ t = C t-1 (x t -w t-1 ) -(x t -w t-1 ). Note that the model perturbation ζ t is not a zero mean random variablefoot_2 , and that in view of (2), the size of the perturbation can be bounded via E ζ t 2 | x t , w t-1 ≤ (1 -α) x t -w t-1 2 . ( ) From now on, we will write C ∈ B(α) to mean that C is a compressor satisfying (2). 1.1 EF21-P THEORY If f is L-smooth and µ-strongly convex, we prove that both x t and w t converge to x * = arg min f at a linear rate, in O(( L /αµ) log 1 /ε) iterations in expectation (see Section D). Intuitively speaking, this happens because the error-feedback mechanism embedded in EF21-P makes sure that the quantity on the right-hand side of (4) converges to zero, which forces the size of the error ζ t caused by the perturbation to converge to zero as well. However, EF21-P can be analyzed in the smooth nonconvex regime as well, in which case it finds an ε-approximate stationary point. The precise convergence result, proof, as well as an extension that allows to replace ∇f (w t ) with a stochastic gradient under the general ABC inequality introduced by Khaled & Richtárik (2020) (which provably holds for various sources of stochasticity, including subsampling and gradient compression) can be found in Section E.

1.2. SUMMARY OF CONTRIBUTIONS

We believe that EF21-P and its analysis could be useful in various optimization and machine learning contexts in which some kind of iterate perturbation plays an important role, including randomized smoothing (Duchi et al., 2012) , perturbed SGD (Vardhan & Stich, 2022) , and generalization (Orvieto et al., 2022) . In this work we do not venture into these potential application areas and instead focus all our attention on a single and important use case where, as we found out, EF21-P leads to new state-of-the-art methods and theory: the design of communication-efficient distributed optimization methods supporting bidirectional (i.e., workers-to-server and server-to-workers) compression. In particular, we use EF21-P as the mechanism for compressing and subsequently error-correcting the model broadcast by the server to the workers. By combining EF21-P with suitable methods ("friends" in the title of the paper) performing worker-to-server compression, in particular, DIANA (Mishchenko et al., 2019; Horváth et al., 2022) or DCGD (Alistarh et al., 2017; Khirirat et al., 2018) , we obtain novel methods, suggestively named EF21-P + DIANA (Algorithm 1) and EF21-P + DCGD (Algorithm 2), both supporting bidirectional compression, and both enjoying new state-ofthe-art theoretical communication complexity for convex and nonconvex problems. While DIANA and DCGD were not designed to work with compressors from B(α) to compress the workers-toserver communication, and can in principle diverge if used that way, they work well with the smaller class of randomized compression mappings C : R d → R d characterized by E [C(x)] = x, E C(x) -x 2 ≤ ω x 2 , ∀x ∈ R d , where ω ≥ 0 is a constant. We will write C ∈ U(ω) to mean that C satisfies (5). It is well known that if C ∈ U(ω), then C /(ω+1) ∈ B ( 1 /(ω+1)), which means that the class U(ω) is indeed more narrow. Convex setting. EF21-P + DIANA provides new state-of-the-art convergence rates for distributed optimization tasks in the strongly convex (see Table 1 ) and general convex regimes. This is the first method enabling bidirectional compression that has the server-to-workers and workers-to-server communication complexity better than vanilla GD. When the workers calculate stochastic gradients (see Section 3.1), we prove that EF21-P + DIANA improves the rates of the existing methods. We prove that EF21-P + DCGD has an even better convergence rate than EF21-P + DIANA in the interpolation regime (see Section 3.2). Nonconvex setting. In the nonconvex setting (see Section 4), EF21-P + DCGD is the first method using bidirectional compression whose convergence rate decouples the noises coming from the workers-to-server and server-to-workers compression from multiplicative to additive dependence (see Table 2 ). Moreover, EF21-P + DCGD provides the new state-of-the-art convergence rate in the low accuracy regimes (ε is small or the # of workers n is large). Further, we provide examples of optimization problems where EF21-P + DCGD outperforms previous state-of-the-art methods even in the high accuracy regime. Unified SGD analysis framework with the EF21-P mechanism. Khaled & Richtárik (2020) provide a unified framework for the analysis of SGD-type methods for smooth nonconvex problems. Their framework helps to analyze SGD and DCGD under various assumptions, including i) strong and weak growth, ii) samplings strategies, e.g., importance sampling. Unfortunately, the theory relies heavily on the unbiasedness of stochastic gradients, as a result of which it is not applicable to our methods (in EF21-P + DCGD, E [g t ] = ∇f (w t ) = ∇f (x t )). Therefore, we decided to rebuild the theory from scratch. Our results inherit all previous achievements of ( Khaled & Richtárik, 2020) , and further generalize the unified framework to make it suitable for optimization methods where the iterates are perturbed using the EF21-P mechanism. We believe that this is a contribution with potential applications beyond the focus of this work (distributed optimization with bidirectional compression). This development is presented in Section E; our main results from Section 4.1-4.3 which cater to the nonconvex setting are simple corollaries of our general theory.

2. DISTRIBUTED OPTIMIZATION AND BIDIRECTIONAL COMPRESSION

In this paper, we consider distributed optimization problems in strongly convex, convex and nonconvex settings. Such problems arise in federated learning (Konečný et al., 2016; McMahan et al., 2017) and in deep learning (Ramesh et al., 2021) . In federated learning, a large number of workers/devices/nodes contain local data and communicate with a parameter-server that performs optimization of a function in a distributed fashion (Ramaswamy et al., 2019) . Due to privacy concerns and the potentially large number of workers, the communication between the workers and the server is a bottleneck and requires specialized algorithms capable of reducing the communication overhead. Popular algorithms dealing with these kinds of problems are based on communication compression (Mishchenko et al., 2019; Richtárik et al., 2021; Tang et al., 2019) . We consider the distributed optimization problem of the form min x∈R d f (x) := 1 n n i=1 f i (x) , ( ) where n is the number of workers and f i : R d → R are smooth (possibly nonconvex) functions for all i ∈ [n] := {1, . . . , n}. We assume that the functions f i are stored on n workers. Each of them is directly connected to a server that orchestrates the work of the devices (Kairouz et al., 2021) , i.e., the workers perform some calculations and send the results to the server, after which the server does calculations and sends the results back to the workers and the whole process repeats. Throughput the work we will refer to a subset of these assumptions: Assumption 2.1. The function f is L-smooth, i.e., ∇f (x) -∇f (y) ≤ L x -y ∀x, y ∈ R d . Assumption 2.2. The functions f i are L i -smooth for all i ∈ [n]. L 2 is a constant such that 1 n n i=1 ∇f i (x) -∇f i (y) 2 ≤ L 2 x -y 2 for all x, y ∈ R d and L max := max i∈[n] L i . Assumption 2.3. The functions f i are convex and the function f is µ-strongly convex with µ ≥ 0 and attains a minimum at some point x * ∈ R d . To avoid ambiguity, the constants L, L, and L i are the smallest such numbers. Lemma 2.4. If Assumptions 2.1, 2.2 and 2.3 hold, then L ≤ L max ≤ nL and L ≤ L ≤ √ nL.

2.1. COMMUNICATION COMPLEXITY OF VANILLA GRADIENT DESCENT

Solving the aforementioned optimization problem involves two key steps: i) the workers send results to the server (server-to-workers communication), ii) the server sends results to the workers (workers-to-server communication). Let us first consider how this procedure works in the case of GD: x t+1 = x t -γ∇f (x t ) = x t -γ 1 n n i=1 ∇f i (x t ). It is well known that if the function f is L-smooth and µ-strongly convex (see Assumptions 2.1 and 2.3), then GD with stepsize γ = 1 /L returns an ε-solution after O ( L /µ log 1 /ε) steps. In distributed setting, GD would require i) the workers to send ∇f i (x t ) to the server ii) the server to send x t+1 to the workers or, alternatively, ii) the server to send 1 n n i=1 ∇f i (x t ) to the workers, depending on whether the iterates x t are updated on the server or on the workers. Assuming that the communication complexity is proportional to the number of coordinates, the server-to-workers and workers-to-server communication complexities are equal O ( dL /µ log 1 /ε) .

2.2. WORKERS-TO-SERVER (=UPLINK) COMPRESSION

We now move on to more advanced algorithms that aim to improve the workers-to-server communication complexity. These algorithms assume that the server-to-workers communication complexity is negligible and focus exclusively on sending the message from devices to the server. Such an approach can be justified by the fact that broadcast operation may in some systems be much faster than gather operation (Mishchenko et al., 2019; Kairouz et al., 2021) . Moreover, the server can be considered to be just an abstraction representing "all other nodes", in which case server-to-worker communication does not exist at all. The primary tools that help reduce communication cost are compression operators, such as vector sparsification and quantization (Beznosikov et al., 2020) . However, compression injects error/noise into the process, as formalized in (2) and ( 5). Two canonical examples of compressors belonging to these two classes are the TopK ∈ B( k /d) and RandK ∈ U( d /k -1) sparsifiers. The former retains the K largest values of the input vector, while the latter takes K random values of this vector scaled by d /k (Beznosikov et al., 2020) . Further examples of compressors belonging to B(α) and U(ω) can be found in (Beznosikov et al., 2020) . The theory of methods supporting workers-to-server compression is reasonably well developed. In the convex and strongly convex setting, the current state-of-the-art methods are DIANA (Mishchenko et al., 2019) , ADIANA (Li et al., 2020) , and CANITA (Li & Richtárik, 2021) . In the nonconvex setting, the current state-of-the-art methods are DCGD (Khaled & Richtárik, 2020) (in the low accuracy regime) and MARINA, DASHA, FRECON, and EF21 (Gorbunov et al., 2021; Tyurin & Richtárik, 2022b; a; Zhao et al., 2021; Richtárik et al., 2021) (in the high accuracy regime). To see that these types of algorithms can achieve workers-to-server communication complexity that is no worse than that of GD, let us consider the DIANA method. In the strongly convex case, DIANA (Khaled et al., 2020) has the convergence rate O (((1 + ω /n) Lmax /µ + ω) log 1 /ε) . Using the RandK compression operator with K = d /n, the workers-to-server complexity is not greater than O d n × 1 + ω n Lmax µ + ω log 1 ε = O dLmax nµ + d log 1 ε , meaning that DIANA's complexity is better than GD's complexity O ( dL /µ log 1 /ε) (recall that L max ≤ nL). The same reasoning applies to other algorithms in the convex and nonconvex worlds.

2.3. BIDIRECTIONAL COMPRESSION

In the previous section, we showed that it is possible to improve workers-to-server communication complexity of GD. But what about the server-to-workers compression? Does there exist a method that would also compress the information sent from the server to the workers and obtain the workersto-server and server-to-workers communication complexities at least as good as with the vanilla GD method? As far as we know, the current answer to the question is NO! Bidirectional compression has been considered in many papers, including (Horváth et al., 2019; Tang et al., 2019; Liu et al., 2020; Philippenko & Dieuleveut, 2020; 2021; Fatkhullin et al., 2021) . In Table 1 , we provide a comparison of methods applying this type of compression in the strongly convex setting. Let us now take a closer look at the MCM method of Philippenko & Dieuleveut (2021) . For simplicity, we assume that the server and the workers use RandK compressors with Algorithm 1 EF21-P + DIANA 1: Parameters: learning rates γ > 0 (for learning the model) and β > 0 (for learning the gradient shifts); initial model x 0 ∈ R d (stored on the server and the workers); initial gradient shifts h 0 1 , . . . , h 0 n ∈ R d (stored on the workers); average of the initial gradient shifts h 0 = 1 n n i=1 h 0 i (stored on the server); initial model shift w 0 = x 0 ∈ R d (stored on the server and the workers) 2: for t = 0, 1, . . . , T -1 do 3: for i = 1, . . . , n in parallel do 4: m t i = C D i (∇fi(w t ) -h t i ) Worker i compresses the shifted gradient via the dual compressor C D i ∈ U(ω)

5:

Send compressed message m t i to the server 6: h t+1 i = h t i + βm t i Worker i updates its local gradient shift with stepsize β

7:

end for 8: m t = 1 n n i=1 m t i Server averages the n messages received from the workers 9: h t+1 = h t + βm t Server updates the average gradient shift so that h t = 1 n n i=1 h t i 10: g t = h t + m t Server computes the gradient estimator 11: x t+1 = x t -γg t Server takes a gradient-type step with stepsize γ 12: p t+1 = C P x t+1 -w t Server compresses the shifted model via the primal compressor C P ∈ B (α) 13: w t+1 = w t + p t+1 Server updates the model shift 14: Broadcast compressed message p t+1 to all n workers 15: for i = 1, . . . , n in parallel do 16: w t+1 = w t + p t+1 Worker i updates its local copy of the model shift 17: end for 18: end for parameters K s and K w , respectively. The server-to-workers communication complexity of MCM is not less than Ω K s × 1 + ω 3/2 s + ωsω 1/2 w √ n + ωw n Lmax µ log 1 ε = Ω d 3/2 K 1/2 s Lmax µ log 1 ε . Thus, for any K s ∈ [1, d], the server-to-workers communication complexity is worse than the GD's complexity O dL µ log 1 /ε by a factor of d 1/2 /K 1/2 s . The same reasoning applies to Dore (Liu et al., 2020) and Artemis (Philippenko & Dieuleveut, 2020) : Ω K s × ωsωw n Lmax µ log 1 ε = Ω d 2 Kwn Lmax µ log 1 ε . It turns out that one can find an example of problem (6) with L max = nL. Therefore, in the worst case scenario, the server-to-workers communication complexity can be up to d /Kw times worse than the GD's complexity for any K w ∈ [1, d].

2.4. NEW METHODS

We are now ready to present our main method EF21-P + DIANA (see Algorithm 1), which is a combination of our EF21-P mechanism described in Section 1 (and analyzed in Sections D and E) and the DIANA method of Mishchenko et al. (2019) ; Horváth et al. (2022) ; Gorbunov et al. (2020a) . The pseudocode of Algorithm 1 should be self-explanatory. If the gradient shifts {h t i } employed by DIANA are initialized to zeros, and we choose β = 0, then DIANA reduces to DCGD, and EF21-P + DIANA thus reduces to EF21-P + DCGD (see Algorithm 2). If we further choose the dual/gradient compressors C D i to be identity mappings, then EF21-P + DCGD further reduces to EF21-P.

3. ANALYSIS IN THE CONVEX SETTING

Let us first state the convergence theorem. Theorem 3.1. Suppose that Assumptions 2.1, 2.2 and 2.3 hold, β = 1 ω+1 , set x 0 = w 0 and let γ ≤ min n 160ωLmax , √ nα 20 √ ω L , α 100L , 1 (ω+1)µ . Then Algorithm 1 returns x T such that 1 2γ E x T -x * 2 + E f (x T ) -f (x * ) ≤ 1 -γµ 2 T V 0 , where V 0 := 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) + 8γω(ω+1) n 2 n i=1 h 0 i -∇f i (x * ) 2 . Table 1 : Strongly Convex Case. The number of communication rounds to get an ε-solution (E[ x -x * 2 ] ≤ ε) up to logarithmic factors. To make comparison easier, if a method works with a biased compressor, we assume that the biased compressor is formed from the unbiased compressors and the following relations hold: ω w + 1 = 1 /αw and ω s + 1 = 1 /αs, where ω w and ω s are parameters of workers-to-server and server-to-workers compressors, accordingly. + ωw or presented a bit less accurately: (1 + ωs + ωw n ) Lmax µ + ωw - EF21-P + DCGD (new) (Theorem G.3) (1 + ωs) L µ + (ωs +1)ωw n L µ + ωw n Lmax µ or presented a bit less accurately: (1 + ωs + ωw n ) Lmax µ Interpolation regime: ∇fi(x * ) = 0 The above result means that EF21-P + DIANA guarantees an ε-solution after T NEW := O L αµ + ω αn L µ + ω n Lmax µ + ω log 1 ε steps. Noting that L ≤ L ≤ L max and ωw αn ≤ 1 2α + ωw 2n , this gives T NEW = O 1 α + ω n Lmax µ + ω log 1 ε . Comparing this rate with rates achieved by the existing algorithms (see Table 1 ), our method is the first one to guarantee the decoupling of noises α and ω coming from the server-to-workers and the workers-to-server compressors. Moreover, it is more general, as the server-to-workers compression can use biased compressors, including TopK and RankK (Safaryan et al., 2021) . These can in practice perform better than the unbiassed ones (Beznosikov et al., 2020; Vogels et al., 2019) . As promised, let us now show that EF21-P + DIANA has the communication complexity better than GD. For simplicity, we assume that the server and the workers use TopK and RandK compressors respectively. Since under this assumption ω = d /K -1 and α = K /d, the server-to-workers and the workers-to-server communication complexities equal O K × L αµ + ω αn L µ + ω n Lmax µ + ω log 1 ε = O d L µ + d √ n L µ + d n Lmax µ log 1 ε . Note that L max ≤ nL and L ≤ √ nL, so this complexity is no worse than the GD's complexity for any K ∈ [1, d]. The general convex case is discussed in Section F.1.

3.1. STOCHASTIC GRADIENTS

In this section, we assume that the workers in EF21-P + DIANA calculate stochastic gradients instead of exact gradients: Assumption 3.2 (Stochastic gradients). For all x ∈ R d , stochastic gradients ∇f i (x) are unbiased and have bounded variance, i.e., E[ ∇f i (x)] = ∇f i (x), and E[ ∇f i (x) -∇f i (x) 2 ] ≤ σ 2 for all i ∈ [n], where σ 2 ≥ 0. We now provide a generalization of Theorem 3.1: Theorem 3.3. Let us consider Algorithm 1 using stochastic gradients ∇f i instead of exact gradients ∇f i for all i ∈ [n]. Let Assumptions 2.1, 2.2, 2.3 and 3.2 hold, β = 1 ω+1 , x 0 = w 0 , and γ ≤ min n 160ωLmax , √ nα 20 √ ω L , α 100L , 1 (ω+1)µ . Then Algorithm 1 returns x T such that 1 2γ E x T -x * 2 + E f (x T ) -f (x * ) ≤ 1 -γµ 2 T V 0 + 24(ω+1)σ 2 µn , where V 0 := 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) + 8γω(ω+1) n 2 n i=1 h 0 i -∇f i (x * ) 2 . For general convex case, we refer to Theorem F.4. Note that Theorem 3.3 has the same convergence rate as Theorem 3.1, except for the statistical term O (ω+1)σ 2 /µn that is the same as in DIANA (Gorbunov et al., 2020a; Khaled et al., 2020) and does not depend on α.

3.2. EF21-P + DCGD AND INTERPOLATION REGIME

We also analyze a second method, EF21-P + DCGD, which is based on DCGD (Khaled & Richtárik, 2020; Alistarh et al., 2017) . One can think of DCGD as DIANA with parameter β = 0. On one hand, the convergence of EF21-P + DCGD is faster (see Theorem G.3) comparing to EF21-P + DIANA (see Theorem 3.1). On the other hand, we can guarantee the convergence only to a O( 1 /n n i=1 ∇f i (x * ) 2 ) "neighborhood" of the solution. However, this "neighborhood" disappears in the interpolation regime, i.e., when ∇f i (x * ) = 0 for all i ∈ [n]. The interpolation regime is very common in modern deep learning tasks (Brown et al., 2020; Bubeck & Sellke, 2021) .

3.3. WHY DO BIDIRECTIONAL METHODS WORK MUCH BETTER THAN GD?

Our analysis of EF21-P + DIANA covers the worst case scenario for the values of L max and α. Although L max can be equal to nL, in practice it tends to be much smaller. Similarly, the assumed bound on the parameter α equal to k /d for the TopK compressor is also very conservative and the "effective" α is much larger (Beznosikov et al., 2020; Vogels et al., 2019; Xu et al., 2021) . Note that Algorithm 1 does not depend on α! Our claims are also supported by experiments from Section 5.

4. ANALYSIS IN THE NONCONVEX SETTING

In the nonconvex case, existing bidirectional methods suffer from the same problem as those used in the convex case (see Section 2.3): they either do not provide server-to-workers compression at all, or the compressor errors/noises are coupled in a multiplicative fashion (see ω w and ω s in Table 2 ). Instead of the convexity (see Assumption 2.3), we will need the following assumption: Assumption 4.1 (Lower boundedness). There exist f * ∈ R and f * 1 , . . . , f * n ∈ R such that f (x) ≥ f * and f i (x) ≥ f * i for all x ∈ R d and for all i ∈ [n]. As in the convex setting, the theory of methods that only use workers-to-server compression is well examined. In the high accuracy regimes, the current state-of-the-art methods are MARINA and DASHA (Gorbunov et al., 2021; Tyurin & Richtárik, 2022b) ; both return an ε-stationary point after O ∆0L ε + ∆0ω L √ nε iterations, where ∆ 0 := f (x 0 ) -f * . In the low accuracy regimes, the current state-of-the-art method is DCGD (Khaled & Richtárik, 2020) , with an iteration complexity O ∆0L ε + ∆0(∆0+∆ * )(1+ω)LLmax nε 2 , where ∆ * := f * -1 n n i=1 f * i . Note that DCGD has worse dependence on ε, but it scales much better with the number of workers n. We now investigate how EF21-P can help us in the general nonconvex case. Let us recall that in the convex case, decoupling of the noises coming from two compression schemes can be achieved by combining EF21-P with DIANA. In the nonconvex setting, we successfully combine EF21-P and DCGD. Moreover, we provide analysis of some particular cases where EF21-P + DCGD can be the method of choice in the high accuracy regimes. Whether or not it is possible to achieve the decoupling by combining our method with MARINA or DASHA is not yet known and we leave it to future workfoot_3 . 2 ] ≤ ε). For simplicity, we assume that f * i = f * for all i ∈ [n] and only the terms with respect to ω w and ω s are shown. The parameters ω w and ω s have the same meaning as in Table 1 .

Method # Communication Rounds Limitations

DCGD (Khaled & Richtárik, 2020 ) ∆ 2 0 ωw LLmax nε 2 No server-to-worker compression. MARINA, DASHA (Gorbunov et al., 2021  ) (Tyurin & Richtárik, 2022b) ∆ 0 ωw L √ nε No server-to-worker compression. MCM (Philippenko & Dieuleveut, 2021 ) ∆ 0 ω 3/2 s ε + ωs ω 1/2 w √ nε + ωw nε Lmax Only homogeneous case, i.e., fi = f for all i ∈ [n]. CD-Adam (Wang et al., 2022) Ω √ d max{ωs ,ωw} 4 ε 2 Bounded gradient assumption. EF21-BC (Fatkhullin et al., 2021) ∆ 0 ωw ωs L ε -NEOLITHIC (Huang et al., 2022) ∆ 0 Lmax ε Does not compress vectors (a) . Bounded gradient assumption. EF21-P + DCGD (new) ∆ 2 0 ωw LLmax nε 2 + ∆ 0 ωs L ε - EF21-P + DCGD (new) ∆ 0 Dωw L nε + ∆ 0 ωs L ε Strong-growth assumption with parameter D. ( a) In each communication round, NEOLITHIC sends the number of compressed vectors proportional to 1 /α, where α is the parameter of a biased compressor. For TopK or RandK, it means that NEOLITHIC sends Θ( d /K) sparsified vectors with K nonzero elements, meaning that, in total, Θ(d) values are sent in each communication round.

4.1. EF21-P + DCGD IN THE GENERAL NONCONVEX CASE

Without any restrictive assumptions, we can prove the following convergence result: Theorem 4.2. Consider Algorithm 2 and let Assumptions 2.1, 2.2 and 4.1 hold, x 0 = w 0 , and γ = min α 8L , √ n √ 2ωLLmaxT , nε 32∆ * ωLLmax . Then T ≥ 48∆0L ε max 8 α , 96∆0ωLmax nε , 32∆ * ωLmax nε ⇒ min 0≤t≤T -1 E ∇f (x t ) 2 ≤ ε. (The proof follows from Theorem E.3 and Proposition E.4 (Part 1)). We get the rate of DCGD (Khaled & Richtárik, 2020) plus an additional O ∆0L αε factor, thus obtaining the first method with bidirectional compression where the noises from the compressors are decoupled. Moreover, as noted before, this method provides the state-of-the-art rates when ε is small or the number of workers n is large.

4.2. STRONG GROWTH CONDITION

Here we analyze EF21-P + DCGD under the strong-growth condition (Schmidt & Roux, 2013) . Assumption 4.3. There exists D > 0 such that 1 n n i=1 ∇f i (x) 2 ≤ D ∇f (x) 2 for all x ∈ R d . While this assumption is restrictive and does not even hold for quadratic optimization problems, there exist numerous practical applications when it is reasonable. These include, for example, deep learning, where the number of parameters d is so huge that the model can interpolate the training dataset (Schmidt & Roux, 2013; Vaswani et al., 2019; Meng et al., 2020) . To train such models, engineers use distributed environments, in which case communication becomes the main a bottleneck (Ramesh et al., 2021) . For these problems, our method is suitable and can be successfully applied. Theorem 4.4. Consider Algorithm 2, let Assumptions 2.1, 2.2, 4.1 and 4.3 hold, and choose x 0 = w 0 and γ = min α 8L , n 4DωL . Then T ≥ 48∆0L ε max 8 α , 4Dω n ⇒ min 0≤t≤T -1 E ∇f (x t ) 2 ≤ ε. (The proof follows from Theorem E.3 and Proposition E.4 (Part 2)). Comparing to Section 4.1, the above result shows an improved dependence on ε under the strong growth assumption. 

4.3. HOMOGENEOUS FUNCTIONS

Another important problem where our method can be useful is distributed optimization with homogeneous (identical) functions and stochastic gradients. In particular, we consider the case when f i = f for all i ∈ [n] and instead of exact gradients, stochastic gradients are used. This assumption holds, for instance, for distributed machine learning problems where every worker samples mini-batches from a large shared dataset (Recht et al., 2011; Goyal et al., 2017) . Theorem 4.5. Let us consider Algorithm 2 with the stochastic gradients ∇f instead of the exact gradients ∇f . Suppose that Assumptions 2.1, 2.2, 3.2 and 4.1 hold and f i = f for all i ∈ [n]. Set x 0 = w 0 and let γ = min α 8L , 1 4( ω n +1)L , nε 16(ω+1)σ 2 L . Then T ≥ 48∆0L ε max 8 α , 4 ω n + 1 , 16(ω+1)σ 2 nε ⇒ min 0≤t≤T -1 E ∇f (x t ) 2 ≤ ε. (The proof follows from Theorem E.3 and Proposition E.4 (Part 4)). Under exactly the same assumptions, MCM method by (Philippenko & Dieuleveut, 2020) with bidirectional compression guarantees the convergence rate Θ ∆0L ε max ωw n + 1 , ω 3/2 s , ωsω 1/2 w n , (ωw+1)σ 2 nε . Comparing this with our result, the last statistical term (ω+1)σ 2 /nε is the same in both cases, but we significantly improve the other communication terms (take ω = ω w and α = (ω s + 1) -1 in Theorem 4.5).

5. EXPERIMENTAL HIGHLIGHTS

We now provide a few highlights from our experiments. For more details and experiments, we refer to Section A, where we compare our algorithms with the previous state-of-the-art method MCM and solve a nonconvex task. We consider the logistic regression task with real-sim (# of features = 20,958, # of samples equals 72,309) from LIBSVM dataset (Chang & Lin, 2011) . Each plot represents the relations between function values and the total number of coordinates transmitted from and to the server. In all algorithms, the RandK compressor is used to compress information from the workers to the server. In the case of EF21-P + DIANA and EF21-P + DCGD, we take TopK compressor to compress from the server to the workers. The results are presented in Figure 1 . The main conclusion from these experiments is that EF21-P + DIANA and EF21-P + DCGD converge to a solution not slower than DIANA, even though DIANA does not compress vectors sent from the server to the workers! This means that EF21-P + DIANA and EF21-P + DCGD can send ×400 less values from the server to the workers for free! Moreover, we see that EF21-P + DCGD converges faster than its competitors. 

A FURTHER EXPERIMENTS

We now provide the results of our experiments on practical machine learning tasks with LIBSVM datasets (Chang & Lin, 2011 ) (under the 3-clause BSD license). Each plot represents the relations between function values and the total number of coordinates transmitted from and to the server. The parameters of the algorithms are as suggested by the theory, except for the stepsizes γ that we finetune from a set {2 i | i ∈ [-10, 10]}. We solve the logistic regression problem: f i (x 1 , . . . , x c ) := - 1 m m j=1 log exp a ij x yij c y=1 exp a ij x y , where x 1 , . . . , x c ∈ R d , c is the number of unique labels, a ij ∈ R d is a feature of a sample on the i th worker, y ij is a corresponding label and m is the number of samples located on the i th worker. In all algorithms, the RandK compressor is used to compress information from the workers to the server. In the case of EF21-P + DIANA and EF21-P + DCGD, we take TopK compressor to compress from the server to the workers. The performance of algorithms is compared on w8a (# of features = 300, # of samples equals 49,749), CIFAR10 (Krizhevsky et al., 2009 ) (# of features = 3072, # of samples equals 50,000), and real-sim (# of features = 20958, # of samples equals 72,309) datasets. The results are presented in Figures 1, 2 and 3. The conclusions are the same as in Section 5. One can see that EF21-P + DIANA and EF21-P + DCGD converge to a solution not slower than DIANA, even though DIANA does not compress vectors sent from the server to the workers! EF21-P + DIANA and EF21-P + DCGD send ×100 -×1000 less values from the server to the workers! 0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n (workers-to-server) 1e6 We also compare our algorithm to MCM. Since MCM does not support contractive compressors defined in 2, we use RandK instead of the TopK compressor in the server-to-workers compression. Figure 4 shows that our new algorithms converge faster. Finally, we provide experiments for the nonconvex setting and compare EF21-P + DCGD against EF21-BC (Fatkhullin et al., 2021) and DASHA (Tyurin & Richtárik, 2022b) . We consider the logistic regression with a nonconvex regularizer where [•] k is an indexing operation of a vector and λ = 0.001. We use RandK and TopK compressors for the workers-to-server and server-to-workers compressions, respectively. Note that in these experiments, the server-to-workers compression is only supported by EF21-P + DCGD and EF21-BC. In Figure 5 , one can see that EF21-P + DCGD converges faster than other algorithms and outperforms DASHA, which does not compress vectors when transmitting them from the server to the workers. 

B USEFUL IDENTITIES AND INEQUALITIES

For all x, y, x 1 , . . . , x n ∈ R d , s > 0 and α ∈ (0, 1], we have: x + y 2 ≤ (1 + s) x 2 + (1 + s -1 ) y 2 , (7) x + y 2 ≤ 2 x 2 + 2 y 2 , x, y ≤ x 2 2s + s y 2 2 , ( ) (1 -α) 1 + α 2 ≤ 1 - α 2 , ( ) (1 -α) 1 + 2 α ≤ 2 α , a, b = 1 2 a 2 + b 2 -a -b 2 . ( ) Tower property: For any random variables X and Y , we have E [E [X | Y ]] = E [X] . ( ) Variance decomposition: For any random vector X ∈ R d and any non-random c ∈ R d , we have E X -c 2 = E X -E [X] 2 + E [X] -c 2 . ( ) Lemma B.1 (Nesterov (2018) ). Let f : R d → R be a function for which Assumptions 2.1 and 2.3 are satisfied. Then for all x, y ∈ R d we have: ∇f (x) -∇f (y) 2 ≤ 2L(f (x) -f (y) -∇f (y), x -y ). Lemma B.2 (Khaled & Richtárik (2020) ). Let f be a function for which Assumptions 2.1 and 4.1 are satisfied. Then for all x, y ∈ R d we have: ∇f (x) 2 ≤ 2L(f (x) -f * ). C PROOF OF LEMMA 2.4 Lemma 2.4. If Assumptions 2.1, 2.2 and 2.3 hold, then L ≤ L max ≤ nL and L ≤ L ≤ √ nL. Proof. One can show (see (Nesterov, 2003) ) that a convex function f is L-smooth if and only if either of the two conditions below holds: 0 ≤ ∇f (x) -∇f (x), x -y ≤ L x -y 2 , ∀x, y ∈ R d , ∇f (x) -∇f (x) 2 ≤ L ∇f (x) -∇f (x), x -y , ∀x, y ∈ R d . For any fixed i ∈ [n], we have ∇f i (x) -∇f i (y), x -y ≤ n i=1 ∇f i (x) -∇f i (y), x -y = n 1 n n i=1 ∇f i (x) -∇f i (y), x -y = n ∇f (x) -∇f (y), x -y ≤ n ∇f (x) -∇f (y) x -y (2.1) ≤ nL x -y 2 . Thus L i ≤ nL and L max ≤ nL. Next, 1 n n i=1 ∇f i (x) -∇f i (y) 2 ≤ 1 n n i=1 L i ∇f i (x) -∇f i (y), x -y ≤ L max 1 n n i=1 ∇f i (x) -∇f i (y), x -y = L max ∇f (x) -∇f (y), x -y ≤ L max ∇f (x) -∇f (y) x -y (2.1) ≤ L max L x -y 2 ≤ nL 2 x -y 2 , and hence L ≤ √ nL. Using Jensen's inequality, we have ∇f (x) -∇f (y) 2 ≤ 1 n n i=1 ∇f i (x) -∇f i (y) 2 ≤ L 2 x -y 2 . Thus L ≤ L. Finally, L ≤ L max follows from 1 n n i=1 ∇f i (x) -∇f i (y) 2 ≤ 1 n n i=1 L 2 i x -y 2 ≤ L 2 max x -y 2 .

D CONVERGENCE OF EF21-P IN THE STRONGLY CONVEX REGIME

We now provide the convergence rate of EF21-P from Section 1 in the strongly convex case. Theorem D.1. Let Assumptions 2.1 and 2.3 hold, set w 0 = x 0 and choose γ ≤ α 16L . Then EF21-P returns x T such that 1 2γ E x T -x * 2 + E f (x T ) -f (x * ) ≤ 1 - γµ 2 T 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) . Moreover, E w t -x * 2 → 0 as t → ∞. Theorem D.1 states that EF21-P will return an ε-solution after O L αµ log 1 /ε steps. Comparing to GD's rate O L µ log 1 /ε , one can see that EF21-P converges 1 /α times slower. Proof. First, let us note that x t -x * 2 -x t+1 -x * 2 -x t+1 -x t 2 = x t -x t+1 , x t -2x * + x t+1 -x t+1 -x t , x t+1 -x t = 2 x t -x t+1 , x t+1 -x * = 2γ ∇f (w t ), x t+1 -x * . ( ) Using L-smoothness of f (Assumption 2.1), we obtain f (x t+1 ) ≤ f (w t ) + ∇f (w t ), x t+1 -w t + L 2 x t+1 -w t 2 conv-ty ≤ f (x * ) + ∇f (w t ), x t+1 -x * - µ 2 w t -x * 2 + L 2 x t+1 -w t 2 (17) ≤ f (x * ) + 1 2γ x t -x * 2 - 1 2γ x t+1 -x * 2 - 1 2γ x t+1 -x t 2 - µ 2 w t -x * 2 + L 2 x t+1 -w t 2 . Using (8), we have L 2 x t+1 -w t 2 ≤ L x t+1 -x t 2 + L w t -x t 2 and µ 4 x t -x * 2 ≤ µ 2 w t -x * 2 + µ 2 w t -x t 2 ≤ µ 2 w t -x * 2 + L w t -x t 2 , where we used the fact that µ ≤ L. Hence f (x t+1 ) ≤ f (x * ) + 1 2γ x t -x * 2 - 1 2γ x t+1 -x * 2 - 1 2γ x t+1 -x t 2 - µ 2 w t -x * 2 + L 2 x t+1 -w t 2 ≤ f (x * ) + 1 2γ x t -x * 2 - 1 2γ x t+1 -x * 2 - 1 2γ x t+1 -x t 2 + L w t -x t 2 - µ 4 x t -x * 2 + L x t+1 -x t 2 + L w t -x t 2 = f (x * ) + 1 2γ 1 - γµ 2 x t -x * 2 - 1 2γ x t+1 -x * 2 - 1 2γ -L x t+1 -x t 2 + 2L w t -x t 2 ≤ f (x * ) + 1 2γ 1 - γµ 2 x t -x * 2 - 1 2γ x t+1 -x * 2 + 2L w t -x t 2 , where the last inequality follows from the fact that γ ≤ 1 2L . Let us denote by E t+1 [•] the expectation conditioned on previous iterations {0, . . . , t}. Then E t+1 f (x t+1 ) ≤ f (x * ) + 1 2γ 1 - γµ 2 x t -x * 2 - 1 2γ E t+1 x t+1 -x * 2 + 2L w t -x t 2 . ( ) It remains to bound E t+1 w t+1 -x t+1 2 : E t+1 w t+1 -x t+1 2 = E t+1 w t + C p (x t+1 -w t ) -x t+1 2 (2) ≤ (1 -α)E t+1 x t+1 -w t 2 = (1 -α) x t -γ∇f (w t ) -w t 2 (7) ≤ 1 - α 2 w t -x t 2 + 2γ 2 α ∇f (w t ) 2 (8) ≤ 1 - α 2 w t -x t 2 + 4γ 2 α ∇f (w t ) -∇f (x t ) 2 + 4γ 2 α ∇f (x t ) -∇f (x * ) 2 (2.1),(B.1) ≤ 1 - α 2 + 4γ 2 L 2 α w t -x t 2 + 8γ 2 L α f (x t ) -f (x * ) ≤ 1 - α 4 w t -x t 2 + 8γ 2 L α f (x t ) -f (x * ) , where in the last step we assume that γ ≤ α 4L . Adding a 16L α multiple of the above inequality to (18), we obtain E t+1 f (x t+1 ) + 16L α E t+1 w t+1 -x t+1 2 ≤ f (x * ) + 1 2γ 1 - γµ 2 x t -x * 2 - 1 2γ E t+1 x t+1 -x * 2 + 16L α 1 - α 8 w t -x t 2 + 128γ 2 L 2 α 2 f (x t ) -f (x * ) . Thus, taking full expectation over both sides of the inequality and considering γ ≤ α 16L ≤ α 4µ gives E f (x t+1 ) -f (x * ) + 1 2γ E x t+1 -x * 2 + 16L α E w t+1 -x t+1 2 ≤ 1 - γµ 2 E f (x t ) -f (x * ) + 1 2γ E x t -x * 2 + 16L α E w t -x t 2 . Applying this inequality iteratively and using the assumption w 0 = x 0 proves the result.

E CONVERGENCE OF EF21-P IN THE SMOOTH NONCONVEX REGIME E.1 GENERAL CONVERGENCE THEORY

We now move on to study how the EF21-P method can be used in the nonconvex regime. The analysis relies on the expected smoothness assumption introduced by Khaled & Richtárik (2020) . In their work, they study SGD methods, performing iterations of the form x t+1 = x t -γg t , where g t is an unbiased estimator of the true gradient ∇f (x t ). Following Khaled & Richtárik (2020) , we shall assume that E [g(x)] = ∇f (x). However, in our case, gradients will be evaluated at perturbed points, thus resulting in biased stochastic gradient estimators. In particular, we consider the following general update rule, where the stochastic gradients are calculated at points evolving according to the EF21-P mechanism, rather than at the current iterate: x t+1 = x t -γg(w t ), w t+1 = w t + C P (x t+1 -w t ). Our result covers a wide range of sources of stochasticity that may be present in g. For a detailed discussion of the topic, we refer the reader to the original paper (Khaled & Richtárik, 2020) . Throughout this section, we will rely on the following assumptions: Assumption E.1. The stochastic gradient g(x) is an unbiased estimator of the true gradient ∇f (x), i.e., E [g(x)] = ∇f (x) for all x ∈ R d . Assumption E.2 (From Khaled & Richtárik (2020)). There exist constants A, B, C ≥ 0 such that: E g(x) 2 ≤ 2A(f (x) -f * ) + B ∇f (x) 2 + C for all x ∈ R d . We are ready to state the main theorem: Theorem E.3. Let Assumptions 2.1, 4.1, E.1 and E.2 hold and set w 0 = x 0 . Fix ε > 0 and choose the stepsize γ = min α 8L , 1 4BL , 1 √ 2ALT , ε 16CL . Then T ≥ 48∆ 0 L ε max 8 α , 4B, 96∆ 0 A ε , 16C ε ⇒ min 0≤t≤T -1 E ∇f (x t ) 2 ≤ ε. ( ) Note that by taking A = C = 0 and B = 1, one recovers the convergence of EF21-P in the nonconvex setting. Namely, under Assumptions 2.1 and 4.1, for x 0 = w 0 and 0 < γ ≤ α 8L , we have min 0≤t≤T -1 E ∇f (x t ) 2 ≤ ε as soon as T ≥ 384∆0L αε . We now apply the above result to the combination of EF21-P perturbation of the model and DCGD (Khaled & Richtárik, 2020 ) (EF21-P + DCGD). Suppose that the iterates follow the update (19) (see also Algorithm 2), where g(x) = 1 n n i=1 C i (g i (x)) and each stochastic gradient g i (x) is an unbiased estimator of the true gradient ∇f i (x) (i.e., E [g i (x)] = ∇f i (x)). Proposition E.4. Suppose that the gradient estimator g(x) is constructed via (21) and that Assumption 2.2 holds. Let ∆ * := 1 n n i=1 (f * -f * i ). Then: 1. For g i (x) = ∇f i (x), Assumption E.2 is satisfied with A = 1 n ωL max , B = 1 and C = 2A∆ * . 2. In the same setting as in part 1, assuming additionally that Assumption 4.3 holds, Assumption E.2 is satisfied with A = C = 0 and B = Dω n + 1. 3. Assume that each stochastic gradient g i has bounded variance, (i.e., E g i (x) -∇f i (x) 2 ≤ σ 2 ). Then Assumption E.2 is satisfied with A = 1 n ωL max , B = 1 and C = 2A∆ * + ω+1 n σ 2 . 4. Suppose that E g i (x) -∇f i (x) 2 ≤ σ 2 and f i = f for all i ∈ [n]. Then Assumption E.2 is satisfied with A = 0, B = ω n + 1 and C = ω+1 n σ 2 . In Section 4, we apply Proposition E.4 and state the corresponding theorems.

E.2 PROOF OF THE CONVERGENCE RESULT

We will need the following two lemmas: Lemma E.5. Consider sequences (δ t ) t , (r t ) t and (s t ) t such that δ t , r t , s t ≥ 0 for all t ≥ 0 and s 0 = 0. Suppose that δ t+1 +as t+1 ≤ bδ t + as t -cr t + d, where a, b, c, d are non-negative constants and b ≥ 1. Then for any T ≥ 1 min 0≤t≤T -1 r t ≤ b T cT δ 0 + d c . Proof. The proof follows similar steps as the proof of Lemma 2 of Khaled & Richtárik (2020) and we provide it for completeness. Let us fix w -1 > 0 and define w t = wt-1 b . Multiplying ( 22) by w t gives w t δ t+1 + aw t s t+1 ≤ bw t δ t + aw t s t -cw t r t + dw t ≤ w t-1 δ t + aw t-1 s t -cw t r t + dw t . Summing both sides of the inequality for t = 0, . . . , T -1, we obtain w T -1 δ T +aw T -1 s T ≤ w -1 δ 0 + aw -1 s 0 -c T -1 t=0 w t r t + d T -1 t=0 w t . Rearranging and using the assumption that s 0 = 0 and non-negativity of s t gives c T -1 t=0 w t r t + w T -1 δ T ≤ w -1 δ 0 + aw -1 s 0 -aw T -1 s T + d T -1 t=0 w t ≤ w -1 δ 0 + d T -1 t=0 w t . Next, using the non-negativity of δ t and w t , we have c T -1 t=0 w t r t ≤ c T -1 t=0 w t r t + w T -1 δ T ≤ w -1 δ 0 + d T -1 t=0 w t .

Letting W T :=

T -1 t=0 w t and dividing both sides of the inequality by W T , we obtain c min 0≤t≤T -1 r t ≤ c W T T -1 t=0 w t r t ≤ w -1 W T δ 0 + d. Using the fact that W T = T -1 t=0 w t ≥ T -1 t=0 min 0≤t≤T -1 w t = T w T -1 = T w -1 b T , we can finish the proof. Lemma E.6. Let Assumptions 2.1, 4.1, E.1 and E.2 hold, set w 0 = x 0 , and choose γ ≤ min 1 4A , 1 4BL , α . Then min 0≤t≤T -1 E ∇f (x t ) 2 ≤ 8 1 + 2ALγ 2 T γT ∆ 0 + 8CLγ. ( ) Proof. First, L-smoothness of f implies that f (w t ) ≤ f (x t ) + ∇f (x t ), w t -x t + L 2 w t -x t 2 (9) ≤ f (x t ) + 1 2L ∇f (x t ) 2 + L w t -x t 2 (24) and f (x t+1 ) ≤ f (x t ) + ∇f (x t ), x t+1 -x t + L 2 x t+1 -x t 2 = f (x t ) -γ ∇f (x t ), g(w t ) + Lγ 2 2 g(w t ) 2 . Using the fact that g(x) is an unbiased estimator of the true gradient, subtracting f * from both sides of the latter inequality and taking expectation given iterations {0, . . . , t}, we obtain E t+1 f (x t+1 ) -f * ≤ f (x t ) -f * -γ ∇f (x t ), ∇f (w t ) + Lγ 2 2 E t+1 g(w t ) 2 (E.2),(12) ≤ f (x t ) -f * - γ 2 ∇f (x t ) 2 - γ 2 ∇f (w t ) 2 + γ 2 ∇f (x t ) -∇f (w t ) 2 + Lγ 2 2 2A(f (w t ) -f * ) + B ∇f (w t ) 2 + C (2.1) ≤ f (x t ) -f * - γ 2 ∇f (x t ) 2 - γ 2 ∇f (w t ) 2 + γL 2 2 x t -w t 2 +ALγ 2 (f (w t ) -f * ) + BLγ 2 2 ∇f (w t ) 2 + CLγ 2 2 = f (x t ) -f * - γ 2 ∇f (x t ) 2 - γ 2 (1 -BLγ) ∇f (w t ) 2 + L 2 γ 2 x t -w t 2 +ALγ 2 (f (w t ) -f * ) + CLγ 2 2 (24) ≤ f (x t ) -f * - γ 2 ∇f (x t ) 2 - γ 2 (1 -BLγ) ∇f (w t ) 2 + L 2 γ 2 x t -w t 2 +ALγ 2 f (x t ) + 1 2L ∇f (x t ) 2 + L w t -x t 2 -f * + CLγ 2 2 = (1 + ALγ 2 ) f (x t ) -f * - γ 2 (1 -Aγ) ∇f (x t ) 2 - γ 2 (1 -BLγ) ∇f (w t ) 2 +L 2 γ 1 2 + Aγ w t -x t 2 + CLγ 2 2 . Hence, taking full expectation, for γ ≤ 1 4A , we have E f (x t+1 ) -f * ≤ (1 + ALγ 2 )E f (x t ) -f * - γ 4 E ∇f (x t ) 2 (25) - γ 2 (1 -BLγ) E ∇f (w t ) 2 + L 2 γE w t -x t 2 + CLγ 2 2 . Next, variance decomposition and Assumption E.2 gives E g(w t ) -∇f (w t ) 2 (14) = E g(w t ) 2 -∇f (w t ) 2 (E.2) ≤ 2A(f (w t ) -f * ) + (B -1) ∇f (w t ) 2 + C (24) ≤ 2A f (x t ) + 1 2L ∇f (x t ) 2 + L w t -x t 2 -f * +(B -1) ∇f (w t ) 2 + C = 2A f (x t ) -f * + A L ∇f (x t ) 2 + 2AL w t -x t 2 +(B -1) ∇f (w t ) 2 + C. Therefore, using the unbiasedness of g(x), we can bound the expected distance between w t+1 and x t+1 as 10),( 11) E w t+1 -x t+1 2 = E w t + C p (x t+1 -w t ) -x t+1 2 (2) ≤ (1 -α)E x t+1 -w t 2 = (1 -α) E x t -γg t -w t 2 (14) = (1 -α) γ 2 E g t -∇f (w t ) 2 + (1 -α) E x t -γ∇f (w t ) -w t 2 (7),( ≤ (1 -α) γ 2 E g t -∇f (w t ) 2 + 1 - α 2 E x t -w t 2 + 2γ 2 α E ∇f (w t ) 2 (26) ≤ 2A (1 -α) γ 2 f (x t ) -f * + A (1 -α) γ 2 L ∇f (x t ) 2 +2AL (1 -α) γ 2 w t -x t 2 + (B -1) (1 -α) γ 2 ∇f (w t ) 2 +C (1 -α) γ 2 + 1 - α 2 E x t -w t 2 + 2γ 2 α E ∇f (w t ) 2 . Hence, taking expectation, for γ ≤ α 8AL(1-α) E w t+1 -x t+1 2 ≤ 2A (1 -α) γ 2 E f (x t ) -f * + A (1 -α) γ 2 L E ∇f (x t ) 2 + γ 2 2 α + (B -1) (1 -α) E ∇f (w t ) 2 + 1 - α 2 + 2AL (1 -α) γ 2 E x t -w t 2 + C (1 -α) γ 2 ≤ 2A (1 -α) γ 2 E f (x t ) -f * + A (1 -α) γ 2 L E ∇f (x t ) 2 + γ 2 2 α + (B -1) (1 -α) E ∇f (w t ) 2 + 1 - α 4 E x t -w t 2 + C (1 -α) γ 2 . ( ) Adding a 4L 2 γ α multiple of ( 27) to ( 25), we obtain E f (x t+1 ) -f * + 4L 2 γ α E w t+1 -x t+1 2 ≤ 1 + ALγ 2 + 8AL 2 (1 -α)γ 3 α E f (x t ) -f * - γ 4 1 - 16AL (1 -α) γ 2 α E ∇f (x t ) 2 - γ 2 1 -BLγ - 8L 2 γ 2 α 2 α + (B -1) (1 -α) E ∇f (w t ) 2 + 4L 2 γ α E w t -x t 2 + CLγ 2 2 + 4CL 2 (1 -α)γ 3 α . Then, provided that γ ≤ min 1 4BL , α 8L , α 32(B -1)(α -1)L 2 , α 32AL(1 -α) , = min 1 4BL , α 8L , α 32AL(1 -α) , , (where we used min{a, b} ≤ √ ab for all a, b ∈ R + ), this gives E f (x t+1 ) -f * + 4L 2 γ α E w t+1 -x t+1 2 ≤ 1 + 2ALγ 2 E f (x t ) -f * - γ 8 E ∇f (x t ) 2 + 4L 2 γ α E w t -x t 2 + CLγ 2 . Denoting a := 4L 2 γ α , b := 1 + 2ALγ 2 , c := γ 8 and d := CLγ 2 , this is equivalent to δ t+1 +as t+1 ≤ bδ t + as t -cr t + d, where δ t := E [f (x t ) -f * ], r t := E ∇f (x t ) 2 and s t := E w t -x t 2 . Hence, using Lemma E.5, for any T ≥ 1 min 0≤t≤T -1 r t ≤ b T cT δ 0 + d c , which proves (23). In the proof, we have the following constraints on γ: γ ≤ min 1 4A , 1 4BL , α 8L , α 32AL(1 -α) . Using the inequality min{a, b} ≤ √ ab for all a, b ∈ R + , this can be simplified to γ ≤ min 1 4A , 1 4BL , α . Theorem E.3. Let Assumptions 2.1, 4.1, E.1 and E.2 hold and set w 0 = x 0 . Fix ε > 0 and choose the stepsize γ = min α 8L , 1 4BL , 1 √ 2ALT , ε . Then T ≥ 48∆ 0 L ε max 8 α , 4B, 96∆ 0 A ε , 16C ε ⇒ min 0≤t≤T -1 E ∇f (x t ) 2 ≤ ε. ( ) Proof. By Lemma E.6, we have min 0≤t≤T -1 E ∇f (x t ) 2 ≤ 8 1 + 2ALγ 2 T γT ∆ 0 + 8CLγ provided that γ ≤ min 1 4A , 1 4BL , α 8L . Now, using the fact that 1 + x ≤ e x and the assumption γ ≤ 1 √ 2ALT , we obtain 1 + 2ALγ 2 T ≤ exp 2ALT γ 2 ≤ exp(1) < 3. Hence min 0≤t≤T -1 E ∇f (x t ) 2 ≤ 24 γT ∆ 0 + 8CLγ. In order to obtain 24 γT ∆ 0 + 8CLγ ≤ ε, we require that both terms are no larger than ε 2 , which is equivalent to T ≥ 48∆ 0 γε , ( ) γ ≤ ε 16CL . ( ) We thus require that: γ ≤ min 1 4A , 1 4BL , α 8L , 1 √ 2ALT , ε . which, combined with (29) gives: T ≥ 48∆ 0 ε max 4A, 4BL, 8L α , 96∆ 0 AL ε , 16CL ε . It remains to notice that the term 4A can be dropped, thus simplifying the constraints to γ ≤ min 1 4BL , α 8L , 1 √ 2ALT , ε . and T ≥ 48∆ 0 ε max 4BL, 8L α , 96∆ 0 AL ε , 16CL ε . Indeed, if ∇f (x 0 ) 2 ≤ ε, then (20) holds for any γ > 0. Let us now assume that ∇f (x 0 ) 2 > ε. The above constraints imply that 1 √ 2ALT ≤ ε 96∆0AL . Moreover, from Lemma B.2, we know that ε < ∇f (x 0 ) 2 ≤ 2L∆ 0 . Thus 1 √ 2ALT ≤ 1 48A . Similarly, we see that 96∆0AL ε ≥ 48A.

E.3 PROOF OF PROPOSITION E.4

Proof. 1. Using independence of C 1 , . . . , C n , we have E g(x) 2 = E   1 n n i=1 C i (∇f i (x)) 2   (14) = E   1 n n i=1 (C i (∇f i (x)) -∇f i (x)) 2   + ∇f (x) 2 = 1 n 2 n i=1 E C i (∇f i (x)) -∇f i (x) 2 + ∇f (x) 2 (5) ≤ 1 n 2 n i=1 ω ∇f i (x) 2 + ∇f (x) 2 (B.2) ≤ ω n 2 n i=1 2L i (f i (x) -f * i ) + ∇f (x) ≤ 2ωL max n 2 n i=1 (f i (x) -f * i ) + ∇f (x) 2 = 2A(f (x) -f * ) + ∇f (x) 2 + 2A∆ * , where A := ωLmax n . 2. Starting as in part 1 of the proof, we obtain E g(x) 2 ≤ 1 n 2 n i=1 ω ∇f i (x) 2 + ∇f (x) 2 (4.3) ≤ Dω n + 1 ∇f (x) 2 . 3. First let us note that E g i (x) 2 (14) = E g i (x) -∇f i (x) 2 + ∇f i (x) 2 ≤ σ 2 + ∇f i (x) 2 . Following steps similar to the proof of Proposition 4 of Khaled & Richtárik (2020) , unbiasedness of the stochastic gradients gives E g(x) 2 (13) = E   E   1 n n i=1 C i (g i (x)) 2 | g 1 (x), . . . , g n (x)     (14) = E   E   1 n n i=1 (C i (g i (x)) -g i (x)) 2 | g 1 (x), . . . , g n (x)   + 1 n n i=1 g i (x) 2   (14) = E 1 n 2 n i=1 E C i (g i (x)) -g i (x) 2 | g 1 (x), . . . , g n (x) +E   1 n n i=1 (g i (x) -∇f i (x)) 2   + ∇f (x) 2 ≤ ω n 2 n i=1 E g i (x) 2 + E   1 n n i=1 (g i (x) -∇f i (x)) 2   + ∇f (x) 2 ≤ ω n 2 n i=1 ∇f i (x) 2 + σ 2 + 1 n 2 n i=1 E g i (x) -∇f i (x) 2 + ∇f (x) 2 (B.2) ≤ ω n 2 n i=1 2L i (f i (x) -f * i ) + σ 2 + σ 2 n + ∇f (x) 2 = 2A (f (x) -f * ) + ∇f (x) 2 + C, where A := 1 n ωL max and C := 2A∆ * + ω+1 n σ 2 . 4. Starting as in part 3 and using the assumption f i = f , we have: E g(x) 2 ≤ ω n 2 n i=1 E g i (x) 2 + E   1 n n i=1 g i (x) -∇f (x) 2   + ∇f (x) 2 (14) = ω n 2 n i=1 E g i (x) -∇f (x) 2 + ∇f (x) 2 + 1 n 2 n i=1 E g i (x) -∇f (x) 2 + ∇f (x) 2 ≤ ω + 1 n σ 2 + ω n + 1 ∇f (x) 2 .

F PROOFS FOR EF21-P + DIANA IN THE CONVEX CASE

First, we prove an auxiliary theorem: Theorem F.1. Let us assume that Assumptions 2.1, 2.2 and 2.3 hold, β ∈ 0, 1 ω+1 , and γ ≤ min n 160ωL max , √ nα 20 √ ω L , α 100L , β µ . ( ) Then Algorithm 1 guarantees that 1 2γ E x t+1 -x * 2 + E f (x t+1 ) -f (x * ) + κ 1 n n i=1 E h t+1 i -∇f i (x * ) 2 + νE w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 E x t -x * 2 + 1 2 E f (x t ) -f (x * ) + κ 1 - γµ 2 1 n n i=1 E h t i -∇f i (x * ) 2 + ν 1 - γµ 2 E w t -x t 2 , ( ) where κ ≤ 8γω nβ and ν ≤ 192γω L 2 nα + 32L α . Proof. From L-smoothness (Assumption 2.1) of the function f , we have f (x t+1 ) ≤ f (w t ) + ∇f (w t ), x t+1 -w t + L 2 x t+1 -w t 2 conv-ty ≤ f (x * ) + ∇f (w t ), x t+1 -x * - µ 2 w t -x * 2 + L 2 x t+1 -w t 2 = f (x * ) + g t , x t+1 -x * + ∇f (w t ) -g t , x t+1 -x * + L 2 x t+1 -w t 2 - µ 2 w t -x * 2 . We now reprove a well-known equality from the convex world. Noting that x t+1 = x t -γg t , we obtain x t -x * 2 -x t+1 -x * 2 -x t+1 -x t 2 = x t -x t+1 , x t -2x * + x t+1 -x t+1 -x t , x t+1 -x t = 2 x t -x t+1 , x t+1 -x * = 2γ g t , x t+1 -x * . ( ) Substituting ( 33) in the inequality gives f (x t+1 ) ≤ f (x * ) + ∇f (w t ) -g t , x t+1 -x * + 1 2γ x t -x * 2 - 1 2γ x t+1 -x * 2 - 1 2γ x t+1 -x t 2 + L 2 x t+1 -w t 2 - µ 2 w t -x * 2 . Next, by (8), we have L 2 x t+1 -w t 2 ≤ L x t+1 -x t 2 + L w t -x t 2 and µ 4 x t -x * 2 ≤ µ 2 w t -x * 2 + µ 2 w t -x t 2 ≤ µ 2 w t -x * 2 + L w t -x t 2 , where we used L ≥ µ. Thus f (x t+1 ) ≤ f (x * ) + ∇f (w t ) -g t , x t+1 -x * + 1 2γ x t -x * 2 - 1 2γ x t+1 -x * 2 - 1 2γ x t+1 -x t 2 + L x t+1 -x t 2 + L w t -x t 2 - µ 4 x t -x * 2 + L w t -x t 2 = f (x * ) + ∇f (w t ) -g t , x t+1 -x * + 1 2γ 1 - γµ 2 x t -x * 2 - 1 2γ x t+1 -x * 2 - 1 2γ -L x t+1 -x t 2 + 2L w t -x t 2 ≤ f (x * ) + ∇f (w t ) -g t , x t+1 -x * + 1 2γ 1 - γµ 2 x t -x * 2 - 1 2γ x t+1 -x * 2 + 2L w t -x t 2 , where we used the fact that γ ≤ 1 2L . Then, taking expectation conditioned on previous iterations {0, . . . , t}, we obtain E t+1 f (x t+1 ) ≤ f (x * ) + E t+1 ∇f (w t ) -g t , x t+1 -x * + 1 2γ 1 - γµ 2 x t -x * 2 - 1 2γ E t+1 x t+1 -x * 2 + 2L w t -x t 2 . From the unbiasedness of the compressors C D i , we have E t+1 g t = ∇f (w t ) and E t+1 ∇f (w t ) -g t , x t+1 -x * = E t+1 ∇f (w t ) -g t , x t -γg t -x * = -γE t+1 ∇f (w t ) -g t , g t = γE t+1 g t 2 -γ ∇f (w t ) 2 (14) = γE t+1 g t -∇f (w t ) 2 . Therefore E t+1 f (x t+1 ) ≤ f (x * ) + γE t+1 g t -∇f (w t ) 2 + 1 2γ 1 - γµ 2 x t -x * 2 - 1 2γ E t+1 x t+1 -x * 2 + 2L w t -x t 2 . ( ) Now, we separately consider E t+1 g t -∇f (w t ) 2 . From the independence of compressors, we have E t+1 g t -∇f (w t ) 2 = E t+1   h t + 1 n n i=1 C D i (∇f i (w t ) -h t i ) -∇f (w t ) 2   = 1 n 2 n i=1 E t+1 C D i (∇f i (w t ) -h t i ) -∇f i (w t ) -h t i 2 ≤ ω n 2 n i=1 ∇f i (w t ) -h t i 2 ≤ 2ω n 2 n i=1 h t i -∇f i (x * ) 2 + 2ω n 2 n i=1 ∇f i (w t ) -∇f i (x * ) ≤ 2ω n 2 n i=1 h t i -∇f i (x * ) 2 + 4ω n 2 n i=1 ∇f i (w t ) -∇f i (x t ) 2 + 4ω n 2 n i=1 ∇f i (x t ) -∇f i (x * ) 2 , where in the last three inequalities, we used ( 5) and ( 8). Next, using Assumption 2.2 and Lemma B.1, we obtain E t+1 g t -∇f (w t ) 2 ≤ 2ω n 2 n i=1 h t i -∇f i (x * ) 2 + 4ω L 2 n w t -x t 2 + 8ωL max n f (x t ) -f (x * ) . To construct a Lyapunov function, it remains to bound 1 n n i=1 h t+1 i -∇f i (x * ) 2 and w t+1 -z t+1 2 : 1 n n i=1 E t+1 h t+1 i -∇f i (x * ) 2 = 1 n n i=1 E t+1 h t i + βC D i (∇f i (w t ) -h t i ) -∇f i (x * ) 2 = 1 n n i=1 h t i -∇f i (x * ) 2 + 2β n n i=1 h t i -∇f i (x * ), E t+1 C D i (∇f i (w t ) -h t i ) + β 2 n n i=1 E t+1 C D i (∇f i (w t ) -h t i ) 2 (5) ≤ 1 n n i=1 h t i -∇f i (x * ) 2 + 2β n n i=1 h t i -∇f i (x * ), ∇f i (w t ) -h t i + β 2 (ω + 1) n n i=1 ∇f i (w t ) -h t i 2 (12) = (1 -β) 1 n n i=1 h t i -∇f i (x * ) 2 + β n n i=1 ∇f i (w t ) -∇f i (x * ) 2 + β (β(ω + 1) -1) n n i=1 ∇f i (w t ) -h t i 2 ≤ (1 -β) 1 n n i=1 h t i -∇f i (x * ) 2 + β n n i=1 ∇f i (w t ) -∇f i (x * ) 2 , where we use that β ∈ 0, 1 ω+1 . Thus, using (8), Assumption 2.2 and Lemma B.1, we have 1 n n i=1 E t+1 h t+1 i -∇f i (x * ) 2 ≤ (1 -β) 1 n n i=1 h t i -∇f i (x * ) 2 + 2β L 2 w t -x t 2 + 4βL max f (x t ) -f (x * ) . (36) It remains to bound E t+1 w t+1 -x t+1 2 : E t+1 w t+1 -x t+1 2 = E t+1 w t + C p (x t+1 -w t ) -x t+1 2 (2) ≤ (1 -α)E t+1 x t+1 -w t 2 = (1 -α)E t+1 x t -γg t -w t 2 (14) = (1 -α)γ 2 E t+1 g t -∇f (w t ) 2 + (1 -α) x t -γ∇f (w t ) -w t 2 (7) ≤ γ 2 E t+1 g t -∇f (w t ) 2 + 1 - α 2 w t -x t 2 + 2γ 2 α ∇f (w t ) 2 (8) ≤ γ 2 E t+1 g t -∇f (w t ) 2 + 1 - α 2 w t -x t 2 + 4γ 2 α ∇f (w t ) -∇f (x t ) 2 + 4γ 2 α ∇f (x t ) -∇f (x * ) 2 . Using Assumption 2.1 and Lemma B.1, we obtain E t+1 w t+1 -x t+1 2 ≤ γ 2 E t+1 g t -∇f (w t ) 2 + 1 - α 2 + 4γ 2 L 2 α w t -x t 2 + 8γ 2 L α f (x t ) -f (x * ) (35) ≤ γ 2 2ω n 2 n i=1 h t i -∇f i (x * ) 2 + 4ω L 2 n w t -x t 2 + 8ωL max n f (x t ) -f (x * ) + 1 - α 2 + 4γ 2 L 2 α w t -x t 2 + 8γ 2 L α f (x t ) -f (x * ) = 1 - α 2 + 4γ 2 L 2 α + 4γ 2 ω L 2 n w t -x t 2 + 2γ 2 ω n 2 n i=1 h t i -∇f i (x * ) 2 + 8γ 2 ωL max n + 8γ 2 L α f (x t ) -f (x * ) ≤ 1 - α 4 w t -x t 2 + 2γ 2 ω n 2 n i=1 h t i -∇f i (x * ) 2 + 8γ 2 ωL max n + 8γ 2 L α f (x t ) -f (x * ) , where we assume that γ ≤ α √ 32L and γ ≤ √ αn √ 32ω L . Let us fix some constants κ ≥ 0 and ν ≥ 0. We now combine the above inequality with (34), ( 35) and ( 36) to obtain E t+1 f (x t+1 ) + κ 1 n n i=1 E t+1 h t+1 i -∇f i (x * ) 2 + νE t+1 w t+1 -x t+1 2 ≤ f (x * ) + γ 2ω n 2 n i=1 h t i -∇f i (x * ) 2 + 4ω L 2 n w t -x t 2 + 8ωL max n f (x t ) -f (x * ) + 1 2γ 1 - γµ 2 x t -x * 2 - 1 2γ E t+1 x t+1 -x * 2 + 2L w t -x t 2 + κ (1 -β) 1 n n i=1 h t i -∇f i (x * ) 2 + 2β L 2 w t -x t 2 + 4βL max f (x t ) -f (x * ) + ν 1 - α 4 w t -x t 2 + 2γ 2 ω n 2 n i=1 h t i -∇f i (x * ) 2 + 8γ 2 ωL max n + 8γ 2 L α f (x t ) -f (x * ) . Rearranging the last inequality, one can get 1 2γ E t+1 x t+1 -x * 2 + E t+1 f (x t+1 ) -f (x * ) + κ 1 n n i=1 E t+1 h t+1 i -∇f i (x * ) 2 + νE t+1 w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 x t -x * 2 + 8γωL max n + κ4βL max + ν 8γ 2 ωL max n + 8γ 2 L α f (x t ) -f (x * ) + 2γω n + ν 2γ 2 ω n + κ (1 -β) 1 n n i=1 h t i -∇f i (x * ) 2 + 4γω L 2 n + 2L + κ2β L 2 + ν 1 - α 4 w t -x t 2 . ( ) Our final goal is to find κ and ν such that 2γω n + ν 2γ 2 ω n + κ (1 -β) = κ 1 - β 2 and 4γω L 2 n + 2L + κ2β L 2 + ν 1 - α 4 ≤ ν 1 - α 8 . The last inequality is equivalent to 32γω L 2 nα + 16L α + κ 16β L 2 α ≤ ν. From the first equality we get κ = 4γω nβ + ν 4γ 2 ω nβ . Thus 32γω L 2 nα + 16L α + κ 16β L 2 α = 32γω L 2 nα + 16L α + 4γω nβ + ν 4γ 2 ω nβ 16β L 2 α = 96γω L 2 nα + 16L α + ν 64γ 2 ω L 2 nα ≤ 96γω L 2 nα + 16L α + ν 1 2 , where we used that γ ≤ √ nα √ 128ω L . It means that we can take ν = 192γω L 2 nα + 32L α to ensure that (38) holds. Thus κ = 4γω nβ + 192γω L 2 nα + 32L α 4γ 2 ω nβ = 4γω nβ + 768γ 3 ω 2 L 2 n 2 αβ + 128γ 2 ωL nβα . Let us now substitute these values of κ and ν in inequality (37): 1 2γ E t+1 x t+1 -x * 2 + E t+1 f (x t+1 ) -f (x * ) + κ 1 n n i=1 E t+1 h t+1 i -∇f i (x * ) 2 + νE t+1 w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 x t -x * 2 + κ 1 - β 2 1 n n i=1 h t i -∇f i (x * ) 2 + ν 1 - α 8 w t -x t 2 + 8γωL max n + 4γω nβ + 768γ 3 ω 2 L 2 n 2 αβ + 128γ 2 ωL nβα 4βL max + 192γω L 2 nα + 32L α 8γ 2 ωL max n + 8γ 2 L α f (x t ) -f (x * ) = 1 2γ 1 - γµ 2 x t -x * 2 + κ 1 - β 2 1 n n i=1 h t i -∇f i (x * ) 2 + ν 1 - α 8 w t -x t 2 + 24γωL max n + 4608γ 3 ω 2 L 2 L max n 2 α + 768γ 2 ωLL max nα + 1536γ 3 ωL L 2 nα 2 + 256γ 2 L 2 α 2 f (x t ) -f (x * ) . Using the assumptions on γ, we have 24γωL max n ≤ 1 10 , 4608γ 3 ω 2 L 2 L max n 2 α ≤ 20γ 2 ω L 2 nα ≤ 1 10 , 768γ 2 ωLL max nα ≤ 4γL α ≤ 1 10 , 1536γ 3 ωL L 2 nα 2 ≤ 40γ 2 ω L 2 nα ≤ 1 10 , 256γ 2 L 2 α 2 ≤ 1 10 . Finally, considering γ ≤ β µ and γ ≤ α 4µ gives 1 2γ E t+1 x t+1 -x * 2 + E t+1 f (x t+1 ) -f (x * ) + κ 1 n n i=1 E t+1 h t+1 i -∇f i (x * ) 2 + νE t+1 w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 x t -x * 2 + κ 1 - γµ 2 1 n n i=1 h t i -∇f i (x * ) 2 + ν 1 - γµ 2 w t -x t 2 + 1 2 f (x t ) -f (x * ) . Note that κ = 4γω nβ + 768γ 3 ω 2 L 2 n 2 αβ + 128γ 2 ωL nβα ≤ 8γω nβ . We now prove a theorem for the general convex case: Theorem F.2. Let us assume that Assumptions 2.1, 2.2 and 2.3 hold, the strong convexity parameter satisfies µ = 0, β = 1 ω+1 , x 0 = w 0 and γ ≤ min n 160ωL max , √ nα 20 √ ω L , α . Then Algorithm 1 guarantees a convergence rate f 1 T T t=1 x t -f (x * ) ≤ 1 γT x 0 -x * 2 + f (x 0 ) -∇f (x * ) T + 16γω(ω + 1) T n 2 n i=1 h 0 i -∇f i (x * ) 2 . ( ) Proof. Let us bound (32): 1 2γ E x t+1 -x * 2 + E f (x t+1 ) -f (x * ) + κ 1 n n i=1 E h t+1 i -∇f i (x * ) 2 + νE w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 E x t -x * 2 + 1 2 E f (x t ) -f (x * ) + κ 1 - γµ 2 1 n n i=1 E h t i -∇f i (x * ) 2 + ν 1 - γµ 2 E w t -x t 2 ≤ 1 2γ E x t -x * 2 + 1 2 E f (x t ) -f (x * ) + κ 1 n n i=1 E h t i -∇f i (x * ) 2 + νE w t -x t 2 . We now sum the inequality for t ∈ {0, . . . , T -1} and obtain 1 2γ E x T -x * 2 + 1 2 E f (x T ) -f (x * ) + 1 2 T t=1 E f (x t ) -f (x * ) + κ 1 n n i=1 E h T i -∇f i (x * ) 2 + νE w T -x T 2 ≤ 1 2γ x 0 -x * 2 + 1 2 f (x 0 ) -f (x * ) + κ 1 n n i=1 h 0 i -∇f i (x * ) 2 + ν w 0 -x 0 2 ≤ 1 2γ x 0 -x * 2 + 1 2 f (x 0 ) -f (x * ) + 8γω n 2 β n i=1 h 0 i -∇f i (x * ) 2 , where we used the assumption x 0 = w 0 and the bound on κ. Using nonnegativity of the terms and convexity, we then have f 1 T T t=1 x t -f (x * ) ≤ 1 γT x 0 -x * 2 + f (x 0 ) -∇f (x * ) T + 16γω T n 2 β n i=1 h 0 i -∇f i (x * ) 2 . We now prove a theorem for the strongly convex case: Theorem 3.1. Suppose that Assumptions 2.1, 2.2 and 2.3 hold, β = 1 ω+1 , set x 0 = w 0 and let γ ≤ min n 160ωLmax , √ nα 20 √ ω L , α 100L , 1 (ω+1)µ . Then Algorithm 1 returns x T such that 1 2γ E x T -x * 2 + E f (x T ) -f (x * ) ≤ 1 -γµ 2 T V 0 , where V 0 := 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) + 8γω(ω+1) n 2 n i=1 h 0 i -∇f i (x * ) 2 . Proof. Using γ ≤ α 100L ≤ 1 µ , let us bound (32): 1 2γ E x t+1 -x * 2 + E f (x t+1 ) -f (x * ) + κ 1 n n i=1 E h t+1 i -∇f i (x * ) 2 + νE w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 E x t -x * 2 + 1 2 E f (x t ) -f (x * ) + κ 1 - γµ 2 1 n n i=1 E h t i -∇f i (x * ) 2 + ν 1 - γµ 2 E w t -x t 2 ≤ 1 2γ 1 - γµ 2 E x t -x * 2 + 1 - γµ 2 E f (x t ) -f (x * ) + κ 1 - γµ 2 1 n n i=1 E h t i -∇f i (x * ) 2 + ν 1 - γµ 2 E w t -x t 2 = 1 - γµ 2 1 2γ E x t -x * 2 + E f (x t ) -f (x * ) + κ 1 n n i=1 E h t i -∇f i (x * ) 2 + νE w t -x t 2 . Recursively applying the last inequality and using x 0 = w 0 , one can get that 1 2γ E x T -x * 2 + E f (x T ) -f (x * ) + κ 1 n n i=1 E h T i -∇f i (x * ) 2 + νE w T -x T 2 ≤ 1 - γµ 2 T 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) + κ 1 n n i=1 h 0 i -∇f i (x * ) 2 . Using the nonnegativity of the terms and the bound on κ, we obtain 1 2γ E x T -x * 2 + E f (x T ) -f (x * ) ≤ 1 - γµ 2 T 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) + 8γω n 2 β n i=1 h 0 i -∇f i (x * ) 2 .

F.1 COMMUNICATION COMPLEXITIES IN THE GENERAL CONVEX CASE

We now derive the communication complexities for the general convex case. From Theorem F.2, we know that EF21-P + DIANA has the following convergence rate: f 1 T T t=1 x t -f (x * ) ≤ 1 γT x 0 -x * 2 + f (x 0 ) -∇f (x * ) T + 16γω(ω + 1) T n 2 n i=1 h 0 i -∇f i (x * ) 2 . Let us take h 0 i = ∇f i (x 0 ) for all i ∈ [n]. Using Assumptions 2.1 and 2.2, we have f 1 T T t=1 x t -f (x * ) ≤ 1 γT x 0 -x * 2 + L x 0 -x * 2 2T + 16γω(ω + 1) T n 2 n i=1 ∇f i (x 0 ) -∇f i (x * ) 2 ≤ 1 γT x 0 -x * 2 + L x 0 -x * 2 2T + 16γω(ω + 1) L 2 x 0 -x * 2 T n . Using the bound on γ, we obtain that EF21-P + DIANA returns an ε-solution after O ωL max nε + √ ω L √ nαε + L αε + L ε + γω(ω + 1) L 2 nε steps. For simplicity, we assume that the server and the workers use TopK and RandK compressors, respectively. Thus the server-to-workers and the workers-to-server communication complexities equal O K × ωL max nε + √ ω L √ nαε + L αε + L ε + γω(ω + 1) L 2 nε = O dL max nε + d L √ nε + dL ε + KL ε + dγω L 2 nε . Note that γ ≤ √ nα 20 √ ω L = √ n 20 √ ω(ω+1) L . Thus O K × ωL max nε + √ ω L √ nαε + L αε + L ε + γω(ω + 1) L 2 nε = O dL max nε + d L √ nε + dL ε + KL ε + d L √ nε = O dL max nε + d L √ nε + dL ε . Since L max ≤ nL and L ≤ √ nL, this complexity is not worse than the Then Algorithm 1 guarantees that 1 2γ E x t+1 -x * 2 + E f (x t+1 ) -f (x * ) + κ 1 n n i=1 E h t+1 i -∇f i (x * ) 2 + νE w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 E x t -x * 2 + 1 2 E f (x t ) -f (x * ) + κ 1 - γµ 2 1 n n i=1 E h t i -∇f i (x * ) 2 + ν 1 - γµ 2 E w t -x t 2 + 12γ(ω + 1)σ 2 n , where κ ≤ 8γω nβ and ν ≤ 192γω L 2 nα + 32L α . Proof. First, we bound E t+1 g t -∇f (w t ) 2 , 1 n n i=1 E t+1 h t+1 i -∇f i (x * ) 2 and E t+1 w t+1 -x t+1 2 . Using the independence of compressors, we have E t+1 g t -∇f (w t ) 2 = E t+1   h t + 1 n n i=1 C D i ( ∇f i (w t ) -h t i ) -∇f (w t ) 2   = 1 n 2 n i=1 E t+1 C D i ( ∇f i (w t ) -h t i ) -∇f i (w t ) -h t i 2 (14) = 1 n 2 n i=1 E t+1 C D i ( ∇f i (w t ) -h t i ) -∇f i (w t ) -h t i 2 + E t+1 ∇f i (w t ) -∇f i (w t ) 2 ≤ ω n 2 n i=1 E t+1 ∇f i (w t ) -h t i 2 + 1 n 2 n i=1 E t+1 ∇f i (w t ) -∇f i (w t ) 2 (14) = ω n 2 n i=1 ∇f i (w t ) -h t i 2 + ω + 1 n 2 n i=1 E t+1 ∇f i (w t ) -∇f i (w t ) 2 ≤ ω n 2 n i=1 ∇f i (w t ) -h t i 2 + (ω + 1)σ 2 n ≤ 2ω n 2 n i=1 h t i -∇f i (x * ) 2 + 2ω n 2 n i=1 ∇f i (w t ) -∇f i (x * ) 2 + (ω + 1)σ 2 n ≤ 2ω n 2 n i=1 h t i -∇f i (x * ) 2 + 4ω n 2 n i=1 ∇f i (w t ) -∇f i (x t ) 2 + 4ω n 2 n i=1 ∇f i (x t ) -∇f i (x * ) 2 + (ω + 1)σ 2 n , where in the last three inequalities, we used ( 5) and ( 8). Using Assumption 2.2 and Lemma B.1, we obtain E t+1 g t -∇f (w t ) 2 ≤ 2ω n 2 n i=1 h t i -∇f i (x * ) 2 + 4ω L 2 n w t -x t 2 + 8ωL max n f (x t ) -f (x * ) + (ω + 1)σ 2 n . Next, we bound 1 n n i=1 h t+1 i -∇f i (x * ) 2 to construct a Lyapunov function: 1 n n i=1 E t+1 h t+1 i -∇f i (x * ) 2 = 1 n n i=1 E t+1 h t i + βC D i ( ∇f i (w t ) -h t i ) -∇f i (x * ) 2 = 1 n n i=1 h t i -∇f i (x * ) 2 + 2β n n i=1 h t i -∇f i (x * ), E t+1 C D i ( ∇f i (w t ) -h t i ) + β 2 n n i=1 E t+1 C D i ( ∇f i (w t ) -h t i ) 2 (5) ≤ 1 n n i=1 h t i -∇f i (x * ) 2 + 2β n n i=1 h t i -∇f i (x * ), ∇f i (w t ) -h t i + β 2 (ω + 1) n n i=1 E t+1 ∇f i (w t ) -h t i 2 (14) = 1 n n i=1 h t i -∇f i (x * ) 2 + 2β n n i=1 h t i -∇f i (x * ), ∇f i (w t ) -h t i + β 2 (ω + 1) n n i=1 E t+1 ∇f i (w t ) -h t i 2 + β 2 (ω + 1) n n i=1 E t+1 ∇f i (w t ) -∇f i (w t ) 2 ≤ 1 n n i=1 h t i -∇f i (x * ) 2 + 2β n n i=1 h t i -∇f i (x * ), ∇f i (w t ) -h t i + β 2 (ω + 1) n n i=1 E t+1 ∇f i (w t ) -h t i 2 + β 2 (ω + 1)σ 2 (12) = (1 -β) 1 n n i=1 h t i -∇f i (x * ) 2 + β n n i=1 ∇f i (w t ) -∇f i (x * ) 2 + β (β(ω + 1) -1) n n i=1 ∇f i (w t ) -h t i 2 + β 2 (ω + 1)σ 2 ≤ (1 -β) 1 n n i=1 h t i -∇f i (x * ) 2 + β n n i=1 ∇f i (w t ) -∇f i (x * ) 2 + β 2 (ω + 1)σ 2 , where we use the assumption β ∈ 0, 1 ω+1 . Using (8), Assumption 2.2 and Lemma B.1, we have 1 n n i=1 E t+1 h t+1 i -∇f i (x * ) 2 ≤ (1 -β) 1 n n i=1 h t i -∇f i (x * ) 2 + 2β L 2 w t -x t 2 + 4βL max f (x t ) -f (x * ) + β 2 (ω + 1)σ 2 . It remains to bound E t+1 w t+1 -x t+1 2 : E t+1 w t+1 -x t+1 2 = E t+1 w t + C p (x t+1 -w t ) -x t+1 2 (2) ≤ (1 -α)E t+1 x t+1 -w t 2 = (1 -α)E t+1 x t -γg t -w t 2 (14) = (1 -α)γ 2 E t+1 g t -∇f (w t ) 2 + (1 -α) x t -γ∇f (w t ) -w t 2 (7) ≤ γ 2 E t+1 g t -∇f (w t ) 2 + 1 - α 2 w t -x t 2 + 2γ 2 α ∇f (w t ) 2 (8) ≤ γ 2 E t+1 g t -∇f (w t ) 2 + 1 - α 2 w t -x t 2 + 4γ 2 α ∇f (w t ) -∇f (x t ) 2 + 4γ 2 α ∇f (x t ) -∇f (x * ) 2 . Using Assumption 2.1 and Lemma B.1, we obtain E t+1 w t+1 -x t+1 2 ≤ γ 2 E t+1 g t -∇f (w t ) 2 + 1 - α 2 + 4γ 2 L 2 α w t -x t 2 + 8γ 2 L α f (x t ) -f (x * ) ≤ γ 2 2ω n 2 n i=1 h t i -∇f i (x * ) 2 + 4ω L 2 n w t -x t 2 + 8ωL max n f (x t ) -f (x * ) + (ω + 1)σ 2 n + 1 - α 2 + 4γ 2 L 2 α w t -x t 2 + 8γ 2 L α f (x t ) -f (x * ) = 2γ 2 ω n 2 n i=1 h t i -∇f i (x * ) 2 + 1 - α 2 + 4γ 2 L 2 α + 4γ 2 ω L 2 n w t -x t 2 + 8γ 2 ωL max n + 8γ 2 L α f (x t ) -f (x * ) + γ 2 (ω + 1)σ 2 n ≤ 1 - α 4 w t -x t 2 + 2γ 2 ω n 2 n i=1 h t i -∇f i (x * ) 2 + 8γ 2 ωL max n + 8γ 2 L α f (x t ) -f (x * ) + γ 2 (ω + 1)σ 2 n , where we assume that γ ≤ α √ 32L and γ ≤ √ αn √ 32ω L . Let us fix some constants κ ≥ 0 and ν ≥ 0. In the proof of (34) in Theorem F.1, we do not use the structure of g t . Hence we can reuse (34) here and combine it with the above inequalities to obtain E t+1 f (x t+1 ) + κ 1 n n i=1 E t+1 h t+1 i -∇f i (x * ) 2 + νE t+1 w t+1 -x t+1 2 ≤ f (x * ) + γ 2ω n 2 n i=1 h t i -∇f i (x * ) 2 + 4ω L 2 n w t -x t 2 + 8ωL max n f (x t ) -f (x * ) + (ω + 1)σ 2 n + 1 2γ 1 - γµ 2 x t -x * 2 - 1 2γ E t+1 x t+1 -x * 2 + 2L w t -x t 2 + κ (1 -β) 1 n n i=1 h t i -∇f i (x * ) 2 + 2β L 2 w t -x t 2 + 4βL max f (x t ) -f (x * ) + β 2 (ω + 1)σ 2 + ν 1 - α 4 w t -x t 2 + 2γ 2 ω n 2 n i=1 h t i -∇f i (x * ) 2 + 8γ 2 ωL max n + 8γ 2 L α f (x t ) -f (x * ) + γ 2 (ω + 1)σ 2 n Rearranging the last inequality, one can get 1 2γ E t+1 x t+1 -x * 2 + E t+1 f (x t+1 ) -f (x * ) + κ 1 n n i=1 E t+1 h t+1 i -∇f i (x * ) 2 + νE t+1 w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 x t -x * 2 + 8γωL max n + κ4βL max + ν 8γ 2 ωL max n + 8γ 2 L α f (x t ) -f (x * ) + 2γω n + ν 2γ 2 ω n + κ (1 -β) 1 n n i=1 h t i -∇f i (x * ) 2 + 4γω L 2 n + 2L + κ2β L 2 + ν 1 - α 4 w t -x t 2 + γ(ω + 1)σ 2 n + κβ 2 (ω + 1)σ 2 + ν γ 2 (ω + 1)σ 2 n . Using the same reasoning as in the proof of Theorem F.1, we have 1 2γ E t+1 x t+1 -x * 2 + E t+1 f (x t+1 ) -f (x * ) + κ 1 n n i=1 E t+1 h t+1 i -∇f i (x * ) 2 + νE t+1 w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 x t -x * 2 + κ 1 - γµ 2 1 n n i=1 h t i -∇f i (x * ) 2 + ν 1 - γµ 2 w t -x t 2 + 1 2 f (x t ) -f (x * ) + γ(ω + 1)σ 2 n + κβ 2 (ω + 1)σ 2 + ν γ 2 (ω + 1)σ 2 n for some κ ≤ 8γω nβ and ν ≤ 192γω L 2 nα + 32L α . Thus 1 2γ E t+1 x t+1 -x * 2 + E t+1 f (x t+1 ) -f (x * ) + κ 1 n n i=1 E t+1 h t+1 i -∇f i (x * ) 2 + νE t+1 w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 x t -x * 2 + κ 1 - γµ 2 1 n n i=1 h t i -∇f i (x * ) 2 + ν 1 - γµ 2 w t -x t 2 + 1 2 f (x t ) -f (x * ) + γ(ω + 1)σ 2 n + 8γβω(ω + 1)σ 2 n + 192γ 3 ω(ω + 1) L 2 σ 2 n 2 α + 32γ 2 (ω + 1)Lσ 2 nα ≤ 1 2γ 1 - γµ 2 x t -x * 2 + κ 1 - γµ 2 1 n n i=1 h t i -∇f i (x * ) 2 + ν 1 - γµ 2 w t -x t 2 + 1 2 f (x t ) -f (x * ) + 12γ(ω + 1)σ 2 n , where used the bounds on γ and β. Theorem 3.3. Let us consider Algorithm 1 using stochastic gradients ∇f i instead of exact gradients ∇f i for all i ∈ [n]. Let Assumptions 2.1, 2.2, 2.3 and 3.2 hold, β = 1 ω+1 , x 0 = w 0 , and γ ≤ min n 160ωLmax , √ nα 20 √ ω L , α 100L , 1 (ω+1)µ . Then Algorithm 1 returns x T such that 1 2γ E x T -x * 2 + E f (x T ) -f (x * ) ≤ 1 -γµ 2 T V 0 + 24(ω+1)σ 2 µn , where V 0 := 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) + 8γω(ω+1) n 2 n i=1 h 0 i -∇f i (x * ) 2 . Proof. Using γ ≤ α 100L ≤ 1 µ , we can bound (40) as follows: 1 2γ E x t+1 -x * 2 + E f (x t+1 ) -f (x * ) + κ 1 n n i=1 E h t+1 i -∇f i (x * ) 2 + νE w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 E x t -x * 2 + 1 2 E f (x t ) -f (x * ) + κ 1 - γµ 2 1 n n i=1 E h t i -∇f i (x * ) 2 + ν 1 - γµ 2 E w t -x t 2 + 12γ(ω + 1)σ 2 n ≤ 1 2γ 1 - γµ 2 E x t -x * 2 + 1 - γµ 2 E f (x t ) -f (x * ) + κ 1 - γµ 2 1 n n i=1 E h t i -∇f i (x * ) 2 + ν 1 - γµ 2 E w t -x t 2 + 12γ(ω + 1)σ 2 n = 1 - γµ 2 1 2γ E x t -x * 2 + E f (x t ) -f (x * ) + κ 1 n n i=1 E h t i -∇f i (x * ) 2 + νE w t -x t 2 + 12γ(ω + 1)σ 2 n . Recursively applying the last inequality and using the assumption x 0 = w 0 , one can get that 1 2γ E x T -x * 2 + E f (x T ) -f (x * ) + κ 1 n n i=1 E h T i -∇f i (x * ) 2 + νE w T -x T 2 ≤ 1 - γµ 2 T 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) + κ 1 n n i=1 h 0 i -∇f i (x * ) 2 + T -1 i=0 1 - γµ 2 i 12γ(ω + 1)σ 2 n ≤ 1 - γµ 2 T 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) + κ 1 n n i=1 h 0 i -∇f i (x * ) 2 + 24(ω + 1)σ 2 nµ Using the nonnegativity of the terms and the bound on κ, we obtain 1 2γ E x T -x * 2 + E f (x T ) -f (x * ) ≤ 1 - γµ 2 T 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) + 8γω n 2 β n i=1 h 0 i -∇f i (x * ) 2 + 24(ω + 1)σ 2 nµ . Theorem F.4. Let us consider Algorithm 1 using stochastic gradients ∇f i instead of the exact gradients ∇f i for all i ∈ [n]. Let us assume that Assumptions 2.1, 2.2, 2.3 and 3.2 hold, the strong convexity parameter satisfies µ = 0, β = 1 ω+1 , x 0 = w 0 , and γ ≤ min n 160ωL max , √ nα 20 √ ω L , α . Then Algorithm 1 guarantees the following convergence rate: f 1 T T t=1 x t -f (x * ) ≤ 1 γT x 0 -x * 2 + f (x 0 ) -∇f (x * ) T + 16γω(ω + 1) T n 2 n i=1 h 0 i -∇f i (x * ) 2 + 24γ(ω + 1)σ 2 n . Proof. Let us bound (40): 1 2γ E x t+1 -x * 2 + E f (x t+1 ) -f (x * ) + κ 1 n n i=1 E h t+1 i -∇f i (x * ) 2 + νE w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 E x t -x * 2 + 1 2 E f (x t ) -f (x * ) + κ 1 - γµ 2 1 n n i=1 E h t i -∇f i (x * ) 2 + ν 1 - γµ 2 E w t -x t 2 + 12γ(ω + 1)σ 2 n ≤ 1 2γ E x t -x * 2 + 1 2 E f (x t ) -f (x * ) + κ 1 n n i=1 E h t i -∇f i (x * ) 2 + νE w t -x t 2 + 12γ(ω + 1)σ 2 n . Summing the inequality for t ∈ {0, . . . , T -1} gives 1 2γ E x T -x * 2 + 1 2 E f (x T ) -f (x * ) + 1 2 T t=1 E f (x t ) -f (x * ) + κ 1 n n i=1 E h T i -∇f i (x * ) 2 + νE w T -x T 2 ≤ 1 2γ x 0 -x * 2 + 1 2 f (x 0 ) -f (x * ) + κ 1 n n i=1 h 0 i -∇f i (x * ) 2 + ν w 0 -x 0 2 + 12T γ(ω + 1)σ 2 n ≤ 1 2γ x 0 -x * 2 + 1 2 f (x 0 ) -f (x * ) + 8γω n 2 β n i=1 h 0 i -∇f i (x * ) 2 + 12T γ(ω + 1)σ 2 n , where we used the fact that x 0 = w 0 and the bound on κ. Using nonnegativity of the terms and convexity, we have f 1 T T t=1 x t -f (x * ) ≤ 1 γT x 0 -x * 2 + f (x 0 ) -∇f (x * ) T + 16γω T n 2 β n i=1 h 0 i -∇f i (x * ) 2 + 24γ(ω + 1)σ 2 n .

G PROOFS FOR EF21-P + DCGD IN THE CONVEX CASE

As mentioned before, EF21-P + DCGD arises a special case of EF21-P + DIANA if we do not attempt to learn any local gradient shifts h t i and instead set them to 0 throughout. This can be achieved by setting β = 0. Algorithm 2 EF21-P + DCGD 1: Parameters: learning rate γ > 0; initial iterate x 0 ∈ R d (stored on the server and the workers); initial iterate shift w 0 = x 0 ∈ R d (stored on the server and the workers) 2: for t = 0, 1, . . . , T -1 do 3: for i = 1, . . . , n in parallel do 4: . g t i = C D i (∇fi(w t )) Then Algorithm 2 guarantees that 1 2γ E x t+1 -x * 2 + E f (x t+1 ) -f (x * ) + νE w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 E x t -x * 2 + 1 2 E f (x t ) -f (x * ) + ν 1 - γµ 2 E w t -x t 2 + 4γω n 1 n n i=1 ∇f i (x * ) 2 , ( ) where ν ≤ 32γω L 2 nα + 16L α . Proof. Note that EF21-P + DCGD is EF21-P + DIANA with β = 0 and h t i = 0 for all i ∈ [n] and t ≥ 0. Up to (37), we can reuse the proof of Theorem F.1 and obtain From the assumptions on γ, we have 1 2γ E t+1 x t+1 -x * 2 + E t+1 f (x t+1 ) -f (x * ) + κ 1 n n i=1 E t+1 h t+1 i -∇f i (x * ) 2 + νE t+1 w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 x t - 2γω n + ν 2γ 2 ω n ≤ 2γω n + 32γω L 2 αn + 16L α 2γ 2 ω n ≤ 4γω n and hence 1 2γ E t+1 x t+1 -x * 2 + E t+1 f (x t+1 ) -f (x * ) + νE t+1 w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 x t -x * 2 + 1 2 f (x t ) -f (x * ) + ν 1 - γµ 2 w t -x t 2 + 4γω n 1 n n i=1 h t i -∇f i (x * ) 2 . Taking the full expectation, we obtain 1 2γ E x t+1 -x * 2 + E f (x t+1 ) -f (x * ) + νE w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 E x t -x * 2 + 1 2 E f (x t ) -f (x * ) + ν 1 - γµ 2 E w t -x t 2 + 4γω n 1 n n i=1 E h t i -∇f i (x * ) 2 . It remains to use (36) with β = 0 to finish the proof of the theorem. Theorem G.2. Let us assume that Assumptions 2.1, 2.2 and 2.3 hold, the strong convexity parameter satisfies µ = 0, x 0 = w 0 and γ ≤ min n 160ωL max , √ nα 20 √ ω L , α . Then Algorithm 2 guarantees that f 1 T T t=1 x t -f (x * ) ≤ 1 γT x 0 -x * 2 + f (x 0 ) -∇f (x * ) T + 8γω n 1 n n i=1 ∇f i (x * ) 2 . Proof. Let us bound (41): 1 2γ E x t+1 -x * 2 + E f (x t+1 ) -f (x * ) + νE w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 E x t -x * 2 + 1 2 E f (x t ) -f (x * ) + ν 1 - γµ 2 E w t -x t 2 + 4γω n 1 n n i=1 ∇f i (x * ) 2 ≤ 1 2γ E x t -x * 2 + 1 2 E f (x t ) -f (x * ) + νE w t -x t 2 + 4γω n 1 n n i=1 ∇f i (x * ) 2 . We now sum the inequality for t ∈ {0, . . . , T -1} and obtain 1 2γ E x T -x * 2 + 1 2 E f (x T ) -f (x * ) + 1 2 T t=1 E f (x t ) -f (x * ) + νE w T -x T 2 ≤ 1 2γ x 0 -x * 2 + 1 2 f (x 0 ) -f (x * ) + ν w 0 -x 0 2 + T 4γω n 1 n n i=1 ∇f i (x * ) 2 = 1 2γ x 0 -x * 2 + 1 2 f (x 0 ) -f (x * ) + T 4γω n 1 n n i=1 ∇f i (x * ) 2 . where we used the assumption x 0 = w 0 . Non-negativity of the terms and convexity gives f 1 T T t=1 x t -f (x * ) ≤ 1 γT x 0 -x * 2 + f (x 0 ) -∇f (x * ) T + 8γω n 1 n n i=1 ∇f i (x * ) 2 . Theorem G.3. Let us assume that Assumptions 2.1, 2.2 and 2.3 hold, x 0 = w 0 , and γ ≤ min n 160ωL max , √ nα 20 √ ω L , α . Then Algorithm 2 guarantees that 1 2γ E x T -x * 2 + E f (x T ) -f (x * ) ≤ 1 - γµ 2 T 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) + 8ω nµ 1 n n i=1 ∇f i (x * ) 2 . Proof. Using γ ≤ α 100L ≤ 1 µ , let us bound (41): 1 2γ E x t+1 -x * 2 + E f (x t+1 ) -f (x * ) + νE w t+1 -x t+1 2 ≤ 1 2γ 1 - γµ 2 E x t -x * 2 + 1 2 E f (x t ) -f (x * ) + ν 1 - γµ 2 E w t -x t 2 + 4γω n 1 n n i=1 ∇f i (x * ) 2 ≤ 1 2γ 1 - γµ 2 E x t -x * 2 + 1 - γµ 2 E f (x t ) -f (x * ) + ν 1 - γµ 2 E w t -x t 2 + 4γω n 1 n n i=1 ∇f i (x * ) 2 = 1 - γµ 2 1 2γ E x t -x * 2 + E f (x t ) -f (x * ) + νE w t -x t 2 + 4γω n 1 n n i=1 ∇f i (x * ) 2 . Recursively applying the last inequality and using x 0 = w 0 , one obtains 1 2γ E x T -x * 2 + E f (x T ) -f (x * ) + νE w T -x T 2 ≤ 1 - γµ 2 T 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) + T -1 i=0 1 - γµ 2 i 4γω n 1 n n i=1 ∇f i (x * ) 2 ≤ 1 - γµ 2 T 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) + ∞ i=0 1 - γµ 2 i 4γω n 1 n n i=1 ∇f i (x * ) 2 = 1 - γµ 2 T 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) + 8ω nµ 1 n n i=1 ∇f i (x * ) 2 Non-negativity of E w T -x T 2 gives 1 2γ E x T -x * 2 + E f (x T ) -f (x * ) ≤ 1 - γµ 2 T 1 2γ E x 0 -x * 2 + f (x 0 ) -f (x * ) + 8ω nµ 1 n n i=1 ∇f i (x * ) 2 .



Our method is inspired by the recently proposed error-feedback mechanism, EF21, of Richtárik et al. (2021), which compresses the dual vectors, i.e., the gradients. EF21 is currently the state-of-the-art error feedback mechanism in terms of its theoretical properties and practical performance(Fatkhullin et al., 2021). If we wish to explicitly highlight its dual nature, we could instead meaningfully call their method EF21-D. In fact, this only happens in the non-interesting case when C t-1 is identity with probability 1. We did not try to get the convergence rate of EF21-P + DIANA in the nonconvex regime because it is well known that DIANA is a suboptimal method in the nonconvex case(Gorbunov et al., 2021).



Figure 1: Logistic Regression with real-sim dataset. Number of workers n = 100. Sparsification level was set to K = 100 for all compressors.

Figure 2: Logistic Regression with w8a dataset. # of workers n = 10. K = 10 in all compressors.

Figure 4: Logistic Regression with real-sim dataset. # of workers n = 100. The parameters of workers-to-server and server-to-workers compressors are K w = 100 and K s = 2000.

Figure 5: Logistic Regression with the nonconvex regularizer and real-sim dataset. # of workers n = 100. K = 100 in all compressors.

General nonconvex Case. The # of communication rounds to get an ε-stationary point(E[ ∇f ( x)

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Advances in Neural Information Processing Systems, 24, 2011. Peter Richtárik, Igor Sokolov, and Ilyas Fatkhullin. EF21: A new, simpler, theoretically better, and practically faster error feedback. arXiv preprint arXiv:2106.05203, 2021. Rustem Islamov, Xun Qian, and Peter Richtárik. Fednl: Making newton-type methods applicable to federated learning. arXiv preprint arXiv:2106.02969, 2021. Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370, 2013. Sebastian Stich and Sai Praneeth Karimireddy. The error-feedback framework: Better rates for SGD with delayed gradients and compressed communication. arXiv preprint arXiv:1909.05350, 2019. EF21-P theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Communication complexity of vanilla gradient descent . . . . . . . . . . . . . . . 2.2 Workers-to-server (=uplink) compression . . . . . . . . . . . . . . . . . . . . . . 2.3 Bidirectional compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 New methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Proposition E.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Communication Complexities in the General Convex Case . . . . . . . . . . . . .

.2 PROOFS FOR EF21-P + DIANA WITH STOCHASTIC GRADIENTS First, we prove the following auxiliary theorem: Theorem F.3. Let us consider Algorithm 1 using the stochastic gradients ∇f i instead of the exact gradients ∇f i for all i ∈ [n].

C P x t+1 -w tCompress shifted model on the server via C P ∈ B (α) 10:w t+1 = w t + p t+1The proofs in this section almost repeat the proofs from Section F. Theorem G.1. Let us assume that Assumptions 2.1, 2.2 and 2.3 hold and choose

x * 2 Due to β = 0, we have 1 2γ E t+1 x t+1 -x * 2 + E t+1 f (x t+1 ) -f (x * ) t+1 -x * 2 + E t+1 f (x t+1 ) -f (x * ) + νE t+1 w t+1 -x t+1 2

H FUTURE WORK AND POSSIBLE EXTENSIONS

In this paper, many important features of distributed and federated learning were not investigated in detail. These include variance reduction of stochastic gradients (Horváth et al., 2022; Tyurin & Richtárik, 2022b) , acceleration (Li & Richtárik, 2021; Li et al., 2020) , local steps (Murata & Suzuki, 2021) , partial participation (McMahan et al., 2017; Tyurin & Richtárik, 2022a ) and asynchronous SGD (Koloskova et al., 2022) . While some are simple exercises and can be easily added to our methods, many of them deserve further investigation and separate work. Further, note that several authors, including Szlendak et al. (2021); Richtárik et al. (2022) ; Condat et al. (2022) , considered somewhat different families of compressors than those we consider here. We believe that the results and discussion from our paper can be adapted to these families.

