A BETTER ALTERNATIVE TO ERROR FEEDBACK FOR COMMUNICATION-EFFICIENT DISTRIBUTED LEARN-ING

Abstract

Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed compute systems. A key bottleneck of such systems is the communication overhead for exchanging information (e.g., stochastic gradients) across the workers. Among the many techniques proposed to remedy this issue, one of the most successful is the framework of compressed communication with error feedback (EF). EF remains the only known technique that can deal with the error induced by contractive compressors which are not unbiased, such as Top-K or PowerSGD. In this paper, we propose a new and theoretically and practically better alternative to EF for dealing with contractive compressors. In particular, we propose a construction which can transform any contractive compressor into an induced unbiased compressor. Following this transformation, existing methods able to work with unbiased compressors can be applied. We show that our approach leads to vast improvements over EF, including reduced memory requirements, better communication complexity guarantees and fewer assumptions. We further extend our results to federated learning with partial participation following an arbitrary distribution over the nodes, and demonstrate the benefits thereof. We perform several numerical experiments which validate our theoretical findings.

1. INTRODUCTION

We consider distributed optimization problems of the form min x∈R d f (x) := 1 n n i=1 f i (x) , where x ∈ R d represents the weights of a statistical model we wish to train, n is the number of nodes, and f i : R d → R is a smooth differentiable loss function composed of data stored on worker i. In a classical distributed machine learning scenario, f i (x) := E ζ∼Di [f ζ (x)] is the expected loss of model x with respect to the local data distribution D i of the form, and f ζ : R d → R is the loss on the single data point ζ. This definition allows for different distributions D 1 , . . . , D n on each node, which means that the functions f 1 , . . . , f n can have different minimizers. This framework covers Stochastic Optimization when either n = 1 or all D i are identical, Empirical Risk Minimization (ERM), when f i (x) can be expressed as a finite average, i.e, f i (x) = 1 mi mi i=1 f ij (x) for some f ij : R d → R, and Federated Learning (FL) (Kairouz et al., 2019) where each node represents a client. Communication Bottleneck. In distributed training, model updates (or gradient vectors) have to be exchanged in each iteration. Due to the size of the communicated messages for commonly considered deep models (Alistarh et al., 2016) , this represents significant bottleneck of the whole optimization procedure. To reduce the amount of data that has to be transmitted, several strategies were proposed. One of the most popular strategies is to incorporate local steps and communicated updates every few iterations only (Stich, 2019a; Lin et al., 2018a; Stich & Karimireddy, 2020; Karimireddy et al., 2019a; Khaled et al., 2020) . Unfortunately, despite their practical success, local methods are poorly understood and their theoretical foundations are currently lacking. Almost all existing error guarantees are dominated by a simple baseline, minibatch SGD (Woodworth et al., 2020) . In this work, we focus on another popular approach: gradient compression. In this approach, instead of transmitting the full dimensional (gradient) vector g ∈ R d , one transmits a compressed vector C(g), where C : R d → R d is a (possibly random) operator chosen such that C(g) can be represented using fewer bits, for instance by using limited bit representation (quantization) or by enforcing sparsity. A particularly popular class of quantization operators is based on random dithering (Goodall, 1951; Roberts, 1962) ; see (Alistarh et al., 2016; Wen et al., 2017; Zhang et al., 2017; Horváth et al., 2019a; Ramezani-Kebrya et al., 2019) . Much sparser vectors can be obtained by random sparsification techniques that randomly mask the input vectors and only preserve a constant number of coordinates (Wangni et al., 2018; Konečný & Richtárik, 2018; Stich et al., 2018; Mishchenko et al., 2019b; Vogels et al., 2019) . There is also a line of work (Horváth et al., 2019a; Basu et al., 2019) in which a combination of sparsification and quantization was proposed to obtain a more aggressive effect. We will not further distinguish between sparsification and quantization approaches, and refer to all of them as compression operators hereafter. Considering both practice and theory, compression operators can be split into two groups: biased and unbiased. For the unbiased compressors, C(g) is required to be an unbiased estimator of the update g. Once this requirement is lifted, extra tricks are necessary for Distributed Compressed Stochastic Gradient Descent (DCSGD) (Alistarh et al., 2016; 2018; Khirirat et al., 2018) employing such a compressor to work, even if the full gradient is computed by each node. Indeed, the naive approach can lead to exponential divergence (Beznosikov et al., 2020) , and Error Feedback (EF) (Seide et al., 2014; Karimireddy et al., 2019b) is the only known mechanism able to remedy the situation.

Contributions. Our contributions can be summarized as follows:

• Induced Compressor. When used within the stabilizing EF framework, biased compressors (e.g., Top-K) can often achieve superior performance when compared to their unbiased counterparts (e.g., Rand-K). This is often attributed to their low variance. However, despite ample research in this area, EF remains the only known mechanism that allows the use of these powerful biased compressors. Our key contribution is the development of a simple but remarkably effective alternative-and this is the only alternative we know of-which we argue leads to better and more versatile methods both in theory and practice. In particular, we propose a general construction that can transform any biased compressor, such as Top-K, into an unbiased one for which we coin the name induced compressor (Section 3). Instead of using the desired biased compressor within EF, our proposal is to instead use the induced compressor within an appropriately chosen existing method designed for unbiased compressors, such as distributed compressed SGD (DCSGD) (Khirirat et al., 2018) , variance reduced DCSGD (DIANA) (Mishchenko et al., 2019a) or accelerated DIANA (ADIANA) (Li et al., 2020) . While EF can bee seen as a version of DCSGD which can work with biased compressors, variance reduced nor accelerated variants of EF were not known at the time of writing this paper. • Better Theory for DCSGD. As a secondary contribution, we provide a new and tighter theoretical analysis of DCSGD under weaker assumptions. If f is µ-quasi convex (not necessarily convex) and local functions f i are (L, σ 2 )-smooth (weaker version of L-smoothness with strong growth condition), we obtain the rate O δ n Lr 0 exp -µT 4δnL + (δn-1)D+δ σ 2 /n µT , where δ n = 1 + δ-1 n and δ ≥ 1 is the parameter which bounds the second moment of the compression operator, and T is the number of iterations. This rate has linearly decreasing dependence on the number of nodes n, which is strictly better than the best-known rate for DCSGD with EF, whose convergence does not improve as the number of nodes increases, which is one of the main disadvantages of using EF. Moreover, EF requires extra assumptions. In addition, while the best-known rates for EF (Karimireddy et al., 2019b; Beznosikov et al., 2020) are expressed in terms of functional values, our theory guarantees convergence in both iterates and functional values. Another practical implication of our findings is the reduction of the memory requirements by half; this is because in DCSGD one does not need to store the error vector. • Partial Participation. We further extend our results to obtain the first convergence guarantee for partial participation with arbitrary distributions over nodes, which plays a key role in Federated Learning (FL). Algorithm 1 DCSGD 1: Input: {η k } T k=0 > 0, x 0 2: for k = 0, 1, . . . T do

3:

Parallel: Worker side 4: for i = 1, . . . , n do 5: obtain g k i 6: send ∆ k i = C k (g k i ) to master 7: [no need to keep track of errors] 8: end for

9:

Master side 10: aggregate ∆ k = 1 n n i=1 ∆ k i 11: broadcast ∆ k to each worker 12: Parallel: Worker side 13: for i = 1, . . . , n do 14: x k+1 = x k -η k ∆ k 15: end for 16: end for Algorithm 2 DCSGD with Error Feedback 1: Input: {η k } T k=0 > 0, x 0 , e 0 i = 0 ∀i ∈ [n] 2: for k = 0, 1, . . . T do 3: Parallel: Worker side 4: for i = 1, . . . , n do 5: obtain g k i 6: send ∆ k i = C k (η k g k i + e k i ) to master 7: e k+1 i = η k g k i + e k i -∆ k i 8: end for 9: Master side 10: aggregate ∆ k = 1 n n i=1 ∆ k i 11: broadcast ∆ k to each worker 12: Parallel: Worker side 13: for i = 1, . . . , n do 14: x k+1 = x k -∆ k 15: end for 16: end for • Experimental Validation. Finally, we provide an experimental evaluation on an array of classification tasks with CIFAR10 dataset corroborating our theoretical findings.

2. ERROR FEEDBACK IS NOT A GOOD IDEA WHEN USING UNBIASED COMPRESSORS

In this section we first introduce the notions of unbiased and general compression operators, and then compare Distributed Compressed SGD (DCSGD) without (Algorithm 1) and with (Algorithm 2) Error Feedback. Unbiased vs General Compression Operators. We start with the definition of unbiased and general compression operators (Cordonnier, 2018; Stich et al., 2018; Koloskova et al., 2019) . Definition 1 (Unbiased Compression Operator). A randomized mapping C : R d → R d is an unbiased compression operator (unbiased compressor) if there exists δ ≥ 1 such that E [C(x)] = x, E C(x) 2 ≤ δ x 2 , ∀x ∈ R d . If this holds, we will for simplicity write C ∈ U(δ). Definition 2 (General Compression Operator). A (possibly) randomized mapping C : R d → R d is a general compression operator (general compressor) if there exists λ > 0 and δ ≥ 1 such that E λC(x) -x 2 ≤ 1 -1 δ x 2 , ∀x ∈ R d . If this holds, we will for simplicity write C ∈ C(δ). The following lemma provides a link between these notions (see, e.g. Beznosikov et al. (2020) ). Lemma 1. If C ∈ U(δ), then (3) holds with λ = 1 δ , i.e., C ∈ C(δ). That is, U(δ) ⊂ C(δ). Note that the opposite inclusion to that established in the above lemma does not hold. For instance, the Top-K operator belongs to C(δ), but does not belong to U(δ). In the next section we develop a procedure for transforming any mapping C : R d → R d (and in particular, any general compressor) into a closely related induced unbiased compressor. Distributed SGD with vs without Error Feedback. In the rest of this section, we compare the convergence rates for DCSGD (Algorithm 1) and DCSGD with EF (Algorithm 2). We do this comparison under standard assumptions (Karimi et al., 2016; Bottou et al., 2018; Necoara et al., 2019; Gower et al., 2019; Stich, 2019b; Stich & Karimireddy, 2020) , listed next. First, we assume throughout that f has a unique minimizer x , and let f = f (x ) > -∞. Assumption 1 (µ-quasi convexity). f is µ-quasi convex, i.e., f ≥ f (x) + ∇f (x), x -x + µ 2 x -x 2 , ∀x ∈ R d . Assumption 2 (unbiased gradient oracle). The stochastic gradient used in Algorithms 1 and 2 satisfies E g k i | x k = ∇f i (x k ), ∀i, k. Note that this assumption implies E 1 n n i=1 g k i | x k = ∇f (x k ). Assumption 3 ((L, σ 2 )-expected smoothness). Function f is (L, σ 2 )-smooth if there exist constants L > 0 and σ 2 ≥ 0 such that ∀i ∈ [n] and ∀x k ∈ R d E g k i 2 ≤ 2L(f i (x k ) -f i ) + σ 2 , (6) E 1 n n i=1 g k i 2 ≤ 2L(f (x k ) -f ) + σ 2 /n, where f i is the minimum functional value of f i and [n] = {1, 2, . . . , n}. This assumption generalizes standard smoothness and boundedness of variance assumptions. For more details and discussion, see the works of Gower et al. ( 2019); Stich (2019b). Equipped with these assumptions, we are ready to proceed with the convergence theory. Theorem 2 (Convergence of DCSGD). Consider the DCSGD algorithm with n ≥ 1 nodes. Let Assumptions 1-3 hold and C ∈ U(δ), where δ n = δ-1 n + 1. Let D := 2L n n i=1 (f i (x ) -f i ). Then there exist stepsizes η k ≤ 1 2δnL and weights w k ≥ 0 such that for all T ≥ 1 we have E f (x T ) -f + µE x T -x 2 ≤ 64δ n Lr 0 exp -µT 4δnL + 36 (δn-1)D+ δσ 2 /n µT , where r 0 = x 0 -x 2 , W T = T k=0 w k , and Prob(x T = x k ) = w k /W T . If δ = 1 (no compression), Theorem 2 recovers the optimal rate of Distributed SGD (Stich, 2019b) . If δ > 1, there is an extra term (δ n -1)D in the convergence rate, which appears due to heterogenity of data ( n i=1 ∇f i (x ) = 0, but n i=1 C(∇f i (x )) = 0 in general). In addition, the rate is negatively affected by extra variance due to presence of compression which leads to L → δ n L and σ 2 /n → δσ 2 /n. Next we compare our rate to the best-known result for Error Feedback (Stich & Karimireddy, 2020) (n = 1), (Beznosikov et al., 2020)  (n ≥ 1) used with C ∈ U(δ) ⊂ C(δ) E f (x T ) -f = Õ δLr 0 exp -µT δL + δD+σ 2 µT One can note several disadvantages of Error Feedback (Alg. 2) with respect to plain DCSGD (Alg. 1). The first major drawback is that the effect of compression δ is not reduced with an increasing number of nodes. Another disadvantage is that Theorem 2 implies convergence for both the functional values and the last iterate, rather than for functional values only as it is the case for EF. On top of that, our rate of DCSGD as captured by Theorem 2 does not contain any hidden polylogarithmic factor comparing to EF. Another practical supremacy of DCSGD is that there is no need to store an extra vector for the error, which reduces the storage costs by a factor of two, making Algorithm 1 a viable choice for Deep Learning models with millions of parameters. Finally, one does not need to assume standard L-smoothness in order to prove convergence in Theorem 2, while, one the other hand, L-smoothness is an important building block for proving convergence for general compressors due to the presence of bias (Stich & Karimireddy, 2020; Beznosikov et al., 2020) . The only term in which EF might outperform plain DCSGD is O( σ 2 /µT ) for which the corresponding term is O( δσ 2 /nµT ). This is due to the fact that EF compensates for the error, while standard compression introduces extra variance. Note that this is not major issue as it is reasonable to assume δ /n = O(1) or, in addition, σ 2 = 0 if weak growth condition holds (Vaswani et al., 2019) , which is quite standard assumption, or one can remove effect of σ 2 by either computing full gradient locally or by incorporating variance reduction such as SVRG (Johnson & Zhang, 2013) . In Section 4, we also discuss the way how to remove the effect of D in Theorem 2. Putting all together, this suggests that standard DCSGD (Algorithm 1) is strongly preferable, in theory, to DCSGD with Error Feedback (Algorithm 2) for C ∈ U(δ).

3. INDUCED COMPRESSOR: FIXING BIAS WITH ERROR-COMPRESSION

In the previous section, we showed that compressed DCSGD is theoretically preferable to DCSGD with Error Feedback for C ∈ U(δ). Unfortunately, C(δ) ⊂ U(δ), an example being the Top-K compressor (Alistarh et al., 2018; Stich et al., 2018) . This compressors belongs to C( d K ), but does not belong to U(δ) for any δ. On the other hand, multiple unbiased alternatives to Top-K have been proposed in the literature, including gradient sparsification (Wangni et al., 2018) and adaptive random sparsification (Beznosikov et al., 2020) . Induced Compressor. We now propose a general mechanism for constructing an unbiased compressor C ∈ U from any biased compressor C 1 ∈ C. We shall argue that it is preferable to use this induced compressor within DCSGD, in both theory and practice, to using the original biased compressor C 1 within DCSGD + Error Feedback. Theorem 3. For C 1 ∈ C(δ 1 ) with λ = 1, choose C 2 ∈ U(δ 2 ) and define the induced compressor via C(x) := C 1 (x) + C 2 (x -C 1 (x)).

The induced compression operator satisfies

C ∈ U(δ) with δ = δ 2 (1 -1 /δ1) + 1 /δ1. To get some intuition about this procedure, recall the structure used in Error Feedback. The gradient estimator is first compressed with C 1 (g) and the error e = g -C 1 (g) is stored in memory and used to modify the gradient in the next iteration. In our proposed approach, instead of storing the error e, we compress it with an unbiased compressor C 2 (which can be seen as a parameter allowing flexibility in the design of the induced compressor) and communicate both of these compressed vectors. Note that this procedure results in extra variance as we do not work with the exact error, but with its unbiased estimate only. On the other hand, there is no bias and error accumulation that one needs to correct for. In addition, due to our construction, at least the same amount of information is sent to the master as in the case of plain C 1 (g): indeed, we send both C 1 (g) and C 2 (e). The drawback of this is the necessity to send more bits. However, Theorem 3 provides the freedom in generating the induced compressor through the choice of the unbiased compressor C 2 . In theory, it makes sense to choose C 2 with similar compression factor to the compressor C 1 we are transforming as this way the total number of communicated bits per iteration is preserved, up to the factor of two. Remark: The rtop k1,k2 (x, y) operator proposed by Elibol et al. (2020) can be seen as a special case of our induced compressor with x = y, C 1 = Top-k 1 and C 2 = Rand-k 2 . Benefits of Induced Compressor. In the light of the results in Section 2, we argue that one should always prefer unbiased compressors to biased ones as long as their variances δ and communication complexities are the same, e.g., Rand-K over Top-K. In practice, biased/greedy compressors are in some settings observed to perform better due to their lower empirical variance (Beznosikov et al., 2020) . These considerations give a practical significance to Theorem 3 as we demonstrate on the following example. Let us consider two compressors: one biased C 1 ∈ C(δ 1 ) and one unbiased C 2 ∈ U(δ 2 ), such that δ 1 = δ 2 = δ, having identical communication complexity, e.g., Top-K and Rand-K. The induced compressor C(x) := C 1 (x) + C 2 (x -C 1 (x)) belongs to U(δ 3 ), where δ 3 = δ -1 -1 δ < δ. While the size of the transmitted message is doubled, one can use Algorithm 1 since C is unbiased, which provides better convergence guarantees than Algorithm 2. Based on the construction of the induced compressor, one might expect that we need extra memory as "the error" e = g -C 1 (g) needs to be stored, but during computation only. This is not an issue as compressors for DNNs are always applied layer-wise (Dutta et al., 2019) , and hence the size of the extra memory is negligible. It does not help EF, as the error needs to be stored at any time for each layer.

4. EXTENSIONS

We now develop several extensions of Algorithm 1 relevant to distributed optimization in general, and to Federated Learning in particular. This is all possible due to the simplicity of our approach. Note that in the case of Error Feedback, these extensions have either not been obtained yet, or similarly to Section 2, the results are worse when compared to our derived bounds for unbiased compressors. Partial Participation with Arbitrary Distribution over Nodes. In this section, we extend our results to a variant of DCSGD utilizing partial participation, which is of key relevance to Federated Learning. In this framework, only a subset of all nodes communicates to the master node in each communication round. Such framework was analyzed before, but only for the case of uniform subsampling (Sattler et al., 2019; Reisizadeh et al., 2020) . In our work, we consider a more general partial participation framework: we assume that the subset of participating clients is determined by a fixed but otherwise arbitrary random set-valued mapping S (a "sampling") with values in 2 [n] , where [n] = {1, 2, . . . , n}. To the best of our knowledge, this is the first partial participation result for FL where an arbitrary distribution over the nodes is considered. On the other hand, this is not the first work which makes use of the arbitrary sampling paradigm; this was used before in other contexts, e.g., for obtaining importance sampling guarantees for coordinate descent (Qu et al., 2015) , primal-dual methods (Chambolle et al., 2018) , and variance reduction (Horváth & Richtárik, 2019) . Note that the sampling S is uniquely defined by assigning probabilities to all 2 n subsets of [n] . With each sampling S we associate a probability matrix P ∈ R n×n defined by P ij := Prob({i, j} ⊆ S). The probability vector associated with S is the vector composed of the diagonal entries of P: p = (p 1 , . . . , p n ) ∈ R n , where p i := Prob(i ∈ S). We say that S is proper if p i > 0 for all i. It is easy to show that b := E [|S|] = Trace (P) = n i=1 p i , and hence b can be seen as the expected number of clients participating in each communication round. There are two algorithmic changes due to this extension: line 4 of Algorithm 1 does not iterate over every node, only over nodes i ∈ S k , where S k ∼ S, and the aggregation step in line 9 is adjusted to lead to an unbiased estimator of the gradient, which gives ∆ k = i∈S k 1 npi ∆ k i . To prove convergence, we exploit the following lemma.  P -pp Diag (p 1 v 1 , p 2 v 2 , . . . , p n v n ) . Moreover, if S ∼ S, then E i∈S ζi npi -ζ 2 ≤ 1 n 2 n i=1 vi pi ζ i 2 . ( ) The following theorem establishes the convergence rate for Algorithm 1 with partial participation. Theorem 5. Let Assumptions 1-3 hold and C ∈ U(δ), then there exist stepsizes η k ≤ 1 2δ S L and weights w k ≥ 0 such that E f (x T ) -f + µE x T -x 2 ≤ 64δ S Lr 0 exp -µT 4δ S L + 36 (δ S -1)D+(1+a S ) δσ 2 /n µT , where r 0 , W T , xT , and D are defined in Theorem 2, a S = max i∈[n] { vi /pi}, and δ S = δa S +(δ-1) n + 1. For the case S = [n] with probability 1, one can show that Lemma 4 holds with v = 0, and hence we exactly recover the results of Theorem 2. In addition, we can quantify the slowdown factor with respect to full participation regime (Theorem 2), which is δ max i∈[n] vi pi . While in our framework we assume the distribution S to be fixed, it can be easily extended to several proper distributions S j 's or we can even handle a block-cyclic structure with each block having an arbitrary proper distribution S j over the given block j combining our analysis with the results of Eichner et al. (2019) . Obtaining Linear Convergence. Note that in all the previous theorems, we can only guarantee a sublinear O( 1 /T ) convergence rate. Linear rate is obtained in the special case when D = 0 and σ 2 = 0. The first condition is satisfied, when f i = f i (x ) for all i ∈ [n], thus when x is also minimizer of every local function f i . Furthermore, the effect od D can be removed using compression of gradient differences, as pioneered in the DIANA algorithm (Mishchenko et al., 2019a) . Note that σ 2 = 0 if weak growth condition holds (Vaswani et al., 2019) . Moreover, one can remove effect of σ 2 by either computing full gradients locally or by incorporating variance reduction such as SVRG (Johnson & Zhang, 2013) . It was shown by Horváth et al. (2019b) that both σ 2 and D can be removed for the setting of Theorem 2. These results can be easily extended to partial participation using our proof technique for Theorem 5. Note that this reduction is not possible for Error Feedback as the analysis of the DIANA algorithm is heavily dependent on the unbiasedness property. This points to another advantage of the induced compressor framework introduced in Section 3. Acceleration. We now comment on the combination of compression and acceleration/momentum. This setting is very important to consider as essentially all state-of-the-art methods for training deep learning models, including Adam (Kingma & Ba, 2015; Reddi et al., 2018) , rely on the use of momentum in one form or another. One can treat the unbiased compressed gradient as a stochastic gradient (Gorbunov et al., 2020) and the theory for momentum SGD (Yang et al., 2016; Gadat et al., 2018; Loizou & Richtárik, 2017) would be applicable with an extra smoothness assumption. Moreover, it is possible to remove the variance caused by stochasticity and obtain linear convergence with an accelerated rate, which leads to the Accelerated DIANA method (Li et al., 2020) . Similarly to our previous discussion, both of these techniques are heavily dependent on the unbiasedness property. It is an intriguing question, but out of the scope of the paper, to investigate the combined effect of momentum and Error Feedback and see whether these techniques are compatible theoretically.

5. EXPERIMENTS

In this section, we compare Algorithms 1 and 2 for several compression operators. If the method contains " + EF ", it means that EF is applied, thus Algorithm 2 is applied. Otherwise, Algorithm 1 is displayed. To be fair, we always compare methods with the same communication complexity per iteration. All experimental details can be found in the Appendix. Failure of DCSGD with biased Top-1. In this experiment, we present example considered in Beznosikov et al. (2020) , which was used as a counterexample to show that some form of error correction is needed in order for biased compressors to work/provably converge. In addition, we run experiments on their construction and show that while Error Feedback fixes divergence, it is still significantly dominated by unbiased non-uniform sparsification(NU Rand-1), which works by only keeping one non-zero coordinate sampled with probability equal to |x| / d i=1 |x|i, where |x| denotes element-wise absolute value, as can be seen in Figure 1 . The details can be found in the Appendix. Error Feedback for Unbiased Compression Operators. In our second experiment, we compare the effect of Error Feedback in the case when an unbiased compressor is used. Note that unbiased compressors are theoretically guaranteed to work both with Algorithm 1 and 2. We can see from Figure 2 that adding Error Feedback can hurt the performance; we use TernGrad (Wen et al., 2017) (coincides with QSGD (Alistarh et al., 2016) and natural dithering (Horváth et al., 2019a) with the infinity norm and one level) as compressors. This agrees with our theoretical findings. In addition, for sparsification techniques such as Random Sparsification or Gradient Sparsification (Wangni et al., 2018) , we observed that when sparsity is set to be 10 %, Algorithm 1 converges for all the selected values of step-sizes, but Algorithm 2 diverges and a smaller step-size needs to be used. This is an important observation as many practical works (Li et al., 2014; Wei et al., 2015; Aji & Heafield, 2017; Hsieh et al., 2017; Lin et al., 2018b; Lim et al., 2018) use sparsification techniques mentioned in this section, but proposed to use EF, while our work shows that using unbiasedness property leads not only to better convergence but also to memory savings. Unbiased Alternatives to Biased Compression. In this section, we investigate candidates for unbiased compressors than can compete with Top-K, one of the most frequently used compressors. Theoretically, Top-K is not guaranteed to work by itself and might lead to divergence (Beznosikov et al., 2020) unless Error Feedback is applied. One would usually compare the performance of Top-K with EF to Rand-K, which keeps K randomly selected coordinates and then scales the output by d /K to preserve unbiasedness. Rather than naively comparing to Rand-K, we propose to use more nuanced unbiased approaches. The first one is Gradient Sparsification proposed by Wagni et al. (Wangni et al., 2018) , which we refer to here as Rand-K (Wangni et al.) , where the probability of keeping each coordinate scales with its magnitude and communication budget. As the second alternative, we propose to use our induced compressor, where C 1 is Top-a and unbiased part C 2 is Rand-(K -a) (Wangni et al.) with communication budget K -a. It should be noted that a can be considered as a hyperparameter to tune. For our experiment, we chose it to be K /2 for simplicity. Figure 3 suggests that our induced compressor outperforms all of its competitors as can be seen for both VGG11 and Resnet18. Moreover, induced compressor as well as Rand-K do not require extra memory to store the error vector. Finally, Top-K without EF suffers a significant decrease in performance, which stresses the necessity of error correction.

6. CONCLUSION

In this paper, we argue that if compressed communication is required for distributed training due to communication overhead, it is better to use unbiased compressors. We show that this leads to strictly better convergence guarantees with fewer assumptions. In addition, we propose a new construction for transforming any compressor into an unbiased one using a compressed EF-like approach. Besides theoretical superiority, usage of unbiased compressors enjoys lower memory requirements. Our theoretical findings are corroborated with empirical evaluation. As a future work we plan to investigate the question of the appropriate choice of the inducing compressor C. Our preliminary studies show that there is much to be discovered here, both in theory and in terms of developing further practical guidelines to those already contained in this work. The question of (theoretically) optimizing for C 1 and C 2 is difficult, as it necessitates a deeper theoretical understanding of biased compressors, which is currently missing. An alternative is to impose some assumptions on the structure of gradients encountered during the iterative process, or to perform an extensive experimental evaluation on desired tasks to provide guidelines for practitioners. 

APPENDIX A EXPERIMENTAL DETAILS

To be fair, we always compare methods with the same communication complexity per iteration. We report the number of epochs (passes over the dataset) with respect to training loss and testing accuracy. The test accuracy is obtained by evaluating the best model in terms of validation accuracy. A validation accuracy is computed based on 10 % randomly selected training data. We tune the step-size using based on the training loss. For every experiment, we randomly distributed the training dataset among 8 workers; each worker computes its local gradient-based on its own dataset. We used a local batch size of 32. All the provided figures display the mean performance with one standard error over 5 independent runs. For a fair comparison, we use the same random seed for the compared methods. Our experimental results are based on a Python implementation of all the methods running in PyTorch. All reported quantities are independent of the system architecture and network bandwidth. Dataset and Models. We do an evaluation on CIFAR10 dataset. We consider VGG11 (Simonyan & Zisserman, 2015) and ResNet18 (He et al., 2016 ) models and step-sizes 0.1, 0.05 and 0.01.

A.1 EXTRA EXPERIMENTS

Momentum. In this extra experiment, we look at the effect of momentum on Algorithm 1 and 2. We set momentum to 0.9. Similarly to Figure 2 , we work with the unbiased compressor, concretely Tern-Grad (Wen et al., 2017) (coincides with QSGD (Alistarh et al., 2016) and natural dithering (Horváth et al., 2019a) with the infinity norm and one level), to see the effect of adding Error Feedback. We can see from Figure 4 that adding Error Feedback can hurt the performance, which agrees with our theoretical findings. B EXAMPLE 1, BEZNOSIKOV ET AL. (2020) In this section, we present example considered in Beznosikov et al. (2020) , which was used as a counterexample to show that some form of error correction is needed in order for biased compressors to work/provably converge. In addition, we run experiments on their construction and show that while Error Feedback fixes divergence, it is still significantly dominated by unbiased non-uniform sparsification as can be seen in Figure 1 . The construction follows. Published as a conference paper at ICLR 2021 Consider n = d = 3 and define the following smooth and strongly convex quadratic functions f 1 (x) = a, x 2 + 1 4 x 2 , f 2 (x) = b, x 2 + 1 4 x 2 , f 3 (x) = c, x 2 + 1 4 x 2 , where a = (-3, 2, 2), b = (2, -3, 2), c = (2, 2, -3). Then, with the initial point x 0 = (t, t, t), t > 0 9, 9, -11) . ∇f 1 (x 0 ) = t 2 (-11, 9, 9), ∇f 2 (x 0 ) = t 2 (9, -11, 9), ∇f 3 (x 0 ) = t 2 ( Using the Top-1 compressor, we get C(∇f 1 (x 0 )) = t 2 (-11, 0, 0), C(∇f 2 (x 0 )) = t 2 (0, -11, 0), C(∇f 3 (x 0 )) = t 2 (0, 0, -11). The next iterate of DCGD is x 1 = x 0 - η 3 3 i=1 C(∇f i (x 0 )) = 1 + 11η 6 x 0 . Repeated application gives x k = 1 + 11η 6 k x 0 , which diverges exponentially fast to +∞ since η > 0. As a initial point, we use (1, 1, 1) in our experiments and we choose step size 1 L , where L is smoothness parameter of f = 1 3 (f 1 + f 2 + f 3 ). Note that zero vector is the unique minimizer of f .

C PROOFS

C.1 PROOF OF LEMMA 1 We follow (2), which holds for C ∈ U(δ). E 1 δ C k (x) -x 2 = 1 δ 2 E C k (x) 2 -2 1 δ E C k (x) , x + x 2 ≤ 1 δ - 2 δ + 1 x 2 = 1 - 1 δ x 2 , which concludes the proof.

C.2 PROOF OF THEOREM 2

We use the update of Algorithm 1 to bound the following quantity E x k+1 -x 2 |x k = x k -x 2 - η k n n i=1 E C k (g k i ), x k -x |x k + η k n 2 E   n i=1 C k (g k i ) 2 |x k   (2)+(5) ≤ x k -x 2 -η k ∇f (x k ), x k -x + (η k ) 2 n 2 E   n i=1 C k (g k i ) -g k i 2 + n i=1 g k i 2 |x k   (2) ≤ x k -x 2 -η k ∇f (x k ), x k -x + (η k ) 2 n 2 E   (δ -1) n i=1 g k i 2 + n i=1 g k i 2 |x k   (6)+(7) ≤ x k -x 2 -η k ∇f (x k ), x k -x + 2L(η k ) 2 δ n (f (x k ) -f ) + (δ n -1) 1 n n i=1 (f i (x ) -f i ) + (η k ) 2 δσ 2 n (4) ≤ (1 -µη k ) x k -x 2 -2η k 1 -η k δ n L (f (x k ) -f ) + (η k ) 2 (δ n -1)D + δσ 2 n . Taking full expectation and η k ≤ 1 2δnL , we obtain E x k+1 -x 2 ≤ (1 -µη k )E x k -x 2 -η k E f (x k ) -f + (η k ) 2 (δ n -1)D + δσ 2 n . The rest of the analysis is closely related to the one of Stich (2019b). We would like to point out that similar results to Stich (2019b) were also present in (Lacoste-Julien et al., 2012; Stich et al., 2018; Grimmer, 2019) . We first rewrite the previous inequality to the form r k+1 ≤ (1 -aη k )r k -η k s k + (η k ) 2 c, where r k = E x k -x 2 , s k = E f (x k ) -f , a = µ, c = (δ n -1)D + δσ 2 n . We proceed with lemmas that establish a convergence guarantee for every recursion of type (10). Lemma 6. Let {r k } k≥0 , {s k } k≥0 be as in (10) for a > 0 and for constant stepsizes η k ≡ η := 1 d , ∀k ≥ 0. Then it holds for all T ≥ 0: r T ≤ r 0 exp - aT d + c ad . Proof. This follows by relaxing (10) using E f (x k ) -f ≥ 0,and unrolling the recursion r T ≤ (1 -aη)r T -1 + cγ 2 ≤ (1 -aη) T r 0 + cη 2 T -1 k=0 (1 -aη) k ≤ (1 -aη) T r 0 + cη a . Lemma 7. Let {r k } k≥0 , {s k } k≥0 as in (10) for a > 0 and for decreasing stepsizes η k := 2 a(κ+k) , ∀k ≥ 0, with parameter κ := 2d a , and weights w k := (κ + k). Then 1 W T T k=0 s k w k + ar T +1 ≤ 2aκ 2 r 0 T 2 + 2c aT , where W T := T k=0 w k . Proof. We start by re-arranging (10) and multiplying both sides with w k s k w k ≤ w k (1 -aη k )r k η k - w k r k+1 η k + cη k w k = a(κ + k)(κ + k -2)r k -a(κ + k) 2 r k+1 + c a ≤ a(κ + k -1) 2 r k -a(κ + k) 2 r k+1 + c a , where the equality follows from the definition of η k and w k and the inequality from (κ + k)(κ + k - 2) = (κ + k -1) 2 -1 ≤ (κ + k -1) 2 . Again have a telescoping sum: 1 W T T k=0 s k w k + a(κ + T ) 2 r T +1 W T ≤ aκ 2 r 0 W T + c(T + 1) aW T , with • W T = T k=0 w k = T k=0 (κ + k) = (2κ+T )(T +1) 2 ≥ T (T +1) 2 ≥ T 2 2 , • and W T = (2κ+T )(T +1) 2 ≤ 2(κ+T )(1+T ) 2 ≤ (κ + T ) 2 for κ = 2d a ≥ 1. By applying these two estimates we conclude the proof. The convergence can be obtained as the combination of these two lemmas. Lemma 8. Let {r k } k≥0 , {s k } k≥0 as in (10), a > 0. Then there exists stepsizes η k ≤ 1 d and weighs w k ≥ 0, W T := T k=0 w k , such that 1 W T T k=0 s k w k + ar T +1 ≤ 32dr 0 exp - aT 2d + 36c aT . Proof of Lemma 8. For integer T ≥ 0, we choose stepsizes and weights as follows if T ≤ d a , η k = 1 d , w k = (1 -aη k ) -(k+1) = 1 - a d -(k+1) , if T > d a and k < t 0 , η k = 1 d , w k = 0 , if T > d a and k ≥ t 0 , η k = 2 a(κ + k -t 0 ) , w k = (κ + k -t 0 ) 2 , for κ = 2d a and t 0 = T 2 . We will now show that these choices imply the claimed result. We start with the case T ≤ d a . For this case, the choice η = 1 d gives 1 W T T k=0 s k w k + ar T +1 ≤ (1 -aη) (T +1) r 0 η + cη ≤ r 0 η exp [-aη(T + 1)] + cη ≤ dr 0 exp - aT d + c aT . If T > d a , then we obtain from Lemma 6 that r t0 ≤ r 0 exp -aT 2d + c ad . From Lemma 7 we have for the second half of the iterates: 1 W T T k=0 s k w k + ar T +1 = 1 W T T k=t0 s k w k + ar T +1 ≤ 8aκ 2 r t0 T 2 + 4c aT . Now we observe that the restart condition r t0 satisfies: aκ 2 r t0 T 2 = aκ 2 r 0 exp -aT 2d T 2 + κ 2 c dT 2 ≤ 4ar 0 exp - aT 2d + 4c aT , because T > d a . These conclude the proof. these general convergence lemmas for the recursion of the form (10), the proof of the theorem follows directly from Lemmas 6 and 8 with a = µ, c = σ 2 , d = 2δ n L . It is easy to check that condition η k ≤ 1 d = 1 2δnL is satisfied.

C.3 PROOF OF THEOREM 3

We have to show that our new compression is unbiased and has bounded variance. We start with the first property with λ = 1. E [C 1 (x) + C 2 (x -C 1 (x))] = E C1 [E C2 [C 1 (x) + C 2 (x -C 1 (x))|C 1 (x)]] = E C1 [C 1 (x) + x -C 1 (x)] = x, where the first equality follows from tower property and the second from unbiasedness of C 2 . For the second property, we also use tower property E C 1 (x) -x + C 2 (x -C 1 (x)) 2 = E C1 E C2 C 1 (x) -x + C 2 (x -C 1 (x)) 2 |C 1 (x) ≤ (δ 2 -1)E C1 C 1 (x) -x 2 ≤ (δ 2 -1) 1 - 1 δ 1 x 2 , where the first and second inequalities follow directly from (2) and (3). C.4 PROOF OF LEMMA 4 (HORVÁTH & RICHTÁRIK, 2019) For the first part of the claim, it was shown that P -pp is positive semidefinite (Richtárik & Takáč, 2016) , thus we can bound P -pp nDiag P -pp = Diag (p • v), where v i = n(1 -p i ), which implies that (8) holds for this choice of v. For the second part of the claim, let 1 i∈S = 1 if i ∈ S and 1 i∈S = 0 otherwise. Likewise, let 1 i,j∈S = 1 if i, j ∈ S and 1 i,j∈S = 0 otherwise. Note that E [1 i∈S ] = p i and E [1 i,j∈S ] = p ij . Next, let us compute the mean of X := i∈S ζi npi : E [X] = E i∈S ζ i np i = E n i=1 ζ i np i 1 i∈S = n i=1 ζ i np i E [1 i∈S ] = 1 n n i=1 ζ i = ζ. Let A = [a 1 , . . . , a n ] ∈ R d×n , where a i = ζi pi , and let e be the vector of all ones in R n . We now write the variance of X in a form which will be convenient to establish a bound: E To obtain (9), it remains to combine this with (13). X -E [X] 2 = E X 2 -E [X] 2 = E   i∈S ζ i np i 2   -ζ 2 = E   i,j ζ i np i ζ j np j 1 i,j∈S   -ζ 2 = i,j p ij ζ i np i ζ j np j - i,j C.5 PROOF OF THEOREM 5 Similarly to the proof of Theorem 2, we use the update of Algorithm 1 to bound the following quantity E x k+1 -x 2 |x k = x k -x 2 -η k n i=1 E   i∈S k 1 np i C k (g k i ), x k -x |x k   + E    i∈S k η k np i C k (g k i ) 2 |x k    (2)+(5) ≤ x k -x 2 -η k ∇f (x k ), x k -x + (η k ) 2   E    i∈S k 1 np i C k (g k i ) - 1 n n i=1 C k (g k i ) 2 |x k    + E   1 n n i=1 C k (g k ) 2 |x k      (2)+( 5)+( 9) ≤ x k -x 2 -η k ∇f (x k ), x k -x + (η k ) 2 n 2 E   n i=1 δv i p i + δ -1 g k i + n i=1 g k i 2 |x k   (4)+(6)+(7) ≤ (1 -µη k ) x k -x 2 -2η k 1 -η k δ S L (f (x k ) -f ) + (η k ) 2 (δ S -1)D + (1 + a S ) δσ 2 n . Taking full expectation and η k ≤ 1 2δ S L , we obtain E x k+1 -x 2 ≤ (1 -µη k )E x k -x 2 -η k E f (x k ) -f + (η k ) 2 (δ S -1)D + (1 + a S ) δσ 2 n . The rest of the analysis is identical to the proof of Theorem 2 with only difference c = (δ S -1)D + (1 + a S ) δσ 2 n .



Figure 1: Comparison of Top-1 (+ EF) and NU Rand-1 on Example 1 from Beznosikov et al. (2020).

Lemma 1, Horváth & Richtárik (2019)). Let ζ 1 , ζ 2 , . . . , ζ n be vectors in R d and let ζ := 1 n n i=1 ζ i be their average. Let S be a proper sampling. Then there exists v ∈ R n such

Figure 2: Algorithm 1 vs. Algorithm 2 on CIFAR10 with ResNet18 (bottom), VGG11 (top) and TernGrad as a compression.

Figure 3: Comparison of different sparsification techniques with and without usage of Error Feedback on CIFAR10 with Resnet18 (top) and VGG11 (bottom). K = 5% * d, for Induced compressor C 1 is Top-K /2 and C 2 is Rand-K /2 (Wangni et al.).

Figure 4: Algorithm 1 vs. Algorithm 2 on CIFAR10 with ResNet18 (bottom), VGG11 (top) and TernGrad as a compression.

-p i p j )a i a j = 1 n 2 e P -pp • A A e. (13)Since by assumption we have P -pp Diag (p • v), we can further bounde P -pp • A A e ≤ e Diag (p • v) • A A e = n i=1 p i v i a i 2 .

