A BETTER ALTERNATIVE TO ERROR FEEDBACK FOR COMMUNICATION-EFFICIENT DISTRIBUTED LEARN-ING

Abstract

Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed compute systems. A key bottleneck of such systems is the communication overhead for exchanging information (e.g., stochastic gradients) across the workers. Among the many techniques proposed to remedy this issue, one of the most successful is the framework of compressed communication with error feedback (EF). EF remains the only known technique that can deal with the error induced by contractive compressors which are not unbiased, such as Top-K or PowerSGD. In this paper, we propose a new and theoretically and practically better alternative to EF for dealing with contractive compressors. In particular, we propose a construction which can transform any contractive compressor into an induced unbiased compressor. Following this transformation, existing methods able to work with unbiased compressors can be applied. We show that our approach leads to vast improvements over EF, including reduced memory requirements, better communication complexity guarantees and fewer assumptions. We further extend our results to federated learning with partial participation following an arbitrary distribution over the nodes, and demonstrate the benefits thereof. We perform several numerical experiments which validate our theoretical findings.

1. INTRODUCTION

We consider distributed optimization problems of the form min x∈R d f (x) := 1 n n i=1 f i (x) , where x ∈ R d represents the weights of a statistical model we wish to train, n is the number of nodes, and f i : R d → R is a smooth differentiable loss function composed of data stored on worker i. In a classical distributed machine learning scenario, f i (x) := E ζ∼Di [f ζ (x)] is the expected loss of model x with respect to the local data distribution D i of the form, and f ζ : R d → R is the loss on the single data point ζ. This definition allows for different distributions D 1 , . . . , D n on each node, which means that the functions f 1 , . . . , f n can have different minimizers. This framework covers Stochastic Optimization when either n = 1 or all D i are identical, Empirical Risk Minimization (ERM), when f i (x) can be expressed as a finite average, i.e, f i (x) = 1 mi mi i=1 f ij (x) for some f ij : R d → R, and Federated Learning (FL) (Kairouz et al., 2019) where each node represents a client. Communication Bottleneck. In distributed training, model updates (or gradient vectors) have to be exchanged in each iteration. Due to the size of the communicated messages for commonly considered deep models (Alistarh et al., 2016) , this represents significant bottleneck of the whole optimization procedure. To reduce the amount of data that has to be transmitted, several strategies were proposed. One of the most popular strategies is to incorporate local steps and communicated updates every few iterations only (Stich, 2019a; Lin et al., 2018a; Stich & Karimireddy, 2020; Karimireddy et al., 2019a; Khaled et al., 2020) . Unfortunately, despite their practical success, local methods are poorly understood and their theoretical foundations are currently lacking. Almost all existing error guarantees are dominated by a simple baseline, minibatch SGD (Woodworth et al., 2020) . In this work, we focus on another popular approach: gradient compression. In this approach, instead of transmitting the full dimensional (gradient) vector g ∈ R d , one transmits a compressed vector C(g), where C : R d → R d is a (possibly random) operator chosen such that C(g) can be represented using fewer bits, for instance by using limited bit representation (quantization) or by enforcing sparsity. A particularly popular class of quantization operators is based on random dithering (Goodall, 1951; Roberts, 1962) ; see (Alistarh et al., 2016; Wen et al., 2017; Zhang et al., 2017; Horváth et al., 2019a; Ramezani-Kebrya et al., 2019) . Much sparser vectors can be obtained by random sparsification techniques that randomly mask the input vectors and only preserve a constant number of coordinates (Wangni et al., 2018; Konečný & Richtárik, 2018; Stich et al., 2018; Mishchenko et al., 2019b; Vogels et al., 2019) . There is also a line of work (Horváth et al., 2019a; Basu et al., 2019) in which a combination of sparsification and quantization was proposed to obtain a more aggressive effect. We will not further distinguish between sparsification and quantization approaches, and refer to all of them as compression operators hereafter. Considering both practice and theory, compression operators can be split into two groups: biased and unbiased. For the unbiased compressors, C(g) is required to be an unbiased estimator of the update g. Once this requirement is lifted, extra tricks are necessary for Distributed Compressed Stochastic Gradient Descent (DCSGD) (Alistarh et al., 2016; 2018; Khirirat et al., 2018) employing such a compressor to work, even if the full gradient is computed by each node. Indeed, the naive approach can lead to exponential divergence (Beznosikov et al., 2020) , and Error Feedback (EF) (Seide et al., 2014; Karimireddy et al., 2019b) is the only known mechanism able to remedy the situation.

Contributions. Our contributions can be summarized as follows:

• Induced Compressor. When used within the stabilizing EF framework, biased compressors (e.g., Top-K) can often achieve superior performance when compared to their unbiased counterparts (e.g., Rand-K). This is often attributed to their low variance. However, despite ample research in this area, EF remains the only known mechanism that allows the use of these powerful biased compressors. Our key contribution is the development of a simple but remarkably effective alternative-and this is the only alternative we know of-which we argue leads to better and more versatile methods both in theory and practice. In particular, we propose a general construction that can transform any biased compressor, such as Top-K, into an unbiased one for which we coin the name induced compressor (Section 3). Instead of using the desired biased compressor within EF, our proposal is to instead use the induced compressor within an appropriately chosen existing method designed for unbiased compressors, such as distributed compressed SGD (DCSGD) (Khirirat et al., 2018) , variance reduced DCSGD (DIANA) (Mishchenko et al., 2019a) or accelerated DIANA (ADIANA) (Li et al., 2020) . While EF can bee seen as a version of DCSGD which can work with biased compressors, variance reduced nor accelerated variants of EF were not known at the time of writing this paper. • Better Theory for DCSGD. As a secondary contribution, we provide a new and tighter theoretical analysis of DCSGD under weaker assumptions. If f is µ-quasi convex (not necessarily convex) and local functions f i are (L, σ 2 )-smooth (weaker version of L-smoothness with strong growth condition), we obtain the rate O δ n Lr 0 exp -µT 4δnL + (δn-1)D+δ σ 2 /n µT , where δ n = 1 + δ-1 n and δ ≥ 1 is the parameter which bounds the second moment of the compression operator, and T is the number of iterations. This rate has linearly decreasing dependence on the number of nodes n, which is strictly better than the best-known rate for DCSGD with EF, whose convergence does not improve as the number of nodes increases, which is one of the main disadvantages of using EF. Moreover, EF requires extra assumptions. In addition, while the best-known rates for EF (Karimireddy et al., 2019b; Beznosikov et al., 2020) are expressed in terms of functional values, our theory guarantees convergence in both iterates and functional values. Another practical implication of our findings is the reduction of the memory requirements by half; this is because in DCSGD one does not need to store the error vector. • Partial Participation. We further extend our results to obtain the first convergence guarantee for partial participation with arbitrary distributions over nodes, which plays a key role in Federated Learning (FL).

