EF21-P AND FRIENDS: IMPROVED THEORETICAL COMMUNICATION COMPLEXITY FOR DISTRIBUTED OPTIMIZATION WITH BIDIRECTIONAL COMPRESSION

Abstract

The starting point of this paper is the discovery of a novel and simple errorfeedback mechanism, which we call EF21-P, for dealing with the error introduced by a contractive compressor. Unlike all prior works on error feedback, where compression and correction operate in the dual space of gradients, our mechanism operates in the primal space of models. While we believe that EF21-P may be of interest in many situations where it is often advantageous to perform model perturbation prior to the computation of the gradient (e.g., randomized smoothing and generalization), in this work we focus our attention on its use as a key building block in the design of communication-efficient distributed optimization methods supporting bidirectional compression. In particular, we employ EF21-P as the mechanism for compressing and subsequently error-correcting the model broadcast by the server to the workers. By combining EF21-P with suitable methods performing worker-to-server compression, we obtain novel methods supporting bidirectional compression and enjoying new state-of-the-art theoretical communication complexity for convex and nonconvex problems. For example, our bounds are the first that manage to decouple the variance/error coming from the workersto-server and server-to-workers compression, transforming a multiplicative dependence to an additive one. In the convex regime, we obtain the first bounds that match the theoretical communication complexity of gradient descent. Even in this convex regime, our algorithms work with biased gradient estimators, which is nonstandard and requires new proof techniques that may be of independent interest. Finally, our theoretical results are corroborated through suitable experiments.

1. INTRODUCTION: ERROR FEEDBACK IN THE PRIMAL SPACE

The key moment which ultimately enabled the main results of this paper was our discovery of a new and simple error-feedback technique, which we call EF21-P, that operates in the primal space of the iterates/models instead of the prevalent approach to error-feedback (Stich & Karimireddy, 2019; Karimireddy et al., 2019; Gorbunov et al., 2020b; Beznosikov et al., 2020; Richtárik et al., 2021) which operates in the dual space of gradientsfoot_0 . To describe EF21-P, consider solving the optimization problem min x∈R d f (x), where f : R d → R is a smooth but not necessarily convex function. Given a contractive compression operator C : R d → R d , i.e., a (possibly) randomized mapping satisfying the inequality E C(x) -x 2 ≤ (1 -α) x 2 , ∀x ∈ R d (2) for some constant α ∈ (0, 1], our EF21-P method aims to solve (1) via the iterative process x t+1 = x t -γ∇f (w t ), w t+1 = w t + C t (x t+1 -w t ), where γ > 0 is a stepsize, x 0 ∈ R d is the initial iterate, w 0 = x 0 ∈ R d is the initial iterate shift, and C t is an instantiation of a randomized contractive compressor satisfying (2) sampled at time t. Note that when C is the identity mapping (α = 1), then w t = x t for all t, and hence EF21-P reduces to vanilla gradient descent (GD). Otherwise, EF21-P is a new optimization method. Note that {x t } iteration of EF21-P can be equivalently written in the form of perturbed gradient descent x t+1 = x t -γ∇f (x t + ζ t ), ζ t = C t-1 (x t -w t-1 ) -(x t -w t-1 ). Note that the model perturbation ζ t is not a zero mean random variablefoot_2 , and that in view of (2), the size of the perturbation can be bounded via E ζ t 2 | x t , w t-1 ≤ (1 -α) x t -w t-1 2 . ( ) From now on, we will write C ∈ B(α) to mean that C is a compressor satisfying (2).

1.1. EF21-P THEORY

If f is L-smooth and µ-strongly convex, we prove that both x t and w t converge to x * = arg min f at a linear rate, in O(( L /αµ) log 1 /ε) iterations in expectation (see Section D). Intuitively speaking, this happens because the error-feedback mechanism embedded in EF21-P makes sure that the quantity on the right-hand side of (4) converges to zero, which forces the size of the error ζ t caused by the perturbation to converge to zero as well. However, EF21-P can be analyzed in the smooth nonconvex regime as well, in which case it finds an ε-approximate stationary point. The precise convergence result, proof, as well as an extension that allows to replace ∇f (w t ) with a stochastic gradient under the general ABC inequality introduced by Khaled & Richtárik (2020) (which provably holds for various sources of stochasticity, including subsampling and gradient compression) can be found in Section E.

1.2. SUMMARY OF CONTRIBUTIONS

We believe that EF21-P and its analysis could be useful in various optimization and machine learning contexts in which some kind of iterate perturbation plays an important role, including randomized smoothing (Duchi et al., 2012) , perturbed SGD (Vardhan & Stich, 2022) , and generalization (Orvieto et al., 2022) . In this work we do not venture into these potential application areas and instead focus all our attention on a single and important use case where, as we found out, EF21-P leads to new state-of-the-art methods and theory: the design of communication-efficient distributed optimization methods supporting bidirectional (i.e., workers-to-server and server-to-workers) compression. In particular, we use EF21-P as the mechanism for compressing and subsequently error-correcting the model broadcast by the server to the workers. By combining EF21-P with suitable methods ("friends" in the title of the paper) performing worker-to-server compression, in particular, DIANA (Mishchenko et al., 2019; Horváth et al., 2022) or DCGD (Alistarh et al., 2017; Khirirat et al., 2018) , we obtain novel methods, suggestively named EF21-P + DIANA (Algorithm 1) and EF21-P + DCGD (Algorithm 2), both supporting bidirectional compression, and both enjoying new state-ofthe-art theoretical communication complexity for convex and nonconvex problems. While DIANA and DCGD were not designed to work with compressors from B(α) to compress the workers-toserver communication, and can in principle diverge if used that way, they work well with the smaller class of randomized compression mappings C : R d → R d characterized by E [C(x)] = x, E C(x) -x 2 ≤ ω x 2 , ∀x ∈ R d , where ω ≥ 0 is a constant. We will write C ∈ U(ω) to mean that C satisfies (5). It is well known that if C ∈ U(ω), then C /(ω+1) ∈ B ( 1 /(ω+1)), which means that the class U(ω) is indeed more narrow. Convex setting. EF21-P + DIANA provides new state-of-the-art convergence rates for distributed optimization tasks in the strongly convex (see Table 1 ) and general convex regimes. This is the first method enabling bidirectional compression that has the server-to-workers and workers-to-server communication complexity better than vanilla GD. When the workers calculate stochastic gradients (see Section 3.1), we prove that EF21-P + DIANA improves the rates of the existing methods.



Our method is inspired by the recently proposed error-feedback mechanism, EF21, of Richtárik et al. (2021), which compresses the dual vectors, i.e., the gradients. EF21 is currently the state-of-the-art error feedback mechanism in terms of its theoretical properties and practical performance(Fatkhullin et al., 2021). If we wish to explicitly highlight its dual nature, we could instead meaningfully call their method EF21-D. In fact, this only happens in the non-interesting case when C t-1 is identity with probability 1.

