NATURAL COMPRESSION FOR DISTRIBUTED DEEP LEARNING

Abstract

Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practically effective compression technique: natural compression (C nat ). Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa. We show that compared to no compression, C nat increases the second moment of the compressed vector by not more than the tiny factor 9 /8, which means that the effect of C nat on the convergence speed of popular training algorithms, such as distributed SGD, is negligible. However, the communications savings enabled by C nat are substantial, leading to 3-4× improvement in overall theoretical running time. For applications requiring more aggressive compression, we generalize C nat to natural dithering, which we prove is exponentially better than the common random dithering technique. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect, and offer new state-of-the-art both in theory and practice.

1. INTRODUCTION

Modern deep learning models (He et al., 2016) are almost invariably trained in parallel or distributed environments, which is necessitated by the enormous size of the data sets and dimension and complexity of the models required to obtain state-of-the-art performance. In our work, the focus is on the data-parallel paradigm, in which the training data is split across several workers capable of operating in parallel (Bekkerman et al., 2011; Recht et al., 2011) . Formally, we consider optimization problems of the form min x∈R d f (x) := 1 n n i=1 f i (x) , where x ∈ R d represents the parameters of the model, n is the number of workers, and f i : R d → R is a loss function composed of data stored on worker i. Typically, f i is modeled as a function of the form f i (x) := E ζ∼Di [f ζ (x)] , where D i is the distribution of data stored on worker i, and f ζ : R d → R is the loss of model x on data point ζ. The distributions D 1 , . . . , D n can be different on every node, which means that the functions f 1 , . . . , f n may have different minimizers. This framework covers i) stochastic optimization when either n = 1 or all D i are identical, and ii) empirical risk minimization when f i (x) can be expressed as a finite average, i.e, 1 mi mi i=1 f ij (x) for some f ij : R d → R. Distributed Learning. Typically, problem (1) is solved by distributed stochastic gradient descent (SGD) (Robbins & Monro, 1951) , which works as follows: Stochastic gradients g i (x k )'s are computed locally and sent to a master node, which performs update aggregation g k = i g i (x k ). The aggregated gradient g k is sent back to the workers and each performs a single step of SGD: x k+1 = x k -η k n g k , where η k > 0 is a step size. A key bottleneck of the above algorithm, and of its many variants (e.g., variants utilizing minibatching (Goyal et al., 2017) , importance sampling (Horváth & Richtárik, 2019) , momentum (Nesterov, 2013) , or variance reduction (Johnson & Zhang, 2013) ), is the cost of communication of the typically dense gradient vector g i (x k ), and in a parameter-sever implementation with a master node, also the cost of broadcasting the aggregated gradient g k . These are d dimensional vectors of floats, with d being very large in modern deep learning. It is well-known (Seide et al., 2014; Alistarh et al., 2017; Zhang et al., 2017; Lin et al., 2018; Lim et al., 2018) that in many practical applications with common computing architectures, communication takes much more time than computation, creating a bottleneck of the entire training system. Communication Reduction. Several solutions were suggested in the literature as a remedy to this problem. In one strain of work, the issue is addressed by giving each worker "more work" to do, which results in a better communication-to-computation ratio. For example, one may use minibatching to construct more powerful gradient estimators (Goyal et al., 2017) , define local problems for each worker to be solved by a more advanced local solver (Shamir et al., 2014; Richtárik & Takáč, 2016; Reddi et al., 2016) , or reduce communication frequency (e.g., by communicating only once (McDonald et al., 2009; Zinkevich et al., 2010) or once every few iterations (Stich, 2018) ). An orthogonal approach to the above efforts aims to reduce the size of the communicated vectors instead (Seide et al., 2014; Alistarh et al., 2017; Wen et al., 2017; Wangni et al., 2018; Hubara et al., 2017) using various lossy (and often randomized) compression mechanisms, commonly known in the literature as quantization techniques. In their most basic form, these schemes decrease the # bits used to represent floating point numbers forming the communicated d-dimensional vectors (Gupta et al., 2015; Na et al., 2017) , thus reducing the size of the communicated message by a constant factor. Another possibility is to apply randomized sparsification masks to the gradients (Suresh et al., 2017; Konečný & Richtárik, 2018; Alistarh et al., 2018; Stich et al., 2018) , or to rely on coordinate/block descent updates-rules, which are sparse by design (Fercoq et al., 2014) . One of the most important considerations in the area of compression operators is the compressionvariance trade-off (Konečný & Richtárik, 2018; Alistarh et al., 2017; Horváth et al., 2019) . For instance, while random dithering approaches attain up to O(d 1 /2 ) compression (Seide et al., 2014; Alistarh et al., 2017; Wen et al., 2017) , the most aggressive schemes reach O(d) compression by sending a constant number of bits per iteration only (Suresh et al., 2017; Konečný & Richtárik, 2018; Alistarh et al., 2018; Stich et al., 2018) . However, the more compression is applied, the more information is lost, and the more will the quantized vector differ from the original vector we want to communicate, increasing its statistical variance. Higher variance implies slower convergence (Alistarh et al., 2017; Mishchenko et al., 2019) , i.e., more communication rounds. So, ultimately, compression approaches offer a trade-off between the communication cost per iteration and the number of communication rounds. Outside of the optimization for machine learning, compression operators are very relevant to optimal quantization theory and control theory (Elia & Mitter, 2001; Sun & Goyal, 2011; Sun et al., 2012) .

Summary of Contributions.

The key contributions of this work are following: • New compression operators. We construct a new "natural compression" operator (C nat ; see Sec. 2) based on a randomized rounding scheme in which each float of the compressed vector is rounded to a (positive or negative) power of 2. This compression has a provably small variance, at most 1 /8 (see Thm 1), which implies that theoretical convergence results of SGD-type methods are essentially unaffected (see Thm 6). At the same time, substantial savings are obtained in the amount of communicated bits per iteration (3.56× less for float32 and 5.82× less for float64). In addition, we utilize these insights and develop a new random dithering operator-natural dithering (D p,s nat ; see Sec. 3)-which is exponentially better than the very popular "standard" random dithering operator (see Thm 5). We remark that C nat and the identity operator arise as limits of D p,s nat and D p,s sta as s → ∞, respectively. Importantly, our new compression techniques can be combined with existing compression and sparsification operators for a more dramatic effect as we argued before. • State-of-the-art compression. When compared to previous state-of-the-art compressors such as (any variant of) sparsification and dithering-techniques used in methods such as Deep Gradient Compression (Lin et al., 2018) , QSGD (Alistarh et al., 2017) and TernGrad (Wen et al., 2017) -our compression operators offer provable and often large improvements in practice, thus leading to new state of the art. In particular, given a budget on the second moment ω+1 (see Eq (3)) of a compression operator, which is the main factor influencing the increase in the number of communications when communication compression is applied compared to no compression, our compression operators offer the largest compression factor, resulting in fewest bits transmitted (see Fig 1 ). • Lightweight & simple low-level implementation. We show that apart from a randomization procedure (which is inherent in all unbiased compression operators), natural compression is computationfree. Indeed, natural compression essentially amounts to the trimming of the mantissa and possibly for several state-of-the-art compressors applied to a gradient of size d = 10 6 . Our methods (Cnat and D p,s nat ) are depicted with a square marker. For any fixed communication budget, natural dithering offers an exponential improvement on standard dithering, and when used in composition with sparsification, it offers an order of magnitude improvement. increasing the exponent by one. This is the first compression mechanism with such a "natural" compatibility with binary floating point types. • Proof-of-concept system with in-network aggregation (INA). The recently proposed SwitchML (Sapio et al., 2019) system alleviates the communication bottleneck via in-network aggregation (INA) of gradients. Since current programmable network switches are only capable of adding integers, new update compression methods are needed which can supply outputs in an integer format. Our natural compression mechanism is the first that is provably able to operate in the SwitchML framework as it communicates integers only: the sign, plus the bits forming the exponent of a float. Moreover, having bounded (and small) variance, it is compatible with existing distributed training methods. • Bidirectional compression for SGD. We provide convergence theory for distributed SGD which allows for compression both at the worker and master side (see Algorithm 1). The compression operators compatible with our theory form a large family (operators C ∈ B(ω) for some finite ω ≥ 0; see Definition 2). This enables safe experimentation with existing and facilitates the development of new compression operators fine-tuned to specific deep learning model architectures. Our convergence result (Thm 1) applies to smooth and non-convex functions, and our rates predict linear speed-up with respect to the number of machines. • Better total complexity. Most importantly, we are the first to prove that the increase in the number of iterations caused by (a carefully designed) compression is more than compensated by the savings in communication, which leads to an overall provable speedup in training time. Read Thm 6, the discussion following the theorem and Table 1 for more details. To the best of our knowledge, standard dithering (QSGD (Alistarh et al., 2017) ) is the only previously known compression technique able to achieve this with our distributed SGD with bi-directional compression. Importantly, our natural dithering is exponentially better than standard dithering, and hence provides for state-of-the -art performance in connection with Algorithm 1. • Experiments. We show that C nat significantly reduces the training time compared to no compression. We provide empirical evidence in the form scaling experiments, showing that C nat does not hurt convergence when the number of workers is growing. We also show that popular compression methods such as random sparsification and random dithering are enhanced by combination with natural compression or natural dithering (see Appendix A). The combined compression technique reduces the number of communication rounds without any noticeable impact on convergence providing the same quality solution.

2. NATURAL COMPRESSION

We define a new (randomized) compression technique, which we call natural compression. This is fundamentally a function mapping t ∈ R to a random variable C nat (t) ∈ R. In case of vectors x = (x 1 , . . . , x d ) ∈ R d we apply it in an element-wise fashion: (C nat (x)) i = C nat (x i ). Natural compression C nat performs a randomized logarithmic rounding of its input t ∈ R. Given nonzero t, let α ∈ R be such that |t| = 2 α (i.e., α = log 2 |t|). Then 2 α ≤ |t| = 2 α ≤ 2 α and we round t to either sign(t)2 α , or to sign(t)2 α . When t = 0, we set C nat (0) = 0. The probabilities are chosen so that C nat (t) is an unbiased estimator of t, i.e., E [C nat (t)] = t for all t. For instance, t = -2.75 will be rounded to either -4 or -2 (since -2 2 ≤ -2.75 ≤ -2 1 ), and t = 0.75 will be rounded to either 1 /2 or 1 (since 2 -1 ≤ 0.75 ≤ 2 0 ). As a consequence, if t is an integer power of 2, then C nat will leave t unchanged, see Fig. 2 . Definition 1 (Natural compression). Natural compression is a random function C nat : R → R defined as follows. We set C nat (0) = 0. If t = 0, we let C nat (t) := sign(t) • 2 log 2 |t| , with p(t), sign(t) • 2 log 2 |t| , with 1 -p(t), where probability p(t) := 2 log 2 |t| -|t| 2 log 2 |t| . Alternatively, (2) can be written as C nat (t) = sign(t) • 2 log 2 |t| (1 + λ(t)) , where λ(t) ∼ Bernoulli(1 -p(t)); that is, λ(t) = 1 with prob. 1 -p(t) and λ(t) = 0 with prob. p(t). The key properties of any (unbiased) compression operator are variance, ease of implementation, and compression level. We characterize the remarkably low variance of C nat and describe an (almost) effortless and natural implementation, and the compression it offers in rest of this section. C nat has a negligible variance: ω = 1 /8. We identify natural compression as belonging to a large class of unbiased compression operators with bounded second moment (Jiang & Agrawal, 2018; Khirirat et al., 2018; Horváth et al., 2019) , defined below. Definition 2 (Compression operators). A function C : R d → R d mapping a deterministic input to a random vector is called a compression operator (on R d ). We say that C is unbiased and has bounded second moment (ω ≥ 0) if E [C(x)] = x, E C(x) 2 ≤ (ω + 1) x 2 ∀x ∈ R d . If C satisfies (3), we will write C ∈ B(ω). Note that ω = 0 implies C(x) = x almost surely. It is easy to see that the variance of C(x) ∈ B(ω) is bounded as: E C(x) -x 2 ≤ ω x 2 . If this holds, we say that "C has variance ω". The importance of B(ω) stems from two observations. First, operators from this class are known to be compatible with several optimization algorithms (Khirirat et al., 2018; Horváth et al., 2019) . Second, this class includes most compression operators used in practice (Alistarh et al., 2017; Wen et al., 2017; Wangni et al., 2018; Mishchenko et al., 2019) . In general, the larger ω is, the higher compression level might be achievable, and the worse impact compression has on the convergence speed. The main result of this section says that the natural compression operator C nat has variance 1 /8. Theorem 1. C nat ∈ B( 1 /8). Consider now a similar unbiased randomized rounding operator to C nat ; but one that rounds to one of the nearest integers (as opposed to integer powers of 2). We call it C int . At first sight, this may seem like a reasonable alternative to C nat . However, as we show next, C int does not have a finite second moment and is hence incompatible with existing optimization methods. Theorem 2. There is no ω ≥ 0 such that C int ∈ B(ω). From 32 to 9 bits, with lightning speed. We now explain that performing natural compression of a real number in a binary floating point format is computationally cheap. In particular, excluding the randomization step, C nat amounts to simply dispensing off the mantissa in the binary representation. The most common computer format for real numbers, binary32 (resp. binary64) of the IEEE 754 standard, represents each number with 32 (resp. 64) bits, where the first bit represents the sign, 8 (resp. 11) bits are used for the exponent, and the remaining 23 (resp. 52) bits are used for the mantissa. A scalar t ∈ R is represented in the form (s, e 7 , e 6 , . . . , e 0 , m 1 , m 2 , . . . , m 23 ), where s, e i , m j ∈ {0, 1} are bits, via the relationship t = (-1) s × 2 e-127 × (1 + m), e =  t = (-1) s × 2 e-127 × (1 + m) = -1 × 2 × (1 + 2 -2 + 2 -3 ) = -2.75. It is clear that 0 ≤ m < 1, and hence 2 e-127 ≤ |t| < 2 e-126 . Moreover, p(t) = 2 e-126 -|t| 2 e-127 = 2 -|t|2 127-e = 1 -m. Hence, natural compression of t represented as binary32 is given as follows: C nat (t) = (-1) s × 2 e-127 , with probability 1 -m, (-1) s × 2 e-126 , with probability m. Observe that (-1) s × 2 e-127 is obtained from t by setting the mantissa m to zero, and keeping both the sign s and exponent e unchanged. Similarly, (-1) s × 2 e-126 is obtained from t by setting the mantissa m to zero, keeping the sign s, and increasing the exponent by one. Hence, both values can be computed from t essentially without any computation. Communication savings. In summary, in case of binary32, the output C nat (t) of natural compression is encoded using the 8 bits in the exponent and an extra bit for the sign. This is 3.56× less communication. In case of binary64, we only need 11 bits for the exponent and 1 bit for the sign, and this is 5.82× less communication. Compatibility with other compression techniques We start with a simple but useful observation about composition of compression operators. Theorem 3. If C 1 ∈ B(ω 1 ) and C 2 ∈ B(ω 2 ), then C 1 • C 2 ∈ B(ω 12 ), where ω 12 = ω 1 ω 2 + ω 1 + ω 2 , and C 1 • C 2 is the composition defined by (C 1 • C 2 )(x) = C 1 (C 2 (x)). Combining this result with Thm. 1, we observe that for any C ∈ B(ω), we have C nat •C ∈ B( 9ω /8+ 1 /8). Since C nat offers substantial communication savings with only a negligible effect on the variance of C, a key use for natural compression beyond applying it as the sole compression strategy is to deploy it with other effective techniques as a final compression mechanism (e.g., with sparsifiers (Stich et al., 2018) ), boosting the performance of the system even further. However, our technique will be useful also as a post-compression mechanism for compressions that do not belong to B(ω) (e.g., TopK sparsifier (Alistarh et al., 2018) ). The same comments apply to the natural dithering operator D p,s nat , defined in the next section.

3. NATURAL DITHERING

Motivated by the natural compression introduced in Sec 2, here we propose a new random dithering operator which we call natural dithering. However, it will be useful to introduce a more general dithering operator, one generalizing both the natural and the standard dithering operators. For 1 ≤ p ≤ +∞, let x p be p-norm: x p := ( i |x i | p ) 1 /p . Definition 3 (General dithering). The general dithering operator with respect to the p norm and with s levels 0 = l s < l s-1 < l s-2 < • • • < l 1 < l 0 = 1, denoted D C,p,s gen , is defined as follows. Let x ∈ R d . If x = 0, we let D C,p,s gen (x) = 0. If x = 0, we let y i := |x i |/ x p for all i ∈ [d]. Assuming l u+1 ≤ y i ≤ l u for some u ∈ {0, 1, . . . , s -1}, we let D C,p,s gen (x) i = C( x p ) × sign(x i ) × ξ(y i ) , where C ∈ B(ω) for some ω ≥ 0 and ξ(y i ) is a random variable equal to l u with probability yi-lu+1 lu-lu+1 , and to l u+1 with probability lu-yi lu-lu+1 . Note that E [ξ(y i )] = y i . Standard (random) dithering, D p,s sta , (Goodall, 1951; Roberts, 1962) is obtained as a special case of general dithering (which is also novel) for a linear partition of the unit interval, l s-1 = 1 /s, l s-2 = 2 /s, . . . , l 1 = (s-1) /s and C equal to the identity operator. D 2,s sta operator was used in QSGD (Alistarh et al., 2017) and D ∞,1 sta in Terngrad (Wen et al., 2017) . Natural dithering-a novel compression operator introduced in this paper-arises as a special case of general dithering for C being an identity operator and a binary geometric partition of the unit interval: l s-1 = 2 1-s , l s-2 = 2 2-s , . . . , l 1 = 2 -1 . For the INA application, we apply C = C nat to have output always in powers of 2, which would introduce extra factor of 9 /8 in the second moment. A comparison of the ξ operators for the standard and natural dithering with s = 3 levels applied to t = 3 /8 can be found in Fig 3. When D C,p,s gen is used to compress gradients, each worker communicates the norm (1 float), vector of signs (d bits) and efficient encoding of the effective levels for each entry i = 1, 2, . . . , d. Note that D p,s nat is essentially an application of C nat to all normalized entries of x, with two differences: i) we can also communicate the compressed norm x p , ii) in C nat the interval [0, 2 1-s ] is subdivided further, to machine precision, and in this Approach CW i No. iterations Bits per 1 Iter. Speedup T (ωW ) = O((ωW + 1) θ ) Wi → M Factor Baseline identity 1 32d 1 New Cnat ( 9 /8) θ 9d 3.2×-3.6× Sparsification S q ( d /q) θ (33 + log 2 d)q 0.6×-6.0× New Cnat • S q ( 9d /8q) θ (10 + log 2 d)q 1.0×-10.7× Dithering D p,2 s-1 sta (1 + κd 1/r 2 1-s ) θ 31 + d(2 + s) 1.8×-15.9× New D p,s nat ( 9 /8 + κd 1 r 2 1-s ) θ 31 + d(2 + log 2 s) 4.1×-16.0× Table 1 : The overall speedup of distributed SGD with compression on nodes via C W i over a Baseline variant without compression. Speed is measured by multiplying the # communication rounds (i.e., iterations T (ω W )) by the bits sent from worker i to master (Wi → M ) per 1 iteration. We neglect M → Wi communication as in practice this is often much faster (see e.g. (Mishchenko et al., 2019) , for other cost/speed model see Appendix D.7). We do not just restrict to this scenario and . We assume binary32 representation. The relative # iterations sufficient to guarantee ε optimality is T (ω W ) := (ω W + 1) θ , where θ ∈ (0, 1] (see Thm 6). Note that in the big n regime the iteration bound T (ω W ) is better due to θ ≈ 0 (however, this is not very practical as n is usually small), while for small n we have θ ≈ 1. For dithering, r = min{p, 2}, κ = min{1, √ d2 1-s }. The lower bound for the Speedup Factor is obtained for θ = 1, and the upper bound for θ = 0. The Speedup Factor T (ω W )•# Bits T (0)•32d figures were calculated for d = 10 6 , q = 0.1d (10% sparsity), p = 2 and optimal choice of s with respect to speedup. sense D p,s nat can be seen as a limited precision variant of C nat . As is the case with C nat , the mantissa is ignored, and one communicates exponents only. The norm compression is particularly useful on the master side since multiplication by a naturally compressed norm is just summation of the exponents. The main result of this section establishes natural dithering as belonging to the class B(ω): Theorem 4. D p,s nat ∈ B(ω), where ω = 1 /8 + d 1 /r 2 1-s min 1, d 1 /r 2 1-s , and r = min{p, 2}. To illustrate the strength of this result, we now compare natural dithering D p,s nat to standard dithering D p,s sta and show that natural dithering is exponentially better than standard dithering. In particular, for the same level of variance, D p,s nat uses only s levels while D p,u sta uses u = 2 s-1 levels. Note also that the levels used by D p,s nat form a subset of the levels used by D p,s sta (see Fig 22) . We also confirm this empirically (see Appendix A.4). Theorem 5. Fixing s, natural dithering D p,s nat has O(2 s-1 /s) times smaller variance than standard dithering D p,s sta . Fixing ω, if u = 2 s-1 , then D p,u sta ∈ B(ω) implies that D p,s nat ∈ B( 9 /8(ω + 1) -1).

4. DISTRIBUTED SGD

There are several stochastic gradient-type methods (Robbins & Monro, 1951; Bubeck et al., 2015; Ghadimi & Lan, 2013; Mishchenko et al., 2019) for solving (1) that are compatible with compression operators C ∈ B(ω), and hence also with our natural compression (C nat ) and natural dithering (D p,s nat ) techniques. However, as none of them support compression at the master node we propose a distributed SGD algorithm that allows for bidirectional compression (Algorithm 1 in Appendix D.1). We note that there are two concurrent papers to ours (all appeared online in the same month and year) proposing the use of bidirectional compression, albeit in conjunction with different underlying algorithms, such as SGD with error feedback or local updates (Tang et al., 2019; Zheng et al., 2019) . Since we instead focus on vanilla distributed SGD with bidirectional compression, the algorithmic part of our paper is complementary to theirs. Moreover, our key contribution-the highly efficient natural compression and dithering compressors-can be used within their algorithms as well, which expands their impact further. We assume repeated access to unbiased stochastic gradients g i (x k ) with bounded variance σ 2 i for every worker i. We also assume node similarity represented by constant ζ 2 i , and that f is L-smooth (gradient is L-Lipschitz). Formal definitions as well as detailed explanation of Algorithm 1 can be found in Appendix D. We denote ζ 2 = 1 n n i=1 ζ 2 i , σ 2 = 1 n n i=1 σ 2 i and α = (ω M +1)(ω W +1) n σ 2 + (ω M +1)ω W n ζ 2 , β = 1 + ω M + (ω M +1)ω W n , where  C M ∈ B(ω M ) C Wi = D 2,2 7 sta , C M = identity. Blue line: C Wi = D 2,8 nat , C M = C nat . Theorem 6. Let C M ∈ B(ω M ), C Wi ∈ B(ω Wi ) and η k ≡ η ∈ (0, 2 /βL), where α, β are as in (4). If a is picked uniformly at random from {0, 1, • • • , T -1}, then E ∇f (x a ) 2 ≤ 2(f (x 0 )-f (x )) η(2-βLη)T + αLη 2-βLη , where x is an opt. solution of (1). In particular, if we fix any ε > 0 and choose η = L(α+εβ) and T ≥ 2L(f (x 0 )-f (x ))(α+ β) /ε 2 , then E ∇f (x a ) 2 ≤ ε . The above theorem has some interesting consequences. First, notice that (5) posits a O( 1 /T ) convergence of the gradient norm to the value αLη 2-βLη , which depends linearly on α. In view of (4), the more compression we perform, the larger this value becomes. More interestingly, assume now that the same compression operator is used at each worker: C W = C Wi . Let C W ∈ B(ω W ) and C M ∈ B(ω M ) be the compression on master side. Then, T (ω M , ω W ) := 2L(f (x 0 )-f (x ))ε -2 (α+εβ) is its iteration complexity. In the special case of equal data on all nodes, i.e., ζ = 0, we get α = (ω M +1)(ω W +1)σ 2 /n and β = (ω M +1) (1 + ω W /n). If no compression is used, then ω W = ω M = 0 and α+εβ = σ 2 /n+ε. So, the relative slowdown of Algorithm 1 used with compression compared to Algorithm 1 used without compression is given by T (ω M ,ω W ) T (0,0) = (ω M + 1) (ω W +1)σ 2 n +(1+ ω W/n)ε /( σ 2 /n+ε) ∈ (ω M + 1, (ω M + 1)(ω W + 1)] . The upper bound is achieved for n = 1 (or for any n and ε → 0), and the lower bound is achieved in the limit as n → ∞. So, the slowdown caused by compression on worker side decreases with n. More importantly, the savings in communication due to compression can outweigh the iteration slowdown, which leads to an overall speedup! See Table 1 for the computation of the overall worker to master speedup achieved by our compression techniques (also see Appendix D.7 for additional similar comparisons under different cost/speed models). Notice that, however, standard sparsification does not necessarily improve the overall running time -it can make it worse. Our methods have the desirable property of significantly uplifting the minimal speedup comparing to their "non-natural" version. The minimal speedup is more important as usually the number of nodes n is not very big.

5. EXPERIMENTS

To showcase properties of our approach in practice, we built a proof-of-concept system and provide evaluation results. We focus on illustrating convergence behavior, training throughput improvement, and transmitted data reduction. Experimental setup is presented in Appendix B. Results. We first elaborate the microbenchmark experiments of aggregated tensor elements (ATE) per second. We collect time measurements for aggregating 200 tensors with the size of 100MB, and present violin plots which show the median, min, and max values among workers. Fig 9 shows the result where we vary the number of workers between 4 and 8. The performance difference observed for the case of C nat , along with the similar performance for C nat deterministic indicate that the overhead of doing stochastic rounding at the aggregator is a bottleneck. We then illustrate the convergence behavior by training ResNet110 and AlexNet models on CIFAR10. Fig 5 shows the train loss and test accuracy over time. We note that natural compression lowers training time by ∼ 26% for ResNet110 (17% more than QSGD for the same setup, see Alistarh et al. (2017) (Table 1 )) and 66% for AlexNet, compared to using no compression, while the accuracy matches the results in (He et al., 2016) without any loss of final accuracy with the same hyperparameters setting, while training loss is not affected by compression. In addition, combining C nat with other compression operators, we can see no effect in convergence, but significant reduction in communication, e.g., 16× fewer levels for D p,s nat w.r. We further break down the speedup by showing the relative speedup of In-Network Aggregation, which performs no compression but reduces the volume of data transferred (shown below). We also show the effects of deterministic rounding on throughput. Because deterministic rounding does not compute random numbers, it provides some additional speedups. However, it may affect convergence. These results represent potential speedups in case the overheads of randomization were low, for instance, when using simply lookup for pre-computed randomness. We observe that the communication-intensive models (VGG, AlexNet) benefit more from quantization as compared to the computation-intensive models (GoogleNet, Inception, ResNet). These observations are consistent with prior work (Alistarh et al., 2017) . To quantify the data reduction benefits of natural compression, we measure the total volume of data transferred during training. In order to validate that C nat does not incur any loss in performance, we trained various DNNs on the Tensorflow CNN Benchmarkfoot_0 on the CIFAR 10 dataset with and without C nat for the same number of epochs, and compared the test set accuracy, and training loss. As mentioned earlier, the baseline for comparison is the default NCCL setting. We didn't tune the hyperparameters. In all of the experiments, we used Batch Normalization, but no Dropout was used. Looking into Figures 11, 12 and 13, one can see that C nat achieves significant speed-up without incurring any accuracy loss. As expected, the communication intensive AlexNet (62.5 M parameters) benefits more from the compression than the computation intensive ResNets (< 1.7 M parameters) and DenseNet40 (1 M parameters). A.1.1 DENSENET HYPERPARAMETERS: We trained DenseNet40 (k = 12) and followed the same training procedure as described in Huang et al. (2017) . We used a weight decay of 10 -4 and the optimizer as vanilla SGD. We trained for a total of 300 epochs. The initial learning rate was 0.1, which was decreased by a factor of 10 at 150 and 225 epoch. For AlexNet, we chose the optimizer as SGD with momentum, with a momentum of 0.9. We trained on three minibatch sizes: 256, 512 and 1024 for 200 epochs. The learning rate was initially set to be 0.001, which was decreased by a factor of 10 after every 30 epoch.

A.1.3 RESNET HYPERPARAMETERS:

All the ResNets followed the training procedure as described in He et al. (2016) . We used a weight decay of 10 -4 and the optimizer was chosen to be vanilla SGD. The minibatch size was fixed to be 128 for ResNet 20, and 256 for all the others. We train for a total of 64K iterations. We start with an initial learning rate of 0.1, and multiply it by 0.1 at 32K and 48K iterations. A In this section, we perform experiments to confirm that D p,s nat level selection brings not just theoretical but also practical performance speedup in comparison to D p,u sta . We measure the empirical variance of D p,u sta and D p,s nat . For D p,s nat , we do not compress the norm, so we can compare just variance introduced by level selection. Our experimental setup is the following. We first generate a random vector x of size d = 10 5 , with independent entries with Gaussian distribution of zero mean and unit variance (we tried other distributions, the results were similar, thus we report just this one) and then we measure normalized empirical variance ω(x) := C(x) -x 2 x 2 . We provide boxplots, each for 100 randomly generated vectors x using the above procedure. We perform this for p = 1, p = 2 and p = ∞. We report our findings in sta for u = s, i.e., we use the same number of levels for both compression strategies. In each of the three plots we generated vectors x with a different norm. We find that natural dithering has dramatically smaller variance, as predicted by Thm 5. . That is, we give standard dithering an exponential advantage in terms of the number of levels (which also means that it will need more bits for communication). We now study the effect of this change on the variance. We observe that the empirical variance is essentially the same for both, as predicted by Thm 5. We now remark on the situation when the number of levels s is chosen to be very large (see Fig 18 ). While this is not a practical setting as it does not provide sufficient compression, it will serve as an illustration of a fundamental theoretical difference between D p,s sta and D p,s nat in the s → ∞ limit which we want to highlight. Note that while D p,s sta converges to the identity operator as s → ∞, which enjoys zero variance, D p,s nat converges to C nat instead, with variance that can't reduce below ω = 1 /8. Hence, for large enough s, one would expect, based on our theory, the variance of D p,s nat to be around 1 /8, while the variance of D p,s sta to be closer to zero. In particular, this means that D p,s sta can, in a practically meaningless regime, outperform D p,s nat . In Fig 18 we choose p = ∞ and s = 32 (this is large). Note that, as expected, the empirical variance of both compression techniques is small, and that, indeed, D p,s sta outperforms D p,s nat . , 32 nat , 32 sta sta can be smaller than that of D p,s nat . However, in this case, the variance of D p,s nat is already negligible.

A.4.4 COMPRESSING GRADIENTS

We also performed identical to those reported above, but with a different generation technique of the vectors x. In particular, instead of a synthetic Gaussian generation, we used gradients generated by our optimization procedure as applied to the problem of training several deep learning models. Our results were essentially the same as the ones reported above, and hence we do not include them.

A.5 DIFFERENT COMPRESSION OPERATORS

We report additional experiments where we compare our compression operator to previously proposed ones. These results are based on a Python implementation of our methods running in PyTorch as this enabled a rapid direct comparisons against the prior methods. We compare against no compression, random sparsification, and random dithering methods. We compare on MNIST and CIFAR10 datasets. For MNIST, we use a two-layer fully connected neural network with RELU activation function. For CIFAR10, we use VGG11 with one fully connected layer as the classifier. We run these experiments with 4 workers and batch size 32 for MNIST and 64 for CIFAR10. The results are averages over 3 runs. We tune the step size for SGD for a given "non-natural" compression. Then we use the same step size for the "natural" method. Step sizes and parameters are listed alongside the results. Figures 

B EXPERIMENTAL SETUP

Our experiments execute the standard CNN benchmarkfoot_3 . We summarize the hyperparameters setting in Appendix A.1.2. We further present results for two more variations of our implementation: one without compression (providing the baseline for In-Network Aggregation (Sapio et al., 2019) ), and the other with deterministic rounding to the nearest power of 2 to emphasize that there exists a performance overhead of sampling in natural compression. We implement the natural compression operator within the Gloo communication libraryfoot_4 , as a drop-in replacement for the ring all-reduce routine. Our implementation is in C++. We integrate our communication library with Horovod and, in turn, with TensorFlow. We follow the same communication strategy introduced in SwitchML (Sapio et al., 2019) , which aggregates the deep learning model's gradients using In-Network Aggregation on programmable network switches. We choose this strategy because natural compression is a good fit for the capabilities of this class of modern hardware, which only supports basic integer arithmetic, simple logical operations and limited storage. A worker applies the natural compression operator to quantize gradient values and sends them to the aggregator component. As in SwitchML, an aggregator is capable of aggregating a fixed-length array of gradient values at a time. Thus, the worker sends a stream of network packets, each carrying a chunk of compressed values. For a given chunk, the aggregator awaits all values from every worker; then, it restores the compressed values as integers, aggregates them and applies compression to quantize the aggregated values. Finally, the aggregator multicasts back to the workers a packet of aggregated values. For implementation expedience, we prototype the In-Network Aggregation as a server-based program implemented atop DPDKfoot_5 for fast I/O performance. We leave to future work a complete P4 implementation for programmable switches; however, we note that all operations (bit shifting, masking, and random bits generation) needed for our compression operator are available on programmable switches. Implementation optimization. We carefully optimize our implementation using modern x86 vector instructions (AVX2) to minimize the overheads in doing compression. To fit the byte length and access memory more efficiently, we compress a 32-bit floating point numbers to an 8-bit representation, where 1 bit is for the sign and 7 bits are for the exponent. The aggregator uses 64-bit integers to store the intermediate results, and we choose to clip the exponents in the range of -50 ∼ 10. As a result, we only use 6 bits for exponents. The remaining one bit is used to represent zeros. Note that it is possible to implement 128-bit integers using two 64-bit integers, but we found that, in practice, the exponent values never exceed the range of -50 ∼ 10 (Figure 21 ). Despite the optimization effort, we identify non-negligible 10 ∼ 15% overheads in doing random number generation used in stochastic rounding, which was also reported in Hubara et al. (2017) . We include the experimental results of our compression operator without stochastic rounding as a reference. There could be more efficient ways to deal with stochastic rounding, but we observe that doing deterministic rounding gives nearly the same training curve in practice meaning that computational speed up is neutralized by slower convergence due to biased compression operator. Hardware setup. We run the workers on 8 machines configured with 1 NVIDIA P100 GPU, dual CPU Intel Xeon E5-2630 v4 at 2.20GHz, and 128 GB of RAM. The machines run Ubuntu (Linux kernel 4.4.0-122) and CUDA 9.0. Following Sapio et al. (2019) , we balance the workers with 8 aggregators (4 aggregators in the case of 4 workers) running on machines configured with dual Intel Xeon Silver 4108 CPU at 1.80 GHz. Each machine uses a 10 GbE network interface and has CPU frequency scaling disabled. The chunks of compressed gradients sent by workers are uniformly distributed across all aggregators. This setup ensures that workers can fully utilize their network bandwidth and match the performance of a programmable switch. We leave the switch-based implementation for future work. C DETAILS AND PROOFS FOR SECTIONS 2 AND 3 C.1 PROOF OF THEOREM 1 By linearity of expectation, the unbiasedness condition and the second moment condition (3) have the form E [(C(x)) i ] = x i , ∀x ∈ R d , ∀i ∈ [d] and d i=1 E (C(x)) 2 i ≤ (ω + 1) d i=1 x 2 i , ∀x ∈ R d . Recall that C nat (t) can be written in the form C nat (t) = sign(t) • 2 log 2 |t| (1 + λ(t)). where the last step follows since p(t) = 2 log 2 |t| -|t| 2 log 2 |t| . Hence, E [C nat (t)] (8) = E sign(t) • 2 log 2 |t| (1 + λ(t)) = sign(t) • 2 log 2 |t| (1 + E [λ(t)]) = sign(t) • 2 log 2 |t| (1 + 1 -p(t)) = t, This establishes unbiasedness (6). In order to establish (7), it suffices to show that E (C nat (x)) 2 i ≤ (ω + 1)x 2 i for all x i ∈ R. Since by definition (C nat (x)) i = C nat (x i ) for all i ∈ [d], it suffices to show that E (C nat (t)) 2 ≤ (ω + 1)t 2 , ∀t ∈ R. If t = 0 or t = sign(t)2 α with α being an integer, then C nat (t) = t, and (9) holds as an identity with ω = 0, and hence inequality (9) holds for ω = 1 /8. Otherwise t = sign(t)2 α where a := α < α < α = a + 1. With this notation, we can write E (C nat (t)) 2 = 2 2a 2 a+1 -|t| 2 a + 2 2(a+1) |t| -2 a 2 a = 2 a (3|t| -2 a+1 ). So, E (C nat (t)) 2 t 2 = 2 a (3|t| -2 a+1 ) t 2 ≤ sup 2 a <t<2 a+1 2 a (3|t| -2 a+1 ) t 2 = sup 1<θ<2 2 a (3 • 2 a θ -2 a+1 ) (2 a θ) 2 = sup 1<θ<2 3θ -2 θ 2 . The optimal solution of the last maximization problem is θ = 4 3 , with optimal objective value 9 8 . This implies that (9) holds with ω = 1 8 .

C.2 PROOF OF THEOREM 2

Let assume that there exists some ω < ∞ for which C int is the ω quantization. Unbiased rounding to the nearest integer can be defined in the following way C int (x i ) := x i , with probability p(x i ), x i , with probability 1 -p(x i ), where p(x i ) = x i -x i . Let's take 1-D example, where x ∈ (0, 1), then E C int (x 2 ) = (1 -x)0 2 + x1 2 = x ≤ ωx 2 , which implies ω ≥ 1/x, thus taking x → 0 + , one obtains ω → ∞, which contradicts the existence of finite ω.

C.3 PROOF OF THEOREM 3

The main building block of the proof is the tower property of mathematical expectation. The tower property says: If X and Y are random variables, then E [X] = E [E [X | Y ]] . Applying it to the composite compression operator C 1 • C 2 , we get E [(C 1 • C 2 ) (x)] = E [E [C 1 (C 2 (x)) | C 2 (x)]] (3) = E [C 2 (x)] (3) = x . For the second moment, we have E (C 1 • C 2 ) (x) 2 = E E C 1 (C 2 (x)) 2 | C 2 (x) ≤ (ω 2 + 1)E C 1 (x) 2 (3) ≤ (ω 1 + 1)(ω 2 + 1) x 2 , which concludes the proof.

C.4 PROOF OF THEOREM 4

Unbiasedness of D p,s nat is a direct consequence of unbiasedness of D C,p,s gen . For the second part, we first establish a bound on the second moment of ξ: E   ξ x i x p 2   ≤ 1 |x i | x p ≥ 2 1-s 9 8 |x i | 2 x 2 p + 1 |x i | x p < 2 1-s |x i | x p 2 1-s ≤ 9 8 |x i | 2 x 2 p + 1 |x i | x p < 2 1-s |x i | x p 2 1-s . Using this bound, we have E D p,s nat (x) 2 = E x 2 p d i=1 E   ξ x i x p 2   (10) ≤ x 2 p 9 x 2 8 x 2 p + d i=1 1 |x i | x p < 2 1-s |x i | x p 2 1-s ≤ 9 8 x 2 + min 2 1-s x p x 1 , 2 2-2s d x 2 p ≤ 9 8 x 2 + min d 1/2 2 1-s x p x , 2 2-2s d x 2 p ≤ 9 8 + d 1/ min{p,2} 2 1-s min 1, d 1/ min{p,2} 2 1-s x 2 , where the second inequality follows from min{a i , b i } ≤ min{ a i , b i } and the last two inequalities follow from the following consequence of Hölder's inequality x p ≤ d 1/p-1/2 x for 1 ≤ p < 2 and from the fact that x p ≤ x for p ≥ 2. This concludes the proof.

C.5 PROOF OF THEOREM 5

The main building block of the proof is useful connection between D p,s nat and D p,2 s-1 sta , which can be formally written as D p,s nat (x) D = x p • sign(x) • C nat (ξ(x)) , where sta with u = 2 s-1 , with s = 4. Notice that the numbers standard dithering rounds to, i.e., 0, 1 /8, 2 /8, . . . , 7 /8, 1, form a superset of the numbers natural dithering rounds to, i.e., 0, 2 -3 , 2 -2 , 2 -1 , 1. Importantly, while standard dithering uses u = 2 4-1 = 8 levels (i.e., intervals) to achieve a certain fixed variance, natural dithering only needs s = 4 levels to achieve the same variance. This is an exponential improvement in compression (see Theorem 5 for the formal statement). (ξ(x)) i = ξ( xi / x p ) with levels 0, 1 /2 s-1 , 2 /2 s-1 , • • • , 1 . Graphical visualization can Equipped with this, we can proceed with E ξ( xi / x p ) 2 (10) = E x p • sign(x) • C nat (ξ(x)) 2 = E x 2 p • E C nat (ξ(x)) 2 Thm. 1 ≤ 9 8 E x p sign(x)ξ(x) 2 = 9 8 E D p,2 s-1 sta 2 (x) ≤ 9 8 (ω + 1), which concludes the proof.

C.6 NATURAL COMPRESSION AND DITHERING ALLOW FOR FAST AGGREGATION

Besides communication savings, our new compression operators C nat (natural compression) and D p,s nat (natural dithering) bring another advantage, which is ease of aggregation. Firstly, our updates allow in-network aggregation on a primitive switch, which can speed up training by up to 300% (Sapio et al., 2019) itself. Moreover, our updates are so simple that if one uses integer format on the master side for update aggregation, then our updates have just one non-zero bit, which leads to additional speed up. For this reason, one needs to operate with at least 64 bits during the aggregation step, which is the reason why we also do C nat compression on the master side; and hence we need to transmit just exponent to workers. Moreover, the translation from floats to integers and back is computation-free due to structure of our updates. Lastly, for D p,s nat compression we obtain additional speed up with respect to standard randomized dithering D p,s sta as our levels are computationally less expensive due to their natural compatibility with floating points. In addition, for effective communication one needs to communicate signs, norm and levels as a tuple for both D p,s nat and D p,s sta , which needs to be then multiplied back on the master side. For D p,s nat , this is just the summation of exponents rather than actual multiplication as is the case for D p,s sta .

D DETAILS AND PROOFS FOR SECTION 4 D.1 ALGORITHM

Algorithm 1 Distributed SGD with bidirectional compression Input: learning rates {η k } T k=0 > 0, initial vector x 0 for k = 0, 1, . . . T do Parallel: Worker side for i = 1, . . . , n do compute a stochastic gradient g i (x k ) (of f i at x k ) compress it ∆ k i = C Wi (g i (x k )) end for Master side aggregate ∆ k = n i=1 ∆ k i compress g k = C M (∆ k ) and broadcast to each worker Parallel: Worker side for i = 1, . . . , n do x k+1 = x kη k n g k end for end for

D.2 ASSUMPTIONS AND DEFINITIONS

Formal definitions of some concepts used in Section follows: Definition 4. Let f i : R d → R be fixed function. A stochastic gradient for f i is a random function g i (x) so that E [g i (x)] = ∇f i (x). In order to obtain the rate, we introduce additional assumptions on g i (x) and ∇f i (x). Assumption 1 (Bounded Variance). We say the stochastic gradient has variance at most σ 2 i if E g i (x) -∇f i (x) 2 ≤ σ 2 i for all x ∈ R d . Moreover, let σ 2 = 1 n n i=1 σ 2 i . Assumption 2 (Similarity). We say the variance of gradient among nodes is at most ζ 2 i if ∇f i (x) -∇f (x) 2 ≤ ζ 2 i for all x ∈ R d . Moreover, let ζ 2 = 1 n n i=1 ζ 2 i . Moreover, we assume that f is L-smooth (gradient is L-Lipschitz). These are classical assumptions for non-convex SGD (Ghadimi & Lan, 2013; Jiang & Agrawal, 2018; Mishchenko et al., 2019) and comparing to some previous works (Alistarh et al., 2017) , our analysis does not require bounded iterates and bounded the second moment of the stochastic gradient. Assumption 2 is automatically satisfied with ζ 2 = 0 if every worker has access to the whole dataset. If one does not like Assumption 2 one can use the DIANA algorithm (Horváth et al., 2019) as a base algorithm instead of SGD, then there is no need for this assumption. For simplicity, we decide to pursue just SGD analysis and we keep Assumption 2.

D.3 DESCRIPTION OF ALGORITHM 1

Let us describe Algorithm 1. First, each worker computes its own stochastic gradient g i (x k ), this is then compressed using a compression operator C Wi (this can be different for every node, for simplicity, one can assume that they are all the same) and send to the master node. The master node then aggregates the updates from all the workers, compress with its own operator C M and broadcasts update back to the workers, which update their local copy of the solution parameter x. Note that the communication of the updates can be also done in all-to-all fashion, which implicitly results in C M being the identity operator. Another application, which is one of the key motivations of our natural compression and natural dithering operators, is in-network aggregation (Sapio et al., 2019) . In this setup, the master node is a network switch. However, current network switches can only perform addition (not even average) of integers.

D.4 THREE LEMMAS NEEDED FOR THE PROOF OF THEOREM 6

Before we proceed with the theoretical guarantees for Algorithm 1 in smooth non-convex setting, we first state three lemmas which are used to bound the variance of g k as a stochastic estimator of the true gradient ∇f (x k ). In this sense compression at the master-node has the effect of injecting additional variance into the gradient estimator. Unlike in SGD, where stochasticity is used to speed up computation, here we use it to reduce communication. Lemma 7 (Tower property + Compression). If C ∈ B(ω) and z is a random vector independent of C, then E C(z) -z 2 ≤ ωE z 2 ; E C(z) 2 ≤ (ω + 1)E z 2 . ( ) Proof. Recall from the discussion following Definition 2 that the variance of a compression operator C ∈ B(ω) can be bounded as E C(x) -x 2 ≤ ω x 2 , ∀x ∈ R d . Using this with z = x, this can be written in the form E C(z) -z 2 | z ≤ ω z 2 , ∀x ∈ R d , which we can use in our argument: E C(z) -z 2 = E E C(z) -z 2 | z (12) ≤ E ω z 2 = ωE z 2 . The second inequality can be proved exactly same way. Lemma 8 (Local compression variance). Suppose x is fixed, C ∈ B(ω), and g i (x) is an unbiased estimator of ∇f i (x). Then E C(g i (x)) -∇f i (x) 2 ≤ (ω + 1)σ 2 i + ω ∇f i (x) 2 . ( ) Proof. E C(g i (x)) -∇f i (x) 2 Def. 4+(3) = E C(g i (x)) -g i (x) 2 + E g i (x) -∇f i (x) 2 (11) ≤ ωE g i (x) 2 + E g i (x) -∇f i (x) 2 Def. 4+(3) = (ω + 1)E g i (x) -∇f i (x) 2 + ω ∇f i (x) 2 Assum. 1 ≤ (ω + 1)σ 2 i + ω ∇f i (x) 2 . Lemma 9 (Global compression variance). Suppose x is fixed, C Wi ∈ B(ω Wi ) for all i, C M ∈ B(ω M ), and g i (x) is an unbiased estimator of ∇f i (x) for all i. Then E   1 n C M n i=1 C Wi (g i (x)) 2   ≤ α + β ∇f (x) 2 , ( ) where ω W = max i∈[n] ω Wi and α = (ω M + 1)(ω W + 1) n σ 2 + (ω M + 1)ω W n ζ 2 , β = 1 + ω M + (ω M + 1)ω W n . Proof. For added clarity, let us denote ∆ = n i=1 C Wi (g i (x)). Using this notation, the proof proceeds as follows: E 1 n C M (∆) 2 Def. 4+(3) = E 1 n C M (∆) -∇f (x) 2 + ∇f (x) 2 Def. 4+(3) = 1 n 2 E C M (∆) -∆ 2 + E 1 n ∆ -∇f (x) 2 + ∇f (x) 2 (11) ≤ ω M n 2 E ∆ 2 + E 1 n ∆ -∇f (x) 2 + ∇f (x) 2 Def. 4+(3) = (ω M + 1)E 1 n ∆ -∇f (x) 2 + (ω M + 1) ∇f (x) 2 = ω M + 1 n 2 n i=1 E C Wi (g i (x)) -∇f i (x) 2 + (ω M + 1) ∇f (x) 2 (13) ≤ (ω M + 1)(ω W + 1) n σ 2 + (ω M + 1)ω W n 1 n n i=1 ∇f i (x) 2 +(ω M + 1) ∇f (x) 2 = (ω M + 1)(ω W + 1) n σ 2 + (ω M + 1)ω W n 1 n n i=1 ∇f i (x) -∇f (x) 2 + 1 + ω M + (ω M + 1)ω W n ∇f (x) 2 Assum. 2 ≤ (ω M + 1)(ω W + 1) n σ 2 + (ω M + 1)ω W n ζ 2 + 1 + ω M + (ω M + 1)ω W n ∇f (x) 2 D.5 PROOF OF THEOREM 6 Using L-smoothness of f and then applying Lemma 9, we get E f (x k+1 ) ≤ E f (x k ) + E ∇f (x k ), x k+1 -x k + L 2 E x k+1 -x k 2 ≤ E f (x k ) -η k E ∇f (x k ) 2 + L 2 η 2 k E g k n 2 (14) ≤ E f (x k ) -η k - L 2 βη 2 k E ∇f (x k ) 2 + L 2 αη 2 k . Summing these inequalities for k = 0, ..., T -1, we obtain T -1 k=0 η k - L 2 βη 2 k E ∇f (x k ) 2 ≤ f (x 0 ) -f (x ) + T Lαη 2 k 2 . Taking η k = η and assuming η < 2 Lβ , one obtains E ∇f (x a ) 2 ≤ 1 T T -1 k=0 E ∇f (x k ) 2 ≤ 2(f (x 0 ) -f (x )) T η (2 -Lβη) + Lαη 2 -Lβη := δ(η, T ) . It is easy to check that if we choose η = ε L(α+εβ) (which satisfies (16) for every ε > 0), then for any  T ≥ 2L(f (x 0 )-f (x ))(α+ β) 2 we have δ(η, T ) ≤ ε, η k = η = 2(f (x 0 ) -f (x )) LT α with T ≥ Lβ 2 (f (x 0 ) -f (x )) α (number of iterations), we have iteration complexity O (ω W + 1)(ω M + 1) T n , which will be essentially same as doing no compression on master and using • The relative speed of communication (per bit) from workers to the master and from the master to the workers, C W • C M or C W • C M on the • The intelligence of the master, i.e., its ability or the lack thereof of the master to perform aggregation of real numbers (e.g., a switch can only perform integer aggregation), • Variability of various resources (speed, memory, etc) among the workers. For simplicity, we will consider four situations/regimes only, summarized in Table 2 . Direct consequences of Theorem 6: Notice that (5) posits a O( 1 /T ) convergence of the gradient norm to the value αLη 2-βLη , which depends linearly on α. In view of (4), the more compression we perform, the larger this value. More interestingly, assume now that the same compression operator is used at each worker: C W = C Wi . Let C W ∈ B(ω W ) and C M ∈ B(ω M ) be the compression on master side. Then, T (ω M , ω W ) := 2L(f (x 0 ) -f (x ))ε -2 (α + εβ) is its iteration complexity. In the special case of equal data on all nodes, i.e., ζ = 0, we get α = (ω M +1)(ω W +1)σ 2 /n and β = (ω M + 1) (1 + ω W /n). If no compression is used, then ω W = ω M = 0 and α + εβ = σ 2 /n + ε. Under review as a conference paper at ICLR 2021 So, the relative slowdown of Algorithm 1 used with compression compared to Algorithm 1 used without compression is given by T (ω M ,ω W ) T (0,0) = (ω W +1)σ 2 /n + (1 + ω W /n)ε σ 2 /n + ε (ω M + 1) ∈ (ω M + 1, (ω M + 1)(ω W + 1)] . (17) The upper bound is achieved for n = 1 (or for any n and ε → 0), and the lower bound is achieved in the limit as n → ∞. So, the slowdown caused by compression on worker side decreases with n. More importantly, the savings in communication due to compression can outweigh the iteration slowdown, which leads to an overall speedup! D.7.1 MODEL 1 First, we start with the comparison, where we assume that transmitting one bit from worker to node takes the same amount of time as from master to worker. Compression C ∈ B(ω) No. iterations T (ω) = O((ω + 1) 1+θ ) Bits per iteration Wi → M + M → Wi Speedup T (0)B(0) T (ω)B(ω) None 1 2 • 32d 1 Cnat ( 9 8 ) 1+θ 2 • 9d 2.81×-3.16× S q ( d q ) 1+θ 2 • (33 + log 2 d)q 0.06×-0.60× S q • Cnat ( 9d 8q ) 1+θ 2 • (10 + log 2 d)q 0.09×-0.98× D p,2 s-1 sta 1 + √ d2 1-s κ 1+θ 2 • (32 + d(s + 2)) 1.67×-1.78× D p,s nat 81 64 + 9 8 √ d2 1-s κ 1+θ 2 • (8 + d(log 2 s + 2)) 3.19×-4.10× Table 3 : Our compression techniques can speed up the overall runtime (number of iterations T (ω) times the bits sent per iteration) of distributed SGD. We assume binary32 floating point representation, bi-directional compression using C, and the same speed of communication from worker to master (W i → M ) and back (M → W i ). The relative number of iterations (communications) sufficient to guarantee ε optimality is T (ω) := (ω + 1) θ , where θ ∈ (1, 2] (see Theorem 6). Note that big n regime leads to better iteration bound T (ω) since for big n we have θ ≈ 1, while for small n we have θ ≈ 2. For dithering, κ = min{1, √ d2 1-s }. The 2.81× speedup for C nat is obtained for θ = 1, and the 3.16× speedup for θ = 0. The speedup figures were calculated for d = 10 6 , p = 2 (dithering),optimal choice of s (dithering), and q = 0.1d (sparsification).

D.7.2 MODEL 2

For the second model, we assume that the master communicates much faster than workers thus communication from workers is the bottleneck and we don't need to compress updates after aggregation, thus C M is identity operator with ω M = 0. This is the case we mention in the main paper. For completeness, we provide the same table here.

D.7.3 MODEL 3

Similarly to previous sections, we also do the comparison for methods that might be used for In-Network Aggregation. Note that for INA, it is useful to do compression also from master back to workers as the master works just with integers, hence in order to be compatible with floats, it needs to use bigger integers format. Moreover, C nat compression guarantees free translation to floats. For the third model, we assume we have the same assumptions on communication as for Model 1. As a baseline, we take SGD with C nat as this is the most simple analyzable method, which supports INA.

D.7.4 MODEL 4

Here, we do the same comparison as for Model 3. In contrast, for communication we use the same assumptions as for Model 2. )) by the bits sent from worker to master (W i → M ) per 1 iteration. We neglect M → W i communication as in practice this is much faster. We assume binary32 representation. The relative # iterations sufficient to guarantee ε optimality is T (ω W ) := (ω W + 1) θ , where θ ∈ (0, 1] (see Theorem 6). Note that in the big n regime the iteration bound T (ω W ) is better due to θ ≈ 0 (however, this is not very practical as n is usually small), while for small n we have θ ≈ 1. For dithering, r = min{p, 2}, κ = min{1, √ d2 1-s }. The lower bound for the Speedup Factor is obtained for θ = 1, and the upper bound for θ = 0. The Speedup Factor T (ω W )•# Bits T (0)•32d figures were calculated for d = 10 6 , q = 0.1d, p = 2 and optimal choice of s with respect to speedup. Table 5 : Overall speedup (number of iterations T times the bits sent per iteration (W i → M + M → W i ) of distributed SGD. We assume binary32 floating point representation, bi-directional compression using the same compression C. The relative number of iterations (communications) sufficient to guarantee ε optimality is displayed in the third column, where θ ∈ (0, 1] (see Theorem 6). Note that big n regime leads to smaller slowdown since for big n we have θ ≈ 0, while for small n we have θ ≈ 1. For dithering, we chose p = 2 and κ = min{1, √ d2 1-s }. The speedup factor figures were calculated for d = 10 6 , p = 2 (dithering),optimal choice of s (dithering), and q = 0.1d (sparsification). Table 6 : Overall speedup (number of iterations T times the bits sent per iteration (W i → M ) of distributed SGD. We assume binary32 floating point representation, bi-directional compression using C Wi , C M . The relative number of iterations (communications) sufficient to guarantee ε optimality is displayed in the third column, where θ ∈ (0, 1] (see Theorem 6). Note that big n regime leads to smaller slowdown since for big n we have θ ≈ 0, while for small n we have θ ≈ 1. For dithering, we chose p = 2 and κ = min{1, √ d2 1-s }. The speedup factor figures were calculated for d = 10 6 , p = 2 (dithering),optimal choice of s (dithering), and q = 0.1d (sparsification). TABLES 1, 3, 5, 6 No Compression or C nat . Each worker has to communicate a (possibly dense) d dimensional vector of scalars, each represented by 32 or 9 bits, respectively. Sparsification S q with or without C nat . Each worker has to communicate a sparse vector of q entries with full 32 or limited 9 bit precision. We assume that q is small, hence one would prefer to transmit positions of non-zeros, which takes q(log 2 (d) + 1) additional bits for each worker.

D.7.5 COMMUNICATION STRATEGIES USED IN

Dithering (D p,s sta or D p,s nat ). Each worker has to communicate 31(8 -D p,s nat ) bits (sign is always positive, so does not need to be communicated) for the norm, and log 2 (s) + 1 bits for every coordinate for level encoding (assuming uniform encoding) and 1 bit for the sign.

D.8 SPARSIFICATION -FORMAL DEFINITION

Here we give a formal definition of the sparsification operator S q used in Tables 1, 3 ,5,6. Definition 5 (Random sparsification). Let 1 ≤ q ≤ d be an integer, and let • denote the Hadamard (element-wise) product. The random sparsification operator S q : R d → R d is defined as follows: S q (x) = d q • ξ • x, where ξ ∈ R d is a random vector chosen uniformly from the collection of all binary vectors y ∈ {0, 1} d with exactly q nonzero entries (i.e., y 0 = q}). The next result describes the variance of S q : Theorem 10. S q ∈ B( d /q -1). Notice that in the special case q = d, S q reduces to the identity operator (i.e., no compression is applied), and Theorem 10 yields a tight variance estimate: d /d -1 = 0. Proof. See e.g. Stich et al. (2018) (Lemma A.1). Let us now compute the variance of the composition C nat • S q . Since C nat ∈ B( 1 /8) (Theorem 1) and S q ∈ B( d /q -1) (Theorem 10), in view of the our composition result (Theorem 3) we have C W = C nat • S q ∈ B(ω W ), where ω W = 1 8 d q -1 + 1 8 + d q -1 = 9d 8q -1. (18)

E LIMITATIONS AND EXTENSIONS

Quantization techniques can be divided into two categories: biased (Alistarh et al., 2018; Stich et al., 2018) and unbiased (Alistarh et al., 2017; Wen et al., 2017; Wangni et al., 2018) . While the focus of this paper was on unbiased quantizations, it is possible to combine our natural quantization mechanisms in conjunction with biased techniques, such as the TopK sparsifier proposed in Dryden et al. (2016) ; Aji & Heafield (2017) and recently analyzed in Alistarh et al. (2018) ; Stich et al. (2018) , and still obtain convergence guarantees.



https://github.com/tensorflow/benchmarks https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/ Classification/RN50v1.5 https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/ Recommendation/NCF https://github.com/tensorflow/benchmarks https://github.com/facebookincubator/gloo https://www.dpdk.org Jiang & Agrawal (2018) allows compression on the worker side only.



Figure 1: Communication (in bits) vs. the second moment ω + 1 (see Eq (3))

Figure 2: An illustration of nat. compression applied to t = 2.5: Cnat(2.5) = 2 with probability 4-2.5 2 = 0.75, and Cnat(2.5) = 4 with prob.

-j , where s is the sign, e is the exponent and m is the mantissa. A binary32 representation of t = -2.75 is visualized in Fig 4. In this case, s = 1, e 7 = 1, m 2 = m 3 = 1 and hence

Figure 3: Randomized rounding for natural (left) and standard (right) dithering (s = 3 levels).

Figure 4: IEEE 754 single-precision binary floating-point format: binary32.

Figure 5: Train Loss and Test Accuracy of ResNet110 and Alexnet on CIFAR10. Speed-up is displayed with respect to time to execute fixed number of epochs, 320 and 200, respectively.

Figure 8: Train loss and test Aacuracy of VGG11 on CIFAR10. Green line: C Wi = D 2,2 7 sta , C M = identity. Blue line: C Wi = D 2,8 nat , C M = C nat .

Figure 10: Convergence comparison for weak scaling.

t. D p,s sta ; see Fig 8 and for other compressions see Fig 19 and 20 in Appendix. Next, we report the speedup measured in average training throughput while training benchmark CNN models on Imagenet dataset for one epoch. The throughput is calculated as the total number of images processed divided by the time elapsed. Fig 6 shows the speedup normalized by the training throughput of the baseline, that is, TensorFlow + Horovod using the NCCL communication library.

Fig 7 shows that data transferred grows linearly over time, as expected. Natural compression saves 84% of data, which greatly reduces communication time. Fig 10 studies weak scaling for training ResNet50 on ImageNet showing that C nat in itself does not have a negative effect on weak scaling. Further details and additional experiments including convergence experiments for Neural Collaborative Filtering (He et al., 2017) are presented in Appendix A.

Figure 11: DenseNet40 (k = 12)

Figure 13: ResNet (#layers: 20, 44 and 56)

Figure 14: ResNet50 on ImageNet

Fig 16, Fig 17 and Fig 18. These experimental results support our theoretical findings. A.4.1 D p,s nat HAS EXPONENTIALLY BETTER VARIANCE In Fig 16, we compare D p,s nat and D p,u

Figure 16: D p,s nat vs. D p,u sta with u = s.

Figure18: When p = ∞ and s is very large, the empirical variance of D p,s sta can be smaller than that of D p,s nat . However, in this case, the variance of D p,s nat is already negligible.

Random sparsification, step size 0.04, sparsity 10%. S 0.1 nat + nat (c) Random sparsification with non-uniform probabilities(Wangni et al., 2018), step size 0.04, sparsity 10%. Random dithering, step size 0.08, s = 8, u = 2 7 , second norm.

Figure 19: CIFAR10 with VGG11.

Figure 20: MNIST with 2 fully conected layers.

Figure 21: Histogram of exponents of gradients exchanged during the entire training process for ResNet110 (left) and Alexnet (right). Red lines denote the minimum and maximum exponent values of all gradients.

be found in Fig 22.

Figure 22: 1D visualization of the workings of natural dithering D p,s nat and standard dithering D p,usta with u = 2 s-1 , with s = 4. Notice that the numbers standard dithering rounds to, i.e., 0, 1 /8, 2 /8, . . . , 7 /8, 1, form a superset of the numbers natural dithering rounds to, i.e., 0, 2 -3 , 2 -2 , 2 -1 , 1. Importantly, while standard dithering uses u = 2 4-1 = 8 levels (i.e., intervals) to achieve a certain fixed variance, natural dithering only needs s = 4 levels to achieve the same variance. This is an exponential improvement in compression (see Theorem 5 for the formal statement).

s ) 1+θ 2 • (8 + d(2 + log 2 s)) 1.14×-1.30×

Violin plot of Aggregated Tensor Elements (ATE) per second. Dashed lines denote the maximum ATE/s under line rate.

.2 CONVERGENCE TESTS ON IMAGENETTo further demonstrate the convergence behavior of C nat , we run experiments which conform to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). We follow publicly available benchmark 2 and apply C nat on it without modifying any hyperparameter. The model trains for 50 epochs on 8 and 16 workers, with default ResNet50 setup: SGD optimizer with 0.875 momentum, cosine learning rate schedule with 0.256 initial learning rate and linear warmup during the first 8 epochs. The weight decay is set to 1 /32768 and is not applied on Batch Norm trainable parameters. Furthermore, 0.1 label smoothing is used. As shown in 14, C nat does not incur any accuracy loss even if applied on large distributed tasks.A.3 CONVERGENCE TESTS FOR NEURAL COLLABORATIVE FILTERINGWe also train Neural Collaborative Filtering (NCF)(He et al., 2017) on MovieLens-20M Dataset using C nat and compare its convergence to no compression. Neural Collaborative Filtering is a big recommendation model with ∼32 million parameters. We use a publicly available benchmark 3 and apply C

One can see that in terms of epochs, we obtain almost the same result in terms of training loss and test accuracy, sometimes even better. On the other hand, our approach has a huge impact on the number of bits transmitted from workers to master, which is the main speedup factor together with the speedup in aggregation if we use In-Network Aggregation (INA). Moreover, with INA we compress updates also from master to nodes, hence we send also fewer bits. These factors together bring significant speedup improvements, as illustrated in Fig6, which strongly suggests similar speed-up in training time as observed for C nat , see e.g. Section 5.

concluding the proof. Four theoretical models.

workers' side. Our rate generalizes to the rate ofGhadimi & Lan (2013) without compression and dependency on the compression operator is better comparing to the linear one inJiang & Agrawal  (2018)  7 . Moreover, our rate enjoys linear speed-up in the number of workers n, the same asGhadimi & Lan (2013). In addition, if one introduces mini-batching on each worker of size b and assuming each worker has access to the whole data, then σ 2 → σ 2 /b and ζ 2 → 0, which implies

The overall speedup of distributed SGD with compression on nodes via C Wi over a Baseline variant without compression. Speed is measured by multiplying the # communication rounds (i.e., iterations T (ω W

Appendix

For easy navigation through the Paper and the Appendices, we provide a table of contents. 

