NATURAL COMPRESSION FOR DISTRIBUTED DEEP LEARNING

Abstract

Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practically effective compression technique: natural compression (C nat ). Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa. We show that compared to no compression, C nat increases the second moment of the compressed vector by not more than the tiny factor 9 /8, which means that the effect of C nat on the convergence speed of popular training algorithms, such as distributed SGD, is negligible. However, the communications savings enabled by C nat are substantial, leading to 3-4× improvement in overall theoretical running time. For applications requiring more aggressive compression, we generalize C nat to natural dithering, which we prove is exponentially better than the common random dithering technique. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect, and offer new state-of-the-art both in theory and practice.

1. INTRODUCTION

Modern deep learning models (He et al., 2016) are almost invariably trained in parallel or distributed environments, which is necessitated by the enormous size of the data sets and dimension and complexity of the models required to obtain state-of-the-art performance. In our work, the focus is on the data-parallel paradigm, in which the training data is split across several workers capable of operating in parallel (Bekkerman et al., 2011; Recht et al., 2011) . Formally, we consider optimization problems of the form min x∈R d f (x) := 1 n n i=1 f i (x) , where x ∈ R d represents the parameters of the model, n is the number of workers, and f i : R d → R is a loss function composed of data stored on worker i. Typically, f i is modeled as a function of the form f i (x) := E ζ∼Di [f ζ (x)] , where D i is the distribution of data stored on worker i, and f ζ : R d → R is the loss of model x on data point ζ. The distributions D 1 , . . . , D n can be different on every node, which means that the functions f 1 , . . . , f n may have different minimizers. This framework covers i) stochastic optimization when either n = 1 or all D i are identical, and ii) empirical risk minimization when f i (x) can be expressed as a finite average, i.e, 1 mi mi i=1 f ij (x) for some f ij : R d → R. Distributed Learning. Typically, problem (1) is solved by distributed stochastic gradient descent (SGD) (Robbins & Monro, 1951) , which works as follows: Stochastic gradients g i (x k )'s are computed locally and sent to a master node, which performs update aggregation g k = i g i (x k ). The aggregated gradient g k is sent back to the workers and each performs a single step of SGD: x k+1 = x k -η k n g k , where η k > 0 is a step size. A key bottleneck of the above algorithm, and of its many variants (e.g., variants utilizing minibatching (Goyal et al., 2017) , importance sampling (Horváth & Richtárik, 2019) , momentum (Nesterov, 2013) , or variance reduction (Johnson & Zhang, 2013)), is the cost of communication of the typically dense gradient vector g i (x k ), and in a parameter-sever implementation with a master node, also the cost of broadcasting the aggregated gradient g k . These are d dimensional vectors of floats, with d being very large in modern deep learning. It is well-known (Seide et al., 2014; Alistarh et al., 2017; Zhang et al., 2017; Lin et al., 2018; Lim et al., 2018) that in many practical applications with common computing architectures, communication takes much more time than computation, creating a bottleneck of the entire training system. Communication Reduction. Several solutions were suggested in the literature as a remedy to this problem. In one strain of work, the issue is addressed by giving each worker "more work" to do, which results in a better communication-to-computation ratio. For example, one may use minibatching to construct more powerful gradient estimators (Goyal et al., 2017) , define local problems for each worker to be solved by a more advanced local solver (Shamir et al., 2014; Richtárik & Takáč, 2016; Reddi et al., 2016) , or reduce communication frequency (e.g., by communicating only once (McDonald et al., 2009; Zinkevich et al., 2010) or once every few iterations (Stich, 2018) ). An orthogonal approach to the above efforts aims to reduce the size of the communicated vectors instead (Seide et al., 2014; Alistarh et al., 2017; Wen et al., 2017; Wangni et al., 2018; Hubara et al., 2017) using various lossy (and often randomized) compression mechanisms, commonly known in the literature as quantization techniques. In their most basic form, these schemes decrease the # bits used to represent floating point numbers forming the communicated d-dimensional vectors (Gupta et al., 2015; Na et al., 2017) , thus reducing the size of the communicated message by a constant factor. Another possibility is to apply randomized sparsification masks to the gradients (Suresh et al., 2017; Konečný & Richtárik, 2018; Alistarh et al., 2018; Stich et al., 2018) , or to rely on coordinate/block descent updates-rules, which are sparse by design (Fercoq et al., 2014) . One of the most important considerations in the area of compression operators is the compressionvariance trade-off (Konečný & Richtárik, 2018; Alistarh et al., 2017; Horváth et al., 2019) . For instance, while random dithering approaches attain up to O(d 1 /2 ) compression (Seide et al., 2014; Alistarh et al., 2017; Wen et al., 2017) , the most aggressive schemes reach O(d) compression by sending a constant number of bits per iteration only (Suresh et al., 2017; Konečný & Richtárik, 2018; Alistarh et al., 2018; Stich et al., 2018) . However, the more compression is applied, the more information is lost, and the more will the quantized vector differ from the original vector we want to communicate, increasing its statistical variance. Higher variance implies slower convergence (Alistarh et al., 2017; Mishchenko et al., 2019) , i.e., more communication rounds. So, ultimately, compression approaches offer a trade-off between the communication cost per iteration and the number of communication rounds. Outside of the optimization for machine learning, compression operators are very relevant to optimal quantization theory and control theory (Elia & Mitter, 2001; Sun & Goyal, 2011; Sun et al., 2012) .

Summary of Contributions.

The key contributions of this work are following: • New compression operators. We construct a new "natural compression" operator (C nat ; see Sec. 2) based on a randomized rounding scheme in which each float of the compressed vector is rounded to a (positive or negative) power of 2. This compression has a provably small variance, at most 1 /8 (see Thm 1), which implies that theoretical convergence results of SGD-type methods are essentially unaffected (see Thm 6). At the same time, substantial savings are obtained in the amount of communicated bits per iteration (3.56× less for float32 and 5.82× less for float64). In addition, we utilize these insights and develop a new random dithering operator-natural dithering (D p,s nat ; see Sec. 3)-which is exponentially better than the very popular "standard" random dithering operator (see Thm 5). We remark that C nat and the identity operator arise as limits of D p,s nat and D p,s sta as s → ∞, respectively. Importantly, our new compression techniques can be combined with existing compression and sparsification operators for a more dramatic effect as we argued before. In particular, given a budget on the second moment ω+1 (see Eq (3)) of a compression operator, which is the main factor influencing the increase in the number of communications when communication compression is applied compared to no compression, our compression operators offer the largest compression factor, resulting in fewest bits transmitted (see Fig 1 ). • Lightweight & simple low-level implementation. We show that apart from a randomization procedure (which is inherent in all unbiased compression operators), natural compression is computationfree. Indeed, natural compression essentially amounts to the trimming of the mantissa and possibly

