NATURAL COMPRESSION FOR DISTRIBUTED DEEP LEARNING

Abstract

Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practically effective compression technique: natural compression (C nat ). Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa. We show that compared to no compression, C nat increases the second moment of the compressed vector by not more than the tiny factor 9 /8, which means that the effect of C nat on the convergence speed of popular training algorithms, such as distributed SGD, is negligible. However, the communications savings enabled by C nat are substantial, leading to 3-4× improvement in overall theoretical running time. For applications requiring more aggressive compression, we generalize C nat to natural dithering, which we prove is exponentially better than the common random dithering technique. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect, and offer new state-of-the-art both in theory and practice.

1. INTRODUCTION

Modern deep learning models (He et al., 2016) are almost invariably trained in parallel or distributed environments, which is necessitated by the enormous size of the data sets and dimension and complexity of the models required to obtain state-of-the-art performance. In our work, the focus is on the data-parallel paradigm, in which the training data is split across several workers capable of operating in parallel (Bekkerman et al., 2011; Recht et al., 2011) . Formally, we consider optimization problems of the form min x∈R d f (x) := 1 n n i=1 f i (x) , where x ∈ R d represents the parameters of the model, n is the number of workers, and f i : R d → R is a loss function composed of data stored on worker i. Typically, f i is modeled as a function of the form f i (x) := E ζ∼Di [f ζ (x)] , where D i is the distribution of data stored on worker i, and f ζ : R d → R is the loss of model x on data point ζ. The distributions D 1 , . . . , D n can be different on every node, which means that the functions f 1 , . . . , f n may have different minimizers. This framework covers i) stochastic optimization when either n = 1 or all D i are identical, and ii) empirical risk minimization when f i (x) can be expressed as a finite average, i.e, 1 mi mi i=1 f ij (x) for some f ij : R d → R. Distributed Learning. Typically, problem (1) is solved by distributed stochastic gradient descent (SGD) (Robbins & Monro, 1951) , which works as follows: Stochastic gradients g i (x k )'s are computed locally and sent to a master node, which performs update aggregation g k = i g i (x k ). The aggregated gradient g k is sent back to the workers and each performs a single step of SGD: x k+1 = x k -η k n g k , where η k > 0 is a step size. A key bottleneck of the above algorithm, and of its many variants (e.g., variants utilizing minibatching (Goyal et al., 2017) , importance sampling (Horváth & Richtárik, 2019 ), momentum (Nesterov, 2013) , or variance reduction (Johnson & Zhang, 2013) ), is the cost of communication of the

