COMPRESSING GRADIENTS IN DISTRIBUTED SGD BY EXPLOITING THEIR TEMPORAL CORRELATION Anonymous authors Paper under double-blind review

Abstract

We propose SignXOR, a novel compression scheme that exploits temporal correlation of gradients for the purpose of gradient compression. Sign-based schemes such as Scaled-sign and SignSGD (Bernstein et al., 2018; Karimireddy et al., 2019) compress gradients by storing only the sign of gradient entries. These methods, however, ignore temporal correlations between gradients. The equality or non-equality of signs of gradients in two consecutive iterations can be represented by a binary vector, which can be further compressed depending on its entropy. By implementing a rate-distortion encoder we increase the temporal correlation of gradients, lowering entropy and improving compression. We achieve theoretical convergence of SignXOR by employing the two-way error-feedback approach introduced by Zheng et al. ( 2019). Zheng et al. (2019) show that two-way compression with error-feedback achieves the same asymptotic convergence rate as SGD, although convergence is slower by a constant factor. We strengthen their analysis to show that the rate of convergence of two-way compression with errorfeedback asymptotically is the same as that of SGD. As a corollary we prove that two-way SignXOR compression with error-feedback achieves the same asymptotic rate of convergence as SGD. We numerically evaluate our proposed method on the CIFAR-100 and ImageNet datasets and show that SignXOR requires less than 50% of communication traffic compared to sending sign of gradients. To the best of our knowledge we are the first to present a gradient compression scheme that exploits temporal correlation of gradients.

1. INTRODUCTION

Distributed optimization has become the norm for training machine learning models on large datasets. With the need to train bigger models on ever-growing datasets, scalability of distributed optimization has become a key focus in the research community. While an obvious solution to growing dataset size is to increase the number of workers, the communication among workers has proven to be a bottleneck. For popular benchmark models such as AlexNet, ResNet and BERT the communication can account for a significant portion of the overall training time (Alistarh et al., 2017; Seide et al., 2014; Lin et al., 2018) . The BERT ("Bidirectional Encoder Representing from Transformers") architecture for language models (Devlin et al., 2018) comprises about 340 million parameters. If 32-bit floating-point representation is used one gradient update from a worker amounts to communicating around 1.3GB (340×10 6 parameters × 32 bits per parameter × 2 -33 gigabytes per bit ≈ 1.3GB). Frequently communicating such large payloads can easily overwhelm the network resulting in prolonged training times. In addition, large payloads may increase other forms of costs in distributed optimization. Novel approaches such as federated learning employ mobile devices as worker nodes. Exchanging information with mobile devices is heavily constrained due to communication bandwidth and budget limitations. Therefore, communication remains an important bottleneck in distributed optimization and reducing communication is of utmost importance. Gradient compression alleviates the communication bottleneck. The idea is to apply a compression scheme on gradients before sending them over the network. There has been an increasing amount of literature on gradient compression within the last few years (Seide et al., 2014; Aji & Heafield, 2017; Alistarh et al., 2017; Wen et al., 2017; Wangni et al., 2018; Wu et al., 2018; Lin et al., 2018; Wang et al., 2018) . Such compression schemes have been demonstrated to work well with distributed stochastic gradient descent (SGD) and its variants. However, SGD with arbitrary compres-sion schemes may not converge. Karimireddy et al. (2019) give one example of non-convergence. The recently proposed error-feedback based algorithms (Stich et al., 2018; Karimireddy et al., 2019) circumvent the convergence issue. Error-feedback methods accumulate the compression error and feed it back to the input of the compression scheme so that the error gets transmitted over subsequent iterations. The dist-EF-SGD algorithm proposed by Zheng et al. ( 2019) applies error-feedback to two-way compression, in which both worker-to-master and master-to-worker communications are compressed. Theoretical guarantees provided by dist-EF-SGD are valid for all compression schemes that fall under the definition of 'δ-approximate compressors', also referred to as δ-compressors. The authors prove that error-feedback with two-way compression asymptotically achieves the O(1/ √ T ) convergence rate of SGD. However, the analysis by Zheng et al. (2019) suggests that dist-EF-SGD converges slower than SGD by a constant factor. Our contributions in this paper are as follows. We propose SignXOR, a novel compression scheme that exploits temporal correlation of gradients. We prove that SignXOR is a δ-compressor, and we provide convergence guarantees for SignXOR by employing dist-EF-SGD. We strengthen the convergence bound by Zheng et al. (2019) to show that dist-EF-SGD asymptotically converges at the same O(1/ √ T ) rate as SGD. Consequently, we show that the proposed method asymptotically achieves the SGD convergence rate. We empirically validate the proposed method on CIFAR-100 and ImageNet datasets and demonstrate that the ratio between total communication budgets of SignXOR and Scaled-sign is less than 50%. Notation: For x ∈ R d , x[j] denotes the jth entry of x, x 1 denotes the 1 -norm, and x denotes the 2 -norm. For vector inputs sgn(•) function outputs the sign of the input element-wise. The index set {1, . . . , n} is denoted by [n], and denotes elementwise multiplication.

2. RELATED WORK

The most common gradient compression schemes can be categorized into those based on sparsification and those based on quantization. The methods based on sparsification such as Top-k, Rand-k (Stich et al., 2018; Lin et al., 2018) and Spectral-ATOMO (Wang et al., 2018) preserve only the most significant gradient elements, effectively reducing the quantity of information carrying gradient components. On the other hand, the methods based on quantization such as QSGD (Alistarh et al., 2017 ), TernGrad (Wen et al., 2017) and SignSGD (Bernstein et al., 2018) reduce the overall floating-point precision of the gradient. Therefore, these two classes of methods can be respectively thought of as approaches that reduce the quantity versus the quality of the gradient. One can think of this in analogy to image compression. For example, JPEG image compression that is based on discrete cosine transform determines both which transform coefficients to store (the quantity) and at what level of resolution to store those coefficients (the quality). Sign-based compression schemes such as Scaled-sign, SignSGD and Signum (Bernstein et al., 2018) sit at the far end of quantization-based algorithms. Such schemes quantize real values to only two levels, +1 and -1. For example, the compressing function of Scaled-sign takes in a vector x ∈ R d , and the decompressing function outputs the vector ( x 1 /d) sgn(x). This means that the compressed representation needs to store only the sign of each entry x[j], along with the scaling constant x 1 /d. In practice one can avoid the 'zero' output of sgn by mapping it to +1 or -1. This allows the two outcomes +1 and -1 to be represented using one bit per entry, making the size of the compressed representation d + 32 bits in total (assuming 32-bit single-precision representation of the scaling constant). As per Shannon's source coding theorem (MacKay, 2003, p. 81) , the sequence of +1 and -1 can be further compressed without any information loss if the probability of encountering +1 is different from that for -1. However, in our experiments on Scaled-sign compression we observe that both outputs are equally likely across all iterations. Any lossy gradient compression scheme introduces noise, also known as distortion, in addition to the measurement noise that is already present in the stochastic gradients computed by the workers. It is reasonable to expect the additional compression error hurts the convergence rate of the algorithm. However, it has been empirically observed that significant compression ratios can be achieved before observing any impact on convergence (Seide et al., 2014; Alistarh et al., 2017) . One can achieve even greater compression while keeping the convergence rate nearly the same by employing errorfeedback (Stich et al., 2018; Karimireddy et al., 2019; Zheng et al., 2019) . Algorithms based on error-feedback accumulate the compression error in past iterations and add it to the input of the

