COMPRESSING GRADIENTS IN DISTRIBUTED SGD BY EXPLOITING THEIR TEMPORAL CORRELATION Anonymous authors Paper under double-blind review

Abstract

We propose SignXOR, a novel compression scheme that exploits temporal correlation of gradients for the purpose of gradient compression. Sign-based schemes such as Scaled-sign and SignSGD (Bernstein et al., 2018; Karimireddy et al., 2019) compress gradients by storing only the sign of gradient entries. These methods, however, ignore temporal correlations between gradients. The equality or non-equality of signs of gradients in two consecutive iterations can be represented by a binary vector, which can be further compressed depending on its entropy. By implementing a rate-distortion encoder we increase the temporal correlation of gradients, lowering entropy and improving compression. We achieve theoretical convergence of SignXOR by employing the two-way error-feedback approach introduced by Zheng et al. ( 2019). Zheng et al. (2019) show that two-way compression with error-feedback achieves the same asymptotic convergence rate as SGD, although convergence is slower by a constant factor. We strengthen their analysis to show that the rate of convergence of two-way compression with errorfeedback asymptotically is the same as that of SGD. As a corollary we prove that two-way SignXOR compression with error-feedback achieves the same asymptotic rate of convergence as SGD. We numerically evaluate our proposed method on the CIFAR-100 and ImageNet datasets and show that SignXOR requires less than 50% of communication traffic compared to sending sign of gradients. To the best of our knowledge we are the first to present a gradient compression scheme that exploits temporal correlation of gradients.

1. INTRODUCTION

Distributed optimization has become the norm for training machine learning models on large datasets. With the need to train bigger models on ever-growing datasets, scalability of distributed optimization has become a key focus in the research community. While an obvious solution to growing dataset size is to increase the number of workers, the communication among workers has proven to be a bottleneck. For popular benchmark models such as AlexNet, ResNet and BERT the communication can account for a significant portion of the overall training time (Alistarh et al., 2017; Seide et al., 2014; Lin et al., 2018) . The BERT ("Bidirectional Encoder Representing from Transformers") architecture for language models (Devlin et al., 2018) comprises about 340 million parameters. If 32-bit floating-point representation is used one gradient update from a worker amounts to communicating around 1.3GB (340×10 6 parameters × 32 bits per parameter × 2 -33 gigabytes per bit ≈ 1.3GB). Frequently communicating such large payloads can easily overwhelm the network resulting in prolonged training times. In addition, large payloads may increase other forms of costs in distributed optimization. Novel approaches such as federated learning employ mobile devices as worker nodes. Exchanging information with mobile devices is heavily constrained due to communication bandwidth and budget limitations. Therefore, communication remains an important bottleneck in distributed optimization and reducing communication is of utmost importance. Gradient compression alleviates the communication bottleneck. The idea is to apply a compression scheme on gradients before sending them over the network. There has been an increasing amount of literature on gradient compression within the last few years (Seide et al., 2014; Aji & Heafield, 2017; Alistarh et al., 2017; Wen et al., 2017; Wangni et al., 2018; Wu et al., 2018; Lin et al., 2018; Wang et al., 2018) . Such compression schemes have been demonstrated to work well with distributed stochastic gradient descent (SGD) and its variants. However, SGD with arbitrary compres-

