DQSGD: DYNAMIC QUANTIZED STOCHASTIC GRA-DIENT DESCENT FOR COMMUNICATION-EFFICIENT DISTRIBUTED LEARNING

Abstract

Gradient quantization is widely adopted to mitigate communication costs in distributed learning systems. Existing gradient quantization algorithms often rely on design heuristics and/or empirical evidence to tune the quantization strategy for different learning problems. To the best of our knowledge, there is no theoretical framework characterizing the trade-off between communication cost and model accuracy under dynamic gradient quantization strategies. This paper addresses this issue by proposing a novel dynamic quantized SGD (DQSGD) framework, which enables us to optimize the quantization strategy for each gradient descent step by exploring the trade-off between communication cost and modeling error. In particular, we derive an upper bound, tight in some cases, of the modeling error for arbitrary dynamic quantization strategy. By minimizing this upper bound, we obtain an enhanced quantization algorithm with significantly improved modeling error under given communication overhead constraints. Besides, we show that our quantization scheme achieves a strengthened communication cost and model accuracy trade-off in a wide range of optimization models. Finally, through extensive experiments on large-scale computer vision and natural language processing tasks on CIFAR-10, CIFAR-100, and AG-News datasets, respectively. we demonstrate that our quantization scheme significantly outperforms the state-of-the-art gradient quantization methods in terms of communication costs.

1. INTRODUCTION

Recently, with the booming of Artificial Intelligence (AI), 5G wireless communications, and Cyber-Physical Systems (CPS), distributed learning plays an increasingly important role in improving the efficiency and accuracy of learning, scaling to a large input data size, and bridging different wireless computing resources (Dean et al., 2012; Bekkerman et al., 2011; Chilimbi et al., 2014; Chaturapruek et al., 2015; Zhu et al., 2020; Mills et al., 2019) . Distributed Stochastic Gradient Descent (SGD) is the core in a vast majority of distributed learning algorithms (e.g., various distributed deep neural networks), where distributed nodes calculate local gradients and an aggregated gradient is achieved via communication among distributed nodes and/or a parameter server. However, due to limited bandwidth in practical networks, communication overhead for transferring gradients often becomes the performance bottleneck. Several approaches towards communicationefficient distributed learning have been proposed, including compressing gradients (Stich et al., 2018; Alistarh et al., 2017) or updating local models less frequently (McMahan et al., 2017) . Gradient quantization reduces the communication overhead by using few bits to approximate the original real value, which is considered to be one of the most effective approaches to reduce communication overhead (Seide et al., 2014; Alistarh et al., 2017; Bernstein et al., 2018; Wu et al., 2018; Suresh et al., 2017) . The lossy quantization inevitably brings in gradient noise, which will affect the convergence of the model. Hence, a key question is how to effectively select the number of quantization bits to balance the trade-off between the communication cost and its convergence performance. Existing algorithms often quantize parameters into a fixed number of bits, which is shown to be inefficient in balancing the communication-convergence trade-off (Seide et al., 2014; Alistarh et al., 2017; Bernstein et al., 2018) . An efficient scheme should be able to dynamically adjust the number

