DQSGD: DYNAMIC QUANTIZED STOCHASTIC GRA-DIENT DESCENT FOR COMMUNICATION-EFFICIENT DISTRIBUTED LEARNING

Abstract

Gradient quantization is widely adopted to mitigate communication costs in distributed learning systems. Existing gradient quantization algorithms often rely on design heuristics and/or empirical evidence to tune the quantization strategy for different learning problems. To the best of our knowledge, there is no theoretical framework characterizing the trade-off between communication cost and model accuracy under dynamic gradient quantization strategies. This paper addresses this issue by proposing a novel dynamic quantized SGD (DQSGD) framework, which enables us to optimize the quantization strategy for each gradient descent step by exploring the trade-off between communication cost and modeling error. In particular, we derive an upper bound, tight in some cases, of the modeling error for arbitrary dynamic quantization strategy. By minimizing this upper bound, we obtain an enhanced quantization algorithm with significantly improved modeling error under given communication overhead constraints. Besides, we show that our quantization scheme achieves a strengthened communication cost and model accuracy trade-off in a wide range of optimization models. Finally, through extensive experiments on large-scale computer vision and natural language processing tasks on CIFAR-10, CIFAR-100, and AG-News datasets, respectively. we demonstrate that our quantization scheme significantly outperforms the state-of-the-art gradient quantization methods in terms of communication costs.

1. INTRODUCTION

Recently, with the booming of Artificial Intelligence (AI), 5G wireless communications, and Cyber-Physical Systems (CPS), distributed learning plays an increasingly important role in improving the efficiency and accuracy of learning, scaling to a large input data size, and bridging different wireless computing resources (Dean et al., 2012; Bekkerman et al., 2011; Chilimbi et al., 2014; Chaturapruek et al., 2015; Zhu et al., 2020; Mills et al., 2019) . Distributed Stochastic Gradient Descent (SGD) is the core in a vast majority of distributed learning algorithms (e.g., various distributed deep neural networks), where distributed nodes calculate local gradients and an aggregated gradient is achieved via communication among distributed nodes and/or a parameter server. However, due to limited bandwidth in practical networks, communication overhead for transferring gradients often becomes the performance bottleneck. Several approaches towards communicationefficient distributed learning have been proposed, including compressing gradients (Stich et al., 2018; Alistarh et al., 2017) or updating local models less frequently (McMahan et al., 2017) . Gradient quantization reduces the communication overhead by using few bits to approximate the original real value, which is considered to be one of the most effective approaches to reduce communication overhead (Seide et al., 2014; Alistarh et al., 2017; Bernstein et al., 2018; Wu et al., 2018; Suresh et al., 2017) . The lossy quantization inevitably brings in gradient noise, which will affect the convergence of the model. Hence, a key question is how to effectively select the number of quantization bits to balance the trade-off between the communication cost and its convergence performance. Existing algorithms often quantize parameters into a fixed number of bits, which is shown to be inefficient in balancing the communication-convergence trade-off (Seide et al., 2014; Alistarh et al., 2017; Bernstein et al., 2018) . An efficient scheme should be able to dynamically adjust the number of quantized bits according to the state of current learning model in each gradient descent step to balance the communication overhead and model accuracy. Several studies try to construct adaptive quantization schemes through design heuristics and/or empirical evidence. However, they do not come up with a solid theoretical analysis (Guo et al., 2020; Cui et al., 2018; Oland & Raj, 2015) , which even results in contradicted conclusions. More specifically, MQGrad (Cui et al., 2018) and AdaQS (Guo et al., 2020) suggest using few quantization bits in early epochs and gradually increase the number of bits in later epochs; while the scheme proposed by Anders (Oland & Raj, 2015) states that more quantization bits should be used for the gradient with larger root-mean-squared (RMS) value, choosing to use more bits in the early training stage and fewer bits in the later stage. One of this paper's key contributions is to develop a theoretical framework to crystallize the design tradeoff in dynamic gradient quantization and settle this contradiction. In this paper, we propose a novel dynamic quantized SGD (DQSGD) framework for minimizing communication overhead in distributed learning while maintaining the desired learning accuracy. We study this dynamic quantization problem in both the strongly convex and the non-convex optimization frameworks. In the strongly convex optimization framework, we first derive an upper bound on the difference (that we term the strongly convex convergence error) between the loss after N iterations and the optimal loss to characterize the strongly convex convergence error caused by sampling, limited iteration steps, and quantization. In addition, we find some particular cases and prove the tightness for this upper bound on part of the convergence error caused by quantization. In the non-convex optimization framework, we derive an upper bound on the mean square of gradient norms at every iteration step, which is termed the non-convex convergence error. Based on the above theoretical analysis, we design a dynamic quantization algorithm by minimizing the strongly convex/non-convex convergence error bound under communication cost constraints. Our dynamic quantization algorithm is able to adjust the number of quantization bits adaptively by taking into account the norm of gradients, the communication budget, and the remaining number of iterations. We validate our theoretical analysis through extensive experiments on large-scale Computer Vision (CV) and Natural Language Processing (NLP) tasks, including image classification tasks on CIFAR-10 and CIFAR-100 and text classification tasks on AG-News. Numerical results show that our proposed DQSGD significantly outperforms the baseline quantization methods. To summarize, our key contributions are as follows: • We propose a novel framework to characterize the trade-off between communication cost and modeling error by dynamically quantizing gradients in the distributed learning. • We derive an upper bound on the convergence error for strongly convex objectives and non-convex objectives. The upper bound is shown to be optimal in particular cases. • We develop a dynamic quantization SGD strategy, which is shown to achieve a smaller convergence error upper bound compared with fixed-bit quantization methods. • We validate the proposed DQSGD on a variety of real world datasets and machine learning models, demonstrating that our proposed DQSGD significantly outperforms state-of-the-art gradient quantization methods in terms of mitigating communication costs.

2. RELATED WORK

To solve large scale machine learning problems, distributed SGD methods have attracted a wide attention (Dean et al., 2012; Bekkerman et al., 2011; Chilimbi et al., 2014; Chaturapruek et al., 2015) . To mitigate the communication bottleneck in distributed SGD, gradient quantization has been investigated. 1BitSGD uses 1 bit to quantize each dimension of the gradients and achieves the desired goal in speech recognition applications (Seide et al., 2014) . TernGrad quantizes gradients to ternary levels {-1, 0, 1} to reduce the communication overhead (Wen et al., 2017) . Furthermore, QSGD is considered in a family of compression schemes that use a fixed number of bits to quantize gradients, allowing the user to smoothly trade-off communication and convergence time (Alistarh et al., 2017) . However, these fixed-bit quantization methods may not be efficient in communication. To further reduce the communication overhead, some empirical studies began to dynamically adjust the quantization bits according to current model parameters in the training process, such as the gradient's mean to standard deviation ratio (Guo et al., 2020) , the training loss (Cui et al., 2018 ), gradient's root-mean-squared value (Oland & Raj, 2015) . Though these empirical heuristics of adaptive quan-

