DQSGD: DYNAMIC QUANTIZED STOCHASTIC GRA-DIENT DESCENT FOR COMMUNICATION-EFFICIENT DISTRIBUTED LEARNING

Abstract

Gradient quantization is widely adopted to mitigate communication costs in distributed learning systems. Existing gradient quantization algorithms often rely on design heuristics and/or empirical evidence to tune the quantization strategy for different learning problems. To the best of our knowledge, there is no theoretical framework characterizing the trade-off between communication cost and model accuracy under dynamic gradient quantization strategies. This paper addresses this issue by proposing a novel dynamic quantized SGD (DQSGD) framework, which enables us to optimize the quantization strategy for each gradient descent step by exploring the trade-off between communication cost and modeling error. In particular, we derive an upper bound, tight in some cases, of the modeling error for arbitrary dynamic quantization strategy. By minimizing this upper bound, we obtain an enhanced quantization algorithm with significantly improved modeling error under given communication overhead constraints. Besides, we show that our quantization scheme achieves a strengthened communication cost and model accuracy trade-off in a wide range of optimization models. Finally, through extensive experiments on large-scale computer vision and natural language processing tasks on CIFAR-10, CIFAR-100, and AG-News datasets, respectively. we demonstrate that our quantization scheme significantly outperforms the state-of-the-art gradient quantization methods in terms of communication costs.

1. INTRODUCTION

Recently, with the booming of Artificial Intelligence (AI), 5G wireless communications, and Cyber-Physical Systems (CPS), distributed learning plays an increasingly important role in improving the efficiency and accuracy of learning, scaling to a large input data size, and bridging different wireless computing resources (Dean et al., 2012; Bekkerman et al., 2011; Chilimbi et al., 2014; Chaturapruek et al., 2015; Zhu et al., 2020; Mills et al., 2019) . Distributed Stochastic Gradient Descent (SGD) is the core in a vast majority of distributed learning algorithms (e.g., various distributed deep neural networks), where distributed nodes calculate local gradients and an aggregated gradient is achieved via communication among distributed nodes and/or a parameter server. However, due to limited bandwidth in practical networks, communication overhead for transferring gradients often becomes the performance bottleneck. Several approaches towards communicationefficient distributed learning have been proposed, including compressing gradients (Stich et al., 2018; Alistarh et al., 2017) or updating local models less frequently (McMahan et al., 2017) . Gradient quantization reduces the communication overhead by using few bits to approximate the original real value, which is considered to be one of the most effective approaches to reduce communication overhead (Seide et al., 2014; Alistarh et al., 2017; Bernstein et al., 2018; Wu et al., 2018; Suresh et al., 2017) . The lossy quantization inevitably brings in gradient noise, which will affect the convergence of the model. Hence, a key question is how to effectively select the number of quantization bits to balance the trade-off between the communication cost and its convergence performance. Existing algorithms often quantize parameters into a fixed number of bits, which is shown to be inefficient in balancing the communication-convergence trade-off (Seide et al., 2014; Alistarh et al., 2017; Bernstein et al., 2018) . An efficient scheme should be able to dynamically adjust the number of quantized bits according to the state of current learning model in each gradient descent step to balance the communication overhead and model accuracy. Several studies try to construct adaptive quantization schemes through design heuristics and/or empirical evidence. However, they do not come up with a solid theoretical analysis (Guo et al., 2020; Cui et al., 2018; Oland & Raj, 2015) , which even results in contradicted conclusions. More specifically, MQGrad (Cui et al., 2018) and AdaQS (Guo et al., 2020) suggest using few quantization bits in early epochs and gradually increase the number of bits in later epochs; while the scheme proposed by Anders (Oland & Raj, 2015) states that more quantization bits should be used for the gradient with larger root-mean-squared (RMS) value, choosing to use more bits in the early training stage and fewer bits in the later stage. One of this paper's key contributions is to develop a theoretical framework to crystallize the design tradeoff in dynamic gradient quantization and settle this contradiction. In this paper, we propose a novel dynamic quantized SGD (DQSGD) framework for minimizing communication overhead in distributed learning while maintaining the desired learning accuracy. We study this dynamic quantization problem in both the strongly convex and the non-convex optimization frameworks. In the strongly convex optimization framework, we first derive an upper bound on the difference (that we term the strongly convex convergence error) between the loss after N iterations and the optimal loss to characterize the strongly convex convergence error caused by sampling, limited iteration steps, and quantization. In addition, we find some particular cases and prove the tightness for this upper bound on part of the convergence error caused by quantization. In the non-convex optimization framework, we derive an upper bound on the mean square of gradient norms at every iteration step, which is termed the non-convex convergence error. Based on the above theoretical analysis, we design a dynamic quantization algorithm by minimizing the strongly convex/non-convex convergence error bound under communication cost constraints. Our dynamic quantization algorithm is able to adjust the number of quantization bits adaptively by taking into account the norm of gradients, the communication budget, and the remaining number of iterations. We validate our theoretical analysis through extensive experiments on large-scale Computer Vision (CV) and Natural Language Processing (NLP) tasks, including image classification tasks on CIFAR-10 and CIFAR-100 and text classification tasks on AG-News. Numerical results show that our proposed DQSGD significantly outperforms the baseline quantization methods. To summarize, our key contributions are as follows: • We propose a novel framework to characterize the trade-off between communication cost and modeling error by dynamically quantizing gradients in the distributed learning. • We derive an upper bound on the convergence error for strongly convex objectives and non-convex objectives. The upper bound is shown to be optimal in particular cases. • We develop a dynamic quantization SGD strategy, which is shown to achieve a smaller convergence error upper bound compared with fixed-bit quantization methods. • We validate the proposed DQSGD on a variety of real world datasets and machine learning models, demonstrating that our proposed DQSGD significantly outperforms state-of-the-art gradient quantization methods in terms of mitigating communication costs.

2. RELATED WORK

To solve large scale machine learning problems, distributed SGD methods have attracted a wide attention (Dean et al., 2012; Bekkerman et al., 2011; Chilimbi et al., 2014; Chaturapruek et al., 2015) . To mitigate the communication bottleneck in distributed SGD, gradient quantization has been investigated. 1BitSGD uses 1 bit to quantize each dimension of the gradients and achieves the desired goal in speech recognition applications (Seide et al., 2014) . TernGrad quantizes gradients to ternary levels {-1, 0, 1} to reduce the communication overhead (Wen et al., 2017) . Furthermore, QSGD is considered in a family of compression schemes that use a fixed number of bits to quantize gradients, allowing the user to smoothly trade-off communication and convergence time (Alistarh et al., 2017) . However, these fixed-bit quantization methods may not be efficient in communication. To further reduce the communication overhead, some empirical studies began to dynamically adjust the quantization bits according to current model parameters in the training process, such as the gradient's mean to standard deviation ratio (Guo et al., 2020) , the training loss (Cui et al., 2018) , gradient's root-mean-squared value (Oland & Raj, 2015) . Though these empirical heuristics of adaptive quan-tization methods show good performance in some certain tasks, their imprecise conjectures and the lack of theoretical guidelines in the conjecture framework have limited their generalization to a broad range of machine learning models/tasks.

3. PROBLEM FORMULATION

We consider to minimize the objective function F : R d → R with parameter x min x∈R d F (x) = E ξ∼D [l(x; ξ)], where the data point ξ is generated from an unknown distribution D, and a loss function l(x; ξ) measures the loss of the model x at data point ξ. Vanilla gradient descent (GD) will solve this problem by updating model parameters via iterations x (n+1) = x (n) -η∇F (x (n) ), where x (n) is the model parameter at iteration n; η is the learning rate; ∇F (x (n) ) is the gradient of F (x (n) ). A modification to the GD scheme, minibatch SGD, uses mini-batches of random samples with size K, A K = {ξ 0 , ..., ξ K-1 }, to calculate the stochastic gradient g(x) = 1/K K-1 i=0 ∇l(x; ξ i ). In distributed learning, to reduce the communication overhead, we consider to quantize the minibach stochastic gradients: x (n+1) = x (n) -ηQ sn [g(x (n) )], where Q sn [•] is the quantization operation that works on each dimension of g(x (n) ). The i-th component of the stochastic gradient vector g is quantized as Q s (g i ) = g p • sgn(g i ) • ζ(g i , s), where g p is the l p norm of g; sgn(g i ) = {+1, -1} is the sign of g i ; s is the quantization level; and ζ(g i , s) is an unbiased stochastic function that maps scalar |g i |/ g p to one of the values in set {0, 1/s, 2/s, . . . , s/s}: if |g i |/ g p ∈ [l/s, (l + 1)/s], we have ζ(g i , s) =    l/s, with probability 1 -p, (l + 1)/s, with probability p = s |g i | g p -l. Note that, the quantization level is roughly exponential to the number of quantized bits. If we use B bits to quantize g i , we will use one bit to represent its sign and the other B -1 bits to represent ζ(g i , s), thus resulting in a quantization level s = 2 B-1 -1. In total, we use B pre + dB bits for the gradient quantization at each iteration: a certain number of B pre bits of precision to construct g p and dB bits to express the d components of g. Given a total number of training iterations N and the overall communication budget C to upload all stochastic gradients, we would like to design a gradient quantization scheme to maximize the learning performance. To measure the learning performance under gradient quantization, we follow the commonly adopted convex/non-convex-convergence error δ(F, N, C) (Alistarh et al., 2017) : δ(F, N, C) =    F (x (N ) , C) -F (x * , C), for strongly convex F , 1 N N -1 n=0 ∇F (x (n) ) 2 2 , for non-convex F , where x * is the optimal point to minimize F . In general, this error δ(F, N, C) is hard to determine; instead, we aim to lower and upper bound this error and design corresponding quantization schemes.

4. DYNAMIC QUANTIZED SGD

In this part, we derive upper bounds on the strongly convex/non-convex convergence error δ(F, N, C) and lower bounds on the strongly convex-convergence error. By minimizing the upper bound on this convergence error, we propose the dynamic quantized SGD strategies for strongly convex and non-convex objective functions.

4.1. PRELIMINARIES

We first state some assumptions as follows. Assumption 1 (Smoothness). The objective function F (x) is L-smooth, if ∀x, y ∈ R d , ∇F (x) - ∇F (y) 2 L x -y 2 . It implies that ∀x, y ∈ R d , we have F (y) ≤ F (x) + ∇F (x) T (y -x) + L 2 y -x 2 2 (6) ∇F (x) 2 2 ≤ 2L[F (x) -F (x * )] Assumption 2 (Strongly convexity). The objective function F (x) is µ-strongly convex, if ∃µ > 0, F (x) - µ 2 x T x is a convex function. From Assumption 2, we have: ∀x, y ∈ R d , F (y) ≥ F (x) + ∇F (x) T (y -x) + µ 2 y -x 2 2 (8) Assumption 3 (Variance bound). The stochastic gradient oracle gives us an independent unbiased estimate ∇l(x; ξ) with a bounded variance: E ξ∼D [∇l(x; ξ)] = ∇F (x), E ξ∼D [ ∇l(x; ξ) -∇F (x) 2 2 ] ≤ σ 2 . ( ) From Assumption 3, for the minibatch stochastic gradient g(x) = [ K-1 i=0 ∇l(x; ξ i )]/K, we have E ξ∼D [g(x)] = ∇F (x) (11) E ξ∼D [ g(x; ξ) 2 ] ≤ ∇F (x) 2 2 + σ 2 /K. ( ) We have the relationship of gradients before and after quantization: Q s [g(x)] = g(x) + ˆ , where ˆ represents the quantization noise, following the probability distribution that can be shown in Proposition 1. The proof of Proposition 1 is given in Appendix A. Proposition 1 (Quantization Noise magnitude). For the stochastic gradient vector g, if the quantization level is s, then the i-th component of quantization noise follows as: p(ˆ i ) =        s g p - s 2 g 2 p ˆ i , 0 < ˆ i ≤ g p s , s g p + s 2 g 2 p ˆ i , - g p s ≤ ˆ i ≤ 0. ( ) Following Proposition 1, we can get E ˆ i [Q s [g]] = g and E ˆ i [ Q s [g] -g 2 2 ] = d 6s 2 g 2 p . This indicates that the quantization operation is unbiased, and the variance bound of Q s [g] is directly proportional to g 2 p and inversely proportional to s 2 , which means that gradients with a larger norm should be quantized using more bits to keep E[ Q s [g] -g 2 2 ] below a given noise level. Therefore, we have the following lemma to characterize the quantization noise Q s [g]. Lemma 1. For the quantized gradient vector Q s [g], we have E[Q s [g]] = ∇F (x) (14) E[ Q s [g] 2 2 ] ≤ ∇F (x) 2 2 + σ 2 K + d 6s 2 g 2 p ( ) We can see that the noise various of Q s [g] contains two parts: the first part is the sampling noise σ 2 K , the second part is the quantization noise d 6s 2 g 2 p .

4.2. CONVERGENCE ERROR OF STRONGLY CONVEX OBJECTIVES

Firstly, we consider a strongly convex optimization problem. Putting the QSGD algorithm (2) on smooth, strongly convex functions yield the following result with proof given in Appendix B. Theorem 1 (Convergence Error Bound of Strongly Convex Objectives). For the problem in Eq. (1) under Assumption 1 and Assumption 2 with initial parameter x (0) , using quantized gradients in Eq. (2) for iteration, we can upper and lower bound the convergence error by E[F (x (N ) ) -F (x * )] ≤ α N [F (x (0) ) -F (x * )] + Lη 2 σ 2 (1 -α N ) 2K(1 -α) + Ldη 2 12 N -1 n=0 α N -1-n 1 s 2 n g(x (n) ) 2 p , E[F (x (N ) ) -F (x * )] ≥ β N [F (x (0) ) -F (x * )] + µη 2 σ 2 (1 -β N ) 2K(1 -β) + µdη 2 12 N -1 n=0 β N -1-n 1 s 2 n g(x (n) ) 2 p , where α = 1 -2µη + Lµη 2 , β = 1 -2Lη + Lµη 2 . The convergence error consists of three parts: the error of the gradient descent method, which which tends to 0 as the number of iterations N increases and also depends on the learning rate η (from the expression of α, we can see that when η ≤ 1/L, with the increase of η, α decrease, and the convergence rate of the model is accelerated); the sampling error, which can be reduced by increasing the batch size K or decaying the learning rate; and the convergence error due to quantization, which we want to minimize. Note that there is a positive correlation between the upper bound of convergence error due to quantization and the variance of the quantization noise. The contribution of quantization noise to the error is larger at the late stage of training. Therefore, noise reduction helps improve the accuracy of the model. In other words, more quantization bits should be used in the later training period. In addition, we can show that the upper and lower bound matches each other in some particular cases. As a simple example, we consider a quadratic problem: F (x) = x T Hx + A T x + B, where the Hessian matrix is isotropic H = λI, A ∈ R d and B is a constant. Clearly, L = µ, so α = β and the upper is equal to the lower bound. Theorem 2 (Convergence Error of Quadratic Functions). For a quadratic optimization problem F (x) = x T Hx + A T x + B, we consider a Gaussian noise case x (n+1) = x (n) -η∇F (x (n) ) -η (n) , (n) ∼ N (0, Σ(x (n) )). ( ) We achieve E[F (x (N ) ) -F (x * )] = 1 2 (x (0) -x * ) T (ρ N ) T Hρ N (x (0) -x * ) + η 2 2 N -1 n=0 Tr[ρ N -1-n Σ(x (n) )H(ρ N -1-n ) T ], where ρ = I -ηH and H is the Hessian matrix. Detailed proof is in Appendix C.

4.3. DQSGD FOR STRONGLY CONVEX OBJECTIVES

We will determine the dynamic quantization strategy by minimizing the upper bound of convergence error due to quantization. The optimization problem is: min Bn N -1 n=0 α N -1-n 1 (2 Bn-1 -1) 2 g(x (n) ) 2 p , N -1 n=0 (dB n + B pre ) = C. By solving this optimization problem, we can get B n = log 2 [kα (N -n)/2 g(x (n) ) p + 1] + 1, where k depends on the total communication overhead C, and α is related to the convergence rate of the model. The larger the total communication cost C is, the greater k is; the faster the model's convergence rate is, the smaller α is. In Appendix E, we prove that our scheme outperforms the fixed bits scheme in terms of the convergence error.

4.4. DQSGD FOR NON-CONVEX OBJECTIVES

In general, if we consider non-convex smooth objective functions, we can get the following theorem with proofs given in Appendix D. Theorem 3 (Convergence Error Bound of Non-Convex Objectives). For the problem in Eq. (1) under Assumption 1, with initial parameter x (0) , using quantized gradients in Eq. (2) for iteration, we can upper bound the convergence error by 1 N N -1 n=0 E[ ∇F (x (n) ) 2 2 ] ≤ 2 2N η -LN η 2 [F (x (0) ) -F (x * )] + Lησ 2 (2 -Lη)K + Ldη 6(2 -Lη)N N -1 n=0 1 s 2 n g(x (n) ) 2 p . Similarly, the convergence error consists of three parts: the error of the gradient descent method, which tends to 0 as the number of iterations N increases; the sampling error, which can be reduced by increasing the batch size K or decaying the learning rate; and the convergence error due to quantization, which we want to minimize. Thus, the optimization problem is: min Bn N -1 n=0 1 s 2 n g(x (n) ) 2 p N -1 n=0 (dB n + B pre ) = C By solving this optimization problem, we can get B n = log 2 [t g(x (n) ) p + 1] + 1, where t depends on the total communication overhead C. In Appendix E, we also give a detailed comparison of our scheme's the upper bound of convergence error compared with fixed-bit schemes.

4.5. DQSGD IN DISTRIBUTED LEARNING

Next, we consider the deployment of our proposed DQSGD algorithm in the distributed learning setting. We have a set of W workers who proceed in synchronous steps, and each worker has a complete copy of the model. In each communication round, workers compute their local gradients and communicate gradients with the parameter server, while the server aggregates these gradients from workers and updates the model parameters. If gl (x (n) ) is the quantized stochastic gradients in the l-th worker and x (n) is the model parameter that the workers hold in iteration n, then the updated value of x by the end of this iteration is: x (n+1) = x (n) + η G(x (n) ), where G(x (n) ) = 1 W W l=1 gl (x (n) ). The pseudocode is given in Algorithm 2 in Appendix E .

5. EXPERIMENTS

In this section, we conduct experiments on CV and NLP tasks on three datasets: AG-News (Zhang et al., 2015) , CIFAR-10, and CIFAR-100 (Krizhevsky et al., 2009) , to validate the effectiveness of our proposed DQSGD method. We use the testing accuracy to measure the learning performance and use the compression ratio to measure the communication cost. We compare our proposed DQSGD with the following baselines: SignSGD (Seide et al., 2014) , TernGrad (Wen et al., 2017) , QSGD (Alistarh et al., 2017) , Adaptive (Oland & Raj, 2015) , AdaQS (Guo et al., 2020) . We conduct experiments for W = 8 workers and use canonical networks to evaluate the performance of different algorithms: BiLSTM on the text classification task on the AG-News dataset, Resnet18 on the image classification task on the CIFAR-10 dataset, and Resnet34 on the image classification task on the CIFAR-100 dataset. A detailed description of the three datasets, the baseline algorithms, and experimental setting is given in Appendix F. Test Accuracy vs Compression Ratio. In Table 1 , we compare the testing accuracy and compression ratio of different algorithms under different tasks. We can see that although SignSGD, TernGrad, QSGD (4 bits) have a compression ratio greater than 8, they cannot achieve more than 0.8895, 0.8545, 0.6840 test accuracy for AG-News, CIFAR-10, CIFAR-100 tasks, respectively. In contrast, QSGD (6 bits), Adaptive, AdaQS, and DQSGD can achieve more than 0.8986, 0.8785, 0.6939 test test accuracy. Among them, our proposed DQSGD can save communication cost by 4.11% -21.73%, 22.36% -25%, 11.89% -24.07% than the other three algorithms. Fixed Bits vs. Adaptive Bits. Figure 1 shows the comparison results of fixed bit algorithm QSGD and our proposed DQSGD on CIFAR-10. From these results, we can see that although QSGD (2 bits) and QSGD (4 bits) have less communication cost, they suffer up to about 14% and 2.7% accuracy degradation compared with Vanilla SGD. The accuracy of QSGD (6 bits) and DQSGD is similar to that of Vanilla SGD, but the communication overhead of DQSGD is reduced up to 25% compared with that of QSGD (6 bits). This shows that our dynamic quantization strategy can effectively reduce the communication cost compared with the fixed quantization scheme. Figure 2 shows the accuracy of QSGD and DQSGD under different compression ratios. It can be seen that DQSGD can achieve higher accuracy than QSGD under the same communication cost. A PROOF OF PROPOSITION 1 Suppose g i g p ∼ U ( l s , l + 1 s ) and let i = g i g p -ζ(g i , s ), for 0 < 0 < 1 s , we have: p{ i = 0 } = p{ g i g p = l s + 0 } • p{ l s | g i g p = l s + 0 } = s • 1 s -0 1 s = s -s 2 0 Similarly, for -1 s < 0 < 0, we have: p{ i = 0 } = p{ g i g p = l + 1 s + 0 } • p{ l + 1 s | g i g p = l + 1 s + 0 } = s • 1 s + 0 1 s = s + s 2 0 Considering that Q s (g i ) = g p • sgn(g i ) • ζ(g i , s ) and let ˆ i = g i -Q s (g i ), so we have: p(ˆ i ) =        s g p - s 2 g 2 p ˆ i 0 < ˆ i ≤ g p s s g p + s 2 g 2 p ˆ i - g p s ≤ ˆ i ≤ 0 B PROOF OF THEOREM 1 Considering function F is Lsmooth, and using Assumption 1, we have: F (x (n+1) ) ≤ F (x (n) ) + ∇F (x (n) ) T (x (n+1) -x (n) ) + L 2 x (n+1) -x (n) 2 2 For QSGD, x (n+1) = x (n) -ηQ sn [g(x (n) )] , so: F (x (n+1) ) ≤ F (x (n) ) + ∇F (x (n) ) T (-ηQ sn [g(x (n) )]) + L 2 -ηQ sn [g(x (n) )] 2 2 = F (x (n) ) -η∇F (x (n) ) T Q sn [g(x (n) )] + Lη 2 2 Q sn [g(x (n) )] 2 2 Taking total expectations, and using Lemma 1, this yields: E[F (x (n+1) )] ≤ F (x (n) ) + (-η + Lη 2 2 ) ∇F (x (n) ) 2 2 + Lη 2 σ 2 2K + Lη 2 d 12s 2 n g(x (n) ) 2 p Considering that function F is µstrongly convex, and using Assumption 2, so: E[F (x (n+1) )] ≤ F (x (n) ) -(2µη -Lµη 2 )[F (x (n) ) -F (x * )] + Lη 2 σ 2 2K + Lη 2 d 12s 2 n g(x (n) ) 2 p Subtracting F (x * ) from both sides, and let α = 1 -2µη + Lµη 2 , so: E[F (x (n+1) ) -F (x * )] ≤ α[F (x (n) ) -F (x * )] + Lη 2 σ 2 2K + Lη 2 d 12s 2 n g(x (n) ) 2 p Applying this recursively: E[F (x (N ) ) -F (x * )] ≤ α N [F (x (0) ) -F (x * )] + Lη 2 σ 2 (1 -α N ) 2K(1 -α) + Ldη 2 12 N -1 n=0 α N -1-n 1 s 2 n g(x (n) ) 2 p Similarly, let β = 1 -2Lη + Lµη 2 E[F (x (N ) ) -F (x * )] ≥ β N [F (x (0) ) -F (x * )] + µη 2 σ 2 (1 -β N ) 2K(1 -β) + µdη 2 12 N -1 n=0 β N -1-n 1 s 2 n g(x (n) ) 2 p C PROOF OF THEOREM 2 Both SGD and QSGD can be considered a general kind of optimization dynamics, namely, gradient descent with unbiased noise. Based on the central limit theorem, it is assumed that the noise caused by sampling and quantization obeys Gaussian distribution, that is, Q sn [g(x (n) )] = ∇F (x (n) ) + (n) , (n) ∼ N (0, Σ(x (n) ) ). Therefore, we can consider Equation (2) as the discrimination of the Gaussian process: x (n+1) = x (n) -η∇F (x (n) ) -η (n) , (n) ∼ N (0, Σ(x (n) )) The error for general Gaussian processes is hard to analyze due to the intractableness of the integrals, so we only consider a quadratic problem: x (n+1) = x (n) -η∇F (x (n) ) -η (n) = x (n) -η[Hx (n) + A] -η (n) = (I -ηH)x (n) -ηA -η (n) Considering ∇F (x * ) = ηA + ηHx * = 0, subtracting x * from both sides, and rearranging, this yields: x (n+1) -x * = (I -ηH)x (n) -ηA -x * -η (n) = (I -ηH)(x (n) -x * ) -ηA -ηHx * -η (n) = (I -ηH)(x (n) -x * ) -η (n) Subtracting F (x (n) ) from both sides, then applying it recursively, this yields: E[F (x (N ) ) -F (x (0) )] ≤ -(η - Lη 2 2 ) N -1 n=0 E[ ∇F (x (n) ) 2 2 ] + LN η 2 σ 2 2K + Ldη 2 12 N -1 n=0 1 s 2 n g(x (n) ) 2 p Considering that F (x (N ) ) ≥ F (x * ), so: 1 N N -1 n=0 E[ ∇F (x (n) ) 2 2 ] ≤ 2 2N η -LN η 2 [F (x (0) ) -F (x * )] + Lησ 2 (2 -Lη)K + Ldη 6(2 -Lη)N N -1 n=0 1 s 2 n g(x (n) ) 2 p E ALGORITHM Algorithm 1 Dynamic quantized SGD 1: Input: Learning rate η, Initial point x (0) ∈ R d , Hyperparametric k, α 2: for each iteration n = 0, 1, ..., N -1: do 3: g(x (n) ) ← compute gradient of a batch of data 4: g(x (n) ) ← calculate the norm of g(x (n) ) 5: B n ← determine the quantization bits 6: g(x (n) ) ← quantize (g(x (n) ), B n ) 7: Update the parameter: x (n+1) = x (n) -ηg(x (n) )

8: end for

We make an assumptions as follows. Assumption 4. (Second moment bound). If F (x) is Lsmooth, so the l p norm of the minibach stochastic gradient g(x) satisfied: g(x) 2 p ≤ 2L[F (x) -F (x * )] γ It is noted that Assumption 4 is a generalization of Equation ( 7). Based on this quantization scheme 18 and Assumption 4, we can get the quantization error: δ DQSGD ≤ L 2 η 2 dα N -1 [F (x (0) ) -F (x * )] γ 6 × 4 (C-32N -dN )/dN N α (γ-1)(N -1)/2 (23) Accordingly, if we fix the bits, the quantization error is: δ Fixed ≤ L 2 η 2 dα N -1 [F (x (0) ) -F (x * )] γ 6 × 4 (C-32N -dN )/dN N -1 n=0 α (γ-1)n Comparing ( 23) and ( 24), we can see that our scheme reduces the error bound about: On each worker l = 1, ..., W : δ Fixed -δ DQSGD ≈ L 2 η 2 dα N -1 [F (x (0) ) -F (x * )] γ 6 × 4 (C-32N -dN )/dN λ 1 4: g l (x (n) ) ← compute gradient w.r.t. a batch of data 5: gl (x (n) ) ← quantize (g l (x (n) ), B n ) 6: send gl (x (n) ) to server 7: receive g(x (n) ) and B n+1 from server 8: On server: 9: collect all W gradients gl (x (n) ) from workers 10: average: G(x (n) ) = 1 W W l=1 gl (x (n) ) 11: G(x (n) ) ← calculate the norm of g(x (n) ) 12: B n+1 ← Determine the quantization bits for the next iteration ( G(x (n) ) ) 13: send G(x (n) ) and B n+1 to all workers 14: end for where λ 1 = N -1 n=0 α (γ-1)n -N α (γ-1)(N -1)/2 is the difference between arithmetic mean and geometric mean. When γ = 1, λ 1 = 0. If γ = 1, λ 1 > 0. Based on quantization scheme (20) and Assumption 4, the quantization error is: δ DQSGD ≤ L 2 ηd[F (x (0) ) -F (x * )] γ 3N (2 -Lη) × 4 (C-32N -dN )/dN N α γ(N -1)/2 (26) Accordingly, if we fix the bits, the quantization error is: δ Fixed ≤ L 2 ηd[F (x (0) ) -F (x * )] γ 3N (2 -Lη) × 4 (C-32N -dN )/dN N -1 n=0 α γn Comparing ( 26) and ( 27), we can see that our scheme reduces the error bound about: δ Fixed -δ DQSGD ≈ L 2 ηd[F (x (0) ) -F (x * )] γ 3N (2 -Lη) × 4 (C-32N -dN )/dN λ 2 where λ 2 = N -1 n=0 α γn -N α γ(N -1)/2 is the difference between arithmetic mean and geometric mean. Consider that γ = 0, so λ 2 > 0.

F EXPERIMENTS F.1 DATASETS AND BASELINE

We evaluate our method DQSGD on three datasets: AG-News, CIFAR-10, and CIFAR-100. AG-News dataset (Zhang et al., 2015) contains four categorized news articles, and the number of training samples for each class is 30000 and testing 1900. CIFAR-10/100 (Krizhevsky et al., 2009) dataset are all contain 60,000 32 × 32 RGB images, which are divided into 10 and 100 classes, respectively. We compare DQSGD with the following gradients quantization methods: • SignSGD (Seide et al., 2014) : To take the sign of each coordinate of the stochastic gradient vector. • TernGrad (Wen et al., 2017) : Quantizes gradients to ternary levels {-1; 0; 1}. • QSGD (Alistarh et al., 2017) : It is a family of compression schemes. The specific quantization operation is shown in equation 3. In our experiments, we replace the g 2 in the original text with g ∞ . • Adaptive (Oland & Raj, 2015) : This dynamic scheme considers that for the gradient with larger root-mean-squared (RMS) value, more quantization bits are used. • AdaQS (Guo et al., 2020) : It is an adaptive quantization scheme that using few quantization bits in the early epochs and gradually increase bits in the later epochs. Table 2 : Baselines Unbiased Basis for determining bits SignSGD (Seide et al., 2014) No Fixed bits TernGrad (Wen et al., 2017) Yes Fixed bits QSGD (Alistarh et al., 2017) Yes Fixed bits Adaptive (Oland & Raj, 2015) Yes Gradient's root-mean-squared value AdaQS (Guo et al., 2020 ) Yes Gradient's mean to standard deviation ratio, Iteration number

F.2 EXPERIMENTAL SETUP

We conduct simulations for W = 8 workers. For AG-News, we use 300-dim embeddings pre-trained on Glove; then, each word can be further encoded sequentially using two layers bidirectional LSTM (BiLSTM). Furthermore, we use the self-attention mechanism to obtain the sentence embedding. The classifier is two fully connected layers of size 128 and 4 neurons, respectively. We training CIFAR-10 on Resnet18 (He et al., 2016) and CIFAR-100 on Resnet34 (He et al., 2016) , respectively. Other parameters information is shown in Table 3 . All results were the average of four random runs. 



Figure 1 (a) and Figure 1 (b) show the testing accuracy curves and the training loss curves, respectively. Figure 1 (c) shows the bits allocation of each iteration of DQSGD, and Figure 1 (d) represents the communication overhead used in the training process of different quantization schemes.

Figure 1: The comparison results of QSGD and DQSGD on CIFAR-10.

Figure 2: Testing accuracy of QSGD and DQSGD under different compression ratios on CIFAR-10.

25)Algorithm 2 Dynamic QSGD in Distributed Learning 1: Input: Learning rate η, Initial point x (0) ∈ R d , Hyperparametric k, α 2: for each iteration n = 0, 1, ..., N -1: do 3:

Accuracy vs. compression ratio.



annex

Applying this recursively, let ρ = I -ηH, we have:where, W is a standard ddimensional Wiener process, andSubtracting F (x * ) from both sides, taking total expectations, and rearranging, this yields:The property of Ito integral I(N ) is:Using this property, we have:

D PROOF OF THEOREM 3

Considering function F is Lsmooth, using the result of Appendix B, we have:

