BINSGDM: EXTREME ONE-BIT QUANTIZATION FOR COMMUNICATION EFFICIENT LARGE-SCALE DIS-TRIBUTED TRAINING

Abstract

To alleviate the communication bottleneck of large-scale distributed training, a rich body of prior communication-compression optimizers have been proposed. These methods focus mainly on a high compression ratio to expect acceleration. However, some recent works pointed out, when running with distributed training frameworks ( e.g., DistributedDataParallel in Pytorch), these methods provide no acceleration over the off-the-shelve uncompressed SGD/Adam in the typical settings, due to heavy compression/decompression computation or incompatibility with efficient communication primitives or the requirement of uncompressed warmup at the early stage. For these reasons, we propose a novel extreme one-bit quantization optimizer, dubbed BinSGDM. The quantization of BinSGDM is computed easily and lightly, and it does not need to resort to uncompressed optimizers for warmup. We also theoretically prove that it promises the same convergence speed as the original Adam. Moreover, we specially present a hierarchical 1bit All-Reduce technique to further lower the communication volume. Extensive experiments are conducted on 8 to 64 GPUs (1 to 8 nodes) for distributed training with DistributedDataParallel, and the experimental results demonstrate that BinSGDM with the communication scheme can achieve up to 2.5× speedup for training ResNet-50 and 6.3× speedup for training BERT-Base, compared to the full-precision optimizers. * This step follows the technique in AMSGrad Reddi et al. (2018) . It is more about theoretical significance, and we commonly do not implement it in practice.

1. INTRODUCTION

With the rapid development of computational power, "bigger" and "bigger" deep neural network (DNN) models are proposed for expecting better performance, from the early classical models, such as AlexNet(61M parameters) (Krizhevsky et al. (2017) ), and ResNet (ResNet-50: 20.5M parameters) (He et al. (2016) ) to the current foundation models, such as BERT (BERT-Lagre: 340M parameters ) (Devlin et al. (2018) ), and GPT (GPT-3: 176B parameters) (Brown et al. (2020) ). Scalable parallelism across distributed computing workers for training these large-scale models becomes a necessity. During training, millions to billions of parameters need to be communicated among workers at each iteration, and the expensive communication cost becomes a bottleneck. To address the communication bottleneck, a wide variety of lossy gradient compression optimizers have been proposed to lower the communication volume. These algorithms can be typically divided into three groups, including low-precision approximation (e.g., 1-bit SGD (Seide et al. (2014) ), SignSGD (Bernstein et al. (2018) ), TernGrad (Wen et al. (2017) ), and QSGD) (Alistarh et al. (2017) ), 1-Bit Adam (Tang et al. (2021) )), low-rank simplification (e.g., ATOMO(Wang et al. (2018) ), Pow-erSGD (Vogels et al. (2019) ), and GradZip (Cho et al. (2019) )), and sparsification (e.g., Random-k (Stich et al. (2018) ), Top-k (Aji & Heafield (2017) ), and MSTop-k (Shi et al. (2021) )). While much of the research on gradient compression algorithms has focused mainly on the high compression ratio, a more important yet underexplored problem is how to decrease the actual system-level runtime and increase the distributed scaling efficiency. Actually, some recent works (Xu et al. (2020) , Agarwal et al. (2022) ) pointed out, when distributedly training typical models (e.g., ResNet-50 and BERT-Base) with off-the-shelf DistributedDataParallel (DDP) at typical bandwidths (e.g., 10Gbps), these existing gradient compression algorithms with high compression ratios are still slower than the original uncompressed optimizers. This is because they exhibit one or more of the following weaknesses (Xu et al. (2020) , Agarwal et al. (2022) ): (i) Some gradient compression algorithms should perform compression/decompression and communication within the limited time frame, and the time cost of compression/decompression, in some cases, is close to and even larger than the savings by the reduces communications; (ii) some gradient compression algorithms cannot take full advantage of overlapping between gradient computation and communication. Because if the gradient computation and compression/decompression overlap, their intensive computation will compete with each other for GPU resources, which can result in an overall slowdown; (iii) due to inherent structures, some algorithms can only use inefficient collective communication primitive, such as All-Gather; (iv) some gradient compression algorithms need to harness uncompressed optimizers to warm up at the early stage. The warm-up time is commonly nontrivial which to some extent renders their high compression ratios vacuous. Therefore, from a system-level perspective, the design ethos of a system-efficient communication-compression algorithm is that we should guarantee that the compression/decompression of the algorithm is computationally light and takes less time, and that the corresponding communication should also be friendly to efficient collective communication primitives. Additionally, there is no need to resort to an uncompressed optimizer for warm-up. To this end, we propose a communication-compression optimization algorithm, referred to as Binary SGD-Momentum (BinSGDM), in which the core updating rule is x t+1 = x t -α t Q mt bt where m t = βm t-1 + (1 -β)g t , b t = βb t-1 + (1 -β)|g t | and g t is the gradient, and Q(•) is a binary quantization operator. The main difference between BinSGDM and existing gradient-quantization algorithms is that we directly quantize the entire update mt bt rather than quantize the gradient g t or the momentum m t . Due to -1 ≤ (mt)j (bt)j ≤ 1 where (m t ) j , (b t ) j are the j th element of m t , b t , each element of mt bt is easy to be randomly quantized to 1 or -1 in probability, so the quantization is computationally light. Another advantage of BinSGDM is that it does not need a full-precision optimizer to warm up at the early stage to ensure stable convergence. Besides, we theoretically demonstrate BinSGDM's convergence rate can match that of the original Adam. Moreover, according to the nature of BinSGDM, we specifically devise an efficient hierarchical communication scheme to further speed up communication, which sufficiently leverages the ultra-high intra-bandwidth among GPUs within the same node and efficient commutation primitives rather than All-Gather. In particular, we make the following key contributions: • We propose a novel communication-compress distributed optimizer, dubbed BinSGDM. To the best of our knowledge, it is the first algorithm that quantizes the entire model update of an adaptive optimizer and does not need to leverage uncompressed optimizers to warm up to address the convergence issue, which makes compression/decompression computationally light and the extreme quantization ratio exert its best function (Section 2). • We theoretically prove that even though extreme 1-bit quantization is employed, BinSGDM still promise the same convergence speed as the full-precision Adam (Section 3). • We present a new hierarchical communication scheme for 1-bit communication, called Hierarchical 1-bit All-Reduce, which sufficiently harnesses the ultra-fast intra-connects to accelerate the local communication, and utilize more efficient commutation primitives to further reduce the communication overhead (Section 4). • We perform extensive distributed training experiments to demonstrate the effectiveness of the proposed algorithm. As far as we know, our algorithm is the first work to consistently trump the uncompressed optimizers with the highly system-level optimized DDP in overall running time at no inference performance cost, reaching up to 2.47× speedup for Resnet-50 and 6.26× speedup for BERT-Base on 64 GPUs. The better scalability makes BinSGDM promising to train more large-scale models (Section 5).

2. EXTREMELY ONE-BIT QUANTIZED BINSGDM

In this section, we focus on solving the following problem when training a DNN model distributedly: min x∈R d f (x) = 1 n n i=1 f i (x; ξ (i) ) (1) where x is the d-dimensional model parameter, n is the number of distributed workers. ξ (i) is the sampled min-batch data on the i-the worker. The sampled min-batch data on all the workers is independent and identically distributed (i.i.d.). f i (x; ξ (i) ) is the loss function. Note that f i (x; ξ i ) is commonly abbreviated as f i (x) in the following. When we directly employ vallina full-precision and dense-computation optimizers to train a largescale DNN model on distributed workers, the gradient communication among workers at each iteration becomes a bottleneck. Elegant SignSGD was proposed to alleviate the bottleneck problem, which merely takes the sign of each coordinate of the gradients. Although it can substantially reduce the communication overhead, its practical performance is still inferior to popular optimizers, such as Adam. Fortunately, we observe that the mathematical formulations of SignSGD and Adam have close connections, so it leaves us an opportunity to propose a new optimizer that can combine their merits, i.e., considerably reducing the communication volume with light computation yet maintaining fast convergence speed and high inference performance. The mathematical update step of SignSGD can be formulated as: x t+1 ← x t -α t Sign(g t ) = x t -α t g t |g t | (2) where α t is the learning rate, g t denotes the estimated unbias noisy gradient of f (x t ) with random samples. Note that the divider here is an element-wise divider. Whereas the updating rule of vallina Adam can be expressed as: m t ← β 1 m t-1 + (1 -β 1 )g t , v t ← β 2 v t-1 + (1 -β 2 )g 2 t , x t+1 ← x t -α t m t √ v t , where β 1 and β 2 represents the exponential moving average factorsfoot_0 . If taking the exponential moving average factors to zero, β 1 , β 2 → 0, in Eq. ( 3), Adam will be equal to SignSGD. Given the observations above, we propose a new optimizer that is an intermediate between SignSGD and Adam, referred to as BinSGDM, i.e., m t ← βm t-1 + (1 -β)g t , b t ← βb t-1 + (1 -β)|g t |, x t+1 ← x t -α t Q m t b t , where the j-th elements of m t , b t rigorously satisfies -1 ≤ (mt)j (bt)j ≤ 1, Q(•) is an element-wise quantization operator, and it quantizes the j-th element of mt bt as follows: Q (m t ) j (b t ) j = 1, with probability p = 1 2 ( (mt)j (bt)j + 1) -1, with probability 1 -p , where E Q mt bt = mt bt , so Q(•) is unbiased. The detailed implementation of BinSGDM in a parameter-server model is illustrated in Algorithm 1. Some appealing characters of BinSGDM are summarized in the following: • All the existing communication-efficiency optimizers are built upon gradient compression. In contrast, to the best of our knowledge, we are the first to directly quantize the entire model update, which will streamline the quantization. Moreover, each element of mt bt bounds Algorithm 1. BinSGDM 1: Input: all workers's model parameter x0, x1 , the i th worker's momentum m (i) 0 = 0 , b (i) 0 = 0, the i th worker's local error e (i) 0 = 0, server's global error ē0 = 0, exponential moving average factor β, the threshold T0, and the learning rate sequence {αt}. 2: for t = 1, ..., T do 3: (On the i th worker) 4: Randomly sample ξ (i) t and compute local gradient: g (i) t = ∇fi(xt; ξ (i) t ) 5: Update the local m (i) t : m (i) t = βm (i) t-1 + (1 -β)g (i) t 6: Update the local b(i) t : b(i) t = β b(i) t-1 + (1 -β)|g (i) t | 7: Update the local b (i) t : if t > T0 { b (i) t = max(b (i) t-1 , b(i) t )} else {b (i) t = b(i) t } * 8: Quantize the local update: u (i) t = Q( m (i) t b (i) t + e (i) t-1 )

9:

Update the local error feedback e (i) t : e (i) t = e (i) t-1 + m (i) t b (i) t -u (i) t 10: Send u (i) t to the server 11: (On server) 12: Average all received qt and quantize it: in the range [-1, 1], and then it is extremely quantized to binary 1 or -1. Thus, the quantization in BinSGDM is computed easily and lightly, compared to the existing gradientquantized optimizers 2 . ūt = Q( 1 n n i=1 u (i) t + • Unlike SignSGD, Adam adaptively preconditions the gradients with v t , which is the key to ensuring a fast convergence rate in practice. BinSGDM also inherits the character of adaptive preconditioning from Adam to accelerate the convergence speed. The difference is that we keep the exponential moving average factor for m t and b t the same, so that each element of mt bt bounding in the range [-1, 1] can be strictly guaranteed, which is crucial to perform light quantization. • When the model parameter x t is close to a local optimal value during training, an ideal optimizer should make sure the updates gradually decay to zero, otherwise, x t will oscillate around the optimum and cannot indeed approach it. SignSGD and Adam do not have this appealing property, while, as for unquantized BinSGDM (we refer to it as SoftSignSGD, and the implementation details for it please see Algorithm 2 in the Appendix), the gradient g t will continually change its sign around the optimum of which the gradient is zero, so that the update mt bt will damp to zero. An Example illustrating this phenomenon is provided in Section C in the Appendix. Remark. We have noticed that the prior works 1-bit Adam (Tang et al. ( 2021)) and its variants (Li et al. (2021) , Lu et al. (2022) ) also quantize the communication data to 1-bit. However, the design ethos of 1-bit Adam and the proposed BinSGDM are completely different. 1-bit Adam is still built on gradient compression rather than the entire update. The motivation of 1-bit Adam is to harness the error feedback (EF) technique (Seide et al. (2014 ), Stich et al. (2018) ) to compensate for the gradient information lost to alleviate the convergence issue. However, unlike SGD, the parameter update in Adam no longer linearly depends on the gradient, so that EF cannot be directly employed. Authors of 1-bit Adam observed that Adam's variance (non-linear term) becomes stable after the early stage, 1-bit Adam runs full-precision Adam in the beginning (warmup phase) and utilizes it 2 The typical gradient-quantized optimizer QSGD quantizes the gradient as follows: Q ((gt)j) = gt 2sign((gt)j ) • r s , with probability pi = s|(g t ) j | g t 2 -r gt 2sign((gt)j ) • r+1 s , with probability 1 -pi where 0 ≤ r < s (r, l ∈ N) and s|(g t ) j | g t 2 ∈ [ r s , r+1 s ]. as a precondition for SGDM during the rest of training (compression phase), and then EF in the compression phase help 1-bit Adam to converge as rapidly as uncompressed Adam. There are two aspects that influence 1-bit Adam to indeed accelerate communication. First, the warmup steps commonly make up 15%-25% of the total steps in 1-bit Adam, which to some extent discounts the high quantization ratio. Second, in the compression phase, 1-bit Adam should communicate the signs of the gradients of each layer as well as the average scale of all the gradients in this layer, while in DDP, raising communication efficiency, the gradients from many layers should be flatted to 1-dimension and concatenated together, and then packed to one bucket to communicate, which means the scale factor of each layer cannot be computed. Therefore, 1-bit Adam is not compatible with DDP, which will also lower communication efficiency.

3. THEORETICAL ANALYSIS

In this section, we present the theoretical convergence guarantee for BinSGDM (Algorithm 1). We first introduce some necessary assumptions. Assumption 1.[Bounded infimum] For any x and a constant f * , we have the objective value f (x) ≥ f * . Assumption 2. [Lipschitz continuous gradient] The gradient ∇f (•) is L-Lipschitz continuous, i.e., , ∇f (x) -∇f (y) ≤ L x -y 2 , ∀x, y ∈ R d . Assumption 3. [Unbias and indpendent noisy gradient] The gradient with respect to the random samples on each worker and at a different time is independent identically distributed (i.i.d.), i.e., E[g (i) t ] = ∇f (x t ), ∀t ≥ 1, g t is independent of g (j) t for i = j, and g (i) t1 is independent of g (j) t2 for t 1 = t 2 . Assumption 4. [Bounded gradient] The noisy gradient and the full-set gradient are bounded i.e., g (i) t ≤ G, ∇f t (x) ≤ G, ∀t ≥ 1. Under the assumptions above, we then present the theoretical convergence for BinSGDM in Algorithm 1. Theorem 1. For BinSGDM in Algorithm 1, under Assumption 1-4, assuming (b (i) t ) j ≥ ρ > 0 , ∀j ∈ [1, 2, ..., d] 3 , choosing α t = c √ t , ∀t ∈ [1, 2, ..., T ] and α 0 = α 1 , and defining z 1 = x 1 + α 1 (δ 1 -e 1 ) where δ 1 = 1 n n i=1 m (i) 1 b (i) 1 - n i=1 m (i) 1 n i=1 b (i) 1 and e 1 = 1 n n i=1 e (i) 1 + ē1 , we then have the following E 1 T T t=1 ∇f (x t ) 2 ≤ C 1 √ T + C 2 (1 + log T ) √ T , where C 1 = cG E[f (z 1 ) -f * ] + 3c 2 dL 16 + βcdG 2 (1 -β)ρ + 4cdG 2 ρ + c 2 β 2 LG 2 d ρ 2 (1 -β) 2 , C 2 = c 3 G (8β 2 + 10β + 5)L 2 d (1 -β) 2 + G 2 (1 + L) 2ρ 2 + 2dL .

4. HIERARCHICAL 1-BIT ALL-REDUCE

The data in BinSGDM to communicate is one-bit, so it cannot be directly aggregated through the efficient All-Reduce. Moreover, the intra-node bandwidth and inter-node bandwidth are severely imbalanced. If we aggregate the data uniformly from intra-nodes and inter-nodes, the communication will be slowed down by the inter-node data exchanges. In light of the problems above, we propose a hierarchical communication scheme, called Hierarchical 1-bit All-Reduce, to efficiently aggregate our 1-bit data, which can hierarchically take advantage . (ii) Each GPU performs BinS-GDM to quantize the data, and then volume becomes P 32m on each GPU. (iii) Each GPUs conducts 1-bit All-Reduce to inter-aggregate data. This step includes two sub-steps: 1) Each GPU performs All-to-All to collect the data of corresponding GPUs in other nodes, and the communication volume is (n-1)P 32mn ; 2) each GPU averages and re-quantizes the data, and then conducts All-Gather to gather the data, and the communication volume is also (n-1)P 32mn . (iv) Each GPU performs All-Gather to intra-aggregate data, and the communication volume is (m-1)P 32m . Compared to the time cost of inter-node communication, the time cost of inter-node communication is trivial. Hence, when leveraging Hierarchical 1-bit All-Reduce, the most communication cost comes from 1-bit All-Reduce in Step (iii), and then the communication volume across nodes for all GPUs is approximately 2(n-1)P

32

. In contrast, if we simply utilize the original All-Gather to aggregate data, the communication volume across nodes for all GPUs is approximately m 2 n(n-1)P . Therefore, Hierarchical 1-bit All-Reduce is considerably more efficient than the original All-Gather. Recently, some recent works [Xu et al. (2020) , Agarwal et al. (2022) ] have demonstrated that when running with the system-level optimized distributed data-parallel frameworks(e.g., DDP ), the existing typical communication-compression optimizers (not including 1-bit Adam) runs still slower than the full-precision original SGD/Adam (the reasons please refer to the section of Introduction). Hence, we only evaluate the performances of BinSGDM, the uncompressed original SGDM /AdamW, and the closely relevant 1-bit Adam based on performing distributed training experiments with the benchmark model ResNet-50 (CNN) and BERT-Base (Transformer). We show that running with the distributed data-parallel framework DDP in Pytorch, BinSGDM with the proposed specific commutation scheme is up to 2.47 times faster for ResNet-50 and 6.26 times faster for BERT-Base than the uncompressed optimizers with highly system-level optimized all-reduce at no accuracy cost. with momentum of 0.9 and weight decay of 0.0001. When employing 1-bit Adam and BinSGDM, the learning rate starts at 0.001× batch size 256 with weight decay of 0.0001, and [β 1 , β 2 ] for 1-bit Adam is set to [0.9, 0.999] and β for BinSGDM is set to 0.95. Then, the learning rate is divided by 10 after 30, 60 and 90 epochs, and training is finally terminated after 100 epochs. Specifically, the first 15 epochs are used as the warmup stage for 1-bit Adam. For the experiments over BERT-Base, we access the convergence and performance of AdamW (baseline), 1-bit Adam and BinSGDM for SQuAD 1.1 fine-tuning task using a pre-trained BERT-Base model checkpoint from HuggingFacefoot_2 . The batch size per GPU is set to 3. We perform fine-tuning for 2 epochs. The learning rate linearly increases to 0.3× steps in the early 500 steps and then linearly decreases to 0 in the rest iteration. Specifically, the first 0.2× steps are used as the warmup stage for 1-bit Adam. [β 1 , β 2 ] for AdamW, and 1-bit Adam is set to [0.9, 0.999] and β for BinSGDM is set to 0.9.

5.2. EXPERIMENTAL RESULTS

Figure 2 shows the sample-wise and time-wise training convergence behaviors for SGDM / AdamW (baseline), 1-bit Adam and BinSGDM with ResNet-50 and BERT-Base running on 64 GPUs. The experimental results demonstrate that BinSGD can achieve a similar sample-wise convergence rate to the baseline, while its actual speed is substantially faster than the baseline and 1-bit Adam, up to about 2.5× speedup for ResNet-50 with batch size 32 per GPU on and about 6.3× speedup for BERT-Base fine-tuning on SQuAD 1.1. Figure 3 presents the system throughput for different optimizers running with ResNet-50 and BERT-Base on 8 GPUs to 64 GPUs. When training on 8 GPUs, since computation rather than communication dominates the running time, the throughput performance for BinSGDM is slightly inferior to SGDM and AdamW, but when the number of GPUs is growing, BinSGDM consistently outperforms the counterparts, and the more the number of GPUs is, the more the superiority becomes obvious. Furthermore, the system throughput for SGDM, AdamW and 1-bit Adam even decreases with GPUs increasing in some cases, whereas the throughput for BinSGDM steadily grows, which indicates that BinSGDM can offer better scalability efficiency. and exact-match score after fine-tuning on SQuAD 1.1. As shown in Table 1 , when the batch size is set to 32 samples per GPU, the accuracy for BinSGDM is slightly lower than SGDM. For CNN architecture, some works (Keskar & Socher (2017), Zhou et al. ( 2020)) pointed out that adaptive optimizers commonly generalize worse than SGDM. However, when the batch size becomes larger (Table2), 1-bit Adam and BinSGDM achieve better accuracy. The reason for it is that a certain level of noise can be helpful for generalization (Smith & Le ( 2018)), biasing the optimizer towards wider valleys. Large batch size will reduce the randomness, while quantization errors in BinSGDM increase the randomness. Table 2 exhibits BinSGDM obtains similar or higher F1-score and exactmatch score, compared to AdamW and 1-bit Adam, which validates the inferencing effectiveness of the proposed BinSGDM.

5.3. COMMUNICATION EFFICIENCY ANALYSIS

As illustrated in Figure 4 , when we perform training on a single node, the baseline full-precision SGDM and Adam is slightly faster than BinSGDM and 1-bit Adam. Since the inter-GPU bandwidth within a node is ultra-high, the communication time becomes negligible, and the newly introduced compression/decompression by communication-compression optimizers takes up extra time. Due to the light-computation quantization, BinSGDM with ResNet-50 and BERT-Base only takes about 15ms and 8ms respectively for compression/decompression. When we perform distributed training on two nodes, the bandwidth between nodes is limited (10Gbps in our experimental testbed), and the communication time should be reckoned with. For the uncompressed SGDM and Adam, the communication time considerably exceeds the computation time for ResNet with 32 samples per GPU and BERT-Base, which makes the system throughput even lower than that on a single node (shown in Figure 2 ). Since extreme quantization in BinSGDM substantially reduces the communication overhead (up to 32× reduction), the compunction time for BinSGDM increases slightly. When the number of nodes is further increasing, a good communication scheme becomes significant. Leveraging our proposed Hierarchical 1-bit All-Reduce, the overall inter-node communication volume exchanged is proportional to the number of nodes, while, for CompressedAllreduce utilized by 1-bit Adam, the overall communication volume exchanged among nodes is proportional to the number of GPUs (8 times number of nodes in our experiment). Therefore, with the number of nodes increasing, the communication time for BinSGDM raises gently, but the communication time for 1-bit Adam grows abruptly.

6. CONCLUSION

In this work, we present a novel communication compression optimizer for distributed training. The optimizer is not only easy and light to compute but also quantizes the communication data to an extreme one bit. We also theoretically demonstrate that BinSGDM can converge as fast as the original Adam. To further accelerate the communication. will specifically present a novel communication scheme for BinSGDM to replace the inefficient naive All-Gather. Extensive experiments on training the benchmark ResNet-50 and BERT-Base have validated the effectiveness and efficiency of BinSGDM over the uncompressed SGD,Adam and the most relevant 1-bit Adam. A THEORETICAL ANALYSIS FOR ALGORITHM 1 In practice, we implement BinSGDM in a non-parameter-server model to further reduce the communication overhead, but the data exchange is essentially equivalent to that in a parameter-server prototype. Hence, we provide the theoretical analysis for BinSGDM in a parameter-server model as shown in Algorithm 1. According to Algorithm 1, the update ūt can be recursively formulated as ūt =Q 1 n n i=1 u (i) t + ēt = 1 n n i=1 u (i) t + ēt -ēt+1 = 1 n n i=1 Q m (i) t b (i) t + e (i) t + ēt -ēt+1 = 1 n n i=1 m (i) t b (i) t + e (i) t -e (i) t+1 + ēt -ēt+1 = 1 n n i=1 m (i) t b (i) t + 1 n n i=1 e (i) t -e (i) t+1 + ēt -ēt+1 Denote g t = 1 n n i=1 g (i) t , m t = 1 n n i=1 m (i) t = βm t-1 + (1 -β)g t , b t = 1 n n i=1 b (i) t , δ t = 1 n n i=1 m (i) t b (i) t - m t b t , ( ) e t = 1 n n i=1 e (i) t + ēt (12) (13) Hence, the updating rule can be summarized as x t+1 = x t -α t ūt =x t -α t m t b t + δ t + e t -e t+1 A.1 AUXILIARY LEMMAS Lemma 1. Let u t = mt bt , the element-wise quantization function is defined in Eq.( 5) can be reformulated as Q ((u t ) j ) = 1, with probability p = (ut)j +1 2 -1, with probability 1 -p (j ∈ {1, 2, ..., d}, -1 ≤ (u t ) j ≤ 1). ( 15) We have e t = u t -Q (u t ), and then the following holds true E [e t ] = 0, E e t 2 ≤ d. ( ) Proof. From Eq.( 15), we know E [(e t ) j ] = E [u t -Q (u t )] = 1 2 ((u t ) j + 1) ((u t ) j -1) + (1 - 1 2 ((u t ) j + 1))((u t ) j + 1) = 0, and, E (e t ) 2 j = E ((u t ) j -Q ((u t ) j )) 2 = 1 2 ((u t ) j + 1) ((u t ) j -1) 2 + (1 - 1 2 ((u t ) j + 1))((u t ) j + 1) 2 = 1 -((u t ) j ) 2 ≤ 1. Hence, E [e t ] = 0, E e t 2 ≤ d. ( ) Lemma 2. Let x 0 = x 1 and α 0 = α 1 in Algorithm 1, defining the sequence z 1 = x 1 + α 1 (δ 1 -e 1 ) z t = x t + β 1 -β (x t -x t-1 ) + α t-1 1 -β (δ t-1 + βe t-1 -e t ), ∀t ≥ 2. ( ) Then the following equality will hold, i.e., z t+1 = z t + β 1 -β α t-1 b t-1 - α t b t m t-1 -α t g t b t -α t-1 δ t-1 -(α t -α t-1 )e t . Proof. For t = 1, we have z 2 -z 1 = x 2 + β 1 -β (x 2 -x 1 ) + α 1 1 -β (δ 1 + βe 1 -e 2 ) -(x 1 + α 1 (δ 1 -e 1 )) = ( β 1 -β + 1)(x 2 -x 1 ) + α 1 1 -β (δ 1 + βe 1 -e 2 ) -α 1 (δ 1 -e 1 ) = - α 1 1 -β (1 -β)g 1 b 1 + δ 1 + e 1 -e 2 + α 1 1 -β (δ 1 + βe 1 -e 2 ) -α 1 (δ 1 -e 1 ) = -α 1 g 1 b 1 -α 0 δ 1 (23) where the second equality follows the updating rule in Eq.( 14). For t ≥ 2, following the updating rule in Eq.( 14), we have x t+1 -x t + α t (δ t + e t -e t+1 ) = -α t m t b t = -α t βm t-1 + (1 -β)g t b t =β (x t -x t-1 + α t-1 (δ t + e t-1 -e t )) + β α t-1 b t-1 - α t b t m t-1 -(1 -β)α t g t b t We know x t+1 -x t + α t (e t -e t+1 ) = (1 -β)(x t+1 + -α t (e t+1 -δ t )) -(1 -β)(x t -α t e t ) + β(x t+1 -x t + α t (δ t + e t -e t+1 )), so Eq. ( 24) can be rearranged as (1 -β)(x t+1 + α t (δ t -e t+1 )) + β(x t+1 -x t + α t (δ t + e t -e t+1 )) =(1 -β)(x t -α t e t ) + β (x t -x t-1 + α t-1 (δ t-1 + e t-1 -e t )) + β α t-1 b t-1 - α t b t m t-1 -(1 -β)α t g t b t (25) Divided both sides by 1 -β, we obtain x t+1 + α t (δ t -e t+1 ) + β 1 -β (x t+1 -x t + α t (δ t + e t -e t+1 )) =x t + α t-1 (δ t-1 -e t ) + β 1 -β (x t -x t-1 + α t-1 (δ t-1 + e t-1 -e t )) + β 1 -β α t-1 b t-1 - α t b t m t-1 -α t g t b t -α t-1 δ t-1 -(α t -α t-1 )e t ( ) Rearranging Eq. ( 26), we have x t+1 + β 1 -β (x t+1 -x t ) + α t 1 -β (δ t + βe t -e t+1 ) =x t + β 1 -β (x t -x t-1 ) + α t-1 1 -β (δ t-1 + βe t-1 -e t ) + β 1 -β α t-1 b t-1 - α t b t m t-1 -α t g t b t -α t-1 δ t-1 -(α t -α t-1 )e t ( ) Under review as a conference paper at ICLR 2023 Define the sequence z t = x t + β 1 -β (x t -x t-1 ) + α t-1 1 -β (δ t-1 + βe t-1 -e t ) We finally obtain z t+1 = z t + β 1 -β α t-1 b t-1 - α t b t m t-1 -α t g t b t -α t-1 δ t-1 -(α t -α t-1 )e t . Recalling x 1 = x 0 and α 1 = α 0 , we have α1 b1 = α0 b0 . Then, combining Eq.( 23) and Eq.( 29), we obtain the conclusion.

A.2 PROOF OF THEOREM 1

Proof. By the the gradient Lipschitz continuous in Assumption 2 and Lemma 2, we obtain E[f (z t+1 ) -f (z t )] ≤ E ∇f (z t ), z t+1 -z t + L 2 E z t+1 -z t 2 =E β 1 -β ∇f (z t ), α t-1 b t-1 - α t b t m t-1 -E ∇f (z t ), α t g t b t -E [ ∇f (z t ), α t-1 δ t-1 ] -E [ ∇f (z t ), (α t -α t )e t-1 ] + E L 2 β 1 -β α t-1 b t-1 - α t b t m t-1 -α t g t b t -α t-1 δ t-1 -(α t -α t-1 )e t-1 2 =E β 1 -β ∇f (z t ), α t-1 b t-1 - α t b t m t-1 -E ∇f (z t ), α t g t b t + E L 2 β 1 -β α t-1 b t-1 - α t b t m t-1 -α t g t b t -α t-1 δ t-1 -(α t -α t-1 )e t-1 2 ≤E β 1 -β ∇f (z t ), α t-1 b t-1 - α t b t m t-1 -E ∇f (z t ), α t g t b t + LE β 1 -β α t-1 b t-1 - α t b t m t-1 2 + LE α 2 t g t b t 2 + L 2 E α t-1 δ t-1 2 + L 2 E (α t-1 -α t )e t 2 (30) where the second equality holds due to E [δ t-1 ] = 0 and E [e t-1 ] = 0. The last inequality holds owing to E[ a + b 2 ] = E[ a 2 ] + E[ b 2 ] if E[a] = 0 or E[b] = 0, and E[ a + b 2 ] ≤ 2E[ a 2 ] + 2E[ b 2 ] if E[a] = 0 and E[b] = 0. Taking telescope sum from 1 to T on the both sides of Eq.( 30) , we then have E[f (z T ) -f (z 1 )] ≤ β 1 -β E T t=1 ∇f (z t ), α t-1 b t-1 - α t b t m t-1 T1 -E T t=1 ∇f (z t ), α t g t b t T2 + LE T t=1 β 1 -β α t-1 b t-1 - α t b t m t-1 2 T3 + LE T t=1 α 2 t g t b t 2 + L 2 E[ T t=1 α t-1 δ t-1 2 ] + L 2 E[ T t=1 (α t-1 -α t )e t 2 ] T4 Now we focus on bounding T 1 below. From Assumption 4, we know g t ≤ G (t = 1, 2, ..., T ) and ∇f (z t ) ≤ G . Due to m t = βm t-1 + (1 -β)g t and m 1 = g 1 , it is easy to obtain m t ≤ G by complete induction. Since ∇f (z t ) ≤ G and m t ≤ G, we have T 1 = β 1 -β E T i=1 ∇f (z i ), α t-1 b t-1 - α t b t m i-1 (i) ≤ β 1 -β E T i=1 ∇f (z t ) m t α t-1 b t-1 - α t b t 1 (ii) ≤ β 1 -β G 2 E T i=1 α t-1 b t-1 - α t b t 1 (iii) = β 1 -β G 2 E T i=1 α t-1 b t-1 - α t b t 1 ≤ β 1 -β G 2 E α 0 b 0 1 (iv) ≤ α 0 βd (1 -β)ρ G 2 , where (i) holds sice a b ≤ a max j |(b) j | ≤ a b 1 , (ii) holds due to ∇f (z t ) ≤ G and m t ≤ G, (iii) holds because αt-1 (bt-1)j -αt (bt)j ≥ 0 for any j ∈ [1, 2, ..., d], (iv) holds due to min j (b t ) j ≥ ρ > 0 for any j ∈ [1, 2, ..., d]. Let us turn to bound T 2 , T 2 = -E T t=1 ∇f (z t ), α t g t b t = -E T t=1 ∇f (z t ) -f (x t ), α t g t b t T5 -E T t=1 ∇f (x t ), α t g t b t T6 We now analyze T 5 below, T 5 = -E T t=1 ∇f (z t ) -f (x t ), α t g t b t (i) ≤ 1 2 E T t=1 f (z t ) -f (x t ) 2 + 1 2 E T t=1 α 2 t g t b t 2 (ii) ≤ L 2 2 E T t=1 z t -x t 2 + 1 2 E T t=1 α 2 t g t b t 2 (iii) = L 2 2 E T t=1 β 1 -β (x t -x t-1 ) + α t-1 1 -β (δ t-1 + βe t-1 -e t ) 2 + 1 2 E T t=1 α 2 t g t b t 2 (iv) ≤ β 2 L 2 (1 -β) 2 E T t=1 x t -x t-1 2 + L 2 (1 -β) 2 E T t=1 α t-1 δ t-1 2 + β 2 L 2 (1 -β) 2 E T t=1 α t-1 e t-1 2 + L 2 (1 -β) 2 E T t=1 α t-1 e t 2 + 1 2 E T t=1 α 2 t g t b t 2 (v) = β 2 L 2 (1 -β) 2 E T t=1 α 2 t-1 m t-1 b t-1 + δ t-1 + e t-1 -e t 2 + L 2 (1 -β) 2 E T t=1 α t-1 δ t-1 2 + β 2 L 2 (1 -β) 2 E T t=1 α t-1 e t-1 2 + L 2 (1 -β) 2 E T t=1 α t-1 e t 2 + 1 2 E T t=1 α 2 t g t b t 2 = β 2 L 2 (1 -β) 2 E T t=1 α t-1 m t-1 b t-1 2 + β 2 L 2 (1 -β) 2 E T t=1 α t-1 δ t-1 2 + β 2 L 2 (1 -β) 2 E T t=1 α t-1 e t-1 2 + β 2 L 2 (1 -β) 2 E T t=1 α t-1 e t 2 + L 2 (1 -β) 2 E T t=1 α t-1 δ t-1 2 + β 2 L 2 (1 -β) 2 E T t=1 α t-1 e t-1 2 + L 2 (1 -β) 2 E T t=1 α t-1 e t 2 + 1 2 E T t=1 α 2 t g t b t 2 = β 2 L 2 (1 -β) 2 E T t=1 α 2 t-1 m t-1 b t-1 2 + (1 + β 2 )L 2 (1 -β) 2 E T t=1 α 2 t-1 δ t-1 2 + 2β 2 L 2 (1 -β) 2 E T t=1 α 2 t-1 e t-1 2 + (1 + β 2 )L 2 (1 -β) 2 E T t=1 α 2 t-1 e t 2 + 1 2 E T t=1 α 2 t g t b t 2 (vi) ≤ β 2 L 2 d (1 -β) 2 + 4(1 + β 2 )L 2 d (1 -β) 2 + 2β 2 L 2 d (1 -β) 2 + (1 + β 2 )L 2 d (1 -β) 2 + G 2 2ρ 2 T t=1 α 2 t-1 = (8β 2 + 10β + 5)L 2 d (1 -β) 2 + G 2 2ρ 2 T t=1 α 2 t-1 ) where (i) holds by following a, b ≤ 1 2 a 2 + 1 2 a 2 , (ii) holds due to Assumption 1, (iii) holds due to Assumption 1 owing to Eq.( 21), (iii ) holds since E[ a+b 2 ] = E[ a 2 ]+E[ b 2 ] if E[a] = 0 or E[b] = 0, (v) holds resulting from the updating rule in Eq. ( 14), (vi) holds due to (mt)j (bt)j ≤ 1, |(δ) j | ≤ 2 (the definition of δ t in Eq. (11) ), E[ e t 2 ] ≤ d in Lemma 1, g t ≤ G in Assumption 2 and min j (b t ) j ≥ ρ > 0. We then bound T 6 T 6 = -E T t=1 ∇f (x t ), α t g t b t = -E T t=1 ∇f (x t ), α t ∇f (x t ) b t -E T t=1 ∇f (x t ), α t g t -∇f (x t ) b t (i) ≤ - 1 G E T t=1 α t ∇f (x t ) 2 + E T t=1 ∇f (x t ), α t ∇f (x t ) -g t b t = - 1 G E T t=1 α t ∇f (x t ) 2 + E ∇f (x 1 ), α 1 ∇f (x 1 ) -g 1 b 1 + E T t=2 ∇f (x t ), ∇(f (x t ) -g t ) α t b t - α t-1 b t-1 + E T t=2 ∇f (x t ), α t-1 ∇f (x t ) -g t b t-1 (ii) = - 1 G E T t=1 α t ∇f (x t ) 2 + E ∇f (x 1 ), α 1 ∇f (x 1 ) -g 1 b 1 + E T t=2 ∇f (x t ), (∇f (x t ) -g t ) α t b t - α t-1 b t-1 (iii) ≤ - 1 G E T t=1 α t ∇f (x t ) 2 + E ∇f (x 1 ) ∇f (x 1 ) -g 1 α 1 b 1 1 + E T t=2 ∇f (x t ) ∇f (x t ) -g t α t b t - α t-1 b t-1 1 (iv) ≤ - 1 G E T t=1 α t ∇f (x t ) 2 + 2G 2 E α 1 b 1 1 + T t=2 α t b t - α t-1 b t-1 1 (v) = - 1 G E T t=1 α t ∇f (x t ) 2 + 2G 2 E α 1 b 1 + T t=2 α t-1 b t-1 - α t b t 1 , = - 1 G E T t=1 α t ∇f (x t ) 2 + 4G 2 E α 1 b 1 1 , (vi) ≤ - 1 G E T t=1 α t ∇f (x t ) 2 + 4G 2 α 1 d ρ ) where (i) holds due to max j (b t ) j ≤ b t ≤ G , (ii) holds owing to E[∇f (x t ) -g t ] = 0 in Assumption 2 and g t , b t-1 are independent, (iii) holds sice a b ≤ a max j |(b) j | ≤ a b 1 , (iv) holds resulting from ∇f (x t ) ≤ G and ∇f (x t ) -g t ≤ ∇f (x t ) + g t ≤ 2G, and (v) holds because αt-1 (bt-1)j -αt (bt)j ≥ 0 for any j ∈ [1, 2, ..., d], (vi) holds due to min j (b t ) j ≥ ρ > 0 for any j ∈ [1, 2, ..., d] . Then, we pay attention to T 3 , (36) where (i) holds due to a b ≤ a b , (ii) holds owing to m t-1 ≤ G, (ii) holds due to a 2 ≤ max j |(a) j | a 1 , (iv) holds due to αt-1 (bt-1)j -αt (bt)j ≥ 0 and αt (bt)j > 0 for any j ∈ [1, 2, ..., d], (v) holds resulting from min j (b t ) j ≥ ρ > 0 for any j and α t is non-increasing, (vi) holds resulting from αt-1 (bt-1)j -αt (bt)j ≥ 0 for any j ∈ [1, 2, ..., d], (vii) holds due to telescoping sum, and (viii) holds due to min j (b t ) j ≥ ρ > 0 for any j ∈ [1, 2, ..., d] .. Now we turn attention to T 4 , T 3 = LE T t=1 β 1 -β α t-1 b t-1 - α t b t m t-1 2 (i) ≤ β 2 L (1 -β) 2 E T t=1 α t-1 b t-1 - α t b t 2 m t-1 2 (ii) ≤ β 2 LG 2 (1 -β) 2 E T t=1 α t-1 b t-1 - α t b t 2 (iii) ≤ β 2 LG 2 (1 -β) 2 E T t=1 max j α t-1 (b t-1 ) j - α t (b t ) j α t-1 b t-1 - α t b t 1 (iv) ≤ α 0 β 2 LG 2 ρ(1 -β) 2 E T t=1 max j α t-1 (b t-1 ) j α t-1 b t-1 - α t b t 1 (v) ≤ α 0 β 2 LG 2 ρ(1 -β) 2 E T t=1 α t-1 b t-1 - α t b t 1 (vi) ≤ α 0 β 2 LG 2 ρ(1 -β) 2 E T t=1 α t-1 b t-1 1 - α t b t 1 (vii) ≤ α 0 β 2 LG 2 ρ(1 -β) 2 E α 0 b 0 1 - α T b T 1 (viii) ≤ α 2 0 β 2 LG 2 d ρ 2 (1 -β) 2 , T 4 = LE T t=1 α 2 t g t b t 2 + L 2 E[ T t=1 α t-1 δ t-1 2 ] + L 2 E[ T t=1 (α t-1 -α t )e t 2 ] ≤ L G 2 ρ 2 + 2dL T t=1 α 2 t + dL 2 T t=1 (α t-1 -α t ) 2 , ( ) where the inequality holds owing to m t-1 ≤ G and min j (b t ) j ≥ ρ > 0, (δ t-1 ) j ≤ 2, and E[ e t 2 ] ≤ d. Combining Eq.(31-37), we can obtain E[f (z T ) -f (z 1 )] ≤ α 0 βd (1 -β)ρ G 2 + (8β 2 + 10β + 5)L 2 d (1 -β) 2 + G 2 2ρ 2 T t=1 α 2 t-1 - 1 G E T t=1 α t ∇f (x t ) 2 + 4G 2 α 1 d ρ + α 2 0 β 2 LG 2 d ρ 2 (1 -β) 2 + L G 2 ρ 2 + 2dL T t=1 α 2 t + dL 2 T t=1 (α t-1 -α t ) 2 . ( ) Reformulating Eq.( 38), we then have 1 G E T t=1 α t ∇f (x t ) 2 ≤E[f (z 1 ) -f (z T )] + (8β 2 + 10β + 5)L 2 d (1 -β) 2 + G 2 (1 + L) 2ρ 2 + 2dL T t=1 α 2 t-1 + dL 2 T t=1 (α t-1 -α t ) 2 + α 0 βd (1 -β)ρ G 2 + 4G 2 α 1 d ρ + α 2 0 β 2 LG 2 d ρ 2 (1 -β) 2 (39) It is known the learning rate saftifies α t = c √ t , ∀t ≥ 1 and α 0 = α 1 = c. Utilizing non-increasing α t and Cauchy-Schwarz inequality, we know E T t=1 α t ∇f (x t ) 2 ≥ T α T E 1 T T t=1 ∇f (x t ) 2 = √ T c E 1 T T t=1 ∇f (x t ) 2 . T t=1 α 2 t-1 = T t=1 c 2 t ≤ c 2 (1 + T -1 1 1 t dt) ≤ c 2 (1 + log T ), and T t=1 (α t-1 -α t ) 2 = T t=2 (α t-1 -α t ) 2 ≤ T t=2 c 2 4(t-1) 3 ≤ c 2 4 (1 + T -2 1 t -3 dt) = c 2 4 ( 3 2 -1 2(T -2) ) ≤ 3c 2 8 , we further have E 1 T T t=1 ∇f (x t ) 2 ≤ C 1 √ T + C 2 (1 + log T ) √ T , where we define C 1 = cG E[f (z 1 ) -f * ] + 3c 2 dL 16 + βcdG 2 (1 -β)ρ + 4cdG 2 ρ + c 2 β 2 LG 2 d ρ 2 (1 -β) 2 , C 2 = c 3 G (8β 2 + 10β + 5)L 2 d (1 -β) 2 + G 2 (1 + L) 2ρ 2 + 2dL . B UNQUANTIZED BINSGDM We refer to BinSGDM without quantization as SoftSignSGD. The implementation details for Soft-SignSGD are shown in Algorithm 2. The gradients need to be aggregated among different workers before SoftSignSGD is performed, just like full-precision SGD and Adam. Compared to Adam, the first difference is that we utilize the exponential moving average of the absolute gradient, i.e., b t = (1-β)b t-1 +|g t |, as the denominator of the updating amount for SoftSignSGD rather than conventionally adopt the squared root of the exponential moving average of the squared gradient, i.e., √ v t = (1 -β 2 )v t-1 + (1 -β 2 )g 2 t . Another difference is that the exponential moving factors for the nominator m t and the denominator b t are the same in SoftSignSGD. Both of the distinctions make each element of the updating amount in SoftSignSGD satisfies -1 ≤ ( mt bt ) j ≤ 1, ∀j ∈ [1, 2, ..., d].

B.1 EXPERIMENTAL RESULTS FOR TRAINING VGG16

We assess the performances of Adam, SoftSignSGD and BinSGDM with VGG-16 on CIFAR100. We sample a set of 128 examples with the replacement for each batch. β for SoftSignSGD and BinSGDM is set to 0.95, and β 1 , β 2 for Adam is set to 0.9, 0.999. The weight decay is uniformly set to 0.05. To simplify the tuning process and ensure fair comparisons, in each case, we start with the same learning rate of 0.005, divide the learning rate by 10 after 75 and 130 epochs, and finally terminate the procedure after 150 epochs. As visually illustrated in Figure 5 , the convergence speed and the test accuracy of SoftSignSGD and BinSGDM are comparative to Adam for training VGG-16 on CIFAR100. 

C THE POTENTIAL OSCILLATION FOR Adam

Compared to SoftSignSGD (Please refer to Algorithm 2 in Section B), Adam may oscillate around the optimum and cannot approach it. As for SoftSignSGD, in the final optimizing stage, the elements of g t will frequently change their signs in the neighborhood of the optimal gradient g * = 0. The more frequently the sign of the element of the gradient, (g t ) j , ∀j ∈ [1, 2, ..., d], change, the according (m t ) j will be smaller than the according (b t ) j . Hence, when the loss approaches a local optimum, mt bt will be gradually close to 0. As for Adam, the update amount is mt √ vt where m t = β 1 m t-1 +β 1 g t and v t = β 2 v t + (1 -β 2 )v 2 t . Although the elements of the gradient will frequently change their signs in the final optimizing stage, the relationship of size between (m t ) j and ( √ v t ) j for any j ∈ [1, 2, ..., d] is uncertain. Hence when the loss is close to a local optimum, the updating amount mt √ vt may not damp to 0, which may lead the loss to be oscillating and not indeed converge to a local minimum. We provide an example in Figure 8 to visually illustrate the convergence behaviors for Adam and SoftSignSGD.

D EXPERIMENTS WITH INFINIBAND CONNECTIONS

To further evaluate the communication efficiency of SGDM/Adam, SoftSignSGD and BinSGDM with high bandwidth connections, we implement experiments for training ResNet-50 and BERT-Base with distributed nodes connected with 200Gbps InfiniBand. All the experimental settings are the same as we perform experiments with Ethernet in Subsection 5.1, and the experimental results are listed in Table 3 and Table 4 . As shown in Table 3 and Table 4 , compared with the baseline SGDM/Adam, BinSGDM can still reach up to 1.45× speedup for ResNet-50 on ILSVRC2012 and 2.85× speedup for BERT-Base on SQuAD 1.1, although the speed advantage is not so obvious as that with lower-bandwidth Ethernet connections. An interesting phenomenon is that the system throughput of BinSGDM with 10Gbps Ethernet can match that of SGDM/Adam with 200Gbps InfiniBand. The experimental results in Table 3 and Table 4 also show that as the number of GPUs is increasing, the scale efficiency of SGDM/Adam, SoftSignSGD and BinSGDM becomes lower. The reason for this phenomenon can be summarized in the following. When the number of GPUs doubles, the number of communication trips also multiplies. We take the communication schemeAll-Reduce for example. If the number of GPUs is n, each GPU requires 2(n -1) trips across the network confections. When the number is non-trivial, the computation time of the communication primitives may exceed the time of the pure communication itself and dominate the overall communication time, since the total communication overhead does not change with the number of GPUs. Notably, All-reduce is more efficient than All-to-All which is the core of our Hierarchical 1-bit . Hence, as shown in in Table 3 and Table 4 , the scale efficiency of BinSGDM decreases more quickly than SGDM/Adam with the number of GPUs growing.

E DISCUSSION

In the original 1-bit Adam paper (Tang et al. ( 2021)), it reports that 1-bit Adam runs significantly (up to 3.8×) faster than the full-precision Adam. Moreover, as the number of GPUs grows, the speed advantage is more obvious. In contrast, in our experiments, 1-bit Adam does not exhibit clear speed advantages over the original Adam, and when running on 64 GPUs, 1-bit Adam is not only slower than the original Adam, but also its throughput rate is even lower than that on 32 GPUs. The reason for this phenomenon can be summarized in the following. First, in (Tang et al. ( 2021)), the speedup of 1-bit Adam is obtained by comparing the throughput at the compression phase with the throughput at the warm-up phase. The warmup phase is excluded for assessing throughput, while, in our experiments, we evaluate the overall average throughput of the warm-up phase and the compression phase for 1-bit Adam. Second, the baseline original Adam in (Tang et al. ( 2021)) does not run with system-level efficient DDP. Third, in (Tang et al. ( 2021)), the authors customized highly efficient communication primitives for 1-bit Adam. For the sake of fairness, we just utilize the off-the-shelf communication primitives in Pytorch for all the optimizers. As shown in Figure 4 , when the number of GPUs continually, the communication time for BinSGDM also grows superlinearly. One of the reasons is that the communication primitive All-to-All accounts for more and more communication time. But the native All-to-All in Step (iii) in Hierarchical 1-bit All-Reduce is not less efficient than the native All-Reduce. Hence, we will further optimize All-to-All and All-Gather to further accelerate BinSGDM. When training large-scale DNNs, the mix-precision technique is used to reduce the memory, which allows us to further increase the model size. In optimizers, we still use the full-precision state and full-precision computation which commonly accounts for 33-75% of the total memory footprint (Tim Dettmers, et al. 8-bit Optimizer via Block-wise Quantization, ICLR 2022.) . BinSGDM does not need full-precision state and full-precision computation. Moreover, due to randomly quantizing the update to 1 or -1, BinSGDM may leverage lower precision than FP16 gradients to estimate the update. Therefore, BinSGDM is promising to find more applications in reducing memory.



For simplicity, we omit the bias correction for mt and vt and the small constant in the numerator. We commonly add a small constant to bt to avoid zero denominators for numerical stability, which guarantees this assumption holds in practice. https://github.com/huggingface/transformers https://github.com/pytorch/vision/tree/main/references/classification https://github.com/juntang-zhuang/Adabelief-Optimizer



Figure 1: Paradigm of Hierarchical 1-bit All-Reduce

(a) Epoch-wise, ResNet-50, batch size=32 × 64 (b) Time-wise, ResNet-50, batch size=32 × 64 (c) Epoch-wise, ResNet-50, batch size=128 × 64(d) Time-wise, ResNet-50, batch size=128 × 64 (e) Epoch-wise, Bert-Base, batch size=3 × 64 (f) Time-wise, Bert-Base, batch size=3 × 64

Figure 2: Epoch-wise and time-wise convergence speed for training ResNet-50 with 32 samples per GPU, ResNet-50 with 128 samples per GPU, and fine tuning BERT-Base with 3 samples per GPU with 64 GPUs.

Figure 3: System throughput of optimizers for training (a) ResNet-50 with 32 samples per GPU, (b)ResNet-50 with 128 samples per GPU, and (c) fine tuning BERT-Base with 3 samples per GPU with 8, 16, 32, 64 GPUs. 5.1 EXPERIMENTAL SETTINGS Testbed. Our experiments are implemented on {1, 2, 4, 8} nodes connected with 10Gbps Ethernet, and each node is equipped with 8 Nvidia Tesla A100-80GB GPUs. The hardware and software are the same in all instances. The operating system in each node is Ubuntu 20.04.4 LTS. Our experiments are performed in Pytorch 1.11.0, and other related libraries are CUDA-11.6, cuDNN-8.2, NCCL-2.10.3 and Pytorch 1.11.0. Notably, to be compatible with DDP of Pytorch, parts of BinS-GDM and our hierarchical communication scheme are implemented in the customized communication hook of DDP in Pytorch. Training details. For the experiments over ResNet-50, we evaluate the convergence and performance of SGDM, 1-bit Adam and BinSGDM on ILSVRC2012. The batch size per GPU is set to 32 or 128 with the standard input resolution 224 × 224. When employing SGDM (baseline), the learning rate starts at 0.1 × batch size 256

Figure 4: Computation time, communication time and compression/decompression time per iteration of optimizers for training (a) ResNet-50 with 32 samples per GPU, (b)ResNet-50 with 128 samples per GPU, and (c) fine tuning BERT-Base with 3 samples per GPU with 8, 16, 32, 64 GPUs.

SoftSignSGD 1: Input: model parameter x0, x1 , the momentum m the exponential moving average factor β, the learning rate sequence {αt} 2: for t = 1, ..., T do 3: Randomly sample ξt and compute the gradient: gt = ∇f (xt; ξt) 4: Update the momentum mt: mt = βmt-1 + (1 -β)gt 5: Update the momentum bt: bt = βbt-1 + (1 -β)|gt| 6: Update the model parameter xt+1: xt+1 = xt -αt m t b t 7: end for

Figure 5: Training loss and test accuracy for VGG-16 on CIFAR100.B.2 EXPERIMENTAL RESULTS FOR TRAINING VITWe train ViT-B with Adam, SoftSignSGD and BinSGDM on the ILSVRC2012. We use the Pytorch official implementation for ViT 5 , and all experimental settings follow the recommended, expect that β is to 0.95 for SoftSignSGD and BinSGDM and total epoch for all optimizers is set to 150 rather than 300. As shown in Figure6, the convergence speed and the test accuracy of SoftSignSGD and BinSGDM can match Adam for training ViT-B-16 on ILSVRC2012.

Figure 6: Training loss and test accuracy for ViT-B-16 on ILSVRC2012.B.3 EXPERIMENTAL RESULTS FOR TRAINING LSTMWe perform experiments for training a 3-layer LSTM on the Penn TreeBank dataset to validate the effectiveness of SoftSignSGD. Our implementations are built upon the codes of the paper AdaBelief 6 , and we use the default experimental settings for adaptive optimizers in the code, expect that we set β to 0.99 for SoftSignSGD and BinSGDM and the weight decay to 0.3 for all the optimizers. Figure7indicates that the convergence speed and the inference performance of SoftSignSGD and BinSGDM are competitive to the widely-used Adam.

Figure 7 indicates that the convergence speed and the inference performance of SoftSignSGD and BinSGDM are competitive to the widely-used Adam.

Figure 7: Training loss and test perplexity (the lower, the better) for 3-layer LSTM on Penn TreeBank.

Figure8: The convegence behaviors for Adam and SoftSignSGD. The loss function is f (x) = 0.5x 2 . The learning rate is set to 0.5, x0 is initialized to 1.0. β1 and β2 for Adam is set to 0.9 and 0.99, and β for SoftSignSGD is set to 0.95.

System throughput and Test Accuracy of SGDM, 1-bit Adam and BinSGDM for training ResNet-50 on ILSVRC2012 from scratch with 8, 16, 32, 64 GPUs.

System

Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, and Yuxiong He. 1-bit adam: Communication efficient large-scale training with adams convergence speed. In International Conference on Machine Learning, pp. 10118-10129, 2021.

