BINSGDM: EXTREME ONE-BIT QUANTIZATION FOR COMMUNICATION EFFICIENT LARGE-SCALE DIS-TRIBUTED TRAINING

Abstract

To alleviate the communication bottleneck of large-scale distributed training, a rich body of prior communication-compression optimizers have been proposed. These methods focus mainly on a high compression ratio to expect acceleration. However, some recent works pointed out, when running with distributed training frameworks ( e.g., DistributedDataParallel in Pytorch), these methods provide no acceleration over the off-the-shelve uncompressed SGD/Adam in the typical settings, due to heavy compression/decompression computation or incompatibility with efficient communication primitives or the requirement of uncompressed warmup at the early stage. For these reasons, we propose a novel extreme one-bit quantization optimizer, dubbed BinSGDM. The quantization of BinSGDM is computed easily and lightly, and it does not need to resort to uncompressed optimizers for warmup. We also theoretically prove that it promises the same convergence speed as the original Adam. Moreover, we specially present a hierarchical 1bit All-Reduce technique to further lower the communication volume. Extensive experiments are conducted on 8 to 64 GPUs (1 to 8 nodes) for distributed training with DistributedDataParallel, and the experimental results demonstrate that BinSGDM with the communication scheme can achieve up to 2.5× speedup for training ResNet-50 and 6.3× speedup for training BERT-Base, compared to the full-precision optimizers.

1. INTRODUCTION

With the rapid development of computational power, "bigger" and "bigger" deep neural network (DNN) models are proposed for expecting better performance, from the early classical models, such as AlexNet(61M parameters) (Krizhevsky et al. (2017) ), and ResNet (ResNet-50: 20.5M parameters) (He et al. (2016) ) to the current foundation models, such as BERT (BERT-Lagre: 340M parameters ) (Devlin et al. (2018) ), and GPT (GPT-3: 176B parameters) (Brown et al. (2020) ). Scalable parallelism across distributed computing workers for training these large-scale models becomes a necessity. During training, millions to billions of parameters need to be communicated among workers at each iteration, and the expensive communication cost becomes a bottleneck. To address the communication bottleneck, a wide variety of lossy gradient compression optimizers have been proposed to lower the communication volume. These algorithms can be typically divided into three groups, including low-precision approximation (e.g., 1-bit SGD(Seide et al. While much of the research on gradient compression algorithms has focused mainly on the high compression ratio, a more important yet underexplored problem is how to decrease the actual system-level runtime and increase the distributed scaling efficiency. Actually, some recent works (Xu et al. (2020) ,Agarwal et al. ( 2022)) pointed out, when distributedly training typical models (e.g., ResNet-50 and BERT-Base) with off-the-shelf DistributedDataParallel (DDP) at typical bandwidths (e.g., 10Gbps), these existing gradient compression algorithms with high compression ratios are still slower than the original uncompressed optimizers. This is because they exhibit one or more of the following weaknesses (Xu et al. (2020 ),Agarwal et al. (2022) ): (i) Some gradient compression algorithms should perform compression/decompression and communication within the limited time frame, and the time cost of compression/decompression, in some cases, is close to and even larger than the savings by the reduces communications; (ii) some gradient compression algorithms cannot take full advantage of overlapping between gradient computation and communication. Because if the gradient computation and compression/decompression overlap, their intensive computation will compete with each other for GPU resources, which can result in an overall slowdown; (iii) due to inherent structures, some algorithms can only use inefficient collective communication primitive, such as All-Gather; (iv) some gradient compression algorithms need to harness uncompressed optimizers to warm up at the early stage. The warm-up time is commonly nontrivial which to some extent renders their high compression ratios vacuous. Therefore, from a system-level perspective, the design ethos of a system-efficient communication-compression algorithm is that we should guarantee that the compression/decompression of the algorithm is computationally light and takes less time, and that the corresponding communication should also be friendly to efficient collective communication primitives. Additionally, there is no need to resort to an uncompressed optimizer for warm-up. To this end, we propose a communication-compression optimization algorithm, referred to as Binary SGD-Momentum (BinSGDM), in which the core updating rule is x t+1 = x t -α t Q mt bt where m t = βm t-1 + (1 -β)g t , b t = βb t-1 + (1 -β)|g t | and g t is the gradient, and Q(•) is a binary quantization operator. The main difference between BinSGDM and existing gradient-quantization algorithms is that we directly quantize the entire update mt bt rather than quantize the gradient g t or the momentum m t . Due to -1 ≤ (mt)j (bt)j ≤ 1 where (m t ) j , (b t ) j are the j th element of m t , b t , each element of mt bt is easy to be randomly quantized to 1 or -1 in probability, so the quantization is computationally light. Another advantage of BinSGDM is that it does not need a full-precision optimizer to warm up at the early stage to ensure stable convergence. Besides, we theoretically demonstrate BinSGDM's convergence rate can match that of the original Adam. Moreover, according to the nature of BinSGDM, we specifically devise an efficient hierarchical communication scheme to further speed up communication, which sufficiently leverages the ultra-high intra-bandwidth among GPUs within the same node and efficient commutation primitives rather than All-Gather. In particular, we make the following key contributions: • We propose a novel communication-compress distributed optimizer, dubbed BinSGDM. To the best of our knowledge, it is the first algorithm that quantizes the entire model update of an adaptive optimizer and does not need to leverage uncompressed optimizers to warm up to address the convergence issue, which makes compression/decompression computationally light and the extreme quantization ratio exert its best function (Section 2). • We theoretically prove that even though extreme 1-bit quantization is employed, BinSGDM still promise the same convergence speed as the full-precision Adam (Section 3). • We present a new hierarchical communication scheme for 1-bit communication, called Hierarchical 1-bit All-Reduce, which sufficiently harnesses the ultra-fast intra-connects to accelerate the local communication, and utilize more efficient commutation primitives to further reduce the communication overhead (Section 4). • We perform extensive distributed training experiments to demonstrate the effectiveness of the proposed algorithm. As far as we know, our algorithm is the first work to consistently trump the uncompressed optimizers with the highly system-level optimized DDP in overall running time at no inference performance cost, reaching up to 2.47× speedup for Resnet-50 and 6.26× speedup for BERT-Base on 64 GPUs. The better scalability makes BinSGDM promising to train more large-scale models (Section 5).

2. EXTREMELY ONE-BIT QUANTIZED BINSGDM

In this section, we focus on solving the following problem when training a DNN model distributedly: min x∈R d f (x) = 1 n n i=1 f i (x; ξ (i) )



(2014)), SignSGD(Bernstein et al. (2018)), TernGrad (Wen et al. (2017)), and QSGD) (Alistarh et al. (2017)), 1-Bit Adam (Tang et al. (2021))), low-rank simplification (e.g., ATOMO(Wang et al. (2018)), Pow-erSGD (Vogels et al. (2019)), and GradZip (Cho et al. (2019))), and sparsification (e.g., Random-k (Stich et al. (2018)), Top-k (Aji & Heafield (2017)), and MSTop-k (Shi et al. (2021))).

