BINSGDM: EXTREME ONE-BIT QUANTIZATION FOR COMMUNICATION EFFICIENT LARGE-SCALE DIS-TRIBUTED TRAINING

Abstract

To alleviate the communication bottleneck of large-scale distributed training, a rich body of prior communication-compression optimizers have been proposed. These methods focus mainly on a high compression ratio to expect acceleration. However, some recent works pointed out, when running with distributed training frameworks ( e.g., DistributedDataParallel in Pytorch), these methods provide no acceleration over the off-the-shelve uncompressed SGD/Adam in the typical settings, due to heavy compression/decompression computation or incompatibility with efficient communication primitives or the requirement of uncompressed warmup at the early stage. For these reasons, we propose a novel extreme one-bit quantization optimizer, dubbed BinSGDM. The quantization of BinSGDM is computed easily and lightly, and it does not need to resort to uncompressed optimizers for warmup. We also theoretically prove that it promises the same convergence speed as the original Adam. Moreover, we specially present a hierarchical 1bit All-Reduce technique to further lower the communication volume. Extensive experiments are conducted on 8 to 64 GPUs (1 to 8 nodes) for distributed training with DistributedDataParallel, and the experimental results demonstrate that BinSGDM with the communication scheme can achieve up to 2.5× speedup for training ResNet-50 and 6.3× speedup for training BERT-Base, compared to the full-precision optimizers.

1. INTRODUCTION

With the rapid development of computational power, "bigger" and "bigger" deep neural network (DNN) models are proposed for expecting better performance, from the early classical models, such as AlexNet(61M parameters) (Krizhevsky et al. ( 2017 While much of the research on gradient compression algorithms has focused mainly on the high compression ratio, a more important yet underexplored problem is how to decrease the actual system-level runtime and increase the distributed scaling efficiency. Actually, some recent works (Xu et al. (2020 ),Agarwal et al. (2022) ) pointed out, when distributedly training typical models (e.g., ResNet-50 and BERT-Base) with off-the-shelf DistributedDataParallel (DDP) at typical bandwidths (e.g., 10Gbps), these existing gradient compression algorithms with high compression ratios are



)), and ResNet (ResNet-50: 20.5M parameters)(He et al. (2016)) to the current foundation models, such as BERT (BERT-Lagre: 340M parameters )(Devlin et al. (2018)), and GPT (GPT-3: 176B parameters)(Brown et al. (2020)). Scalable parallelism across distributed computing workers for training these large-scale models becomes a necessity. During training, millions to billions of parameters need to be communicated among workers at each iteration, and the expensive communication cost becomes a bottleneck.To address the communication bottleneck, a wide variety of lossy gradient compression optimizers have been proposed to lower the communication volume. These algorithms can be typically divided into three groups, including low-precision approximation (e.g., 1-bit SGD(Seide et al. (2014)), SignSGD(Bernstein et al. (2018)), TernGrad (Wen et al. (2017)), and QSGD) (Alistarh et al. (2017)), 1-Bit Adam (Tang et al. (2021))), low-rank simplification (e.g., ATOMO(Wang et al. (2018)), Pow-erSGD (Vogels et al. (2019)), and GradZip (Cho et al. (2019))), and sparsification (e.g., Random-k (Stich et al. (2018)), Top-k (Aji & Heafield (2017)), and MSTop-k (Shi et al. (2021))).

