MAXIMIZING COMMUNICATION EFFICIENCY FOR LARGE-SCALE TRAINING VIA 0/1 ADAM

Abstract

1-bit gradient compression and local steps are two representative techniques that enable drastic communication reduction in distributed SGD. Their benefits, however, remain an open question on Adam-based large model pre-training (e.g. BERT and GPT). In this paper, we demonstrate the non-linearity in Adam causes slow convergence even when 1-bit compression or local steps are individually applied. To alleviate this limitation, we propose 0/1 Adam that linearizes each Adam step via approximating its optimizer states using their stale estimates and linear correlation. 0/1 Adam performs an Adam-like step to preserve the adaptivity, while its linearity allows utilizing 1-bit compression and local steps simultaneously for wall-clock time speed up. We provide convergence guarantee for 0/1 Adam on smooth non-convex objectives. On various large-scale benchmarks such as BERT-Base, BERT-Large, GPT-2 pre-training and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 87% of data volume, 54% of communication rounds, and achieve up to 2× higher training throughput and end-to-end training time reduction compared to the state-of-the-art baseline 1-bit Adam; while enjoying the same statistical convergence speed and end task model accuracy on GLUE dataset and ImageNet validation set.

1. INTRODUCTION

Over the past few years, we have witnessed outstanding performance of foundation models on many applications. However, these models, including BERT Devlin et al. (2018) and GPT Radford et al. (2019a); Brown et al. (2020) , usually have hundreds of millions or even billions of parameters and require to be trained on massive GPUs. For example, the largest dense transformer model, 530B MT-NLG Smith et al. (2022) , was trained over 4000 GPUs in more than a month. At this scale, the expensive communication overhead across computing processors and servers hinders the scalability (Alistarh et al., 2017) . (Kingma and Ba, 2014; Wang et al., 2019a) . Comparing to SGD where the model parameters are linearly dependent on the gradients, the non-linearity in Adam updates (Kingma and Ba, 2014) limits the direct usage of compression or local steps. In particular, this non-linearity incurs two challenges: 1) when aggressively compressing the gradient such as with 1-bit quantizer, all the coordinate-wise effect learning rate will become the same value, so that Adam no longer enjoys adaptive and fast convergence; 2) to ensure all parallel workers reach consensus on the optimizer states, which is critical for convergence, the existence of non-linearity incurs the overhead of iteratively synchronizing the states when using local steps. 



and local steps are two representative methods to mitigate the communication bottleneck. 1-bit compression drastically reduces the communication volume by quantizing each value in gradients with ultra-low bits (i.e., as low as 1 bit) Seide et al. (2014); Bernstein et al. (2018a); and local steps alternatively saves the bandwidth by periodically skipping communication rounds(Stich, 2018). While these techniques demonstrate tremendous success on distributed SGD, their benefits over large-scale Adam-based model training, such as for BERT and GPT pre-training, remains an open question

Tang et al. (2021)  undertook the first investigation of fixing this non-linearity towards compression and proposed 1-bit Adam. The algorithm follows a two-stage training paradigm: first run Adam

