MAXIMIZING COMMUNICATION EFFICIENCY FOR LARGE-SCALE TRAINING VIA 0/1 ADAM

Abstract

1-bit gradient compression and local steps are two representative techniques that enable drastic communication reduction in distributed SGD. Their benefits, however, remain an open question on Adam-based large model pre-training (e.g. BERT and GPT). In this paper, we demonstrate the non-linearity in Adam causes slow convergence even when 1-bit compression or local steps are individually applied. To alleviate this limitation, we propose 0/1 Adam that linearizes each Adam step via approximating its optimizer states using their stale estimates and linear correlation. 0/1 Adam performs an Adam-like step to preserve the adaptivity, while its linearity allows utilizing 1-bit compression and local steps simultaneously for wall-clock time speed up. We provide convergence guarantee for 0/1 Adam on smooth non-convex objectives. On various large-scale benchmarks such as BERT-Base, BERT-Large, GPT-2 pre-training and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 87% of data volume, 54% of communication rounds, and achieve up to 2× higher training throughput and end-to-end training time reduction compared to the state-of-the-art baseline 1-bit Adam; while enjoying the same statistical convergence speed and end task model accuracy on GLUE dataset and ImageNet validation set.

1. INTRODUCTION

Over the past few years, we have witnessed outstanding performance of foundation models on many applications. However, these models, including BERT Devlin et al. (2018) and GPT Radford et al. (2019a) ; Brown et al. (2020) , usually have hundreds of millions or even billions of parameters and require to be trained on massive GPUs. For example, the largest dense transformer model, 530B MT-NLG Smith et al. (2022) , was trained over 4000 GPUs in more than a month. At this scale, the expensive communication overhead across computing processors and servers hinders the scalability (Alistarh et al., 2017) . (Kingma and Ba, 2014; Wang et al., 2019a) . Comparing to SGD where the model parameters are linearly dependent on the gradients, the non-linearity in Adam updates (Kingma and Ba, 2014) limits the direct usage of compression or local steps. In particular, this non-linearity incurs two challenges: 1) when aggressively compressing the gradient such as with 1-bit quantizer, all the coordinate-wise effect learning rate will become the same value, so that Adam no longer enjoys adaptive and fast convergence; 2) to ensure all parallel workers reach consensus on the optimizer states, which is critical for convergence, the existence of non-linearity incurs the overhead of iteratively synchronizing the states when using local steps. Tang et al. (2021) undertook the first investigation of fixing this non-linearity towards compression and proposed 1-bit Adam. The algorithm follows a two-stage training paradigm: first run Adam with full-precision communication (full-precision stagefoot_0 ); and then switch to 1 bit when the variance becomes stable (compression stage). While this paradigm avoids compressing non-linear information with a one-time frozen variance, the experimental results from (Tang et al., 2021) indicate the fullprecision stage still incurs non-trivial overhead. Furthermore, 1-bit Adam is restricted in the scope of gradient compression, and cannot be trivially adapted when other techniques are used, such as local steps. Besides, the empirical success of (Tang et al., 2021) was not substantiated on GPT-style models, for instance, 175B GPT-3 Brown et al. ( 2020), 530B MT-NLG Smith et al. (2022) , etc.

1-bit gradient compression

In this paper, we address this gap by proposing 0/1 Adam. 0/1 Adam breaks the barrier of non-linearity from two aspects: first it adaptively freezes variance, so that given agreement on a stale variance state, the parallel workers only need to communicate momentum that is linearly dependent on the model update; This technique allows reducing the previous two-stage compression scheme to a unified single stage; 2) it leverages the insight that in adjacent Adam steps, the changes to optimizer states are generally bounded, so that with frozen variance, parallel workers can linearly approximate momentum and parameter updates locally without additional synchronization. This further pushes the limit of communication reduction towards its extreme, achieving the state-of-the-art speed up for large-scale model training. To summarize, our contributions are as follows: • We propose 0/1 Adam, which addresses the limitations of previously proposed 1-bit Adam when applying aggressive 1-bit quantization and local steps (Section 4). • We provide convergence guarantee of 0/1 Adam on smooth and non-convex objectives (Section 5). et al., 2021; Li et al., 2021a) , which propose using two-stage training to enable 1-bit Adam and 1-bit Lamb, respectively. Different from those two work, 0/1 Adam addresses non-linearity challenges in adaptive optimizers by considering both extreme quantization and local steps. Furthermore, we also study how to apply extreme communication compression on GPT-style models, which to the best our knowledge is still under-explored. Adaptive learning rate optimizers. One of the most popular adaptive optimizers is Adam, which was first introduced in (Kingma and Ba, 2014). It uses both first and second moment information of stochastic gradient to perform optimizer steps and has shown significant benefits on training deep



In the original 1-bit Adam paper, this stage is referred to as warmup stage. We use a slightly different term to avoid confusion with learning rate warmup. https://github.com/microsoft/DeepSpeed



and local steps are two representative methods to mitigate the communication bottleneck. 1-bit compression drastically reduces the communication volume by quantizing each value in gradients with ultra-low bits (i.e., as low as 1 bit) Seide et al. (2014); Bernstein et al. (2018a); and local steps alternatively saves the bandwidth by periodically skipping communication rounds(Stich, 2018). While these techniques demonstrate tremendous success on distributed SGD, their benefits over large-scale Adam-based model training, such as for BERT and GPT pre-training, remains an open question

• We conduct experiments on a wide range of large-scale model training tasks, including BERT-Base, BERT-Large, GPT-2 pre-training and ImageNet. We demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 87% of data volume, 54% of communication rounds, and achieve up to 2× higher throughput and training time reduction compared to the state-of-the-art 1-bit Adam without compromising end-to-end model accuracy (Section 6).• The 0/1 Adam optimizer and corresponding experimental scripts (e.g. BERT pre-training and GLUE finetuning) have been open sourced in a deep learning optimization library called DeepSpeed 2 .

