ADAM ACCUMULATION TO REDUCE MEMORY FOOT-PRINTS OF BOTH ACTIVATIONS AND GRADIENTS FOR LARGE-SCALE DNN TRAINING Anonymous

Abstract

Running out of GPU memory has become a main bottleneck for large-scale DNN training. How to reduce the memory footprint during training has received intensive research attention. We find that previous gradient accumulation reduces activation memory but fails to be compatible with gradient memory reduction due to a contradiction between preserving gradients and releasing gradients. To address this issue, we propose a novel optimizer accumulation method for Adam, named Adam Accumulation (AdamA), which enables reducing both activation and gradient memory. Specifically, AdamA directly integrates gradients into optimizer states and accumulates optimizer states over micro-batches, so that gradients can be released immediately after use. We mathematically and experimentally demonstrate AdamA yields the same convergence properties as Adam. Evaluated on transformer-based models, AdamA achieves up to 23% memory reduction compared to gradient accumulation with less than 2% degradation in training throughput. Notably, AdamA can work together with memory reduction methods for optimizer states to fit 1.26×~3.14× larger models over PyTorch and DeepSpeed baseline on GPUs with different memory capacities.

1. INTRODUCTION

The past few years have witnessed the remarkable achievements of large-scale DNN models across domains from computer vision to natural language processing (Devlin et al., 2018; Radford et al., 2019; Dosovitskiy et al., 2020; Smith et al., 2022) . Training such big models requires massive powerful GPUs with indispensable large memory capacity, which is prohibitively expensive and inaccessible to most researchers. Even for fine-tuning a large pre-trained model where computational power is a less critical factor, running out of memory is increasingly becoming the first and foremost serious limitation (Ren et al., 2021; Rajbhandari et al., 2021) . Recently, there has been an explosion of interest around methods to reduce the memory footprint during model training (Sohoni et al., 2019; Rajbhandari et al., 2020; Pudipeddi et al., 2020; Chen et al., 2016; Shazeer & Stern, 2018) . However, there is hardly a one-size-fits-all solution to address the out-of-memory issue for two reasons. Firstly, many memory reduction methods usually come at the cost of sacrificing convergence (Mostafa & Wang, 2019; Micikevicius et al., 2017) or training throughput (Chen et al., 2016; Pudipeddi et al., 2020) . It remains unclear how significant the cost of one method or a combination of methods would be for different models before testing. Secondly, the ratio of the memory footprint of various parts (e.g., weights, gradients, optimizer states, activations) varies with the model and training configurations. No single method always performs best in different cases. Among memory reduction methods, gradient accumulation and gradient release are two effective methods to reduce activation memory and gradient memory, respectively (Huang et al., 2019; Pudipeddi et al., 2020) . Both methods have no negative impact on model convergence and training throughput. Unfortunately, these two methods are inherently mutually exclusive. Gradient accumulation reduces the activation memory by splitting a mini-batch into a sequence of micro-batches and accumulating the gradients of all micro-batches. Gradient release reduces the gradient memory by freeing up the gradient-occupied space in a layer-by-layer manner. The contradiction preventing the two from being used together is one must preserve accumulated value of gradients until the last micro-batch, but the other releases the gradients immediately after use. Saving activations or gradients, previous works prefer the former as activations usually consume the most memory during training, while the gradients memory can be ignored when models are small. However, with the ever-increasing model size, the gradient memory consumption cannot be ignored. Micro-batch 0 Micro-batch 1 immediately after the gradients are produced, and accumulates optimizer states sequentially over micro-batches, as shown in Figure 1 . This subtle change of directly integrating gradients to optimizer states makes the memory space for whole model gradients no longer needed, eliminating the aforementioned contradiction between preserving gradients and releasing gradients. Consequently, AdamA can reduce the gradient memory to 1/M of the original (M is the number of layers), and the activation memory to 1/N of the original (N is the number of micro-batches). We further mathematically and experimentally demonstrate AdamA performs the same as standard Adam in terms of the convergence properties and final model accuracy, although the optimizer update of AdamA deviates a little from standard Adam. Notably, AdamA is complementary to previous methods that reduce weights and optimizer states, providing a possibility to achieve an even higher memory reduction rate.

Gradient accumulation with Adam

FWD 0 & BWD 0 FWD 1 & BWD We evaluate AdamA on both language and vision tasks, with the typical transformer architecture and convolution architecture. Our contributions can be summarized as follows: • We propose AdamA, a novel optimizer accumulation method to enable reducing memory footprints of activations and gradients simultaneously. Compared with gradient accumulation baseline, AdamA can save up to 23% memory footprint. • We conduct a convergence analysis for AdamA. Mathematical and experimental results on real workloads show AdamA performs the same convergence properties as Adam. • We implement the training pipeline of AdamA with Pytorch and DeepSpeed. The system is easy to use and incurs less than 2% effect on training throughput.

2. BACKGROUND AND RELATED WORK

The memory footprint during model training can be categorized into four parts: weights, gradients, optimizer states and activations. As different models, optimizers, or batch sizes lead to different ratios of the four parts, many works have been proposed to reduce them accordingly. Reducing weight and optimizer state memory. In model training iterations, weights and optimizer states inherently have the temporal dependency, i.e., the values at time step t update on the basis of their values at time step t -1. Hence, the training system must maintain weights and optimizer states for updates between two consecutive iterations. To reduce the weight and optimizer state memory, many compression-based methods (e.g., sparsification, quantization and matrix approximation) have been proposed, but often sacrifice the convergence or end model accuracy (Mostafa & Wang, 2019; Micikevicius et al., 2017; Shazeer & Stern, 2018) . Reducing activation and gradient memory. Activations and gradients are computed and used only inside each training iteration, indicating a potential to release the memory occupation after finished. Gradient accumulation and gradient release are effective methods to reduce activations and gradients, respectively (Huang et al., 2019; Pudipeddi et al., 2020) . The key idea behind gradient accumulation is to split a mini-batch into several micro-batches. This method computes the gradients of micro-batches sequentially and accumulates them to reduce the memory footprint of activations as well as to keep the same convergence properties as the original mini-batch. Gradient release executes the backward process in a layer-by-layer manner, which immediately releases the gradient-occupied memory after the weight updating is finished, so that the memory allocated for gradients can be reduced from the size of the whole model size to the size of the maximum layer. The contradiction between gradient accumulation and gradient release. Unfortunately, gradient accumulation to save activation memory and gradient release to save gradient memory are mutually exclusive. Because one must maintain all gradients for accumulation until the last micro-batch, while the other frees up the gradients immediately after use. Our proposed AdamA resolves this contradictory and enables saving both activation and gradient memory. Please note that our method is complementary to previous memory reduction methods (e.g., checkpointing (Chen et al., 2016) , Adafactor (Pudipeddi et al., 2020) , offloading (Rajbhandari et al., 2021; Ren et al., 2021; Pudipeddi et al., 2020) ), and can be applied together with these methods to achieve even higher memory reduction rate.

3.1. ADAM ACCUMULATION (ADAMA)

As mentioned, gradient accumulation to save activation memory contradicts gradient release to save gradient memory. The core reason is that gradient accumulation accumulates gradients till the last micro-batch, so that the gradient memory of the whole model must be preserved. Intuitively, as gradients are eventually used to update the optimizer states (m and v in Adam), if we can integrate gradients into optimizer states in advance, the gradients memory can be released, thus resolving this dilemma. Inspired by this insight, we for the first time propose an optimizer accumulation method, namely AdamA, that integrates gradients into optimizer states immediately after produced and then accumulates optimizer states sequentially over micro-batches. Algorithm 1 Adam v.s. AdamA with micro-batches Initialize θ 0 , m 0 ← 0 , v 0 ← 0, t ← 0, N ← # of microbatches while θ t not converged do t ← t + 1 for each micro-batch i in a mini-batch do g t,i ← 1 N ∇ θ f t,i (θ t-1 ) end m t = β 1 m t-1 + (1 -β 1 ) N -1 i=0 g t,i v t = β 2 v t-1 + (1 -β 2 )[( N -1 i=0 g t,i ) 2 v.s. N -1 i=0 (g 2 t,i )] Update mt ← mt 1-β t 1 , vt ← vt 1-β t 2 , θ t ← θ t-1 -α mt √ vt+ϵ end Algorithm 1 illustrates the difference between standard Adam and our proposed AdamA, in the case of micro-batching. Standard Adam first accumulates gradients of all micro-batches, then updates m t with the accumulated gradients and v t with the square of the accumulated gradients (as shown in the blue text). Different from the v t update mechanism in Adam, our proposed AdamA updates v t through accumulating the square of gradients generated from each micro-batch. This slight change in AdamA allows that gradients can be used and released immediately once they are generated, leading to a significant reduction on gradient memory during training. In order to analyze AdamA's impact on model convergence, we mathematically prove that AdamA yields the same convergence rate as Adam (shown in Section 3.2), and experimentally demonstrate AdamA performs the same as Adam in vision and language tasks (shown in Section 4.1). In Algorithm 2, we show the detailed training pipeline of AdamA to reduce both activation and gradient memory. Similar to gradient accumulation, AdamA divides a mini-batch of training data into several micro-batches to reduce activation memory to 1 N of the original without micro-batches. During the backward pass of each micro-batch, once the gradients of a layer (g t,i,j ) are produced, g t,i,j and g 2 t,i,j will be accumulated to the optimizer states of this layer (m t,j and v t,j ), respectively. In this process, g t,i,j memory is released after the accumulation procedure. As a result, the peak memory allocated for gradients can be reduced to only 1 M of the full model gradient size.

Algorithm 2

The training pipeline using AdamA to reduce both activation and gradient memory Initialize θ 0 , m 0 ← 0 , v 0 ← 0, t ← 0, N ← # of micro -batches, M ← # of layers while θ t not converged do t ← t + 1, m t ← β 1 m t-1 , v t ← β 2 v t-1 for each micro-batch i in a mini-batch do // Reduce activation memory to 1/N of the original without micro-batches for each layer j in backward computing do g t,i,j ← 1 N ∇ θ f t,i,j (θ t-1 ) Assign memory for g t,i,j // Reduce gradient memory to 1/M of full model gradients m t,j ← m t,j + (1 -β 1 )g t,i,j v t,j ← v t,j + (1 -β 2 )g 2 t,i,j Release memory for g t,i,j // The g t,i,j values are not needed any more end end Update mt ← mt 1-β t 1 , vt ← vt 1-β t 2 , θ t ← θ t-1 -α mt √ vt+ϵ end 3.2 CONVERGENCE ANALYSIS In this section, we demonstrate the convergence properties of AdamA. Adam (Kingma & Ba, 2014) is a optimization method that adaptively rescales the updating vector with second moment of the gradient. Compared with Adam, AdamA has the same updating direction (i.e., m t ), but different adaptive scaling length (i.e., 1 √ v ). We refer to Adam's proof methods to show that AdamA has the same theoretical convergence properties as Adam. Following analysis method in the online learning framework (Zinkevich, 2003) , we define f t as the convex cost function at time t, and θ t as the parameter we predict. We evaluate the convergence properties of AdamA using the regret R(T ) = T t=1 [f t (θ t ) -f t (θ * )], which is the sum of all the previous difference between our prediction f t (θ t ) and the best fixed point parameter f t (θ * ) (Kingma & Ba, 2014). In Theorem 1, we guarantee that AdamA has the same regret bound O( √ T ) with Adam and the detailed proof is given in the appendix. We define the vector g 1:T,i, b ∈ R t as the i th dimension of gradients from the b th micro-batch in one mini-batch till T . Following Adam paper, Theorem 1 holds when the learning rate α t is decaying at a rate of t -1 2 and first moment running average coefficient β 1,t decay exponentially with λ. (Kingma & Ba, 2014) Theorem 1. Assume β 1 , β 2 ∈ [0, 1) satisfy γ = β 2 1 √ β2 < 1. N is the number of micro-batches in a mini-batch. The function f t has bounded gradients, ∥∇f t (θ)∥ ≤ G, ∥∇f t (θ)∥ ∞ ≤ G ∞ for all θ ∈ R d . For any m, n ∈ {1, ..., T }, the distance between any θ t generated by AdamA is bounded, which can be presented as ∥θ n -θ m ∥ 2 ≤ D, ∥θ n -θ m ∥ ∞ ≤ D ∞ . AdamA achieves the following guarantee, for all T ≥ 1. R(T ) ≤ D 2 2α(1 -β 1 ) d i=1 T vT,i + α(β 1 + 1)G ∞ (1 -β 1 ) √ 1 -β 2 d i=1 N b=1 ∥g 1:T,i, b ∥ 2 + d i=1 D 2 ∞ G ∞ √ 1 -β 2 2α(1 -β 1 )(1 -λ) 2 In Corollary 1, we show the average regret of AdamA is O( 1 √ T ), which is the same as Adam. It is obvious the limit of the average regret is 0 when T gets larger. Corollary 1. Assume that the function f t has bounded gradients, ∥∇f t (θ)∥ ≤ G, ∥∇f t (θ)∥ ∞ ≤ G ∞ for all θ ∈ R d . For any m, n ∈ {1, ..., T }, the distance between any θ t generated by AdamA is bounded, which can be presented as ∥θ n -θ m ∥ 2 ≤ D, ∥θ n -θ m ∥ ∞ ≤ D ∞ . Combining Theorem 1 with the upper bound d i=1 N b=1 ∥g 1:T,i, b ∥ ≤ dG ∞ √ T we can get the average regret of AdamA: R(T ) T = O( 1 √ T )

3.3. SYSTEM IMPLEMENTATION TO REDUCE MEMORY FOOTPRINT

Our system design is intended to reduce memory and incur little impact on training throughput at the same time. The challenge is to achieve efficient memory allocation and low communication cost. In the single device scenario, the model forward computing is done as normal. When doing backward computing, we interleave back propagation and optimizer states update. For each model layer, gradients are accumulated to the corresponding optimizer states and immediately released. We implement this mechanism in PyTorch with backward hook to insert the optimizer states accumulation and layer gradient release operations. It should be mentioned that frequent memory allocation and free operations are costly. Nevertheless, the deep learning framework would maintain a memory pool for tensor memory assignment and release. This prevents the heavy overhead of system calls. In the distributed data parallel scenario, the same operations apply except that gradients need to be allreduced among distributed devices. When training with AdamA, the straightforward implementation is to insert layer gradients all-reduce operation in each micro-batch to update optimizer states. And yet, compared with standard Adam procedure, where all-reduce only needs to be done once after all micro-batches, the communication cost would be increased from O(1) to O(N) in one mini-batch (N is accumulation steps). To reduce the communication volume, we choose to all-reduce optimizer states instead of gradients. In this way, local optimizer states are updated on each device and synchronized at the end of the mini-batch with PyTorch all-reduce API. Therefore, the communication volume stays constant in one mini-batch. The details about our system design can be found in Appendix.

4. EXPERIMENTS

In this section, we evaluate AdamA on both vision and language tasks. We show that AdamA does no harm to the convergence properties compared with Adam. Then, we demonstrate the memory reduction result of AdamA and its impact on training throughput. Finally, we include a case study where AdamA is combined with DeepSpeed ZeRO-DP to further explore the ability to train large-scale models and push the limit to reduce memory of gradients, activations and optimizer states.

4.1. CONVERGENCE BEHAVIOR

To verify the convergence properties of AdamA, we experiment on both NLP (transformer-based) and CV (convolution-based) models. We set the same mini-batch size when training with Adam and AdamA. We set the accumulation steps N to 2,4,8 when training with AdamA. For NLP model, we follow RoBERTa method (Liu et al., 2019) to pre-train BERT-Large (L = 24, H = 1024, A = 16, 340M ) on a DGX A100 with sequence length of 128 and mini-batch size of 1024. We use the implementation of BERT model from Microsoft DeepSpeed (Rasley et al., 2020) . For the pre-training corpus, we use the English Wikipedia and BooksCorpus downloading from NVIDIA GitHub (NVIDIA). It should be noted that the corpus we use is different from that used in original BERT (Devlin et al., 2018) because the BERT corpus is not available to the public at this time. For the pre-training hyper-parameters, we follow the method in RoBERTa. No matter how many micro-batches in one mini-batch, we find the convergence curve of AdamA coincides with that of Adam. To further evaluate the convergence of the BERT-Large model trained by Adam and AdamA, we fine-tune the models on all tasks from GLUE benchmark (Wang et al., 2018) . We fine-tune for 3 epochs for all tasks and select the best fine-tuning learning rate (among 2e-5, 3e-5, 4e-5, 5e-5) on the Dev set. Table 1 shows the fine-tuning results. Obviously, the model pre-trained with AdamA provides similar accuracy with that pre-trained with Adam. For CV model, we train ResNet-50 with 4 A100 GPUs on ImageNet (Deng et al., 2009) dataset to evaluate the convergence properties of AdamA. Following the training setting provided by MMClassification (mmlab), we train ResNet-50 with mini-batch size of 1024. For the learning rate, we initial it to 1e-3 and cosine decay it to 1e-5. Figure 3 presents the training loss curve and the test accuracy of ResNet-50, from which we can jump to the conclusion that AdamA has almost the same convergence properties with that of Adam in CV tasks. Considering that Batch Normalization (BN) (Ioffe & Szegedy, 2015) is used in ResNet, we also pay attention to the effect on model accuracy which may be brought by the difference of the micro-batch normalization statistics and that statistics of the entire mini-batch. Mentioned in (Sohoni et al., 2019) , the influence of BN on model convergence tends to be constant if the micro-batch size increases above a certain extent. Therefore, we do not pay efforts to keep exactly the same BN algorithm between micro-batched training and non-micro-batched one. In the experimental results shown in Figure 3 , we also find the impact on convergence can be ignored. To help understand the difference between AdamA and Adam during training process, we make the following statistics. From the update equation θ t ← θ t-1 -α mt √ vt , it is clear that our adaptive scaling length differs from the standard Adam in a coefficient √ vt / vt ′ . We track the coefficient in training ResNet-50 on CIFAR-100 dataset. In Figure 4 , we plot the mean value of √ vt / vt ′ during each training step and its value range. It shows generally the coefficient keeps around 1.0 and the deviation value range is within 1%. We think the minimal deviation in each iteration might contribute to the same convergence properties of Adam and AdamA during training. As AdamA eliminates the contradiction when combining gradient accumulation and gradient release, we first show the improvement of AdamA compared with gradient accumulation. Compared with gradient accumulation, AdamA can save the memory footprint of both activations and gradients. We measure the memory footprint when training BERT-Large with AdamA on a DGX A100 (8 A100 GPUs) with the mini-batch size of 256 and the sequence length of 128. As shown in Figure 5 , AdamA can save 1.6GB more memory than gradient accumulation no matter how many the accumulation steps are set in a mini-batch. To further show the memory saving effect of AdamA, we expand BERT model to BERT-4B with 4 billion weights using the scaling method of GPT-3 (Brown et al., 2020) . We set the mini-batch size to 64 and accumulation steps to 8 in this experiment. In Figure 6 (a), we train BERT-4B with gradient accumulation and AdamA using PyTorch framework. It can be found that AdamA can save 23.2% memory footprint compared with gradient accumulation when the weights number of a model get to 4 billion. Compared with other memory-efficient optimizers, e.g. Adafactor (Shazeer & Stern, 2018) and SM (Anil et al., 2019) , the memory reduction of our proposal is bigger under the same experiment setting. The comparison with Adam baseline is shown in Table 2 . The reason AdamA can reach more significant memory reduction is AdamA targets at reducing the memory usage of both activations and gradients, while other works only aimed to reduce optimizer states memory. At the same time, AdamA can work well with these previous works to get further memory reduction. Baseline Gradient Accumulation AdamA To show the compatibility of AdamA with existing methods, we combine AdamA with ZeRO-DP (Rajbhandari et al., 2020) , a popular memory reduction method for optimizer states. ZeRO-DP P os partitions the optimizer states to different GPUs when training with data parallelism. In Figure 6 (b), we combine AdamA with ZeRO-DP P os to further reduce gradients and activations. It shows that AdamA with ZeRO-DP P os can save 20.1 GB more memory footprint than only ZeRO-DP P os . Even compared with ZeRO-DP P os+g , which partitions both optimizer states and gradients, our combined method can reduce 7.6 GB more memory. In this section, we show AdamA has negligible impact on training throughput. During training, it is reasonable to set the micro-batch size as large as the device memory can contain, in order to saturate GPUs to achieve maximal training throughput. Therefore, the micro-batch size is fixed in this section. Single-GPU Scenario As mentioned in Section 3.3, our system design for AdamA Single-GPU implementation is intended to incur no extra throughput overhead. In Figure 7 (a), we conduct a throughput comparison with standard Adam training ResNet-50 with one A100 GPU. We keep the micro-batch size to 256 and switch accumulation steps to 2, 4 and 8. We can conclude that training with AdamA has little throughput impact in single-GPU scenario. Distributed Data Parallel Scenario Explained in Section 3.3, our system design keep the communication number to be constant by synchronizing the optimizer states. Although it may incur more communication volume compared with standard Adam that synchronizes the gradient, the impact on throughput is minimal. In Figure 7 (b)(c), we conduct multi-GPU experiments with two models: BERT-Base with 4 A100 GPUs and BERT-Large with 8 A100 GPUs. The micro-batch size of all the models is set to 1024. The experiments show that the training throughput difference is within 2%. The throughput gap between AdamA and Adam is gradually decreasing with the increase of gradient accumulation steps. This is because the communication volume is constant in a mini-batch, the communication overhead proportion becomes smaller in a mini-batch with larger gradient accumulation steps. 

A CONVERGENCE PROOF

In this section, we show the convergence analysis of AdamA. Compared with proof in the original Adam paper Appendix, it's easy to see AdamA follows almost the same analysis proof with Adam. The most obvious difference is that Adam doesn't take micro-batch into consideration. Here we use symbol N as the number of micro-batch in AdamA and b as subscript for micro-batch index. Here we highlight those conclusions that differs between Adam and AdamA from Adam paper Appendix. First, we construct our proof based on the claim a convex function can be lower bounded by a hyperplane, which is Mathematically expressed in Lemma 2. Definition 1. A function f : R d → R is convex if for all x, y ∈ R d , for all λ ∈ [0, 1], λf (x) + (1 -λ)f (y) ≥ f (λx + (1 -λ)y) Lemma 2. If a function f : R d → R is convex, then for all x, y ∈ R d , f (y) ≥ f (x) + ∇f (x) T (y -x) Lemma 3 and Lemma 4 are proved to support the proof of Theorem 5. Lemma 3. Let g t = ∇f t (θ t ) and g 1:t be defined as above and bounded,∥g β2 < 1, and the micro-batch number equals to N . Let α t = α √ t and β 1,t = β 1 λ t-1 , λ ∈ (0, 1). AdamA achieves the following guarantee, for all T ≥ 1. t ∥ 2 ≤ G, ∥g t ∥ ∞ ≤ G ∞ . Then. T t=1 g 2 t,i t ≤ 2G ∞ ∥g 1:T,i ∥ 2 Proof. tv t,i + 1 -β T 2 (1 -β T 1 ) 2 ( T k=1 (1 -β 1 )β T -k 1 g k,i ) 2 T T j=1 (1 -β 2 )β T -j 2 g 2 j,i ≤ T -1 t=1 m2 t,i tv t,i + 1 -β T 2 (1 -β T 1 ) 2 T k=1 ((1 -β 1 )β T -k 1 g k,i ) 2 T (1 -β 2 )β T -k 2 g 2 k,i ≤ T -1 t=1 m2 t,i tv t,i + 1 -β T 2 (1 -β T 1 ) 2 (1 -β 1 ) 2 T (1 -β 2 ) T k=1 T ( β 2 1 √ β 2 ) T -k ∥g k,i ∥ 2 = T -1 t=1 m2 t,i tv t,i + 1 -β T 2 (1 -β T 1 ) 2 (1 -β 1 ) 2 T (1 -β 2 ) T k=1 T ( β 2 1 √ β 2 ) T -k ∥ N b=1 g k,i, b ∥ 2 ≤ T -1 t=1 m2 t,i tv t,i + 1 -β T 2 (1 -β T 1 ) 2 (1 -β 1 ) 2 T (1 -β 2 ) T k=1 T ( β 2 1 √ β 2 ) T -k N b=1 ∥g k,i, b ∥ 2 ≤ T -1 t=1 R(T ) ≤ D 2 2α(1 -β 1 ) d i=1 T vT,i + α(β 1 + 1)G ∞ (1 -β 1 ) √ 1 -β 2 d i=1 N b=1 ∥g 1:T,i, b ∥ 2 + d i=1 D 2 ∞ G ∞ √ 1 -β 2 2α(1 -β 1 )(1 -λ) 2 Proof. Following the proof in Adam Appendix Theorem 10.5, by applying Lemma 4, we can get the final convergence bound by substituting 





Figure 2: Sample-wise convergence properties for BERT-Large pre-training with sequence length 128 using a DGX A100. AdamA has almost the same training loss curve with Adam.

Figure 3: The training loss curve and the test accuracy of ResNet-50 on ImageNet.

Figure 4: Statistics on the value of √ vt √ v′ t

Figure 6: The memory reduction of AdamA when training BERT-4B using PyTorch and DeepSpeed.

Figure 7: AdamA has less than 2% effect on the training throughput compared with gradient accumulation using Adam: (a) training ResNet-50 with single GPU; (b) training BERT-Base with 4 A100 GPUs; (c) training BERT-Large with 8 A100 GPUs.

Theorem 5. Assume that the function f t has bounded gradients,∥∇f t (θ)∥ ≤ G, ∥∇f t (θ)∥ ∞ ≤ G ∞ for all θ ∈ R d and distance between any θ t generated by AdamA is bounded, ∥θ n -θ m ∥ 2 ≤ D, ∥θ n -θ m ∥ ∞ ≤ D ∞ for any m, n ∈ {1, ..., T }, and β 1 , β 2 ∈ [0, 1) satisfy β 2 1 √

the same as Adam does.B MORE DETAILS ABOUT ADAMA IN THE DISTRIBUTED DATA PARALLELSCENARIOIn the distributed data parallel scenario, we pay efforts to make the update effect of m t and v t of AdamA (number of GPUs = M, number of microbatches per minibatch = N) consistent with the



When training BERT-Large, AdamA achieves less memory usage than Adafactor(Shazeer & Stern, 2018) and SM3(Anil et al., 2019).

The largest model size can fit on different DGX systems with AdamA.

we explore the largest transformer-based model can fit on DGX systems with various memory capacity with AdamA. At present, the mainstream DGX systems on the market include DGX-1 (8 V100-16GB GPUs), DGX-2 (16 V100-32GB GPUs), and DGX A100 (8 A100-80GB GPUs). In order to keep the same experimental settings, we set the number of GPUs to 8. The mini-batch size and accumulation steps are set to 256 and 8, respectively. With PyTorch framework, the largest model AdamA can train is 1.26x to 1.33x larger than gradient accumulation can train. Combined with DeepSpeed ZeRO-DP, AdamA can train a model with 18.2 billion weights in a DGX A100, which is 3.14x larger than the model the system can train with only ZeRO-DP P os .

The proof is the same as Lemma 10.3 in Adam Appendix(Kingma & Ba, 2014) and hence is ommitted here. and bounded g t , ∥g t ∥ 2 ≤ G, ∥g t ∥ ∞ ≤ G ∞ , and the micro-batch number equals to N , the following inequality holds

annex

update effect of AdamA (number of microbatches per minibatch = NM) in single device scenarios. To achieve the effect, we propose a new update method for m and v among different GPUs.As mentioned in Section 3.3, we choose to all-reduce optimizer states instead of gradients at the end of each mini-batch. In this way, the value of m t and v t are shown below before optimizer states are all-reduced. M equals to the number of GPUs, and N equals to the number of microbatches per minibatch. Other symbols follow our definition in Algorithm 1, and g t,i ← 1 N ∇ θ f t,i (θ t-1 ). Notice that we will multiply v t-1 by M β 2 instead of β 2 before the start of each minibatch.During all-reduce operations for m t , we take the average of m t from each GPU (add them together and divide by M). For v t , we divide by M 2 instead of M after summing v t from each GPU. After that, the value of m t and v t in each GPU are:It is easy to find the m t and v t keep consistent with the m t and v t from Algorithm 1, as long as we replace N with NM in line 5 "g t,i ← 1 N ∇ θ f t,i (θ t-1 )". In this way, we make the update effect of m t and v t in distributed scenarios (number of GPUs = M, number of microbatches per minibatch = N) consistent with the update effect of AdamA in single device scenario (number of microbatches per minibatch = NM). As the convergence properties of AdamA has been proven the same with Adam in single device scenarios, its convergence properties can also be guaranteed in distributed scenarios.

