MAXIMIZING COMMUNICATION EFFICIENCY FOR LARGE-SCALE TRAINING VIA 0/1 ADAM

Abstract

1-bit gradient compression and local steps are two representative techniques that enable drastic communication reduction in distributed SGD. Their benefits, however, remain an open question on Adam-based large model pre-training (e.g. BERT and GPT). In this paper, we demonstrate the non-linearity in Adam causes slow convergence even when 1-bit compression or local steps are individually applied. To alleviate this limitation, we propose 0/1 Adam that linearizes each Adam step via approximating its optimizer states using their stale estimates and linear correlation. 0/1 Adam performs an Adam-like step to preserve the adaptivity, while its linearity allows utilizing 1-bit compression and local steps simultaneously for wall-clock time speed up. We provide convergence guarantee for 0/1 Adam on smooth non-convex objectives. On various large-scale benchmarks such as BERT-Base, BERT-Large, GPT-2 pre-training and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 87% of data volume, 54% of communication rounds, and achieve up to 2× higher training throughput and end-to-end training time reduction compared to the state-of-the-art baseline 1-bit Adam; while enjoying the same statistical convergence speed and end task model accuracy on GLUE dataset and ImageNet validation set.

1. INTRODUCTION

Over the past few years, we have witnessed outstanding performance of foundation models on many applications. However, these models, including BERT Devlin et al. (2018) and GPT Radford et al. (2019a) ; Brown et al. (2020) , usually have hundreds of millions or even billions of parameters and require to be trained on massive GPUs. For example, the largest dense transformer model, 530B MT-NLG Smith et al. (2022) , was trained over 4000 GPUs in more than a month. At this scale, the expensive communication overhead across computing processors and servers hinders the scalability (Alistarh et al., 2017) . 1-bit gradient compression and local steps are two representative methods to mitigate the communication bottleneck. 1-bit compression drastically reduces the communication volume by quantizing each value in gradients with ultra-low bits (i.e., as low as 1 bit) Seide et al. (2014) ; Bernstein et al. (2018a) ; and local steps alternatively saves the bandwidth by periodically skipping communication rounds (Stich, 2018) . While these techniques demonstrate tremendous success on distributed SGD, their benefits over large-scale Adam-based model training, such as for BERT and GPT pre-training, remains an open question (Kingma and Ba, 2014; Wang et al., 2019a) . Comparing to SGD where the model parameters are linearly dependent on the gradients, the non-linearity in Adam updates (Kingma and Ba, 2014) limits the direct usage of compression or local steps. In particular, this non-linearity incurs two challenges: 1) when aggressively compressing the gradient such as with 1-bit quantizer, all the coordinate-wise effect learning rate will become the same value, so that Adam no longer enjoys adaptive and fast convergence; 2) to ensure all parallel workers reach consensus on the optimizer states, which is critical for convergence, the existence of non-linearity incurs the overhead of iteratively synchronizing the states when using local steps. Tang et al. (2021) undertook the first investigation of fixing this non-linearity towards compression and proposed 1-bit Adam. The algorithm follows a two-stage training paradigm: first run Adam t and v t denotes the variance term computed via local gradient on worker-0 and the gradient from full-precision AllReduce, respectively. We also profile the variance difference in adjacent step ∥v t -v t-1 ∥. Similarly, we profile the same two metrics for the momentum. learning models. Reddi et al. (2019) spots the issue of Adam convergence and provides a variant called AMSGrad while Zaheer et al. (2018) argues the Adam only converges with large batch sizes. Multiple lines of theoretical study on Adam are given in (Fang and Klabjan, 2019; Alacaoglu et al., 2020; Défossez et al., 2020) . Additionally, Chen et al. (2018) ; Zhou et al. (2018a) ; Lu et al. (2020) ; Danilova et al. (2020) ; Zou et al. (2019) provide more general analysis on Adam-type optimizers. Subsequently, other variants of Adam are proposed in (Luo et al., 2019; Chen et al., 2019b; Huang et al., 2018; Wang et al., 2019b; Zhou et al., 2018b; Zhuang et al., 2021; 2020) . Unlike these methods, which focus on improving the convergence of generic optimizations for DNN models, our work studies how to maximize the communication efficiency of Adam in large-scale distributed training settings.

3. A CLOSER LOOK AT NON-LINEARITY IN ADAM

In this section, we provide a more formal description on the problem setting and illustrate the limitations from the original Adam and the state-of-the-art 1-bit Adam (Tang et al., 2021) . Problem Formulation. In this paper, we consider the following optimization problem: min The non-linearity in Adam. At step t ≥ 0, denote x t and g t as the model parameters and stochastic gradient computed at step t, respectively. The update formula of SGD and Adamfoot_2 can be summarized as: SGD update: x t+1 ← x t -γg t . x∈R d f (x) = E ζ∼D f (x;ζ). (2) Adam update: m t+1 ← β 1 m t +(1-β 1 )g t , v t+1 ← β 2 v t +(1-β 2 )(g t ) 2 , x t+1 ← x t - γ √ v t +ϵ effective learning rate •m t , ( ) where γ is the learning rate, ϵ is a small constant to prevent zero division, β 1 and β 2 are tunable decaying factors. The linearity in SGD update implies when using compression or local steps, the potential noise from (accumulated) gradients is in the order of O(γ), which approaches zero when learning rate is decaying or set to be small. By comparison, the two auxiliary optimizer states in Adam, momentum (m) and variance (v), introduce non-linearity in the model update. Equation (3) gives the formula of Adam when running it sequentially. In a distributed setting with n workers, g t in Equation ( 3) is often computed in parallel on different workers. Mathematically, if we denote g (i) t as the stochastic gradient computed on the i-th worker at step t, then distributed Adam Published as a conference paper at ICLR 2023 can be written as replacing g t with 1/n n i=1 g (i) t in Equation (3) as follows: m t+1 ← β 1 m t +(1-β 1 ) 1/n n i=1 g (i) t , v t+1 ← β 2 v t +(1-β 2 ) 1/n n i=1 g (i) t 2 . Issue with non-linearity on 1-bit compression. The main bottleneck in running distributed Adam is the accumulation of 1/n n i=1 g (i) t since the gradients are usually high-dimensional. Based on the profiling results from (Tang et al., 2021; Li et al., 2021a) , the communication of gradients could take up to 94% of the total training time on modern clusters. Gradient compression mitigates this issue by sending and averaging gradients with fewer bits. However, in Adam this causes the loss on the learning rate adaptivity. Consider using the aggressive 1-bit compression (Liu et al., 2018) , which sends each gradient with only signs and a shared, usually the average over all the coordinates, magnitude. More specifically, denote C[•] as the 1-bit compression, then C[a] = ∥a∥ 1 d •sign(a),∀a ∈ R d . It is straightforward to observe that naively applying 1 bit to compress gradients in the original Adam loses coordinate-wise adaptivity since sharing magnitude makes all the coordinates-wise learning rate γ/ √ v t +ϵ the same value. This makes Adam no difference than momentum SGD. Issue with non-linearity on local steps. In SGD (Equation ( 2)), the model updates are linearly dependent on the gradients and has zero additional states. It implies with local steps, the parallel workers can entirely reach consensus after a single round of synchronization, even with compression (Basu et al., 2020) . However, in Adam simply synchronizing the model can still leave the momentum and variance out-of-sync. This makes parallel workers fail to capture the global adaptivity when the system scales up. To give a more concrete example, we profile a full run of BERT-Large pre-training with original Adam, and summarize different metrics of momentum and variance in Figure 1 . As shown in Figure 1 (d) and 1(b), the difference between local and global optimizer states, momentum and variance, remain constants and do not decrease to zero. 1-bit Adam and its limitations. 1-bit Adam (Tang et al., 2021) is a state-of-the-art solution that addresses non-linearity in 1-bit compression. 1-bit Adam adopts a pre-conditioned variance state from running original Adam for T 0 steps first. The intuition there is that at later stage of training, the variance state becomes stable so that v T0 can be a good approximation of variance state for the remaining steps. As paritally illustrated in Section 1, the full-precision stage of 1-bit Adam still presents non-trivial overhead. For instance: as illustrated in (Tang et al., 2021) , when training BERT-Large on 64 GPUs using Ethernet, while the full-precision stage contains 15% of the total steps, it can take more than 50% of the entire training in terms of the wall-clock timefoot_3 . Additionally, 1-bit Adam is restricted in the scope of compression, how it handles other techniques such as local steps remains open question. 4 0/1 ADAM In this section, we give the full description of 0/1 Adam. To maximize the communication efficiency, ideally we want an algorithm that enables adaptive convergence like Adam, while allowing aggressive compression (e.g. 1 bit) and requires no additional synchronization on the optimizer states when using local steps. 0/1 Adam solves this problem from two aspects. Adaptive Variance Freezing. To begin with, 0/1 Adam creates a linear environment that freezes the variance adaptively. The intuition is leveraged from the observation in Figure 1 (a): the change of variance over steps in Adam is generally smooth. While 1-bit Adam captures a reasonable variance estimate via one-time freezing, it is reasonable to also presume that before its freezing point, the variance within several adjacent steps will stay close due to its smoothness. This motivates us to extend the one-time freezing policy in 1-bit Adam into an adaptive one, by letting workers agree upon the freezing points from a given step index set T v ⊆ {0,•••,T -1}. The frozen variance creates multiple intervals over training, during which the workers have agreement on the denominator (Equation (3)) and the only uncertainty is then left in the nominator that is linearly dependent on the model update, just like SGD. Algorithm 1 Proposed 0/1 Adam Algorithm Require: local model on the i-th node x (i) 0 , learning rate {γt} T t=1 , m0 = 0, v0 = 0, auxiliary buffer u0 = 0, total number of iterations T , decaying factor β1, β2 from Adam, numerical constant ϵ, variance update step index set Tv, synchronization step index set Tu, the most recent step with synchronization t ′ = 0. 1: for t = 0,•••,T -1 do 2: Compute local stochastic gradient g (i) t .

3:

Update momentum: m (i) t+ 1 2 = β1m (i) t +(1-β1)g (i) t . 4: Update model: x (i) t+ 1 2 = x (i) t -γtm (i) t / √ vt +ϵ.

5:

Update buffer: u (i) t+ 1 2 = u (i) t +γtm (i) t . 6: if t ∈ Tu then 7: Perform 1-bit AllReduce: u t+ 1 2 = 1bit-AllReduce u (i) t+ 1 2 .

8:

Approximate momentum with compressed buffer: m (i) t+1 = u t+ 1 2 / t h=t ′ γ h . 9: Update model with compressed buffer: x (i) t+1 = x (i) t ′ -u t+ 1 2 / √ vt +ϵ.

10:

Reset the auxiliary buffer: u (i) t+1 = 0. 11: Update the synchronization step: t ′ = t. 12: else 13: x Including 1-bit Compression and Local Steps. With frozen variance, we make another observation based on Equation (3) that the model difference on workers will be linearly dependent to the momentum. So that, the momentum can be approximated locally rather than synchronized additionally based on the communicated model difference, given the premise that the change of momentum is not abrupt within close steps. Formally, denote x (i) t+1 = x (i) t+ 1 2 ; m (i) t+1 = m (i) t+ 1 2 ; u (i) t+1 = u (i) (i) t , m (i) t , v (i) t as the model, momentum, variance on worker i at step t, respectively. Suppose all the workers are synchronized at step t ′ , then with frozen variance v over all the workers, u (i) t = t k=t ′ γ k m (i) k Actual sent tensors in the communication. x (i) t+1 = x (i) t ′ - 1/n n i=1 u (i) t √ v+ϵ Sync model parameters with the sent tensors. m (i) t+1 ≈ 1/n n i=1 u (i) t t k=t ′ γ k Approximate momentum via linear estimates via sent tensor. where we omit the compression part for brevity. Combined with compression, we provide the full description of 0/1 Adamfoot_5 in Algorithm 1. Note that here we defer the details of 1-bit compression to Appendix A and treat it as a black-box procedure named 1bit-AllReduce while the original full-precision AllReduce is referred to as AllReduce. We also remark that although both techniques appear to be natural, to the best of our knowledge, we are the first to apply them to addressing the non-linearity challenges in 1-bit compression and local steps for maximizing the communication efficiency of Adam optimizer.

5. CONVERGENCE ANALYSIS

In this section, we provide the convergence guarantee for 0/1 Adam (Algorithm 1) under arbitrary freezing policy T v and local steps policy T u . In the main paper, we provides the convergence rate in the general case. However, different T v or T u gives us the opportunity to obtain tighter bounds. We leave these discussion in the appendix. We start by making the following assumptions. Assumption 1. Lipschitzian gradient: f (•) is assumed to be with L-Lipschitzian gradients, which means ∥∇f (x)-∇f (y)∥ ≤ L∥x-y∥,∀x,∀y. Assumption 2. Bounded variance: The stochastic gradient computed on each worker is unbiased and has bounded variance:E ζ∼D ∥∇f (x;ζ)-∇f (x)∥ 2 ≤ σ 2 , ∀x. Assumption 3. Bounded gradient: The infinity norm of stochastic gradient is bounded by a constant G ∞ > 0 such that ∥g t ∥ ∞ ≤ G ∞ ,∀t. Assumption 4. Compression error in Algorithm 1: For arbitrary x ∈ R d , there exists a constant ∆, such that the output of compressor C[•] has the following error bound: E∥C[x]-x∥ 2 ≤ ∆ 2 . Assumption 5. Given ordered set T u , denote t j as the j-th element in T u , we assume there exists a constant H ≥ 0, it holds that max 1≤j<|Tu| (t j+1 -t j ) ≤ H. Remarks on the assumptions. Assumption 1, 2 and 3 are standard in the domain of non-convex optimization. Comparing with the 1-bit Adam paper (Tang et al., 2021) , we do not explicitly assume the uniform lower bound on the variance coordinate, i.e., e ⊤ j v > v min > 0,∀j for some constant v min . Instead we assume an infinity-norm bound on the gradient as in Assumption 3 which is more realistic. Assumption 4 is also assumed in (Tang et al., 2021) , in the appendix we discuss the variant of 0/1 Adam that converges with weaker condition on the C. The convergence for Algorithm 1 is then given in the follow theorem. Theorem 1. Under Assumption 1 to 5, let m = |T v |, select β 1 , β 2 ∈ [0, 1) that fulfills m ≤ log(1-β 1 )/log(β 2 ), if we run Algorithm 1 with a constant learning rate: for all t ≥ 0 γ t = min n σ 2 T , 1 4L G 2 ∞ +ϵ , 2 G 2 ∞ +ϵ L , 1 6 , then it holds that 1 T T -1 t=0 E∥∇f ( xt )∥ 2 ≤ O σ √ nT + H 2 ∆ 2 (m+n) T + 1 T , where xt = 1/n n i=1 x (i) t and we omit f (0)-inf x∈R d f (x), G ∞ , d, ϵ, β 1 , β 2 and L as constants. Theorem 1 shows that 0/1 Adam Algorithm 1 essentially admits the same convergence rate as distributed SGD in the sense that it achieves linear speed up, at rate 1/O( √ nT ). The effect of compression (∆) and local steps (H) only appears on a non-dominating term.

6. EXPERIMENTS

In this section we evaluate the performance of 0/1 Adam over several large-scale model training tasks comparing with baselines (1-bit Adam (Tang et al., 2021) and original Adam (Kingma and Ba, 2014)). Since Tang et al. (2021) already demonstrated that 1-bit Adam has similar statistical results to Adam, we omit the comparison of end-to-end model accuracy to Adam for brevity. Throughout the experiments, we enable FP16 training for all the tasks following (Tang et al., 2021) . That makes the full-precision communication (including Adam, full-precision stage in 1-bit Adam and full-precision AllReduce in 0/1 Adam) use 16-bit per number. We use the 1-bit compressor (Equation (4)) in 0/1 Adam. Experimental details. We adopt the following tasks for the evaluation: BERT-Base (L = 12, H = 768, A = 12, 110M params) and BERT-Large (L = 24, H = 1024, A = 16, 340M params) pre-training, training Resnet18 (12M params) on ImageNet (He et al., 2016) and GPT-2 pre-training. For BERT model, we use the same dataset as (Devlin et al., 2018) , which is a concatenation of Wikipedia and BooksCorpus with 2.5B and 800M words respectively. We use the GLUE fine-tuning benchmark (Wang et al., 2018b) to evaluate the convergence of the BERT models trained by different algorithms. For ImageNet, we adopt ImageNet-1k dataset, which contains 1.28M images for training and 50K images Resnet18) are small compared to BERT, and its parallelism speed up will be limited if applied to the same large system on BERT (128 GPUs). And so we test it for 4 to 32 GPUs in Figure (d) . for validation (Deng et al., 2009) . For GPT-2 we adopt the model from its original paper (Radford et al., 2019b) , which contains 117M parameters (48 layers, 1600 hidden size, 25 attention heads). For training data, we adopt the same dataset blend as in (Shoeybi et al., 2019) : Wikipedia (Devlin et al., 2018) , CC-Stories (Trinh and Le, 2018), RealNews (Zellers et al., 2019) , and OpenWebtext (Radford et al., 2019b) . Other details including learning rate schedules, hyperparameters can be found in Appendix C. Hardware. We evaluate two clusters: one with 4 NVIDIA V100 GPUs per node and 40 Gigabit Ethernet inter-node network (2.7 Gbps effective bandwidth); the other one with 8 V100 GPUs per node and 100 Gigabit InfiniBand EDR inter-node network (close to theoretical peak effective bandwidth). We use 4 to 128 GPUs for BERT and ImageNet pretraining tasks to measure 0/1 Adam's performance gain. We use 64 GPUs for GPT-2 pre-training. Additionally, for ImageNet training we apply the accelerated data loading technique from lmdbfoot_6 . Policy for T v and T u in 0/1 Adam. We first illustrate our policy on T v . Observing from our motivation study (Figure 1 ) that the variance difference in adjacent steps decreases roughly exponentially. Denote k j as the step where j-th variance update takes place, we select T v such that, k j+1 -k j = 2 ⌊j/κ⌋ ,∀κ > 0. We adopt κ = 16 for all the three tasks. Table 1 : GLUE development set results. BERT-Base/Large(original) results are from (Devlin et al., 2018) . BERT-Base/Large(Adam and 1-bit Adam) results are from (Tang et al., 2021) Then we move on to discuss the policy for T u . Based on the derivation in Section 4, the approximation noise from local step is proportional to the learning rate. And so if we denote t j as the step where j-th synchronization takes place, then our intuition is to increase t j+1 -t j roughly inversely proportional to the learning rate at t j so as to make the approximation noise bounded. For BERT-Base/Large pretraining, as illustrated before, the learning rate exponentially decreases by 0.99 every 520 steps after 12.5K linear increase warmup steps. So that we set t j+1 -t j = 1 for the first 12.5K steps and after that let it multiply by 2 every 32678 steps based on the calculation that the learning rate will decrease by half. Similarly, for ImageNet we set t j+1 -t j = 1 for the first 50050 steps (10 epochs) and after that let it multiply by 2 every 50050 steps (10 epochs). We clip the interval at 16 in all the tasks. This corresponds to H = 16 in Assumption 5.Finally, since our theory in Section 5 indicates that approximation will be more accurate when the variance is frozen. So that we additionally stop updating variance when t j+1 -t j > 1. Remarks on the selected policy. As described, the policies for T v and T u generally follow the learning rate schedule adopted. This is favored in practice for three reasons: (1) It does not require too much hyperparameter tuning. Consider adopting a constant or decaying policy based on some utility function, capturing the dynamics of the training to reach a useful interval would require tedious searching on the hyperparameters and retraining; (2) Learning rate policy is well-motivated. As naturally, communication is a way of eliminating the difference among workers. If workers adopt a larger learning rate, more frequent communication will be needed since the workers are taking large steps in the weight space and so we require higher-frequent communication for them to reach consensus; (3) Learning rate schedule is universally used in all the ML/DL applications, and thus the method can be easily adapted to other applications. In fact, if we consider the scope of large model training, a typical learning rate schedule is a linear warm-up followed by a decaying phase, which is very similar to our test cases here.

6.1. CONVERGENCE SPEED AND QUALITY ANALYSIS

Figure 2 presents the sample-wise and time-wise convergence results for different algorithms with 128 GPUs on the Ethernet cluster. We find that 0/1 Adam provides the same sample-wise convergence speed compared to the baseline, with up to 2× time-wise speed up. by using both 1-bit quantizer to compress the communication volume (up to 32× reduction) and 1-bit AllReduce to reduce the expensive synchronization overhead for local steps in both warmup and non-warmup phases. Table 2 provides the ImageNet validation accuracy of trained models from different algorithms, and we find the final accuracy can achieve the reported accuracy from Pytorch library (Pytorch, 2014). For brevity, convergence comparison on GPT-2 is given in the Appendix C.

6.2. TRAINING THROUGHPUT ANALYSIS

Figure 3 summarizes the throughput results on different tasks and different clusters. We observe that 0/1 Adam can consistently outperform baselines in all settings. It is also worth mentioning that 0/1 Adam on Ethernet (2.7 Gbps effective bandwidth, 4 GPUs per node) is able to achieve comparable throughput as 1-bit Adam on InfiniBand (near 100 Gbps effective bandwidth, 8 GPUs per node), as shown in the red line in Figure 3 (b) and the blue line in Figure 3 (c), which demonstrates 0/1 Adam further removes the redundancy in communication effectively that exceeds the hardware barrier. Communication reduction and the role of local steps. To better understand the importance and effect of local steps, we additionally run a special case of 0/1 Adam where we keep the same policy of T v but use T u = {0,•••,T -1}. This special version of 0/1 Adam does not skip rounds but use the same variance freezing policy. We plot the data volume usage and throughput results in Figure 4 and 5, respectively. We see that although no local steps suffice to reduce the data volume overhead from 1-bit Adam towards 1-bit-per-parameter in general, the throughput improvement is limited compared to Figure 2 .

7. CONCLUSION

In this paper, we study the challenges of using 1-bit communication on Adam, and limitations of the state-of-the-art 1-bit Adam algorithm. We propose an algorithm named 0/1 Adam that adopts two novel design: adaptive variance state freezing and 1-bit sync. We provide convergence proof for 0/1 Adam and measure its effectiveness over baseline Adam and 1-bit Adam on various benchmarks, including BERT-Base/Large, GPT-2 pretraining and ImageNet. 

A FULL DESCRIPTION TO ALLREDUCE

As introduced in Section 2, the error feedback based 1bit-AllReduce works best both in theory and in practice. In fact, the original 1-bit Adam also adopts the error-feedback design (Tang et al., 2021) . We give the full description of this 1bit-AllReduce in Algorithm 2, to replace the 1bit-AllReduce in Algorithm 1. In the theoretical analysis, our proofs will also rely on this algorithm. Note that this algorithm does not require any additional assumptions for our theory to hold, since this fits the black-box procedure in Algorithm 4 and Algorithm 1. Algorithm 2 The full description of Error Feedback 1 bit Communication (1bit-AllReduce) Require: communication buffer z (i) t , worker error δ (i) t , server error δ t , 1-bit compressor C[•] . Both worker and server errors will be initialized at 0 at t = 0. 1: (On i-th node) 2: Compress z (i) t into ẑ(i) t = C[z (i) t + δ (i) t ], and update the compression error by δ  (i) t+1 = z (i) t +δ (i) t - ẑ(i) t . 3: Send ẑ(i) t to t into z t = 1 n n i=1 z (i) t . 5: Send z t to all the workers. 6: (On i-th node) 7: return z t .

B PROFILING RESULTS FOR FIXED COST OF COMMUNICATION

We profile the time taken in computation and others (including initialization of a communication round and compression) during one 1-bit AllReduce round at different scales on Ethernet cluster in Table 3 . 

C ADDITIONAL EXPERIMENTAL DETAILS

Training Parameters. For BERT pretraining, we follow the settings from (Devlin et al., 2018) and let the learning rate linearly increases to 4×10 -4 as a warmup in the first 12.5K steps, then decays into 0.99 of the original after every 520 steps. We set β 1 = 0.9 and β 2 = 0.999 for all the algorithms. We adopt the batch size of 4096. For 1-bit Adam, we follow the hyperparameters given in (Tang et al., 2021) and set the full-precision stage for 1-bit Adam as 16K and 23K on BERT-Base and BERT-Large, respectively. All the hyperparameters used here (e.g. learning rate) strictly follow (Tang et al., 2021) for fair comparison. For ImageNet, we follow the example script from Pytorchfoot_7 and use batch size of 256 and a milestone decay learning rate schedule: starting at 1e-4 and decay by a factor of 10 at epoch 30 and 60, with 90 epochs in total. We set 10 epochs (50050 steps) as the full-precision stage for 1-bit Adam. For GPT-2 we set batch size to be 512, and use 300K training steps (158B tokens). The learning rate schedule follows a linear warmup of 3K steps and a single cycle consine decay over the remaining 297K steps (1×10 -5 min). For 1-bit Adam, we set its full-precision stage length to be 80K steps, and for the 0/1 Adam, we follow the same learning rate based policy from BERT on T v and T u . For GLUE benchmarks we use Adam optimizer and perform single-task training on the dev set. Following the setup in the BERT paper (Devlin et al., 2018) and 1-bit Adam paper (Tang et al., 2021) , we search over the hyperparameter space with batch sizes ∈ {8,16,32} and learning rates {1×10 -5 ,3×10 -5 ,5×10 -5 ,8×10 -5 }. The convergence plots for GPT-2 pre-training are given in Figure 6 . Require: initialized model on worker i: x (i) 0 , learning rate {γt} T t=1 , m0 = 0, v0 = 0, total number of iterations T , decaying factor β1, β2 from Adam, numerical constant ϵ, variance update step index set Tv. 1: for t = 0,• We start from a special case of 0/1 Adam that compresses gradients without local steps. This is given in Algorithm 4. Note that the following proof will use Algorithm 2 to replace Compressed-AllReduce in Algorithm 4, as introduced in Section A. Algorithm 4 allows us to work with weaker assumption as given in the following Assumption 6. Compression error in Algorithm 4: For arbitrary x ∈ R d , there exists a constant 0 ≤ ω < 1, such that the output of compressor C[•] has the following error bound: E∥C[x]-x∥ 2 ≤ ω∥x∥ 2 . Theorem 2. Under Assumption 1, 2, 3, and 6, let m = |T v |, and select β 1 , β 2 ∈ [0, 1) such that m ≤ log(1-β 1 )/log(β 2 ). If we run Algorithm 4 with a constant learning rate: for all t ≥ 0 γ t = min n σ 2 T , 1 2L G 2 ∞ +ϵ , 1 125 , then it holds that 1 T T -1 t=0 E∥∇f (x t )∥ 2 ≤ O σ √ nT + m+n (1-ω) 4 T + 1 T , where we omit f (0)-inf x∈R d f (x), G ∞ , d, ϵ, β 1 , β 2 and L as constants. Proof. The main update of Algorithm 4 (with constant learning rate) can be summarized as: for every t = 0,•••,T -1, m t+1 = β 1 m t +(1-β 1 )g t v t+1 =      β 2 v t +(1-β 2 ) 1 n n i=1 g (i) t 2 t ∈ T v , v t t ̸ ∈ T v . x t+1 = x t -γ m t √ v t +ϵ , where the g t is the output of the 1-bit AllReduce algorithmfoot_8 . Note that based on Algorithm 2, the gradient approximation term follows: g t = 1 n n i=1 ĝ(i) t +δ t -δ t+1 = 1 n n i=1 g (i) t +δ (i) t -δ (i) t+1 +δ t -δ t+1 = 1 n n i=1 g (i) t + 1 n n i=1 δ (i) t -δ t - 1 n n i=1 δ (i) t+1 -δ t+1 =g t +δ t -δ t+1 , where we denote g t = 1 n n i=1 g (i) t δ t = 1 n n i=1 δ (i) t -δ t . To prove the convergence, we now define the following auxiliary sequence: for any t ≥ 0, y t = x t - γm t (1-β 1 ) √ v t +ϵ - γδ t √ v t +ϵ . The rest of the proof is to use this auxiliary sequence to bound two types of steps separately. We call a step t as reuse step if t ̸ ∈ T v and update step otherwise. We see for all the update steps, v t ̸ = v t+1 while for all the reuse steps v t = v t+1 . The bounds on two different types of steps are provides by Lemma 5 and Lemma 6. Specifically, denoting V 1 = 1 √ v1+ϵ 1 , from Lemma 5 we obtain for all the reuse steps, t̸ ∈Tv γ 4 G 2 ∞ +ϵ E∥∇f (x t )∥ 2 ≤ t̸ ∈Tv E[f (y t )-f (y t+1 )]+ 227γ 3 L 2 V 2 1 (1+ω) 3 G 2 ∞ d G 2 ∞ +ϵ(T -m) β 2m 2 (1-β 1 ) 2 (1-ω) 4 + Lγ 2 σ 2 V 1 (T -m) 2nβ m 2 . while from Lemma 6 we obtain for all the update steps, t∈Tv γ 4 G 2 ∞ +ϵ E∥∇f (x t )∥ 2 ≤ t∈Tv E[f (y t )-f (y t+1 )]+ 34γ L + γ 4 G 2 ∞ +ϵ • σ 2 n +G 2 ∞ d m + 32γ(1+β 1 ) 2 (1+ω) 3 V 1 G 2 ∞ dmL β m 2 (1-β 1 ) 2 (1-ω) 4 . Note that the two inequalities above hold when the learning rate fulfills γ ≤ min β m 2 2V 1 L G 2 ∞ +ϵ , 1 125 . Combine them together, 1 T T -1 t=0 E∥∇f (x t )∥ 2 4 G 2 ∞ +ϵ ≤ f (0)-f * γT + 227γ 2 L 2 V 2 1 (1+ω) 3 G 2 ∞ d G 2 ∞ +ϵ(T -m) β 2m 2 (1-β 1 ) 2 (1-ω) 4 T + Lγσ 2 V 1 (T -m) 2nβ m 2 T + 34 L + 1 4 G 2 ∞ +ϵ • σ 2 n +G 2 ∞ d m T + 32(1+β 1 ) 2 (1+ω) 3 V 1 G 2 ∞ dmL β m 2 (1-β 1 ) 2 (1-ω) 4 T Dropping the constants, we finally obtain 1 T T -1 t=0 E∥∇f (x t )∥ 2 ≤O f (0)-f * γT + γ 2 β 2m 2 (1-β 1 ) 2 (1-ω) 4 + γσ 2 nβ m 2 + ωm β m 2 (1-β 1 ) 2 (1-ω) 4 T + σ 2 m nT ≤O f (0)-f * γT + γ 2 (1-β 1 ) 4 (1-ω) 4 + γσ 2 n(1-β 1 ) + ωm (1-β 1 ) 3 (1-ω) 4 T + σ 2 m nT , where in the last step we use the condition in the theorem that β m 2 ≥ 1-β 1 . To meet the requirement of learning rate we set γ t = min n σ 2 T , 1 2L G 2 ∞ +ϵ , 1 125 , then it holds that 1 T T -1 t=0 E∥∇f (x t )∥ 2 ≤ O σ √ nT + m+n (1-ω) 4 T + 1 T . That completes the proof.

D.2 PROOF TO THEOREM 1

Note that the following proof will use Algorithm 2 to replace 1bit-AllReduce in Algorithm 1, as introduced in Section A. Theorem 1. Under Assumption 1 to 5, let m = |T v |, select β 1 , β 2 ∈ [0, 1) that fulfills m ≤ log(1-β 1 )/log(β 2 ), if we run Algorithm 1 with a constant learning rate: for all t ≥ 0 γ t = min n σ 2 T , 1 4L G 2 ∞ +ϵ , 2 G 2 ∞ +ϵ L , 1 6 , then it holds that 1 T T -1 t=0 E∥∇f ( xt )∥ 2 ≤ O σ √ nT + H 2 ∆ 2 (m+n) T + 1 T , where xt = 1/n n i=1 x (i) t and we omit f (0)-inf x∈R d f (x), G ∞ , d, ϵ, β 1 , β 2 and L as constants. Proof. We now prove Theorem 1. Similar to the proof to Theorem 2, in this proof we discuss the case of t ∈ T v and t ̸ ∈ T v separately. Following the proof of Theorem 2, we define the following auxiliary sequence ỹt = xt - γ mt (1-β 1 ) √ v t +ϵ - γδ t √ v t +ϵ , where xt = 1 n n i=1 x (i) t mt = 1 n n i=1 m (i) t . And we additionally define that ũt = 1 n n i=1 u (i) t gt = 1 n n i=1 g (i) t . Note that the definition of gt is different from the g t in Theorem 2 since the former is computed on local models which potentially can be different before the sync step. To expect a compression error bound to scale in the order of O(γ 2 ), we slightly modify the update of line 5, 8, 9 of Algorithm 1 into u (i) t+ 1 2 =u (i) t +m (i) t m (i) t+1 =u t+ 1 2 / t k=t ′ x (i) t+1 =x (i) t ′ -γu t+ 1 2 / √ v t +ϵ. Note that since Theorem 1 states the convergence results for constant learning rate, such modification does not change the semantics of the original Algorithm 1. Based on Algorithm 2, we know that u t+ 1 2 = 1 n n i=1 û(i) t+ 1 2 +δ t -δ t+1 = 1 n n i=1 u (i) t+ 1 2 +δ (i) t -δ (i) t+1 +δ t -δ t+1 = 1 n n i=1 u (i) t+ 1 2 + 1 n n i=1 δ (i) t -δ t - 1 n n i=1 δ (i) t+1 -δ t+1 = ũt+ 1 2 +δ t -δ t+1 . Based on Lemma 10, we know that for all the t ∈ T v , we have the following bound, t∈Tv γE∥∇f ( xt )∥ 2 4 G 2 ∞ +ϵ ≤ t∈Tv Ef ( ỹt )-Ef ( ỹt+1 )+ 2γσ 2 m nL + 106γH 2 V 1 (M +∆ 2 )mL β m 2 (1-β 1 ) 2 + γσ 2 m 4n G 2 ∞ +ϵ + γG 2 ∞ dm 4 G 2 ∞ +ϵ . On the other hand, for all the t ̸ ∈ T v , we have the following bound, t̸ ∈Tv γE∥∇f ( xt )∥ 2 4 G 2 ∞ +ϵ ≤ t̸ ∈Tv Ef ( ỹt )-Ef ( ỹt+1 )+ 36γ 3 H 2 V 1 (3G 2 ∞ d+25∆ 2 )L 2 (1+L)(G 2 ∞ +ϵ+1)(T -m) β m 2 (1-β 1 ) 4 G 2 ∞ +ϵ + Lγ 2 V 1 σ 2 (T -m) nβ m 2 + 48γ 3 V 1 (H +1) 2 (3G 2 ∞ d+24∆ 2 ) G 2 ∞ +ϵ(T -m) β m 2 (1-β 1 ) 4 . Note that they hold if learning rate is set to be γ ≤ min β m 2 4V 1 L G 2 ∞ +ϵ , 2 G 2 ∞ +ϵ L , 1 6 . Combine them together, we obtain 1 T T -1 t=0 E∥∇f (x t )∥ 2 4 G 2 ∞ +ϵ ≤ f (0)-f * γT + 2σ 2 m nLT + 106H 2 V 1 (M +∆ 2 )mL β m 2 (1-β 1 ) 2 T + σ 2 m 4n G 2 ∞ +ϵT + G 2 ∞ dm 4 G 2 ∞ +ϵT + LγV 1 σ 2 nβ m + 36γ 2 H 2 V 1 (3G 2 ∞ d+25∆ 2 )L 2 (1+L)(G 2 ∞ +ϵ+1) β m 2 (1-β 1 ) 4 G 2 ∞ +ϵ + 48γ 2 V 1 (H +1) 2 (3G 2 ∞ d+24∆ 2 ) G 2 ∞ +ϵ β m 2 (1-β 1 ) 4 . Omitting constants: 1 T T -1 t=0 E∥∇f ( xt )∥ 2 ≤O f (0)-f * γT + γ 2 H 2 ∆ 2 β m 2 + γσ 2 nβ m 2 + σ 2 m nT + H 2 ∆ 2 m β m 2 T + m T ≤O f (0)-f * γT + γ 2 H 2 ∆ 2 1-β 1 + γσ 2 n(1-β 1 ) + σ 2 m nT + H 2 ∆ 2 m (1-β 1 )T + m T , where in the last step we use the condition in the theorem that β m 2 ≥ 1-β 1 . To meet the requirement of learning rate we set γ t = min n σ 2 T , 1 4L G 2 ∞ +ϵ , 2 G 2 ∞ +ϵ L , 1 6 , then it holds that 1 T T -1 t=0 E∥∇f ( xt )∥ 2 ≤ O σ √ nT + H 2 ∆ 2 (m+n) T + 1 T . And that completes the proof.

D.3 TECHNICAL LEMMA

Lemma 1. Consider running Algorithm 2 over a communication buffer z (same notation in Algorithm 2) under Assumption 6, let δ t denote: δ t = 1 n n i=1 δ (i) t -δ t then based on Assumption 6 and 3, it holds that t ≥ 0, if E∥z (i) t ∥ 2 ≤ C for some constant C > 0, E∥δ t ∥ 2 ≤ 32ω(1+ω) 3 C (1-ω) 4 . Proof. Note that the error is initialized by 0, so that when t = 0 the bound trivially holds. We next prove the case for t ≥ 1. For any i ∈ {1,•••,n} and t ≥ 1, by the definition of the sequence δ (i) t , E δ (i) t 2 =E z (i) t-1 +δ (i) t-1 - ẑ(i) t-1 2 =E z (i) t-1 +δ (i) t-1 -C z (i) t-1 +δ (i) t-1 2 Assumption 6 ≤ ωE z (i) t-1 +δ (i) t-1 2 ∀η>0 = ω(1+η)E δ (i) t-1 2 +ω(1+1/η)E z (i) t-1 2 Assumption 3 ≤ ∞ j=0 [ω(1+η)] j ω(1+1/η)C ≤ ω(1+1/η) 1-ω(1+η) C. Selecting η = 1-ω 2ω , we obtain E δ (i) t 2 ≤ 2ω(1+ω) (1-ω) 2 C. Similarly, we can show that for any t ≥ 1, E δ t 2 =E 1 n n i=1 ẑ(i) t-1 +δ t-1 -z t-1 =E 1 n n i=1 ẑ(i) t-1 +δ t-1 -C 1 n n i=1 ẑ(i) t-1 +δ t-1 2 ≤ωE 1 n n i=1 ẑ(i) t-1 +δ t-1 2 ≤ω(1+η)E δ t-1 2 +ω(1+1/η)E 1 n n i=1 ẑ(i) t-1 2 ≤ω(1+η)E δ t-1 2 +ω(1+1/η)• 1 n n i=1 E ẑ(i) t-1 2 , where in the last step we apply the Jensen Inequality. Since we do not assume a bound on the ẑ(i) t-1 2 , we need to bound it in terms of E ẑ(i) t-1 2 =E z (i) t-1 +δ (i) t-1 -δ (i) t 2 ≤2E z (i) t-1 +δ (i) t-1 2 +2E δ (i) t 2 ≤ 4(1+ω) (1-ω) 2 C + 4ω(1+ω) (1-ω) 2 C ≤ 4(1+ω) 2 (1-ω) 2 C, where we apply the results from the bound on E δ (i) t 2 . Given this bound, and following the analysis for E δ (i) t 2 , we can now bound the E δ t 2 as follows E δ t 2 ≤ 2ω(1+ω) (1-ω) 2 • 4(1+ω) 2 (1-ω) 2 C = 8ω(1+ω) 3 (1-ω) 4 C. Finally, we obtain t ≥ 1, E∥δ t ∥ 2 =E 1 n n i=1 δ (i) t -δ t 2 ≤2E δ t 2 +2E 1 n n i=1 δ (i) t 2 ≤2E δ t 2 +2 1 n n i=1 E δ (i) t 2 ≤ 32ω(1+ω) 3 C (1-ω) 4 . That completes the proof. Lemma 2. For the variance term, we have the following upper and lower bound: for any t ≥ 1, β m/2 2 √ v 1 +ϵ ≤ √ v t +ϵ ≤ G 2 ∞ +ϵ , where the inequality holds element-wise. Proof. On one hand, for any t j ≤ t < t j+1 , where t j denotes an update step, we obtain element-wise: v t ≥ β 2 v tj ≥ ••• ≥ β j 2 v 1 ≥ β m 2 v 1 , so that √ v t +ϵ ≥ β m 2 v 1 +ϵ ≥ β m 2 (v 1 +ϵ) = β m/2 2 √ v 1 +ϵ. On the other hand, for any t ≥ 1 and j ∈ {1,•••,d}, [v t ] j = t k=1 (1-β 2 )β t-k 2 1 n n i=1 [g (i) k ] j 2 ≤ G 2 ∞ (1-β 2 ) ∞ k=1 β k 2 ≤ G 2 ∞ , so that √ v t +ϵ ≤ G 2 ∞ +ϵ. That completes the proof. Lemma 3. In Algorithm 4, for any t ≥ 0, E∥m t ∥ 2 ≤ 195(1+ω) 3 G 2 ∞ d (1-ω) 4 . Proof. For any t ≥ 0, E∥m t ∥ 2 =E (1-β 1 ) t k=0 β t-k 1 g k 2 ≤(1-β 1 ) t k=0 β t-k 1 E∥g k ∥ 2 ≤(1-β 1 ) t k=0 β t-k 1 E∥g k +δ k -δ k+1 ∥ 2 ≤(1-β 1 ) t k=0 β t-k 1 3E∥g k ∥ 2 +3E∥δ k ∥ 2 +3E∥δ k+1 ∥ 2 ≤(1-β 1 ) t k=0 β t-k 1   3E 1 n n i=1 g (i) k 2 +3E∥δ k ∥ 2 +3E∥δ k+1 ∥ 2   ≤(1-β 1 ) t k=0 β t-k 1 3 n n i=1 E g (i) k 2 +3E∥δ k ∥ 2 +3E∥δ k+1 ∥ 2 (i) ≤(1-β 1 ) t k=0 β t-k 1 3G 2 ∞ d+ 192ω(1+ω) 3 G 2 ∞ d (1-ω) 4 ≤ 3(1+ω) 3 G 2 ∞ d (1-ω) 4 + 192(1+ω) 3 G 2 ∞ d (1-ω) 4 •(1-β 1 ) ∞ k=0 β k 1 ≤ 195(1+ω) 3 G 2 ∞ d (1-ω) 4 , where in the step (i) we use Lemma 1. That completes the proof. Lemma 4. For any a, b ∈ R d , the following bound holds: a √ b 2 ≤ ∥a∥ 2 1 b 1 . Proof. Denote the subscript j as the index of the coordinate. a √ b 2 = d j=1 a j [ √ b] j 2 ≤   d j=1 a 2 j     d j=1 1 b j   =   d j=1 a 2 j     d j=1 1 b j   = ∥a∥ 2 1 b 1 . Note that the second step holds not because Cauchy-Schwarz Inequality but due to the fact that a 2 j , b j > 0 (since √ b would implicitly assume so). Lemma 5. In Algorithm 4, for all the t ≥ 1 that fulfills v t = v t+1 , i.e., ∀t such that t ̸ ∈ T v , if we let γ ≤ β m 2 2V 1 L G 2 ∞ +ϵ , the following bound holds, t̸ ∈Tv γ 4 G 2 ∞ +ϵ E∥∇f (x t )∥ 2 ≤ t̸ ∈Tv E[f (y t )-f (y t+1 )]+ 227γ 3 L 2 V 2 1 (1+ω) 3 G 2 ∞ d G 2 ∞ +ϵ(T -m) β 2m 2 (1-β 1 ) 2 (1-ω) 4 + Lγ 2 σ 2 V 1 (T -m) 2nβ m 2 . Proof. Recall the auxiliary sequence y t = x t - γm t (1-β 1 ) √ v t +ϵ - γδ t √ v t +ϵ , For all the steps t ≥ 0 that fulfills v t+1 = v t , we obtain y t+1 -y t =x t+1 -x t - γ 1-β 1 m t+1 √ v t+1 +ϵ - m t √ v t +ϵ -γ δ t+1 √ v t+1 +ϵ - δ t √ v t +ϵ =-γ m t √ v t +ϵ - γ (1-β 1 ) √ v t +ϵ (β 1 m t +(1-β 1 )g t -m t -(1-β 1 )(δ t -δ t+1 )) =- γg t √ v t +ϵ . From Assumption 1, we have Ef (y t+1 )-Ef (y t ) ≤E⟨∇f (y t ),y t+1 -y t ⟩+ L 2 E∥y t+1 -y t ∥ 2 =-γE ∇f (y t ), g t √ v t +ϵ + Lγ 2 2 E g t √ v t +ϵ 2 =-γE ∇f (y t ), ∇f (x t ) √ v t +ϵ + Lγ 2 2 E g t √ v t +ϵ 2 =-γE ∇f (x t ), ∇f (x t ) √ v t +ϵ +γE ∇f (x t )-∇f (y t ), ∇f (x t ) √ v t +ϵ + Lγ 2 2 E g t √ v t +ϵ 2 =-γE ∇f (x t ), ∇f (x t ) √ v t +ϵ +γE ∇f (x t )-∇f (y t ) √ v t +ϵ ,∇f (x t ) + Lγ 2 2 E g t √ v t +ϵ 2 ≤- γE∥∇f (x t )∥ 2 G 2 ∞ +ϵ + γ 2η E ∇f (x t )-∇f (y t ) √ v t +ϵ 2 + γη 2 E∥∇f (x t )∥ 2 + Lγ 2 2 E g t √ v t +ϵ 2 , where in the last step we use Lemma 2 and the fact that for any a,b and constant η > 0, ⟨a,b⟩ ≤ η 2 ∥a∥ 2 + 1 2η ∥b∥ 2 . Set η = ( G 2 ∞ +ϵ) -1 , with Assumption 1 and Lemma 4, Ef (y t+1 )-Ef (y t ) ≤- γE∥∇f (x t )∥ 2 2 G 2 ∞ +ϵ + γL 2 V 1 G 2 ∞ +ϵ 2β m 2 E∥x t -y t ∥ 2 + Lγ 2 2 E g t √ v t +ϵ 2 =- γE∥∇f (x t )∥ 2 2 G 2 ∞ +ϵ + γL 2 V 1 G 2 ∞ +ϵ 2β m 2 E γm t (1-β 1 ) √ v t +ϵ + γδ t √ v t +ϵ 2 + Lγ 2 2 E g t √ v t +ϵ 2 ≤- γE∥∇f (x t )∥ 2 2 G 2 ∞ +ϵ + γ 3 L 2 V 1 G 2 ∞ +ϵ β m 2 (1-β 1 ) 2 E m t √ v t +ϵ 2 + γ 3 L 2 V 1 G 2 ∞ +ϵ β m 2 E δ t √ v t +ϵ 2 + Lγ 2 2 E g t √ v t +ϵ 2 ≤- γE∥∇f (x t )∥ 2 2 G 2 ∞ +ϵ + γ 3 L 2 V 2 1 G 2 ∞ +ϵ β 2m 2 (1-β 1 ) 2 E∥m t ∥ 2 + γ 3 L 2 V 2 1 G 2 ∞ +ϵ β 2m 2 E∥δ t ∥ 2 + Lγ 2 V 1 2β m 2 E∥g t ∥ 2 , where in the last step we apply Lemma 2 and 4. Using the bound on the error from Lemma 1, Lemma 3 and the assumption on the stochastic gradient, we obtain γ 2 G 2 ∞ +ϵ - Lγ 2 V 1 2β m 2 E∥∇f (x t )∥ 2 ≤E[f (y t )-f (y t+1 )]+ γ 3 L 2 V 2 1 G 2 ∞ +ϵ β 2m 2 (1-β 1 ) 2 E∥m t ∥ 2 + γ 3 L 2 V 2 1 G 2 ∞ +ϵ β 2m 2 E∥δ t ∥ 2 + Lγ 2 σ 2 V 1 2nβ m 2 ≤E[f (y t )-f (y t+1 )]+ 195γ 3 L 2 V 2 1 (1+ω) 3 G 2 ∞ d G 2 ∞ +ϵ β 2m 2 (1-β 1 ) 2 (1-ω) 4 + 32γ 3 L 2 V 2 1 ω(1+ω) 3 G 2 ∞ d G 2 ∞ +ϵ β 2m 2 (1-ω) 4 + Lγ 2 σ 2 V 1 2nβ m 2 ≤E[f (y t )-f (y t+1 )]+ 227γ 3 L 2 V 2 1 (1+ω) 3 G 2 ∞ d G 2 ∞ +ϵ β 2m 2 (1-β 1 ) 2 (1-ω) 4 + Lγ 2 σ 2 V 1 2nβ m 2 . Based on the learning rate bound γ ≤ β m 2 2V 1 L G 2 ∞ +ϵ , and summing over all the reuse steps, we obtain t̸ ∈Tv γ 4 G 2 ∞ +ϵ E∥∇f (x t )∥ 2 ≤ t̸ ∈Tv E[f (y t )-f (y t+1 )]+ 227γ 3 L 2 V 2 1 (1+ω) 3 G 2 ∞ d G 2 ∞ +ϵ(T -m) β 2m 2 (1-β 1 ) 2 (1-ω) 4 + Lγ 2 σ 2 V 1 (T -m) 2nβ m 2 . That completes the proof. Lemma 6. In Algorithm 4, for all the t ≥ 0 that fulfills v t ̸ = v t+1 , i.e. t ∈ T v , if the learning rate fulfills γ < 1 125 , the following bound holds t∈Tv γ 4 G 2 ∞ +ϵ E∥∇f (x t )∥ 2 ≤ t∈Tv E[f (y t )-f (y t+1 )]+ 34γ L + γ 4 G 2 ∞ +ϵ • σ 2 n +G 2 ∞ d m + 32γ(1+β 1 ) 2 (1+ω) 3 V 1 G 2 ∞ dmL β m 2 (1-β 1 ) 2 (1-ω) 4 . Proof. For all the steps t that fulfills v t ̸ = v t+1 , y t+1 -y t =x t+1 -x t - γ 1-β 1 m t+1 √ v t+1 +ϵ - m t √ v t +ϵ +γ δ t √ v t +ϵ - δ t+1 √ v t+1 +ϵ =-γ m t √ v t +ϵ - γ 1-β 1 m t+1 √ v t+1 +ϵ - m t √ v t +ϵ +γ δ t √ v t +ϵ - δ t+1 √ v t+1 +ϵ =- γβ 1 1-β 1 m t √ v t +ϵ - γ 1-β 1 m t+1 √ v t+1 +ϵ +γ δ t √ v t +ϵ - δ t+1 √ v t+1 +ϵ . Based on the smoothness assumption, for constant η > 0 that will be assigned later, Ef (y t+1 )-Ef (y t ) ≤E⟨∇f (y t ),y t+1 -y t ⟩+ L 2 E∥y t+1 -y t ∥ 2 γη<1 ≤ ηγ 2L E∥∇f (y t )∥ 2 + L ηγ E∥y t+1 -y t ∥ 2 ≤ ηγ L E∥∇f (x t )∥ 2 +ηγLE∥y t -x t ∥ 2 + L ηγ E∥y t+1 -y t ∥ 2 ≤ ηγ L E∥∇f (x t )-g t ∥ 2 + ηγ L E∥g t ∥ 2 +ηγLE∥y t -x t ∥ 2 + L ηγ E∥y t+1 -y t ∥ ≤ ηγ n 2 L n i=1 E ∇f (x t )-g (i) t 2 + ηγ nL n i=1 E g (i) t 2 +ηγLE∥y t -x t ∥ 2 + L ηγ E∥y t+1 -y t ∥ 2 ≤ ηγ L σ 2 n +G 2 ∞ d +ηγLE∥y t -x t ∥ 2 + L ηγ E∥y t+1 -y t ∥ 2 . Now we can bound the last two terms as follows, note that E∥y t -x t ∥ 2 =E γm t (1-β 1 ) √ v t +ϵ + γδ t √ v t +ϵ 2 ≤ 2γ 2 (1-β 1 ) 2 E m t √ v t +ϵ 2 +2γ 2 E δ t √ v t +ϵ 2 ≤ 2γ 2 V 1 (1-β 1 ) 2 β m 2 E∥m t ∥ 2 + 2γ 2 V 1 β m 2 E∥δ t ∥ 2 ≤ 390γ 2 (1+ω) 3 V 1 G 2 ∞ d β m 2 (1-β 1 ) 2 (1-ω) 4 + 64γ 2 ω(1+ω) 3 V 1 G 2 ∞ d β m 2 (1-ω) 4 ≤ 454γ 2 (1+ω) 3 V 1 G 2 ∞ d β m 2 (1-β 1 ) 2 (1-ω) 4 , where in the last step we apply Lemma 1. On the other hand, E∥y t+1 -y t ∥ 2 =E γβ 1 1-β 1 m t √ v t +ϵ + γ 1-β 1 m t+1 √ v t+1 +ϵ -γ δ t √ v t +ϵ - δ t+1 √ v t+1 +ϵ 2 ≤E γβ 1 1-β 1 m t √ v t +ϵ + γ 1-β 1 m t+1 √ v t+1 +ϵ -γ δ t √ v t +ϵ - δ t+1 √ v t+1 +ϵ 2 ≤ 4γ 2 β 2 1 (1-β 1 ) 2 E m t √ v t +ϵ 2 + 4γ 2 (1-β 1 ) 2 E m t+1 √ v t+1 +ϵ 2 +4γ 2 E δ t √ v t +ϵ 2 +4γ 2 E δ t+1 √ v t+1 +ϵ 2 ≤ 4γ 2 β 2 1 V 1 (1-β 1 ) 2 β m 2 E∥m t ∥ 2 + 4γ 2 V 1 (1-β 1 ) 2 β m 2 E∥m t+1 ∥ 2 + 4γ 2 V 1 β m 2 E∥δ t ∥ 2 + 4γ 2 V 1 β m 2 E∥δ t+1 ∥ 2 ≤ 780γ 2 (1+β 2 1 )V 1 (1+ω) 3 G 2 ∞ d β m 2 (1-β 1 ) 2 (1-ω) 4 + 256γ 2 V 1 ω(1+ω) 3 G 2 ∞ d β m 2 (1-ω) 4 ≤ 1036γ 2 (1+β 2 1 )V 1 (1+ω) 3 G 2 ∞ d β m 2 (1-β 1 ) 2 (1-ω) 4 , where we again apply Lemma 1 and Lemma 3. Put everything together, Ef (y t+1 )-Ef (y t ) ≤ ηγ L σ 2 n +G 2 ∞ d +ηγLE∥y t -x t ∥ 2 + L ηγ E∥y t+1 -y t ∥ 2 ≤ ηγ L σ 2 n +G 2 ∞ d + 454ηγ 3 (1+ω) 3 V 1 G 2 ∞ dL β m 2 (1-β 1 ) 2 (1-ω) 4 + 1036γ(1+β 2 1 )V 1 (1+ω) 3 G 2 ∞ dL ηβ m 2 (1-β 1 ) 2 (1-ω) 4 ≤ ηγ L σ 2 n +G 2 ∞ d + 454ηγ 2 + 1036 η γ(1+β 1 ) 2 (1+ω) 3 V 1 G 2 ∞ dL β m 2 (1-β 1 ) 2 (1-ω) 4 Set η = 34, and considering γ < 1 125 , we get Ef (y t+1 )-Ef (y t ) ≤ 34γ L σ 2 n +G 2 ∞ d + 32γ(1+β 1 ) 2 (1+ω) 3 V 1 G 2 ∞ dL β m 2 (1-β 1 ) 2 (1-ω) 4 . Summing over all the update steps, we obtain 0 ≤ t∈Tv E[f (y t )-f (y t+1 )]+ 34γ L σ 2 m n +G 2 ∞ dm + 32γ(1+β 1 ) 2 (1+ω) 3 V 1 G 2 ∞ dmL β m 2 (1-β 1 ) 2 (1-ω) 4 . Adding γ 4 √ G 2 ∞ +ϵ t∈Tv E∥∇f (x t )∥ 2 on both sides, and note that t∈Tv E∥∇f (x t )∥ 2 = t∈Tv E∥∇f (x t )-g t ∥ 2 + t∈Tv E∥g t ∥ 2 ≤ σ 2 m n +G 2 ∞ dm, we finally obtain t∈Tv γ 4 G 2 ∞ +ϵ E∥∇f (x t )∥ 2 ≤ t∈Tv E[f (y t )-f (y t+1 )]+ 34γ L + γ 4 G 2 ∞ +ϵ • σ 2 n +G 2 ∞ d m + 32γ(1+β 1 ) 2 (1+ω) 3 V 1 G 2 ∞ dmL β m 2 (1-β 1 ) 2 (1-ω) 4 . That completes the proof. Lemma 7. Under Assumption 4, for any t ≥ 0, it holds that E∥δ t ∥ 2 ≤ 4∆ 2 . Proof. Based on the definition of the compression error, we obtain E∥δ t ∥ 2 =E 1 n n i=1 δ (i) t -δ t 2 ≤2E δ t 2 +2E 1 n n i=1 δ (i) t 2 ≤2E δ t 2 +2 1 n n i=1 E δ (i) t 2 ≤4∆ 2 . That completes the proof. Lemma 8. In Algorithm 1, for any t ≥ 0, the momentum term is uniformly bounded by the following: E m (i) t 2 ≤ 3G 2 ∞ d+24∆ 2 (1-β 1 ) 2 , E m (i) t+ 1 2 2 ≤ 3G 2 ∞ d+24∆ 2 (1-β 1 ) 2 , E∥ mt ∥ 2 ≤ 3G 2 ∞ d+24∆ 2 (1-β 1 ) 2 , E mt+ 1 2 2 ≤ 3G 2 ∞ d+24∆ 2 (1-β 1 ) 2 . Proof. We prove this lemma via induction. Note that when t = 0, the inequality trivially holds due to initialization at 0 and Jensen Inequality. Now suppose the inequality holds up to step t ≥ 0, then for t+1, if t ∈ T u , then E m (i) t+1 2 =E u t+ 1 2 t-k 2 =E ũt+ 1 2 +δ t -δ t+1 t-k 2 =E t j=k+1 mj +δ t -δ t+1 t-k 2 =E t j=k+1 β j-k 1 mk +(1-β 1 ) j-1 h=k β j-h-1 1 g h +δ t -δ t+1 t-k =E 1 t-k t j=k+1 β j-k 1 mk + 1-β 1 t-k t j=k+1 j-1 h=k β j-h-1 1 g h +(δ t -δ t+1 ) 2 ∀η>0 ≤ (1+η)E 1 t-k t j=k+1 β j-k 1 mk 2 +(1+1/η)E 1-β 1 t-k t j=k+1 j-1 h=k β j-h-1 1 g h +(δ t -δ t+1 ) 2 ≤ 1+η t-k t j=k+1 E β j-k 1 mk 2 + 3(1+1/η)(1-β 1 ) t-k t j=k+1 j-1 h=k β j-h-1 1 g h E∥g h ∥ 2 +3(1+1/η)E∥δ t ∥ 2 +3(1+1/η)E∥δ t+1 ∥ 2 η=1/β1-1 ≤ (1+η)β 2 1 • 3G 2 ∞ d+24∆ 2 (1-β 1 ) 2 +3(1+1/η)G 2 ∞ d+24(1+1/η)∆ 2 =β 1 • 3G 2 ∞ d+24∆ 2 (1-β 1 ) 2 + 3G 2 ∞ d+24∆ 2 (1-β 1 ) 2 = 3G 2 ∞ d+24∆ 2 (1-β 1 ) 2 . On the other hand, if t ̸ ∈ T u , then E m (i) t+1 2 =E m (i) t+ 1 2 2 = E β 1 m (i) t +(1-β 1 )g (i) t 2 ≤β 1 E β 1 m (i) t 2 +(1-β 1 )E g (i) t 2 ≤ 3G 2 ∞ d+24∆ 2 (1-β 1 ) 2 . For all the t+ 1 2 case, the inequality holds trivially due to Jensen Inequality. Finally, all the • bound can also be obtained via Jensen Inequality. And that completes the proof. Lemma 9. In Algorithm 1, for all the t such that t ̸ ∈ T v , it holds that if we set learning rate γ ≤ min β m 2 4V 1 L G 2 ∞ +ϵ , 2 G 2 ∞ +ϵ L , then, t̸ ∈Tv γE∥∇f ( xt )∥ 2 4 G 2 ∞ +ϵ ≤ t̸ ∈Tv Ef ( ỹt )-Ef ( ỹt+1 )+ 36γ 3 H 2 V 1 (3G 2 ∞ d+25∆ 2 )L 2 (1+L)(G 2 ∞ +ϵ+1)(T -m) β m 2 (1-β 1 ) 4 G 2 ∞ +ϵ + Lγ 2 V 1 σ 2 (T -m) nβ m 2 + 48γ 3 V 1 (H +1) 2 (3G 2 ∞ d+24∆ 2 ) G 2 ∞ +ϵ(T -m) β m 2 (1-β 1 ) 4 . Proof. Since when t ̸ ∈ T v , it can either belongs to T u or not. We first prove the case for t ∈ T u . From the definition of the auxiliary sequence, we obtain, ỹt+1 -ỹt = xt+1 -xt - γ 1-β 1 mt+1 √ v t+1 +ϵ - mt √ v t +ϵ - γδ t+1 √ v t+1 +ϵ - γδ t √ v t +ϵ = xt+1 -xt - γ (1-β 1 ) √ v t +ϵ ( mt+1 -mt )- 1 √ v t +ϵ (γδ t+1 -γδ t ) = xt+ 1 2 -xt - γ (1-β 1 ) √ v t +ϵ mt+ 1 2 -mt + xt+1 -xt+ 1 2 - γ (1-β 1 ) √ v t +ϵ mt+1 -mt+ 1 2 - 1 √ v t +ϵ (γδ t+1 -γδ t ) =qt =- γ mt √ v t +ϵ - γ (1-β 1 ) √ v t +ϵ (β 1 mt +(1-β 1 )g t -mt )+q t =- γ gt √ v+ϵ +q t . From Assumption 1, we have Ef ( ỹt+1 )-Ef ( ỹt ) ≤E⟨∇f ( ỹt ), ỹt+1 -ỹt ⟩+ L 2 E∥ ỹt+1 -ỹt ∥ 2 =-γE ∇f ( ỹt ), gt √ v t +ϵ A1 +Lγ 2 E gt √ v t +ϵ 2 A2 -γE⟨∇f ( ỹt ),q t ⟩ A3 +Lγ 2 E∥q t ∥ 2 A4 . We now bound A 1 to A 4 separately. Note that from Lemma 8, the momentum term can be uniformly bounded by a constant. For brevity of the derivation, we use M to denote such constant bound, and fit in its value at the end of the proof. For A 1 , A 1 =-γE ∇f ( ỹt ), gt √ v t +ϵ =-γE ∇f ( ỹt ), 1 n n i=1 ∇f x (i) t √ v t +ϵ =-γE ∇f ( xt ), ∇f ( xt ) √ v t +ϵ -γE ∇f ( xt ), 1 n n i=1 ∇f x (i) t -∇f ( xt ) √ v t +ϵ -γE ∇f ( ỹt )-∇f ( xt ), ∇f ( xt ) √ v t +ϵ -γE ∇f ( ỹt )-∇f ( xt ), 1 n n i=1 ∇f x (i) t -∇f ( xt ) √ v t +ϵ ≤- γE∥∇f ( xt )∥ 2 G 2 ∞ +ϵ + γη 1 2 E∥∇f ( xt )∥ 2 + γ 2η 1 E 1 n n i=1 ∇f x (i) t -∇f ( xt ) √ v t +ϵ 2 + γη 1 2 E∥∇f ( xt )∥ 2 + γ 2η 1 E ∇f ( ỹt )-∇f ( xt ) √ v t +ϵ 2 + γη 1 2 E∥∇f ( ỹt )-∇f ( xt )∥ 2 + γ 2η 1 E 1 n n i=1 ∇f x (i) t -∇f ( xt ) √ v t +ϵ 2 ≤- γ G 2 ∞ +ϵ -γη 1 E∥∇f ( xt )∥ 2 + γV 1 L 2 β m 2 η 1 n n i=1 E x (i) t -xt 2 + γV 1 L 2 2β m 2 η 1 + γη 1 L 2 2 E∥ ỹt -xt ∥ 2 , where in the last step we use Assumption 1, Lemma 2 and Lemma 4. For the second term, denote the last sync step before t is k, then we have: E x (i) t -xt 2 =E x (i) t -x (i) k -( xt -xk ) 2 ≤2E x (i) t -x (i) k 2 +2E∥ xt -xk ∥ 2 ≤2γ 2 E t-1 j=k m (i) j √ v t +ϵ 2 +2γ 2 E 1 n n i=1 t-1 j=k m (i) j √ v t +ϵ 2 ≤2γ 2 (t-k) t-1 j=k E m (i) j √ v t +ϵ 2 +2γ 2 (t-k) 1 n n i=1 t-1 j=k E m (i) j √ v t +ϵ 2 ≤ 4γ 2 H 2 V 1 M β m 2 , where the first step holds because Lemma 4, Lemma 2, and the fact that at the sync step k, xk = x (i) k . For the third term, we have E∥ ỹt -xt ∥ 2 =E γ mt (1-β 1 ) √ v t +ϵ + γδ t √ v t +ϵ 2 ≤ 2γ 2 V 1 β m 2 (1-β 1 ) 2 E∥ mt ∥ 2 + 2V 1 β m 2 E∥γδ t ∥ 2 Lemma 7 ≤ 2γ 2 V 1 M β m 2 (1-β 1 ) 2 + 2γ 2 V 1 β m 2 •4∆ 2 ≤ 2γ 2 V 1 M β m 2 (1-β 1 ) 2 + 8γ 2 V 1 ∆ 2 β m 2 , where we again apply the Lemma 2 and Lemma 4. Then we can get A 1 ≤- γ G 2 ∞ +ϵ -γη 1 E∥∇f ( xt )∥ 2 + γV 1 L 2 β m 2 η 1 n n i=1 E x (i) t -xt 2 + γV 1 L 2 2β m 2 η 1 + γη 1 L 2 2 E∥ ỹt -xt ∥ 2 ≤- γ G 2 ∞ +ϵ -γη 1 E∥∇f ( xt )∥ 2 + 4γ 3 H 2 V 2 1 L 2 M β m 2 η 1 + γV 1 L 2 2β m 2 η 1 + γη 1 L 2 2 • 2γ 2 V 1 M β m 2 (1-β 1 ) 2 + 8γ 2 V 1 ∆ 2 β m 2 ≤- γ G 2 ∞ +ϵ -γη 1 E∥∇f ( xt )∥ 2 + 4γ 3 H 2 V 2 1 L 2 M β m 2 η 1 + γ 3 V 2 1 M L 2 η 1 β 2m 2 (1-β 1 ) 2 + γ 3 η 1 V 1 M L 2 β m 2 (1-β 1 ) 2 + 4γ 3 V 2 1 ∆ 2 L 2 η 1 β 2m 2 + 4γ 3 η 1 V 1 ∆ 2 L 2 β m 2 . where in the second step we reuse Equation ( 5). Next we can bound A 2 as follows A 2 =Lγ 2 E gt √ v t +ϵ 2 ≤ Lγ 2 V 1 β m 2 E 1 n n i=1 g (i) t 2 ≤ Lγ 2 V 1 σ 2 nβ m 2 + Lγ 2 V 1 β m 2 E 1 n n i=1 ∇f x (i) t 2 ≤ Lγ 2 V 1 σ 2 nβ m 2 + 2Lγ 2 V 1 β m 2 E 1 n n i=1 ∇f x (i) t -∇f ( xt ) 2 + 2Lγ 2 V 1 β m 2 E∥∇f ( xt )∥ 2 ≤ Lγ 2 V 1 σ 2 nβ m 2 + 2Lγ 2 V 1 L 2 nβ m 2 n i=1 E x (i) t -xt 2 + 2Lγ 2 V 1 β m 2 E∥∇f ( xt )∥ 2 ≤ Lγ 2 V 1 σ 2 nβ m 2 + 8γ 3 V 2 1 H 2 M L 3 β m 2 + 2Lγ 2 V 1 β m 2 E∥∇f ( xt )∥ 2 , where in the sixth step we reuse Equation (5). For A 3 , A 3 =-γE⟨∇f ( ỹt ),q t ⟩ =-γE⟨∇f ( xt ),q t ⟩-γE⟨∇f ( ỹt )-∇f ( xt ),q t ⟩ ∀η2>0 ≤ γη 2 2 E∥∇f ( xt )∥ 2 + γη 2 2 E∥∇f ( ỹt )-∇f ( xt )∥ 2 + γ η 2 E∥q t ∥ 2 ≤ γη 2 2 E∥∇f ( xt )∥ 2 + γη 2 L 2 2 • 2γ 2 V 1 M β m 2 (1-β 1 ) 2 + 8γ 2 V 1 ∆ 2 β m 2 + γ η 2 E∥q t ∥ ≤ γη 2 2 E∥∇f ( xt )∥ 2 + γ 3 η 2 V 1 M L 2 β m 2 (1-β 1 ) 2 + 4γ 3 η 2 V 1 ∆ 2 L 2 β m 2 + γ η 2 E∥q t ∥ 2 , where in the last step we reuse Equation ( 6). Combine the bound of A 1 to A 4 , we obtain Ef ( ỹt+1 )-Ef ( ỹt ) ≤- γ G 2 ∞ +ϵ -γη 1 - γη 2 2 E∥∇f ( xt )∥ 2 + 4γ 3 H 2 V 2 1 L 2 M β m 2 η 1 + γ 3 V 2 1 M L 2 η 1 β 2m 2 (1-β 1 ) 2 + γ 3 η 1 V 1 M L 2 β m 2 (1-β 1 ) 2 + 4γ 3 V 2 1 ∆ 2 L 2 η 1 β 2m 2 + 4γ 3 η 1 V 1 ∆ 2 L 2 β m 2 + Lγ 2 V 1 σ 2 nβ m 2 + 8γ 3 V 2 1 H 2 M L 3 β m 2 + 2Lγ 2 V 1 β m 2 E∥∇f ( xt )∥ 2 + γ 3 η 2 V 1 M L 2 β m 2 (1-β 1 ) 2 + 4γ 3 η 2 V 1 ∆ 2 L 2 β m 2 + γ η 2 +Lγ 2 E∥q t ∥ 2 . We set the two constants η 1 ,η 2 as η 1 = 1 4 G 2 ∞ +ϵ η 2 = 1 2 G 2 ∞ +ϵ , then we have, Ef ( ỹt+1 )-Ef ( ỹt ) ≤- γ G 2 ∞ +ϵ -γη 1 - γη 2 2 E∥∇f ( xt )∥ 2 + 4γ 3 H 2 V 2 1 L 2 M β m 2 η 1 + γ 3 V 2 1 M L 2 η 1 β 2m 2 (1-β 1 ) 2 + γ 3 η 1 V 1 M L 2 β m 2 (1-β 1 ) 2 + 4γ 3 V 2 1 ∆ 2 L 2 η 1 β 2m 2 + 4γ 3 η 1 V 1 ∆ 2 L 2 β m 2 + Lγ 2 V 1 σ 2 nβ m 2 + 8γ 3 V 2 1 H 2 M L 3 β m 2 + 2Lγ 2 V 1 β m 2 E∥∇f ( xt )∥ 2 + γ 3 η 2 V 1 M L 2 β m 2 (1-β 1 ) 2 + 4γ 3 η 2 V 1 ∆ 2 L 2 β m 2 + γ η 2 +Lγ 2 E∥q t ∥ 2 ≤- γ 2 G 2 ∞ +ϵ - 2Lγ 2 V 1 β m 2 E∥∇f ( xt )∥ 2 + 36γ 3 H 2 V 1 (M +∆ 2 )L 2 (1+L)(G 2 ∞ +ϵ+1) β m 2 (1-β 1 ) 2 G 2 ∞ +ϵ + Lγ 2 V 1 σ 2 nβ m 2 + 2γ G 2 ∞ +ϵ+Lγ 2 E∥q t ∥ 2 . Finally, we need to bound the norm of q t . If we denote the last sync step was k steps before t, then, q t = xt+1 -xt+ 1 2 - γ (1-β 1 ) √ v t +ϵ mt+1 -mt+ 1 2 - γδ t+1 -γδ t √ v t +ϵ = xt+1 -xt-k+1 + xt-k+1 -xt+ 1 2 - γ (1-β 1 ) √ v t +ϵ mt+1 -mt+ 1 2 - γδ t+1 -γδ t √ v t +ϵ =- γ ũt+ 1 2 √ v t +ϵ -   t j=t-k+1 γ mj √ v t +ϵ   - γ mt+1 -mt+ 1 2 (1-β 1 ) √ v t +ϵ =- γ (1-β 1 ) √ v t +ϵ   mt+1 -mt+ 1 2 +2(1-β 1 ) t j=t-k+1 mj   , based on which we obtain E∥q t ∥ 2 =E γ (1-β 1 ) √ v t +ϵ   mt+1 -mt+ 1 2 +2(1-β 1 ) t j=t-k+1 mj   ≤ γ 2 V 1 β m 2 (1-β 1 ) 2   3E∥ mt+1 ∥ 2 +3E mt+ 1 2 2 +12(1-β 1 ) 2 k t j=t-k+1 E∥ mj ∥ 2   ≤ 12γ 2 V 1 (H +1) 2 M β m 2 (1-β 1 ) 2 . Put everything together, and let γ fulfills γ ≤ min β m 2 4V 1 L G 2 ∞ +ϵ , 2 G 2 ∞ +ϵ L , we finally obtain Ef ( ỹt+1 )-Ef ( ỹt ) ≤- γE∥∇f ( xt )∥ 2 4 G 2 ∞ +ϵ + 36γ 3 H 2 V 1 (M +∆ 2 )L 2 (1+L)(G 2 ∞ +ϵ+1) β m 2 (1-β 1 ) 2 G 2 ∞ +ϵ + Lγ 2 V 1 σ 2 nβ m 2 + 48γ 3 V 1 (H +1) 2 M G 2 ∞ +ϵ β m 2 (1-β 1 ) 2 . To this end, we have provided bound to all the sync steps t with (t ̸ ∈ T v and t ∈ T u ). For all the t with (t ̸ ∈ T v and t ̸ ∈ T u ), they can be seen as a special case of q t = 0. Since A 3 +A 4 > 0, this bound will continue to hold for them, so that to sum over all the t with t ̸ ∈ T v , we obtain t̸ ∈Tv γE∥∇f ( xt )∥ 2 4 G 2 ∞ +ϵ ≤ t̸ ∈Tv Ef ( ỹt )-Ef ( ỹt+1 )+ 36γ 3 H 2 V 1 (3G 2 ∞ d+25∆ 2 )L 2 (1+L)(G 2 ∞ +ϵ+1)(T -m) β m 2 (1-β 1 ) 4 G 2 ∞ +ϵ + Lγ 2 V 1 σ 2 (T -m) nβ m 2 + 48γ 3 V 1 (H +1) 2 (3G 2 ∞ d+24∆ 2 ) G 2 ∞ +ϵ(T -m) β m 2 (1-β 1 ) 4 , where we replace M with Lemma 8. That completes the proof. Lemma 10. In Algorithm 1, For all the t ≥ 0 that fulfills v t ̸ = v t+1 , i.e. t ∈ T v , if the learning rate fulfills γ < 1 6 , the following bound holds t∈Tv γE∥∇f ( xt )∥ 2 4 G 2 ∞ +ϵ ≤ t∈Tv Ef ( ỹt )-Ef ( ỹt+1 )+ 2γσ 2 m nL + 106γH 2 V 1 (M +∆ 2 )mL β m 2 (1-β 1 ) 2 + γσ 2 m 4n G 2 ∞ +ϵ + γG 2 ∞ dm 4 G 2 ∞ +ϵ . Proof. From the definition of the auxiliary sequence, we obtain, We now bound the three norm terms separately. From Equation (5), we obtain for the first term, ỹt+1 -ỹt = xt+1 -xt - γ 1-β 1 mt+1 √ v t+1 +ϵ - mt √ v t +ϵ - γδ t+1 √ v E x (i) t -xt 2 ≤ 4γ 2 H 2 V 1 M β m 2 , where we again use M to denote the constant bound from Lemma 8 for brevity. On the other hand, based on a similar derivation to Equation ( 6), we obtain E∥ ỹt -xt ∥ 2 ≤ 2γ 2 V 1 M β m 2 (1-β 1 ) 2 + 8γ 2 V 1 ∆ 2 β m 2 . Finally, for the last norm, it's possible that the update towards t+1 step contains synchronization on the buffer. So that we need to discuss the two cases separately. First, for all the t ∈ T u , denote the last sync step before t is k, then we have E∥ ỹt+1 -ỹt ∥ 2 =E xt+1 -xt - γ 1-β 1 mt+1 √ v t+1 +ϵ - mt √ v t +ϵ - γδ t+1 √ v t+1 +ϵ - γδ t √ v t +ϵ 2 ≤7E∥ xt+1 -xt-k+1 ∥ 2 +7E xt-k+1 -xt+ 1 2 2 +7E xt+ 1 2 -xt 2 + 7γ 2 1-β 1 E mt+1 √ v t+1 +ϵ 2 + 7γ 2 1-β 1 E mt √ v t +ϵ 2 +7E γδ t+1 √ v t+1 +ϵ 2 +7E γδ t √ v t +ϵ 2 ≤7E∥ xt+1 -xt-k+1 ∥ 2 +7E xt-k+1 -xt+ 1 2 2 +7γ 2 E mt √ v t +ϵ 2 + 7γ 2 1-β 1 E mt+1 √ v t+1 +ϵ 2 + 7γ 2 1-β 1 E mt √ v t +ϵ 2 +7E γδ t+1 √ v t+1 +ϵ 2 +7E γδ t √ v t +ϵ 2 ≤7E t j=t-k+1 γ mj +δ t -δ t+1 √ v k +ϵ 2 +7E t j=t-k+1 γ mj √ v k +ϵ 2 +7γ 2 E mt √ v t +ϵ 2 + 7γ 2 1-β 1 E mt+1 √ v t+1 +ϵ 2 + 7γ 2 1-β 1 E mt √ v t +ϵ 2 +7E γδ t+1 √ v t+1 +ϵ 2 +7E γδ t √ v t +ϵ 2 ≤ 105γ 2 H 2 V 1 (M +∆ 2 ) β m 2 (1-β 1 ) 2 , where in the last step we use Lemma 7, 8 and 4. It is straightforward to verify that this bound also holds for t ̸ ∈ T u (since there will be no noise from the sync step). Combine the three norm term bounds, we obtain Ef ( ỹt+1 )-Ef ( ỹt ) ≤ 2ηγL n n i=1 E xt -x (i) t 2 + 2ηγσ 2 nL +ηγLE∥ ỹt -xt ∥ 2 + L ηγ E∥ ỹt+1 -ỹt ∥ 2 = 8ηγ 3 H 2 V 1 M L β m 2 + 2ηγσ 2 nL +ηγL 2γ 2 V 1 M β m 2 (1-β 1 ) 2 + 8γ 2 V 1 ∆ 2 β m 2 + 105γH 2 V 1 (M +∆ 2 )L ηβ m 2 (1-β 1 ) 2 ≤ 2ηγσ 2 nL + 18ηγ 3 H 2 V 1 M L β m 2 (1-β 1 ) 2 + 105γH 2 V 1 (M +∆ 2 )L ηβ m 2 (1-β 1 ) 2 ≤ 2γσ 2 nL + 106γH 2 V 1 (M +∆ 2 )L β m 2 (1-β 1 ) 2 , where in the last step we set η = 1 and use the requirement that γ < 1/6. Summing over all the t ∈ T v , we get 0 ≤ t∈Tv Ef ( ỹt )-Ef ( ỹt+1 )+ 2γσ 2 m nL + 106γH 2 V 1 (M +∆ 2 )mL β m 2 (1-β 1 ) 2 .



In the original 1-bit Adam paper, this stage is referred to as warmup stage. We use a slightly different term to avoid confusion with learning rate warmup. https://github.com/microsoft/DeepSpeed Note that in Adam, operations like division should act element-wise. Concretely, it shows in(Tang et al., 2021) Section 7.1 that to train BERT-Large on 64 GPUs using Ethernet, the full-precision Adam takes 174.3 hours in total while 1-bit Adam takes 51.5 hours. By a simple calculation, we know that full-precision stage of 1-bit Adam takes approximately 26.37 hours while the compression stage takes 25.13 hours. The name comes from the fact that the algorithm can potentially reduce the per-parameter volume to some number between 0 and 1 bit on average. https://github.com/xunge/pytorch_lmdb_imagenet https://github.com/pytorch/examples/blob/master/imagenet/main.py In the original Algorithm 4, the g t is the output of the AllReduce when t ∈ Tv. This, however, does not affect our analysis, since our proof holds for a noisier case. The original Algorithm 4 is mainly for practical concern -we avoid redundant AllReduce rounds when 1-bit AllReduce is performed.



Figure 1: Momentum and variance Profiling for BERT-Large sequence 128 pretraining with original Adam using 64 GPUs. For variance, we profile two types of metrics: the first is the difference between local and global variance: ∥v (0) t -v t ∥, where v (0)

) where x denotes the d-dimensional model. D denotes the training set and f (x;ζ) is the loss incurred over sample ζ given model parameters x. The structure of the problem naturally captures many of the model training problems.

Figure 2: Sample-wise and time-wise convergence for BERT-Base/Large pre-training sequence length 128 and Resnet18 pretraining on ImageNet using 128 GPUs on the Ethernet cluster. Note that the time on the right side is measured on both algorithms processing the same number of samples.

Figure 3: End-to-end average throughput for BERT-Base/Large pre-training sequence length 128 and Resnet18 pretraining on ImageNet using 128 V100 GPUs on the Ethernet/InfiniBand cluster. Note that since for ImageNet, both batch size (256) and model (Resnet18) are small compared to BERT, and its parallelism speed up will be limited if applied to the same large system on BERT (128 GPUs). And so we test it for 4 to 32 GPUs in Figure(d).

Figure 4: Reduction on number of bits per parameter used and number of communication rounds in different tasks. Note that the communication round numbers are normalized due to scale difference in different tasks. Note that original Adam uses 16bits per parameter and communicate at evert step.

Figure 5: Evaluation BERT-Base/Large pretraining throughput using 0/1 Adam without communication rounds skipping. Comparing with Figure 4 and 2, local steps breaks the barrier on the performance gain.

Figure 6: Training loss (left) and validation perplexity (right) with respect to Tokens for 1-bit Adam and 0/1 Adam.

Figure 7: Training loss and Validation error of ResNet18-CIFAR10 task. Hyperparameters that are not related to 0/1 Adam is set to be {learning rate: 1e-4, weight decay: 5e-4}. For hyperparameters associated with 0/1 Adam, we use the same ones as used in ImageNet with no additional tuning.

. The scores are the median scores over 10 runs with different seeds, and are obtained on the checkpoints trained by both sequence 128 and sequence 512 datasets. The first column shows Top1 accuracy on ImageNet of Resnet at the end of epoch 90 from different algorithms. The original accuracy is provided by Pytorch pretrained model library (Pytorch, 2014).

Pytorch. Torchvision 0.11.0 documentation -pytorch.org. https://pytorch.org/vision/ stable/models.html, 2014. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.

Profiling on Ethernet cluster the time taken in computation and others (including initialization of a communication round and compression) during one 1-bit AllReduce round at different scales.

ACKNOWLEDGMENTS

Yucheng Lu is supported by Meta PhD Fellowship. The authors would like to thank anonymous reviewers from ICLR2023 for providing valuable feedback.

Adding

.That completes the proof.

