WIN: WEIGHT-DECAY-INTEGRATED NESTEROV AC-CELERATION FOR ADAPTIVE GRADIENT ALGORITHMS

Abstract

Training deep networks on large-scale datasets is computationally challenging. In this work, we explore the problem of "how to accelerate adaptive gradient algorithms in a general manner", and aim to provide practical efficiency-boosting insights. To this end, we propose an effective and general Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, we minimize a dynamical loss per iteration which combines the vanilla training loss and a dynamic regularizer inspired by proximal point method (PPM) to improve the convexity of the problem. To introduce Nesterov-alike-acceleration into AdamW and Adam, we respectively use the firstand second-order Taylor approximations of vanilla loss to update the variable twice. In this way, we arrive at our Win acceleration for AdamW and Adam that uses a conservative step and a reckless step to update twice and then linearly combines these two updates for acceleration. Next, we extend Win acceleration to LAMB and SGD. Our transparent acceleration derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we prove the convergence of Win-accelerated adaptive algorithms and justify their convergence superiority over their non-accelerated counterparts by taking AdamW and Adam as examples. Experimental results testify to the faster convergence speed and superior performance of our Win-accelerated AdamW, Adam, LAMB and SGD over their non-accelerated counterparts on vision classification tasks and language modeling tasks with both CNN and Transformer backbones. We hope Win shall be a default acceleration option for popular optimizers in deep learning community to improve the training efficiency. Code will be released at https://github.com/sail-sg/win. c 2.5 ∞ ν 1.25 4 and matches the lower bound Ω( 1 4 ) in (Arjevani et al., 2019; 2020) (up to constant factors) under the same conditions, where c ∞ upper bounds the ∞ norm of stochastic gradient. Moreover, this complexity improves a factor O( d c 0.5 ∞ ) over the complexity O( c 2 ∞ dσ 2 L ν 1.25 4 ) of Adam-type optimizers in (Zhou et al., 2018; Guo et al., 2021 ), e.g. Adam, AdaGrad (Duchi et al., 2011), AdaBound (Luo et al., 2018), since network parameter dimension d is often much larger than c 0.5 ∞ , especially for over-parameterized networks. Indeed, Win-accelerated Adam and AdamW also enjoy superior complexity than other Adam variants, e.g. Adabelief (Zhuang et al., 2020) with compelxity O( c 6 2 ν 2 4 ), especially on over-parameterized networks, where c 2 is the maximum 2 -norm of stochastic gradient.

1. INTRODUCTION

Deep neural networks (DNNs) are effective to model realistic data and have been successfully applied to many applications, e.g. image classification (He et al., 2016) and speech recognition (Sainath et al., 2013) . Typically, their training models can be formulated as a nonconvex problem: min z∈R d F (z) := E ζ∼D [f (z, ζ)] + λ 2 z 2 2 , where z ∈ R d is the model parameter; sample ζ is drawn from a data distribution D; the loss f is differentiable; λ is a constant. Though many algorithms, e.g. gradient descent (Cauchy et al., 1847) and variance-reduced algorithms (Rie Johnson, 2013) , can solve problem (1), SGD (Robbins & Monro, 1951) uses the compositional structure in (1) to efficiently estimate gradient via minibatch data, and has become a dominant algorithm to train DNNs in practice because of its higher efficiency and effectiveness. However, on sparse data or ill-conditioned problems, SGD suffers from slow convergence speed (Kingma & Ba, 2014) , as it scales the gradient uniformly in all parameter coordinate and ignores the problem properties on each coordinate. To resolve this issue, recent work has proposed a variety of adaptive methods, e.g. Adam (Kingma & Ba, 2014) and AdamW (Loshchilov & Hutter, 2018) , that scale each gradient coordinate according to the current geometry curvature of the loss F (z). This coordinate-wise scaling greatly accelerates the optimization convergence and helps them, e.g. Adam and AdamW, become more popular in DNN training, especially for transformers. Unfortunately, along with the increasing scale of both datasets and models, efficient DNN training even with SGD or adaptive algorithms has become very challenging. In this work, we are particularly interested in the problem of "how to accelerate the convergence of adaptive algorithms in a general manner" because of their dominant popularity across many DNNs. Heavy ball acceleration (Polyak, 1964) and Nesterov acceleration (Nesterov, 2003) are widely used in SGD but are rarely studied in adaptive algorithms. Among the very few, NAdam (Dozat, 2016) simplifies Nesterov acceleration to estimate the first moment of gradient in Adam while totally ignoring the second-order moments, which is not exact Nesterov acceleration and may not inherit its full acceleration merit. Contributions: In this work, based on a recent Nesterov-type acceleration formulation (Nesterov et al., 2018) and proximal point method (PPM) (Moreau, 1965) , we propose a new Weight-decay-Integrated Nesterov acceleration (Win for short) to accelerate adaptive algorithms, and also further analyze the convergence of Win-accelerated adaptive algorithms to justify their convergence superiority by taking AdamW and Adam as examples. Our main contributions are highlighted below. Firstly, we use PPM to rigorously derive our Win acceleration for accelerating adaptive algorithms. By taking AdamW and Adam as examples, at the k-iteration, we follow PPM spirit and minimize a dynamically regularized loss F (z)+ 1 2η k z -x k 2 √ v k +ν with the second-order gradient moment v k and the stabilizing constant ν in AdamW and Adam. Then to introduce Nesterov-alike acceleration and also make the problem solvable iteratively, we respectively approximate F (z) by its firstand second-order Taylor expansions to update the variable z twice while always fixing the above dynamic regularization and also an extra regularizer 1 2η k z 2 √ v k +ν induced by the weight decay in AdamW. As a result, we arrive at our Win acceleration, a Nesterov-alike acceleration, for AdamW and Adam that uses a conservative step and a reckless step to update twice and then linearly combines these two updates for acceleration. Then we extend this Win acceleration to LAMB (You et al., 2019) and SGD. The above acceleration derivation is transparent and general which could motivate other accelerations and provide examples to introduce other accelerations into adaptive algorithms. Secondly, we prove the convergence of our Win-accelerated AdamW and Adam. For both, to find an -approximate first-order stationary point, their stochastic gradient complexity is O Finally, experimental results on both vision classification tasks and language modeling tasks show that our Win-accelerated algorithms, i.e. accelerated AdamW, Adam, LAMB and SGD, can accelerate the convergence speed and also improve the performance of their corresponding non-accelerated counterparts by a remarkable margin on both CNN and transformer architectures. All these results show the strong compatibility, generalization and superiority of our acceleration technique.

2. RELATED WORK

In the context of deep learning, when considering efficiency and generalization, one often prefers to adopt SGD and adaptive gradient algorithms, e.g. Adam, instead of other algorithms, e.g. variancereduced algorithms (Rie Johnson, 2013) , to solve problem (1). But, in practice and theory, adaptive algorithms often suffer from inferior generalization performance than SGD (Zhou et al., 2020a; b) . To solve this issue, AdamW (Loshchilov & Hutter, 2018) proposes a decoupled weight decay which introduces an 2 -alike regularization into Adam to decay network weight iteratively, and its effectiveness is widely validated on ViTs (Touvron et al., 2021) and CNNs (Touvron et al., 2021) . Later, LAMB (You et al., 2019) scales the update in AdamW to the weight magnitude for avoiding too large or small update, but suffers from unsatisfactory performance on small batch. In this work, we aim to design a general acceleration to accelerate these adaptive algorithms. Heavy-ball acceleration (Polyak, 1964) and Nesterov acceleration (Nesterov, 2003) are two classical acceleration techniques, and their effectiveness in SGD is well testified. Later, NAdam (Dozat, 2016) integrates Nesterov acceleration into the first-order gradient moment estimation but ignores the second-order gradient moments which harms the acceleration effect. Some works (Anil et al., 2022; 2020 ) also explore Nesterov acceleration for second-order algorithms, e.g. shampoo (Gupta et al., 2018) . Recently, for full gradient decent algorithm, a new general Nesterov-type acceleration (Nesterov et al., 2018) directly interpolates two variables to look ahead for correction, and is more flexible than vanilla Nesterov acceleration (Nesterov, 2003) which interpolates the variable and gradient. See discussion in Sec. 3.2. Here we use proximal point method to introduce this new acceleration into adaptive algorithms by a rigorous and transparent derivation and necessary tailors.

3. WEIGHT-DECAY-INTEGRATED NESTEROV ACCELERATION

To accelerate full gradient descent algorithm, given a full gradient ∇F (z k ) of problem (1) at the k-th iteration, Nesterov-type acceleration (Nesterov et al., 2018) generally uses a conservative step η k and a reckless step ηk to update two sequences x k+1 and y k+1 respectively, and then linearly combines them to update the variable z k+1 of the problem. Similar formulations are also observed and proved in recent works, e.g. (Allen-Zhu & Orecchia, 2014; Bansal & Gupta, 2019; Ahn & Sra, 2022) . In general, their acceleration formulation can be formally formulated as x k+1 = x k -η k ∇F (z k ), y k+1 = z k -ηk ∇F (z k ), z k+1 = ρ k x k+1 + (1 -ρ k )y k+1 . (2) This acceleration enjoys provably faster convergence rate for full gradient descent method on convex problems (Nesterov et al., 2018) , and is then empirically validated in many convex and nonconvex cases, e.g. (Wilson et al., 2017; Nado et al., 2021) . Despite its effectiveness, such acceleration is rarely explored in adaptive gradient algorithms, especially for network training. In deterministic optimization setting, another widely used optimization-stabilizing and acceleration approach is proximal point method (PPM) (Moreau, 1965; Rockafellar, 1976) . At the k-th iteration, PPM optimizes an 2 -regularized loss F (z)+ 1 2η k z -z k-1 2 instead of the vanilla loss F (z). This small change enhances the convexity of the problem, accelerating and also stabilizing optimization process (Kim et al., 2022; Zhou et al., 2021c) . To make the 2 -regularized problem solvable iteratively, PPM approximates the loss F (z) by its first-or second-order Taylor expansion so that each iteration has a close-form solution (see below). At below, we borrow the idea in PPM to induce a Weightdecay-Integrated Nesterov acceleration (Win) for adaptive algorithms by using AdamW and Adam as examples in Sec. 3.1, and then extend this acceleration technique to LAMB and SGD in Sec. 3.2.

3.1. WIN-ACCELERATED ADAMW AND ADAM

To begin with, following most adaptive gradient algorithms, e.g. Adam and AdamW, we estimate the first-and second-order moments m k and v k of gradient as follows: g k = 1 b b i=1 ∇f (z k ; ζ i ), m k = (1 -β 1 )m k-1 + β 1 g k , v k = (1 -β 2 )v k-1 + β 2 g 2 k , (3) where g k is the average gradient on a minibatch data of size b, β 1 ∈ [0, 1] and β 2 ∈ [0, 1]. For the initialization, we set m 0 = g 0 , v 0 = g 2 0 . For brevity, with a small scaler ν > 0, we define s k = √ v k + ν, u k = m k / √ v k + ν. (4) Then following the spirit of PPM, at the k-th iteration, we minimize a regularized loss F (x) + 5), and 2) it increases the convexity of the problem and further considers different sharpness property of each coordinate because of different elements in s k , accelerating convergence. To make the problem solvable iteratively, we approximate the vanilla loss F (z) by its first-order Taylor expansion at the point z k and update x k+1 as 1 2η k x-x k 2 s k , where x s k = x, x k+1 = argmin x F (z k )+ m k , x-z k + 1 2η k x-x k 2 s k + λ k 2 x 2 s k = 1 1+λ k η k (x k -η k u k ), where m k is used to approximate the full gradient ∇F (z k ). We add a small regularization λ k 2 x 2 s k , since 1) it can largely improve the generalization performance in practice (Loshchilov & Hutter, 2018; Touvron et al., 2021) ; 2) it allows us to derive Adam (λ k = 0) and AdamW (λ k > 0). Here λ k can be fixed as a constant or evolved along iteration number k. But in practice, a evolving λ k often enjoys better performance than a fixed one (Caron et al., 2021; Zhou et al., 2022) . When λ k = 0, the updating (5) becomes the exact Adam. If λ k > 0, the updating (5) can approximate the updating rule x k+1 = (1 -λ k η k )x k -η k u k of AdamW. This is because consider λ k η k is small in practice, we can approximate (1+λ k η k ) -1 = 1 -λ k η k +O(λ 2 k η 2 k ) and thus 1 1+λ k η k (x k -η k u k ) = [1 -λ k η k + O(λ 2 k η 2 k )]x k -[η k -O(λ k η 2 k ) + O(λ 3 k η 3 k )]u k which becomes AdamW by ignoring the ignorable terms O(η 2 k ) or O(η 3 k ) . This is also one reason that we adopt the regularizer xx k 2 s k in (5) instead of the 2 -regularization in PPM, since we can flexibly derive Adam and AdamW. Similarly, we minimize a regularized loss F (z) + 1 2η k z -x t+1 s k again, and further approximate F (z) by its second-order approximation F (z k ) + m k , z -z k + 1 2η k z -z k 2 s k : z k+1 = argmin z F (z k )+ m k , z -z k + 1 2η k z -z k 2 s k + 1 2η k z -x k+1 2 s k + λ k 2 z 2 s k =η k τ k x k+1 + η k τ k z k -ηk u k , where τ k = 1 η k +η k +λ k η k ηk , m k is used to approximate ∇F (x k ) as guaranteed by Theorem 1 in Sec. 4, ηk approximates the inverse of the local smoothness parameter of F (z) around z k . Here we use a regularizer zx k+1 2 s k with the latest update x k+1 instead of x k as an anchor point, since the latest update x k+1 could often provide better regularization for the concurrent optimization. Now we have used PPM to rigorously derive our Win-accelerated AdamW and Adam in Eqns. (3), ( 5) and ( 6). For more clarity, we summarize their algorithmic steps in Algorithm 1 in which we omit the bias-correction term for simplicity. When λ = 0, it is Win-accelerated Adam; if λ > 0, it gives Win-accelerated AdamW. Generally, AdamW can greatly improve the generalization performance of Adam by simply adding a weight decay (i.e. the regularizer λ 2 • 2 s k ) into Adam as observed in many works, e.g. (Loshchilov & Hutter, 2018; Touvron et al., 2021) . Our Winacceleration is quit simple and efficient, since our accelerated AdamW/Adam only adds one extra simple algorithmic step, i.e. the seventh step in Algorithm 1, on vanilla AdamW/Adam, and brings negligible extra computational overhead into vanilla optimizer, e.g. about 2% ∼ 5% extra training time per iteration on AdamW evaluated on ViT-small and ViT-base. Moreover, for the only extra hyper-parameter, the reckless step ηk , in Algorithm 1 over AdamW/Adam, we always set it 2× larger than the conservative step η k for all iterations, i.e. ηk = 2η k , working well in our all experiments. Now we discuss the relations between Nesterov-type acceleration (2) and our Win acceleration (6). For comparison, we introduce a virtual sequence y k+1 = z k -ηk u k in Win, and rewrite (6) as x k+1 = (1 + λ k η k ) -1 (x k -η k u k ) , y k+1 = z k -ηk u k , z k+1 = ηk τ k x k+1 + η k τ k y k+1 , where u k is defined in (4). By comparing Nesterov-type acceleration (2) with our Win acceleration (7), one can observe some similarity and also differences as well. For similarity, both methods use a conservative step η k and a reckless step ηk to update x k+1 and y k+1 respectively, and then linearly combine x k+1 and y k+1 to obtain z k+1 . For the differences, the first one is that Win has a weight-decay-alike factor 1 1+λ k η k in (7) which slightly decays the variable x k like AdamW and also the update u k , while Nesterov acceleration does not have. Note, weight decay can greatly benefit generalization in practice as shown in many works, e.g. (Loshchilov & Hutter, 2018; Touvron et al., 2021; Liu et al., 2021) . Another difference is that for almost all acceleration techniques, including Nesterov-type acceleration (2), the sum of their linear combination factors (e.g. ρ k and 1 -ρ k in (2)) is always one. In contrast, in Eqn. (7), Win uses ηk τ k + η k τ k = 1- λ k η k ηk η k +η k +λ k η k ηk < 1 when λ k > 0, which further gives a second weigh decay. Since these two differences are caused by the weight decay, we call our acceleration "weight-decay-integrated Nesterov acceleration" (Win for short).

3.2. EXTENSION TO LAMB AND SGD

Here we generalize Win acceleration to LAMB (You et al., 2019) and SGD (Robbins & Monro, 1951) . For LAMB, it scales the update u k of AdamW in Eqn. (4) so that u k is at the same magnitude of the network weight x k . That is, it changes the update rule x k+1 = (1 -λ k η k )x k -η k m k /s k in AdamW to x k+1 = x k -η k x k 2 r k +λ k x k 2 (r k + λ k x k ) where r k = m k /s k . This modification is to avoid too large or small update, improving optimization efficiency. To extend Win acceleration to LAMB, we inherit this scaling spirit, and scale the update u k in (4) to the following one: u k = ( x k 2 / r k + λ k x k 2 ) • (r k + λ k x k ). We scale m k /s k instead of (m k /s k + λ k x k ) in LAMB, as our scaling can be repeatedly used to update our two sequences x k and z k . Next, we can respectively follow Eqn. ( 5) and ( 6) to update the two sequences x k and z k . See the detailed steps of Win-accelerated LAMB in Algorithm 1, and the detailed comparison between LAMB and Win-accelerated LAMB in Appendix A. Algorithm 1: Win-Accelerated AdamW, Adam and LAMB Input: initialization x 0 = z 0 = 0, step size {(η k , ηk )} T k=0 , moment parameters {β 1 , β 2 }. Output: ( x, z) uniformly seleted from {(x k , z k )} T k=0 . while k < T do g k = 1 b b i=1 ∇f (z k ; ζ i ) m k = (1 -β 1 )m k-1 + β 1 g k / * m 0 = g 0 * / v k = (1 -β 2 )v k-1 + β 2 g 2 k / * v 0 = g 2 0 * / u k = m k √ v k +ν for AdamW and Adam, u k = x k 2 m k / √ v k +ν+λ k x k 2 m k √ v k +ν +λ k x k for LAMB x k+1 = 1 1+λ k η k (x k -η k u k ) z k+1 = ηk τ k x k+1 + η k τ k (z k -ηk u k ) with τ k = 1 η k +η k +λ k η k ηk end while For SGD, applying Win acceleration to it is quite direct. Specifically, the only algorithmic difference between SGD and AdamW on the 2 -regularized problems is that SGD has no second-order moment v k while AdamW has. So we can borrow the acceleration framework of AdamW in Sec. 3.1 to accelerate SGD by setting s k = 1 ∈ R d in Eqn. ( 4), ( 5) and ( 6), and obtain Win-accelerated SGD: m k = β 1 m k-1 +β 1 g k , x k+1 = 1 1+λ k η k (x k -η k m k ), z k+1 = ηk τ k x k+1 +η k τ k z k -ηk m k , where β 1 ∈ [0, 1] is dampening parameter. Here we slightly modify the moment m k to accord with the one used in Nesterov-accelerated SGD (e.g. SGD-M in Pytorch) whose updating steps are m k = β 1 m k-1 + β 1 (g k + λ k x k ), x k+1 = (1 -λ k η k )x k -η k (g k + β 2 m k ). By comparing Win-accelerated SGD and SGD-M in (10), one can find their big differences mainly caused by their different acceleration strategies and ways to handle weight decay. Win-accelerated SGD is derived from PPM and a recently proposed acceleration (2), while SGD-M modifies another previous Nesterov-type acceleration (Nesterov, 2003)  (of formulation m k = β 1 m k-1 - η k b b i=1 ∇f (x k + η k m k-1 ; ζ i ) and x k+1 = x k + m k ) to better train networks. See more mechanisms of previous Nesterov acceleration and (10) in (Sutskever et al., 2013; Bengio et al., 2013) .

4. CONVERGENCE ANALYSIS

Here we investigate the convergence performance of Win-accelerated algorithms by taking accelerated AdamW, Adam and SGD as examples, as these algorithms are more preferably used in deep learning field. Moreover, since we aim to accelerate deep network training which is highly nonconvex problems, we focus on analyzing nonconvex problems to accord with the practical setting. For analysis, we follow previous optimization works, e.g. (Kingma & Ba, 2014; Reddi et al., 2019; Duchi et al., 2011; Zhou et al., 2020b; 2021a; b; Xie et al., 2022) , to introduce necessary assumptions. Assumption 1 (L-smoothness). We say a function f (z, •) to be L-smooth w.r.t. z, if for ∀z 1 , z 2 and ∀ζ ∼ D, we have ∇f (z 1 , ζ) -∇f (z 2 , ζ) 2 ≤ L z 1 -z 2 2 with a universal constant L. Assumption 2 (Unbiased and bounded gradient estimation). The gradient estimation g k is unbiased, i.e. for ∀k, E[g k ] = ∇F (z k ), and its magnitude and variance are bounded, namely, for ∀k, g k ∞ ≤ c ∞ and E[ ∇F (z k ) -g k 2 ] ≤ σ with two universal constants c ∞ and σ. Next, we first define a dynamic function F k (z) at the k-th iteration which is real loss minimized by our algorithms. It combines the vanilla loss F (z) in (1) and a dynamic regularization λ k 2 z 2 s k : F k (z) = F (z) + λ k 2 z 2 s k = E ζ [f (z; ζ)] + λ k 2 z 2 s k , where s k is given in (4). To obtain (11), following PPM spirit and Eqn. (5), one can approximate F (z) by its first-order Taylor expansion, and obtain Eqn. ( 5) with x replaced by z to update z k+1 = 1 1+λ k η k (z k -η k m k /s k ). Since λ k η k is very small, one can follow the discussion below Eqn. ( 5) and approximate z k+1 as z k+1 = (1-λ k η k )z k -η k m k /s k which becomes the update rule of AdamW. This is the reason that our analysis on Win-accelerated AdamW involves a dynamic loss F k (z) in ( 11). Note, for Win-accelerated Adam (λ k = 0), F k (z) degenerates to the vanilla loss F (z). With these assumptions, we analyze the convergence behaviors of our accelerated algorithms on general nonconvex problems, and summarize our main results in Theorem 1 with proof in Appendix E. Theorem 1. Suppose Assumptions 1 and 2 hold, and x ∈ argmin x F (x). Let ηk = γη k , γ > 1, η k = η ≤ O ν 1.25 b 2 c 1.5 γ 2.5 σ 2 L , β 1 ≤ O ν 0.5 b 2 cσ 2 , β 2 ∈ (0, 1), c = (c 2 ∞ + ν) 0.5 , λ k = λ(1 - β2c 2 ∞ ν ) k (k > 0) and λ 0 = 0 with a constant λ > 0. Then after T = O c 2.5 ∞ γ 2.5 σ 2 L∆ ν 1.25 b 4 iterations with minibatch size b and ∆ = F (x 0 ) -F (x ), the sequence {(x k , z k )} T k=0 generated by Win-accelerated AdamW and Adam in Algorithm 1 satisfies the following four properties. a) The gradient ∇F k (x k ) of the sequence {x k } T k=0 can be upper bounded by 1 T T -1 k=0 E ∇F k (x k ) 2 2 + 1 4 m k + λ k x k * s k 2 2 ≤ 2 . b) The gradient moment m k can well estimate the full gradient ∇F (x k ) and ∇F (z k ): 1 T T -1 k=0 max E m k -∇F (x k ) 2 2 , E m k -∇F (z k ) 2 2 ≤ 16 + 1 2c ν 0.5 L 2 . c) The sequence {x k , z k } satisfies 1 T T -1 k=0 E x k -x k+1 2 s k , E z k+1 -z k 2 2 , E z k -x k 2 2 ≤ 4η 2 2 , ν 1.5 β 2 1 2 4c(1-β 1 ) 3 L 2 , ν 0.5 2 4cL . d) The total stochastic gradient complexity to achieve the above three properties is O c 2.5 ∞ ∆σ 2 L ν 1.25 4 . Theorem 1 guarantees the convergence of Win-accelerated AdamW and Adam in Algorithm 1 on nonconvex problems. When λ k > 0 (λ k = 0), Algorithm 1 corresponds to Win-accelerated AdamW (Adam). For both cases, Theorem 1 holds. Theorem 1 a) shows that by running at most T = O c 2.5 ∞ ∆σ 2 L ν 1.25 b 4 iterations, the average gradient 1 T T -1 k=0 E ∇F k (x k ) 2 2 is upper bounded by 2 , guaranteeing the algorithmic convergence. Theorem 1 b) indicates the gradient moment m k can well estimate the full gradient ∇F (z k ) and also ∇F (x k ) because of their small distances, guaranteeing the good Taylor approximation used in Eqns. ( 5) and ( 6). Moreover, in Theorem 1 c), one can find that although Algorithm 1 uses a conservative step η k and a reckless step ηk = γη k (∀γ > 1) to update, the two sequences x k+1 and z k+1 can converge to each other, which could be the key for the good convergence behavior of both Win-accelerated AdamW and Adam. Now we discuss the stochastic gradient complexity of Win-accelerated Adam and AdamW. Theorem 1 d) shows that to find an -approximate first-order stationary point, both Win-accelerated Adam and AdamW have the complexity O c 2.5 ∞ σ 2 L ν 1.25 4 which matches the lower bound Ω( 1 4 ) in (Arjevani et al., 2019; 2020) (up to constant factors) under the same Assumptions 1 and 2. Our accelerated Adam and AdamW enjoy superior complexity over Adam-type optimizers, e.g. Adam, AdaGrad (Duchi et al., 2011) , AdaBound (Luo et al., 2018) , whose previously best known complexity under the same assumptions is O( (Zhou et al., 2018; Chen et al., 2021; Guo et al., 2021) . By comparison, both accelerated Adam and AdamW improve their complexity by a factor O( d c 0.5 ∞ ), where the network parameter dimension d is often much larger than c 0.5 ∞ , especially for over-parameterized modern networks. Since the convergence of AdamW has not been proved yet in the literatures, here we cannot directly compare with it. Moreover, the complexity of Win-accelerated Adam and AdamW is also lower than O( c 2 ∞ dσ 2 L ν 1.25 4 ) in c 6 2 σ 2 L ν 2 4 ) of Adabelief (Zhuang et al., 2020) and O( c 0.5 ∞ d 0.5 σ 2 L ν 4 ) of RMSProp (Tijmen & Geoffrey, 2012; Zhou et al., 2018) , especially on over-parameterized networks, since for a d-dimensional gradient, its 2 -norm upper bound c 2 is often much larger than the ∞ -norm c ∞ and can be √ d× larger for worse case. Now we discuss the convergence performance of Win-accelerated SGD in Theorem 2. Theorem 2. Suppose Assumptions 1 and 2 hold, and x ∈ argmin x F (x). Let ηk = γη k , γ > 1, η k = η ≤ O b 2 c 1.5 γ 2.5 σ 2 L , β 1 ≤ O b 2 cσ 2 , β 1 = 1-β 1 , λ k = λ, λ 0 = 0. After T = O ∆σ 2 L b 4 iterations with minibatch size b and ∆ = F (x 0 )-F (x ), the sequence {(x k , z k )} T k=0 generated by Win-accelerated SGD in (9) satisfies the four properties in Theorem 1 with ν = c ∞ = c = 1 and s k = 1 ∈ R d . See its proof in Appendix F. Theorem 2 also guarantees the convergence of Win-accelerated SGD. By using the hyper-parameter settings in Theorem 2, the sequence {(x k , z k )} T k=0 generated by Winaccelerated SGD satisfies the four properties in Theorem 1 with ν = c ∞ = c = 1 and s k = 1. It shows the complexity O( Lσ 2 4 ) of Win-accelerated SGD which also matches the lower bound Ω( 1 4 ) in (Arjevani et al., 2019; 2020) (up to constant factors) under Assumptions 1 and 2. 

5. EXPERIMENTS

Here we evaluate our accelerated algorithms on two representative tasks, including vision classification tasks and natural language modeling tasks. For vision tasks, we test accelerated algorithms on both CNNs, e.g. ResNet (He et al., 2016) , and vision transformers (ViTs), e.g. ViT (Dosovitskiy et al., 2020) and PoolFormer (Yu et al., 2021; 2022) . For language modeling tasks, we use LSTM (Schmidhuber et al., 1997) and Transformer-XL (Dai et al., 2019) for evaluation. For clarity, we call our accelerated algorithm "X-Win", where "X" denotes vanilla optimizers, e.g. Adam. In all experiments, we do not change model architectures and data augmentations, and only replace the default optimizer with ours. Moreover, for all experiments, our accelerated algorithms, e.g. AdamW-Win, always use the default optimizer-inherent hyper-parameters of the vanilla optimizers, e.g. first-and second-order moment parameters β 1 and β 2 in AdamW; and their reckless step ηk always satisfies ηk = 2η k . These settings well reduce the parameter-tuning cost of our algorithms. In the experiments, same with other optimizers, we only slightly tune other widely tuned hyper-parameters around the vanilla ones, e.g. step size and warm-up epochs, etc, which is reasonable, as our accelerated algorithms have two step sizes and the vanilla ones are not very suitable. Results on ResNet18. Here we follow the conventional supervised training setting used in ResNets (He et al., 2016) and evaluate our accelerated algorithms on Ima-geNet (Fei-Fei, 2009) . Due to limited space, we defer the hyper-parameter settings of the four accelerated algorithms in Table 1 

5.1. RESULTS ON VISION CLASSIFICATION TASKS

into Appendix B. Table 1 shows that our accelerated algorithms can improve the corresponding non-accelerated versions by a remarkable margin. For instance, AdamW-Win, Adam-Win and LAMB-Win respectively make 3.1%, 2.8% and 2.6% improvement over their corresponding non-accelerated counterparts, AdamW, Adam and LAMB. Moreover, SGD-Win improves SGD-H (i.e. SGD + heavy ball) by 3.4%, and also surpasses SGD-M ( Nesterov-accelerated SGD in Sec. 3.2) by 0.5%, also validating the superiority of our Win acceleration. Besides, our accelerated algorithms, i.e. SGD-Win, AdamW-Win and LAMB-Win, beat several other optimizers, e.g. AdaBound, Radam (Liu et al., 2019) , Nadam, Padam (Chen et al., 2021) , AdaBelief, Yogi (Zaheer et al., 2018) , in which Nadam uses Nesterov acceleration to estimate its first-order gradient moment. Actually, LAMB-Win sets a new SoTA top-1 accuracy on ResNet18. All these results show the strong compatibility and superiority of our Win-acceleration in adaptive algorithms.

Results on ResNet50&101.

Here we adopt the training setting in (Wightman et al., 2021) to train ResNet50&101, as this setting uses stronger data augmentation and largely improves CNNs' performance. See augmentation details and our algorithmic hyper-parameter settings in Appendix B. Here LAMB is the default optimizer because of its higher performance than other optimizers caused by the stronger augmentations (Wightman et al., 2021) . All optimizers in Table 2 are under this setting. Table 2 shows that our accelerated algorithms consistently outperform their corresponding nonaccelerated version. For example, across the three training epoch settings on ResNet50 / ResNet101, (Touvron et al., 2021) and (Yu et al., 2021 , and also has similar advantage on ResNet101. These improvements are not trivial because of the following two reasons. 1) Since the performance is already high and may approach the model limit, it is already very hard to make very large improvement. This is testified by the fact that in (Wightman et al., 2021) , using LAMB to train ResNet50 for 600 epochs only gives 80.4% top-1 accuracy. In contrast, our accelerated LAMB-Win uses 300 epochs (half training cost) to achieve 80.2%. 2) By comparing the previous optimizers, including SAM, SGD-M, Adam, AdamW and LAMB, one can observe smaller accuracy gap (≤ 0.2%) between the best optimizer and the runner-up. For example, on ResNet101, the SoTA optimizer, i.e. SAM, only makes 0.1% average improvement over the runner-up LAMB. All these comparisons show the non-travail improvement of our accelerated algorithms over their counterparts. Results on ViTs. We follow the widely used official training setting of ViTs (Touvron et al., 2021; Yu et al., 2021) . To evaluate the performance of our accelerated algorithms, we select two popular and representative ViT architures, including ViT (Dosovitskiy et al., 2020) and PoolFormer (Yu et al., 2021) . See the training setting and our hyper-parameter settings in Appendix B. We test our accelerated algorithms under different model sizes and different training epochs, and report the results in Table 3 . One can find that since AdamW and LAMB use the decoupled weight decay, they enjoy better performance than SGD and Adam, which is also observed in other works, e.g. (Xiao et al., 2021; Nado et al., 2021) . Moreover, under different training settings, our accelerated algorithms consistently outperform the corresponding non-accelerated counterparts. Specifically, compared the default AdamW optimizer on both ViT and PoolFormer, our accelerated AdamW-Win respectively makes about 1.0%, 0.9%, 1.0% average improvement under the two training epoch settings on ViT-S, ViT-B and PoolFormer-S12. For Adam-Win and LAMB-Win, one can also observe their remarkable improvements on the three ViT backbones. Moreover, our accelerated SGD-Win also outperforms the Nesterov-accelerated SGD denoted as "SGD-M" by non-trivial margins under all settings. All these results are consistent with the observations on ResNets, and they together demonstrate the advantage of our accelerated optimizers for deep network training. 4 shows the stable performance of AdamW-Win and LAMB-Win when tuning γ in a relatively large range, thus validating their robustness to the hyper-parameter γ. Results on LSTM. We follow AdaBelief to test our accelerated algorithms via training three-layered LSTM (Schmidhuber et al., 1997) From Table 5 , one can observe that our Win-accelerated algorithms consistently surpass the corresponding nonaccelerated counterparts, and actually bring 1.2 overall average perplexity improvement over the four nonaccelerated counterparts. Results on Transformer-XL. We adopt a widely used language sequence model, i.e. Transformer-XL (Dai et al., 2019) , to further evaluate the performance of our accelerated algorithms. Since 1) Adam is the most popular and used optimizer in NLP models, including Transformer-XL, and 2) our limited resource cannot well tune the hyperparameters of other optimizers in Sec. 5.1, we take Adam as an example to show the superiority of our accelerated algorithms. Follow the official setting of Transformer-XL-base, we use Adam-Win with the default hyper-parameters of Adam on the WikiText-103 dataset. See more details in Appendix B. 

APPENDIX

The appendix is structured as follows. In Appendix A, we provide more details of LAMB and Winaccelerated LAMB. Then, Appendix B provides more experimental details, such as hyper-parameter settings of the four accelerated algorithms and the official data augmentations. In Appendix C, we define some necessary notations for our analysis. Then Appendix D provides some auxiliary lemmas throughout this document. Then Appendix E presents the proof of the convergence results in Sec. 1, i.e., the proof of Theorems 1 and 2. Finally, Appendix G provides the proofs of some auxiliary lemmas in Appendix D.

A MORE DETAILS OF LAMB AND WIN-ACCELERATED LAMB

Here we introduce more details of vanilla LAMB (You et al., 2019) and our Win-accelerated LAMB. Specifically, Algorithm 2 and 3 respectively summarize the algorithmic steps of LAMB and Winaccelerated LAMB.

B MORE EXPERIMENTAL DETAILS

Due to space limitation, we defer the experimental details, such as hyper-parameter settings of the four accelerated algorithms, and their official augmentations in (He et al., 2016) and (Wightman et al., 2021) , to this section. For accelerated algorithms, including AdamW-Win, LAMB-Win, Adam-Win and SGD-Win, always share the default optimizer-inherent hyper-parameters of the vanilla optimizers and its reckless step ηk is always 2× larger than its conservative step η k for all iterations, i.e. ηk = 2η k . For AdamW-Win, LAMB-Win, Adam-Win, their first-and second-order moment parameters β 1 and β 2 are set to the default values β 1 = 0.9 and β 2 = 0.999 used in AdamW, LAMB and Adam. For LAMB-Win, its other key parameters, such as "grad averaging" and "trust clip", also adopt the default ones in vanilla LAMB. For SGD-Win, it uses the default momentum parameter 0.9 and set dampening parameter as 0.0 used in vanilla SGD. Algorithm 2: LAMB in (You et al., 2019) Input: initialization x 0 = z 0 = 0, step size {(η k , ηk )} T k=0 , moment parameters {β 1 , β 2 }. Output: x uniformly seleted from {x k } T k=0 . while k < T do g k = 1 b b i=1 ∇f (z k ; ζ i ) m k = (1 -β 1 )m k-1 + β 1 g k / * m 0 = g 0 * / v k = (1 -β 2 )v k-1 + β 2 g 2 k / * v 0 = g 2 0 * / u k = x k 2 m k / √ v k +ν+λx k 2 m k √ v k +ν + λx k x k+1 = x k -η k u k end while Algorithm 3: Win-Accelerated LAMB Input: initialization x 0 = z 0 = 0, step size {(η k , ηk )} T k=0 , moment parameters {β 1 , β 2 }. Output: ( x, z) uniformly seleted from {(x k , z k )} T k=0 . while k < T do g k = 1 b b i=1 ∇f (z k ; ζ i ) m k = (1 -β 1 )m k-1 + β 1 g k / * m 0 = g 0 * / v k = (1 -β 2 )v k-1 + β 2 g 2 k / * v 0 = g 2 0 * / u k = x k 2 m k / √ v k +ν+λ k x k 2 m k √ v k +ν + λ k x k x k+1 = 1 1+λ k η k (x k -η k u k ) where λ k = 0 here z k+1 = ηk τ k x k+1 + η k τ k (z k -ηk u k ) with τ k = 1 η k +η k +λ k η k ηk and λ k = 0 here end while Settings on ResNet18. Here we follow the conventional supervised training setting used in ResNets (He et al., 2016) and evaluate our accelerated algorithms on ImageNet (Fei-Fei, 2009) . For data augmentation in (He et al., 2016) , it uses random crop and horizontal flipping with probability 0.5. For warm-up epochs, for all four accelerated algorithms, we set it as 5.0. For base learning rate, we respectively set it as 3 × 10 -3 , 5 × 10 -3 , 3 × 10 -3 , and 1.2 for AdamW-Win, LAMB-Win, Adam-Win and SGD-Win. Moreover, we follow the default setting and use cosine learning rate decay. For weight decay, we respectively set it as 5 × 10 -2 , 5 × 10 -2 , 10 -6 , and 10 -3 for AdamW-Win, LAMB-Win, Adam-Win and SGD-Win. On ResNet18, all algorithms are trained for 90 epochs with minibatch size 512 by following the conventional setting. Settings on ResNet50&101. For these two networks, we use "A2 training recipe" in (Wightman et al., 2021) to train them, since this training setting uses stronger data augmentation and largely improves CNNs' performance. Specifically, the data augmentation in (Wightman et al., 2021) uses random crop, horizontal flipping with probability, Mixup with parameter 0.1 (Zhang et al., 2018) , CutMix with parameter 1.0 and probability 0.5 (Yun et al., 2019), and RandAugment (Cubuk et al., 2020) with M = 7, N = 2 and MSTD = 0.5. Moreover, it often use binary cross-entropy (BCE) loss for training. On both ResNet50 and ResNet101, for base learning rate, we respectively set it as 2×10 -3 , 8×10 -3 , 1 × 10 -3 , and 0.8 for AdamW-Win, LAMB-Win, Adam-Win and SGD-Win. Moreover, we follow the default setting and use cosine learning rate decay. On both ResNet50 and ResNet101, for weight decay, we respectively set it as 5 × 10 -2 , 2 × 10 -2 , 10 -5 , and 5 × 10 -4 for AdamW-Win, LAMB-Win, Adam-Win and SGD-Win. On both ResNet50 and ResNet101, for warm-up epoch number, we respectively set it as 5, 5, 10, 5 for AdamW-Win, LAMB-Win, Adam-Win and SGD-Win. Settings on ViT and PoolFormer. We follow the widely used official training setting of ViTs (Touvron et al., 2021; Yu et al., 2021) . For this setting, data augmentation includes random crop, horizontal flipping with probability, Mixup with parameter 0.8 (Zhang et al., 2018) , CutMix with parameter 1.0 and probability 0.5 (Yun et al., 2019 ), RandAugment (Cubuk et al., 2020) with M = 9, N = 2 and MSTD = 0.5, and Random Erasing with parameter p = 0.25. For training loss, we use cross entropy loss. On both ViT-S and ViT-B, for base learning rate, we respectively set it as 2 × 10 -3 , 5 × 10 -3 , 1 × 10 -4 , and 0.8 for AdamW-Win, LAMB-Win, Adam-Win and SGD-Win. Moreover, we follow the default setting and use cosine learning rate decay. On both ResNet50 and ResNet101, for weight decay, we respectively set it as 5 × 10 -2 , 2 × 10 -2 , 10 -5 , and 5 × 10 -4 for AdamW-Win, LAMB-Win, Adam-Win and SGD-Win. On both ResNet50 and ResNet101, for warm-up epoch number, we respectively set it as 5, 60, 30, 5 for AdamW-Win, LAMB-Win, Adam-Win and SGD-Win. For AdamW-Win, following the default setting in AdamW, its minibatch size is 1024 for ViT-S and 512 for ViT-B. For all other accelerated optimizer, their minibatch sizes are always 1024. Settings on LSTM. On LSTM, for base learning rate, we respectively set it as 1 × 10 -3 , 1 × 10 -2 , 1 × 10 -2 , and 15.0 for AdamW-Win, LAMB-Win, Adam-Win and SGD-Win. Moreover, we follow the default setting and divide the learning rate by 10 at epoch 100 and 145. For weight decay, we respectively set it as 2 × 10 -2 , 5 × 10 -2 , 1.8 × 10 -6 , and 2 × 10 -5 for AdamW-Win, LAMB-Win, Adam-Win and SGD-Win. We do not utilize the warmup strategy in this experiment. Following the default setting, we set minibatch size as 20. Settings on Transformer-XL. On Transformer-XL, for base learning rate, we set it as 4 × 10 -4 for Adam-Win. Moreover, we follow the default setting and use cosine learning rate decay. For weight decay, we set it as 10 -6 for Adam-Win. For warm-up steps, we set it as 2000. Following the default setting, we set minibatch size as 60 × 4. Test accuracy curves of SGD-Win and Adam-Win on ResNet18. Here we investigate the convergence behaviors of our accelerated algorithms and hope to explain their better test performance over their non-accelerated counterparts. In each sub-figure pair of Fig. 1 , we plot the curves of training and test losses along with the training epochs on ResNet18 and ViT-B. One can find that our accelerated algorithms, e.g. AdamW-Win, show much faster convergence behaviors than their nonaccelerated counterparts, e.g. AdamW. Moreover, SGD-Win also converges faster than Nesteroveaccelerated SGD, i.e. SGD-M. We also plot the curves of test accuracy in Fig. 2 their non-accelerated counterparts in terms of test accuracy. So these faster convergence behaviors could contribute to our accelerated algorithms for their higher performance over non-accelerated algorithms under the same computational cost.

C NOTATIONS

Here we first give some important notations used in this document. For brevity, we let s k = √ v k + ν. Since we have m k ∞ ≤ c ∞ and ν ≤ v i + ν ∞ ≤ c 2 ∞ + ν in Lemma 3 (see Appendix D), for brevity, let c 1 := ν 0.5 ≤ s k ∞ ≤ c 2 := (c 2 ∞ + ν) 0.5 .

Also we define

w k := m k + λ k x k * s k , x k+1 -x k = - η k 1 + λ k η k m k + λ k x k * s k s k = - η k 1 + λ k η k w k s k . Next, we introduce an virtual sequence {y k } into the algorithm. In this way, we can rewrite the update steps in Algorithm 1 in the manuscript as its equivalent form (12):                    g k = 1 b b i=1 ∇f (z k ; ζ i ); m k = (1 -β 1 )m k-1 + β 1 g k ; v k = (1 -β 2 )v k-1 + β 2 g 2 k ; x k+1 = 1 1+λ k η k x k -η k m k s k y k+1 = z k -ηk m k s k z k+1 = ηk τ k x k+1 + η k τ k y k+1 (12) where m 0 = g 0 and v 0 = g 2 0 . For analysis, we further define F k (θ k ) = F (θ) + λ k 2 θ 2 s k = E ζ [f (θ; ζ)] + λ k 2 θ 2 s k , ( ) where λ k = λ(1 -µ) k in which µ = β2c 2 ∞ δ . In the following, we mainly use these notations to finish our proofs.

D AUXILIARY LEMMAS

Before giving our analysis, we first provide some important lemmas. Lemma 3. Suppose the sequence {x k , y k , z k } are updated by Eqn. (12). That is, x k+1 = 1 1+λ k η k x k -η k m k s k , y k+1 = z k -ηk m k s k , z k+1 = ηk τ k x k+1 + η k τ k y k+1 , s k = √ v k + ν. Then {(m k , s k )} satisfies Assume c s,∞ ≤ g k ∞ ≤ c ∞ , then we have m k ∞ ≤ c ∞ , v i + ν ∞ ≤ c 2 ∞ + ν, β 2 c 2 ∞ 2(c 2 s,∞ + ν) ≤ s k s k+1 ∞ < 1 + β 2 c 2 ∞ 2(c 2 s,∞ + ν) . See its proof in Appendix G.1. Lemma 4. (Xie et al., 2022) Suppose the sequence {x k , y k , z k } are updated by Eqn. (12). That is, x k+1 = 1 1+λ k η k x k -η k m k s k , y k+1 = z k -ηk m k s k , z k+1 = ηk τ k x k+1 + η k τ k y k+1 , s k = √ v k + ν. Then {z k } satisfies E m k -∇F (z k ) 2 ≤(1 -β 1 )E m k-1 -∇F (z k-1 ) 2 + (1 -β 1 ) 2 L 2 β 1 E z k -z k-1 2 + β 2 1 σ 2 b . Lemma 5. Suppose the sequence {x k , y k , z k } are updated by Eqn. (12). That is, x k+1 = 1 1+λ k η k x k -η k m k s k , y k+1 = z k -ηk m k s k , z k+1 = ηk τ k x k+1 + η k τ k y k+1 , s k = √ v k + ν. By setting η k = η, ηk = η, β 1,k = β 1 and β 2,k = β 2 , then we have y k+1 -(1 + λ k η)x k+1 = -ρ k+1 k i=0 1 ρ i+1 η -η 1 + λ i η w i s i , y k+1 -(1 + λ k η)x k+1 2 ≤ρ k+1 (η -η) 2 k i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i s i 2 , z k+1 -x k+1 2 ≤τ k ρ k+1 η(η -η) 2 k i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i s i 2 , z k+1 -z k 2 ≤ 2η 2 (1 + λ k η) 2 w k s k 2 + 2ρ k+1 η2 (η -η) 2 τ 2 k (1 + λ k η) 2 k i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i s i 2 , where ρ k+1 = ητ k-1 ρ k , ρ 1 = 1 and ρ 0 = 0. See its proof in Appendix G.2. Lemma 6. Suppose the sequence {x k , y k , z k } are updated by Eqn. (12). That is, x k+1 = 1 1+λη k x k -η k m k s k , y k+1 = z k -ηk m k s k , z k+1 = ηk τ k x k+1 + η k τ k y k+1 , s k = √ v k + ν. By setting η k = η, ηk = η, β 1,k = β 1 and β 2,k = β 2 , then we have E m k -∇F (x k ) 2 ≤2(1 -β 1 )E m k-1 -∇F (z k-1 ) 2 + 2Π k (1 -β 1 ) 2 L 2 β 1 + 2β 2 1 σ 2 b + 2LΠ k , where Π k := 2η 2 (1 + λ k-1 η) 2 w k-1 s k-1 2 + 2ρ k η2 (η -η) 2 τ 2 k-1 (1 + λ k-1 η) 2 k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i s i 2 , Π k :=τ k-1 ρ k η(η -η) 2 k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i s i 2 , where ρ k+1 = ητ k-1 ρ k , ρ 1 = 1 and ρ 0 = 0. see its proof in Appendix G.3. E PROOF OF THEOREM 1 Proof. Recall our definition F k (z k ) = F (z) + λ k 2 z 2 s k = E ζ [f (z; ζ)] + λ k 2 z 2 s k , in the (13). By using the smoothness of f (θ; ζ), we can obtain F k+1 (x k+1 ) ≤F (x k ) + ∇F (x k ), x k+1 -x k + L 2 x k+1 -x k 2 + λ k+1 2 x k+1 2 s k+1 x ≤F (x k ) + ∇F (x k ), x k+1 -x k + L 2 x k+1 -x k 2 + λ k+1 2(1 -µ) x k+1 2 s k y ≤F (x k ) + λ k 2 x k 2 s k + ∇F (x k ) + λ k x k * s k , x k+1 -x k + L 2 x k+1 -x k 2 + λ k 2 x k+1 -x k 2 s k =F k (x k ) - η k 1 + λ k η k ∇F (x k ) + λ k x k * s k , w k s k + Lη 2 k 2(1 + λ k η k ) 2 w k s k 2 + λ k η 2 k 2(1 + λ k η k ) 2 w k s k 2 s k =F k (x k ) + 1 2 η k (1 + λ k η k )s k (∇F (x k ) + λ k x k * s k -w k ) 2 - 1 2 η k (1 + λ k η k )s k (∇F (x k ) + λ k x k * s k ) 2 - 1 2 η k (1 + λ k η k )s k w k 2 + Lη 2 k 2(1 + λ k η k ) 2 w k s k 2 + λ k η 2 k 2(1 + λ k η k ) 2 w k s k 2 s k z ≤F k (x k ) + η k 2c 1 (1 + λ k η k ) ∇F (x k ) -m k 2 - η k 2c 2 (1 + λ k η k ) ∇F k (x k ) 2 - η k 2c 2 (1 + λ k η k ) 1 - c 2 Lη k c 2 1 (1 + λ k η k ) - c 2 λ k η k c 1 (1 + λ k η k ) w k 2 { ≤F k (x k ) + η k 2c 1 (1 + λ k η k ) ∇F (x k ) -m k 2 - η k 2c 2 (1 + λ k η k ) ∇F k (x k ) 2 - η k 4c 2 (1 + λ k η k ) w k 2 , where x holds since Lemma 3 proves s k s k+1 ∞ ∈ [1 -µ, 1 + µ] (∀p ∈ [0, 1]) in which µ = β2c 2 ∞ ν ; y holds because λ k = λ k+1 1-µ and x k+1 2 s k = x k 2 s k + x k+1 -x k 2 s k + 2 x k+1 -x k , x k s k . z holds, because w k := m k + λ k x k * s k , x k+1 -x k = - η k 1 + λ k η k m k + λ k x k * s k s k = - η k 1 + λ k η k w k s k , c 1 := ν 0.5 ≤ s k ∞ ≤ c 2 := (c 2 ∞ + ν) 0.5 . { holds, since we set η k ≤ c 2 1 (1+λ k η k ) 2c2(L+λ k c1) such that c2Lη k c 2 1 (1+λ k η k ) + c2λ k η k c1(1+λ k η k ) ≤ 1 2 . From Lemma 6, by setting η k = η, ηk = η and β 1,k = β 1 , we have E m k -∇F (x k ) 2 ≤2(1 -β 1 )E m k-1 -∇F (z k-1 ) 2 + 2Π k (1 -β 1 ) 2 L 2 β 1 + 2β 2 1 σ 2 b + 2LΠ k , where Π k := 2η 2 (1 + λ k-1 η) 2 w k-1 s k-1 2 + 2ρ k η2 (η -η) 2 τ 2 k-1 (1 + λ k-1 η) 2 k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i s i 2 , Π k :=τ k-1 ρ k η(η -η) 2 k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i s i 2 , Here ρ k+1 = ητ k-1 ρ k , ρ 1 = 1 and ρ 0 = 0. By considering c 2 ≥ s k ∞ ≥ c 1 , we have Π k ≤ Πk := 2η 2 c 2 1 (1 + λ k-1 η) 2 w k-1 2 + 2ρ k η2 (η -η) 2 τ 2 k-1 (1 + λ k-1 η) 2 c 2 1 k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i 2 , Π k ≤ Π k := τ k-1 ρ k η(η -η) 2 c 2 1 k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i 2 , Therefore, by plugging the results in Eqn. ( 14) into the upper bound of F k+1 (x k+1 ), we have F k+1 (x k+1 ) ≤F k (x k ) - η 2c 2 (1 + λ k η) ∇F k (x k ) 2 - η 4c 2 (1 + λ k η) w k 2 + η(1 -β 1 ) c 1 (1 + λ k η) E m k-1 -∇F (z k-1 ) 2 + η Πk (1 -β 1 ) 2 L 2 c 1 β 1 (1 + λ k η) + ηβ 2 1 σ 2 c 1 (1 + λ k η)b + ηL Π k c 1 (1 + λ k η) x ≤F k (x k ) - η 2c 2 (1 + λ k η) ∇F k (x k ) 2 - η 4c 2 (1 + λ k η) w k 2 + η(1 -β 1 ) c 1 E m k-1 -∇F (z k-1 ) 2 + η Πk (1 -β 1 ) 2 L 2 c 1 β 1 (1 + λ k η) + ηβ 2 1 σ 2 c 1 (1 + λ k η)b + ηL Π k c 1 (1 + λ k η) , where x uses the fact that 0 < λ k ≤ λ. Then, from Lemma 4, we have E m k -∇F (z k ) 2 ≤(1 -β 1 )E m k-1 -∇F (z k-1 ) 2 + (1 -β 1 ) 2 L 2 β 1 E z k -z k-1 2 + β 2 1 σ 2 b x ≤(1 -β 1 )E m k-1 -∇F (z k-1 ) 2 + (1 -β 1 ) 2 L 2 Πk β 1 + β 2 1 σ 2 b (18) where we use the results in Lemma 5 that z k -z k-1 2 ≤Π k ≤ Πk . Then we add Eqn. (17) and α× (18) as follows: F k+1 (x k+1 ) + αE m k -∇F (z k ) 2 ≤F k (x k ) - η 2c 2 (1 + λ k η) ∇F k (x k ) 2 - η 4c 2 (1 + λ k η) w k 2 + (1 -β 1 ) η c 1 + α E m k-1 -∇F (z k-1 ) 2 + η Πk (1 -β 1 ) 2 L 2 c 1 β 1 (1 + λ k η) + ηβ 2 1 σ 2 c 1 (1 + λ k η)b + ηL Π k c 1 (1 + λ k η) + α(1 -β 1 ) 2 L 2 Πk β 1 + αβ 2 1 σ 2 b (19) Then by setting α = η(1-β1) c1β1 and G k+1 (x k+1 ) = F k+1 (x k+1 )+ η(1-β1) c1β1 E m k -∇F (x k ) 2 = E ζ [f (z; ζ)] + λ k 2 z 2 s k + η(1-β1) c1β1 E m k -∇F (x k ) 2 , we can obtain G k+1 (x k+1 ) ≤G k (x k ) - η 2c 2 (1 + λ k η) ∇F k (x k ) 2 - η 4c 2 (1 + λ k η) w k 2 + η Πk (1 -β 1 ) 2 L 2 c 1 β 1 (1 + λ k η) + ηβ 2 1 σ 2 c 1 (1 + λ k η)b + ηL Π k c 1 (1 + λ k η) + η(1 -β 1 ) 3 L 2 Πk c 1 β 2 1 + η(1 -β 1 )β 1 σ 2 c 1 b x ≤G k (x k ) - η 2c 2 (1 + λ k η) ∇F k (x k ) 2 - η 4c 2 (1 + λ k η) w k 2 + η(1 -β 1 ) 2 L 2 Πk c 1 β 2 1 + ηL Π k c 1 (1 + λ k η) + ηβ 1 σ 2 c 1 b , where x uses the fact that 0 < λ k ≤ λ. Then summing the above inequality from k = 0 to k = T -1 and using 0 < λ k ≤ λ give 1 T T -1 k=0 E ∇F k (x k ) 2 + 1 2 w k 2 ≤ 2c 2 (1 + λη) ηT [G(x 0 ) -G(x T )] + 2c 2 β 1 σ 2 (1 + λη) c 1 bT + 2c 2 β 2 1 σ 2 c 1 b + 2c 2 (1 -β 1 ) 2 L 2 (1 + λη) c 1 β 2 1 T T -1 k=0 Πk + 2c 2 L c 1 T T -1 k=0 Π k ≤ 2c 2 (1 + λη)∆ ηT + 2c 2 β 1 σ 2 (1 + λη) c 1 b + 2c 2 (1 -β 1 ) 2 L 2 (1 + λη) c 1 β 2 1 T T -1 k=0 Πk + 2c 2 L c 1 T T -1 k=0 Π k where G(x 0 ) -G(x T ) =F 0 (x 0 ) + η(1 -β 1 ) c 1 β 1 E m -1 -∇F (x -1 ) 2 -F T (x T ) - η(1 -β 1 ) c 1 β 1 E m T -1 -∇F (x T -1 ) 2 =F (x 0 ) + λ 0 x 0 s0 -F (x T ) -λ T x T s T - η(1 -β 1 ) c 1 β 1 E m T -1 -∇F (x T -1 ) 2 where ∆ = F (x 0 ) -F (x ); x -1 and m -1 are two virtual points which satisfy m -1 = ∇F (x -1 ).

Now we try to bound

T -1 k=0 Πk and T -1 k=0 Πk . Firstly, we have T -1 k=0 Πk = T -1 k=0 2η 2 c 2 1 (1 + λ k-1 η) 2 w k-1 2 + 2ρ k η2 (η -η) 2 τ 2 k-1 (1 + λ k-1 η) 2 c 2 1 k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i 2 x ≤ 2η 2 c 2 1 T -1 k=0 w k-1 2 + 2η 2 (η -η) 2 c 2 1 T -1 k=0 ρ k τ 2 k-1 (1 + λ k-1 η) 2 k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i 2 = 2η 2 c 2 1 T -1 k=0 w k-1 2 + 2η 2 (η -η) 2 c 2 1 T -1 k=0 1 ρ k+1 (1 -ητ k-1 )(1 + λ k η) 2 w k 2 T -1 i=k ρ i τ 2 i-1 (1 + λ i-1 η) 2 y ≤ 2η 2 c 2 1 T -1 k=0 w k-1 2 + 2a 2 η2 (η -η) 2 c 2 1 (1 -ητ ) T -1 k=0 1 ρ k+1 w k 2 T -1 i=k ρ i τ 2 i-1 = 2η 2 c 2 1 T -1 k=0 w k-1 2 + 2a 2 η2 (η -η) 2 τ c 2 1 η(1 -ητ ) 2 T -1 k=0 w k 2 z ≤ 2η 2 c 2 1 T -1 k=0 w k-1 2 + 2a 2 η2 (η -η) 2 τ c 2 1 η(1 -ητ ) 2 T -1 k=0 w k 2 { ≤ 2γ 2 η 2 c 2 1 1 + a 2 (1 + γ)(γ -1) 2 T -1 k=0 w k-1 2 ≤ 2γ 2 η 2 c 2 1 1 + a 2 γ 3 T -1 k=0 w k-1 2 , where x holds since 0 ≤ λ k ≤ λ; y holds, since 1) for i ≥ k we have 1+λi-1η 1+λ k η ≤ 1+λ k-1 η 1+λ k η = 1+λ k-1 η 1+(1-µ)λ k-1 η ≤ 1+λη 1+(1-µ)λη = a ≤ 1 1-µ and 2) 1 1-ητi-1 = η+η+λi-1 ηη η+λi-1 ηη = 1 + η η+λi-1 ηη ≤ 1 + η η = 1 1-ητ whose minimum is at λ i-1 = 0 and τ = 1 η+η ; z holds, since T -1 i=k ρ i τ 2 i-1 = 1 η T -1 i=k ρ i+1 τ i-1 ≤ τ η T -1 i=k ρ i+1 ≤ τ η ρ k+1 (1-η T -k τ T -k ) 1-ητ ≤ τ ρ k+1 η(1-ητ ) ; { holds by setting η = γη. Similarly, we can bound T -1 k=0 Π k = T -1 k=0 τ k-1 ρ k η(η -η) 2 c 2 1 k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i 2 ≤ τ η(η -η) 2 c 2 1 (1 -ητ ) T -1 k=0 ρ k k-1 i=0 1 ρ i+1 w i 2 ≤ τ η(η -η) 2 c 2 1 (1 -ητ ) T -1 k=0 w k 2 ρ k+1 T -1 i=k ρ i x ≤ (η -η) 2 c 2 1 (1 -ητ ) 2 T -1 k=0 w k 2 ≤ η 2 γ 2 (γ -1) 2 c 2 1 (1 + γ) 2 T -1 k=0 w k 2 ≤ η 2 (γ -1) 2 c 2 1 T -1 k=0 w k 2 (20) where x holds since 1) ρ k+1 = ητ k-1 ρ k ≤ ητ ρ k and ρ 1 = 1 and 2) T -1 i=k ρ i ≤ ρ k (1-η T -k τ T -k ) 1-ητ ≤ ρ k 1-ητ which together give 1 ρ k+1 T -1 i=k ρ i ≤ 1 ρ k+1 ρ k 1-ητ ≤ 1 ητ 1 1-ητ ≤ 1 ητ (1-ητ ) . Therefore, we have 1 T T -1 k=0 E ∇F k (x k ) 2 + 1 2 w k 2 ≤ 2c 2 (1 + λη)∆ ηT + 2c 2 β 1 σ 2 (1 + λη) c 1 b + 4c 2 γ 2 η 2 (1 -β 1 ) 2 L 2 (1 + λη)(1 + a 2 γ 3 ) c 3 1 β 2 1 T T -1 k=0 w k-1 2 + 2c 2 η 2 L(γ -1) 2 c 3 1 T T -1 k=0 w k 2 x ≤ 2c 2 (1 + λη)∆ ηT + 2c 2 β 1 σ 2 (1 + λη) c 1 b + 1 4T T -1 k=0 w k 2 where x holds since we choose proper η and β 1 such that 4c 2 γ 2 η 2 (1 -β 1 ) 2 L 2 (1 + λη)(1 + a 2 γ 3 ) c 3 1 β 2 1 ≤ 1 8 2c 2 η 2 L(γ -1) 2 c 3 1 ≤ 1 8 where x uses γ 3 (γ-1) 2 (1+λτ ) (1+γ) 5 ≤ 1 + γτ = 1 + γη(γ + 1) and y uses γ 3 (γ-1) 2 (1+γ) 4 < γ. Now we select η and β 1 such that (21) holds: η ≤ min c 1.5 1 β 1 4 √ 2c 0.5 2 γ(1 -β 1 )L(1 + λη) 0.5 (1 + a 2 γ 3 ) 0.5 , c 1.5 1 4c 0.5 2 L 0.5 (γ -1) So we arrive at 1 T T -1 k=0 E ∇F k (x k ) 2 + 1 4 w k 2 ≤ 2c 2 (1 + λη)∆ ηT + 2c 2 β 1 (1 + λη)σ 2 c 1 b x ≤ 2 , ( ) where we set T ≥ 4c2(1+λη)∆ η 2 and β 1 ≤ c1b 2 4c2(1+λη)σ 2 . This result directly bounds 1 T T -1 k=0 s k * (x k -x k+1 ) 2 = η 2 T T -1 k=0 1 (1 + λ k η) 2 m k + λx k * s k 2 ≤ η 2 T T -1 k=0 w k 2 ≤ η 2 2 . Moreover, from Lemma 5, we have 1 T T -1 k=0 y k -(1 + λ k-1 η)x k 2 x ≤ 1 T T -1 k=0 ρ k (η -η) 2 k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i s i 2 z = 1 T T -1 k=0 Π k , 1 T T -1 k=0 z k -x k 2 x ≤ 1 T T -1 k=0 τ k-1 ρ k η(η -η) 2 k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i s i 2 y = 1 T T -1 k=0 Π k , 1 T T -1 k=0 z k+1 -z k 2 x ≤ 1 T T -1 k=0 2η 2 (1 + λ k η) 2 + 2ρ k+1 η2 (η -η) 2 τ 2 k (1 + λ k η) 2 k i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w k s k 2 y ≤ 1 T T -1 k=0 Π k where ρ k+1 = ητ k-1 ρ k , ρ 1 = 1 and ρ 0 = 0. where x holds by using Lemma 5; y holds by using the definition in Eqn. (15); z holds by defining: Π k :=ρ k (η -η) 2 k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i s i 2 .

Now remaining task is to upper bound

1 T T -1 k=0 Π k , 1 T T -1 k=0 Π k and 1 T T -1 k=0 Π k . Here we first bound 1 T T -1 k=0 Π k by using almost the same proof in Eqn. ( 20): 1 T T -1 k=0 Π k x ≤ T -1 k=0 ρ k (η -η) 2 c 2 1 T k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i 2 ≤ (η -η) 2 c 2 1 (1 -ητ )T T -1 k=0 ρ k k-1 i=0 1 ρ i+1 w i 2 ≤ (η -η) 2 c 2 1 (1 -ητ )T T -1 k=0 w k 2 ρ k+1 T -1 i=k ρ i y ≤ (η -η) 2 c 2 1 ητ (1 -ητ ) 2 T T -1 k=0 w k 2 ≤ η 2 γ 2 (γ -1) 2 c 2 1 (1 + γ)T T -1 k=0 w k 2 ≤ η 2 γ(γ -1) 2 c 2 1 T T -1 k=0 w k 2 z ≤ c 1 γ 16c 2 L 4 2 = c 1 γ 2 4c 2 L where x holds since 22); 2) we use the results in Eqn. ( 21) to obtain 1 1-ητi-1 = η+η+λi-1 ηη η+λi-1 ηη = 1 + η η+λi-1 ηη ≤ 1 + η η = 1 1-ητ whose minimum is at λ i-1 = 0 and τ = 1 η+η ; y holds since 1) ρ k+1 = ητ k-1 ρ k ≤ ητ ρ k and ρ 1 = 1 and 2) T -1 i=k ρ i ≤ ρ k (1-η T -k τ T -k ) 1-ητ ≤ ρ k 1-ητ which together give 1 ρ k+1 T -1 i=k ρ i ≤ 1 ρ k+1 ρ k 1-ητ ≤ 1 ητ 1 1-ητ ≤ 1 ητ (1-ητ ) ; z holds by using 1) 1 T T -1 k=0 E w k 2 ≤ 4 2 in Eqn. ( η 2 γ(γ -1) 2 c 2 1 ≤ γ(γ -1) 2 c 2 1 c 3 1 16c 2 L(γ -1) 2 ≤ c 1 γ 16c 2 L . From the bound in Eqn. ( 16) and the following bound on 1 T T -1 k=0 Πk and 1 T T -1 k=0 Π k , we have 1 T T -1 k=0 Π k ≤ 1 T T -1 k=0 Πk ≤ 2γ 2 η 2 c 2 1 T 1 + a 2 γ 3 T -1 k=0 E w k 2 x ≤ c 1 β 2 1 2 4c 2 (1 -β 1 ) 2 L 2 (1 + λη) 1 T T -1 k=0 Π k ≤ 1 T T -1 k=0 Π k ≤ η 2 (γ -1) 2 c 2 1 T T -1 k=0 E w k 2 x ≤ c 1 2 4c 2 L where x holds, since 1) 1 T T -1 k=0 E w k 2 ≤ 4 2 ; 2) we use the results in Eqn. (21) to obtain 2γ 2 η 2 c 2 1 1 + a 2 γ 3 ≤ 2γ 2 c 2 1 1 + a 2 γ 3 c 3 1 β 2 1 32c 2 γ 2 (1 -β 1 ) 2 L 2 (1 + λη)(1 + a 2 γ 3 ) ≤ c 1 β 2 1 16c 2 (1 -β 1 ) 2 L 2 (1 + λη) η 2 (γ -1) 2 c 2 1 ≤ (γ -1) 2 c 2 1 c 3 1 16c 2 L(γ -1) 2 ≤ c 1 16c 2 L Therefore, we have 1 T T -1 k=0 E y k -(1 + λ k η)x k 2 ≤ c 1 γ 2 4c 2 L , 1 T T -1 k=0 E z k -x k 2 ≤ c 1 2 4c 2 L , 1 T T -1 k=0 E z k+1 -z k 2 ≤ c 1 β 2 1 2 4c 2 (1 -β 1 ) 2 L 2 (1 + λη) . Published as a conference paper at ICLR 2023 Besides, we have 1 T T -1 k=0 E m k -∇F (x k ) 2 ≤ 1 T T -1 k=0 E m k + λ k x k * s k -∇F (x k ) -λ k x k * s k 2 ≤ 2 T T -1 k=0 E m k + λ k x k * s k 2 + ∇F (x k ) + λ k x k * s k 2 = 2 T T -1 k=0 E m k + λ k x k * s k 2 + ∇F k (x k ) 2 x ≤2 2 + 3 4 × 4 2 ≤ 8 2 . where in x we use w k = m k + λ k x k * s k . In this way, we have 1 T T -1 k=0 E m k -∇F (z k ) 2 ≤ 2 T T -1 k=0 E m k -∇F (x k ) 2 + ∇F (x k ) -∇F (z k ) 2 ≤16 2 + 2L 2 T T -1 k=0 E x k -z k 2 ≤16 2 + c 1 L 2 2c 2 = (c 1 L + 32c 2 ) 2c 2 2 . For all hyper-parameters, we put their constrains together: β 1 ≤ c 1 b 2 4c 2 (1 + λη)σ 2 = O c 1 b 2 c 2 σ 2 , where c 1 = ν 0.5 ≤ s k ∞ ≤ c 2 ∞ + ν 0.5 = c 2 . For η, it should satisfy η ≤ min c 1.5 1 β 1 4 √ 2c 0.5 2 γ(1 -β 1 )L(1 + λη) 0.5 (1 + a 2 γ 3 ) 0.5 , c 1.5 1 4c 0.5 2 L 0.5 (γ -1) , c 2 1 (1 + λη) 2c 2 (L + λc 1 ) Considering λη << 1, 1+λη 1+(1-µ)λη = a ≤ 1 1-µ , µ is a constant, and c 1 = ν 0.5 << 1, then we have F PROOFS OF THEOREM 2 Proof. Recall our definition F k (θ k ) = F (θ) + λ k 2 θ 2 2 = E ζ [f (θ; ζ)] + λ k 2 θ 2 2 in the (13). By setting β 1 = 1 -β 1 , then we have m k ∞ ≤ c ∞ by using Lemma 3 (see Appendix D). Also we define w k := m k + λx k , x k+1 -x k = - η k 1 + λη k (m k + λx k ) = - η k 1 + λη k w k . Note in the following, we set all λ k = λ. By using the smoothness of f (θ; ζ), we can obtain F k+1 (x k+1 ) ≤F (x k ) + ∇F (x k ), x k+1 -x k + L 2 x k+1 -x k 2 + λ 2 x k+1 2 x ≤F (x k ) + λ 2 x k 2 + ∇F (x k ) + λx k , x k+1 -x k + L 2 x k+1 -x k 2 + λ 2 x k+1 -x k 2 =F k (x k ) - η k 1 + λη k ∇F (x k ) + λx k , w k + Lη 2 k 2(1 + λη k ) 2 w k 2 + λη 2 k 2(1 + λη k ) 2 w k 2 =F k (x k ) + 1 2 η k (1 + λη k ) (∇F (x k ) + λx k -w k ) 2 - 1 2 η k (1 + λη k ) (∇F (x k ) + λx k ) 2 - 1 2 η k (1 + λη k ) w k 2 + Lη 2 k 2(1 + λη k ) 2 w k 2 + λη 2 k 2(1 + λη k ) 2 w k 2 y ≤F k (x k ) + η k 2(1 + λη k ) ∇F (x k ) -m k 2 - η k 2(1 + λη k ) ∇F k (x k ) 2 - η k 2(1 + λη k ) 1 - Lη k (1 + λη k ) - λη k (1 + λη k ) w k 2 z ≤F k (x k ) + η k 2(1 + λη k ) ∇F (x k ) -m k 2 - η k 2(1 + λη k ) ∇F k (x k ) 2 - η k 4(1 + λη k ) w k 2 , where x holds because x k+1 2 s k = x k 2 s k + x k+1 -x k 2 s k + 2 x k+1 -x k , x k s k . y holds, because w k := m k + λx k , x k+1 -x k = - η k 1 + λη k (m k + λx k ) = - η k 1 + λη k w k . { holds, since we set η k ≤ c 2 1 (1+λη k ) 2c2(L+λc1) such that c2Lη k c 2 1 (1+λη k ) + c2λη k c1(1+λη k ) ≤ 1 2 . Then in the following, we can directly follow the proof of Theorem 1. This is because the only difference between accelerated SGD and AdamW is that SGD has no the second-order moment v k , while AdamW has. By let s k = 1 in accelerated AdamW and setting β 1 = 1 -β 1 in accelerated SGD, then they share the exact the same updating rules. So after setting β 1 = 1 -β 1 in accelerated SGD, to follow the proofs of Theorem 1, we only need to verify whether the auxiliary lemmas and the proof process of Theorem 1 hold for s k = 1. This is the true case. Please check our auxiliary lemmas, including Lemma 3 ∼ 6, and the proof process of Theorem 1. Consider s k = 1 in accelerated SGD, we have c c) The sequence {x k , z k } satisfies 1 := 1 ≤ s k ∞ ≤ c 2 := 1. 1 T T -1 k=0 E x k -x k+1 2 , E z k+1 -z k 2 2 , E z k -x k 2 2 ≤ 4η 2 2 , β 2 1 2 4(1-β 1 ) 3 L 2 , 2 4L . d) The total stochastic gradient complexity to achieve the above three properties is O Proof. To begin with, we assume that ∀t ≤ k, it holds m t ∞ ≤ c ∞ , v t + ν ∞ ≤ c ∞ + ν Then we consider the case where t = k + 1 as follows m k+1 ∞ = (1 -β 1 )m k + β 1 g k ∞ ≤ (1 -β 1 ) m k ∞ + β 1 g k ∞ ≤ c ∞ , v k+1 ∞ = (1 -β 2 )v k + β 2 g 2 k ∞ ≤ (1 -β 2 ) v k ∞ + β 2 g 2 k ∞ ≤ c 2 ∞ . Then we derive the second results as follows: v k + ν v k+1 + ν ∞ = 1 + v k -v k+1 v k+1 + ν ∞ = 1 + β 2 (v k -g 2 k ) v k+1 + ν ∞ . Therefore, we have 1 - β 2 c 2 ∞ 2(c 2 s,∞ + ν) < 1 - β 2 c 2 ∞ c 2 s,∞ + ν ≤ v k + ν v k+1 + ν ∞ ≤ 1 + β 2 c 2 ∞ c 2 s,∞ + ν < 1 + β 2 c 2 ∞ 2(c 2 s,∞ + ν) . We complete the proof.

G.2 PROOF OF LEMMA 5

Proof. To begin with, we have y k+1 -(1 + λ k ηk )x k+1 =z k -ηk m k s k - 1 + λ k ηk 1 + λ k η k x k -η k m k s k =η k-1 τ k-1 x k + η k-1 τ k-1 y k -ηk m k s k - 1 + λ k ηk 1 + λ k η k x k -η k m k s k =η k-1 τ k-1 (y k -(1 + λ k ηk-1 )x k ) -ηk - 1 + λ k ηk-1 1 + λ k η k-1 η k m k s k + λ k (η k -ηk ) 1 + λ k η k x k x =η k-1 τ k-1 (y k -(1 + λ k ηk-1 )x k ) -ηk - 1 + λ k ηk-1 1 + λ k η k-1 η k w k -λ k √ v k s k + λ k (η k -ηk ) 1 + λ k η k x k =η k-1 τ k-1 (y k -(1 + λ k ηk-1 )x k ) -ηk - 1 + λ k ηk-1 1 + λ k η k-1 η k w k s k + λ k ηk - 1 + λ k ηk-1 1 + λ k η k-1 λ k η k + λ k (η k -ηk ) 1 + λ k η k x k y =ητ k-1 (y k -(1 + λ k η)x k ) - η -η 1 + λ k η w k s k where x holds since w k := m k + λ k x k * s k ; y holds since we set all η k = η and ηk = η which gives τ k = τ = 1 η+η+λ k η η . Therefore, by defining ρ k+1 = ητ k-1 ρ k , ρ 1 = 1 and ρ 0 = 0, then we have y k+1 -(1 + λ k η)x k+1 ρ k+1 = y k -(1 + λ k η)x k ρ k - 1 ρ k+1 η -η 1 + λ k η w k s k (k ≥ 1) For k = 0, we have y 1 -(1 + λ 0 η)x 1 =z 0 - η m 0 s 0 - 1 + λ 0 η 1 + λ 0 η x 0 -η m 0 s 0 =z 0 -η w 0 -λ 0 s 0 * x 0 s 0 -1 + λ 0 η 1 + λ 0 η x 0 -η w 0 -λ 0 s 0 * x 0 s 0 =z 0 -x 0 -η -η 1 + λ 0 η w 0 s 0 In this way, one can obtain y k+1 -(1 + λ k η)x k+1 ρ k+1 =z 0 -x 0 - η -η 1 + λ 0 η w 0 s 0 - k i=1 1 ρ i+1 η -η 1 + λ i η w i s i = - k i=0 1 ρ i+1 η -η 1 + λ i η w i s i where x hold since z 0 = x 0 and ρ 1 = 1. Then we can upper bound y k+1 -(1 + λ k η)x k+1 ρ k+1 2 = k i=0 ρ k+1 (1 -ητ i-1 ) ρ i+1 η -η ρ k+1 (1 -ητ i-1 )(1 + λ i η)  w i s i 2 x ≤ k i=0 ρ k+1 (1 -ητ i-1 ) ρ i+1 (η -η) 2 ρ 2 k+1 (1 -ητ i-1 ) 2 (1 + λ i η) 2 w i s i 2 = (η -η) 2 ρ k+1 k i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 z k+1 -z k 2 ≤2η 2 τ 2 k (1 + λ k η) 2 (1 + λ k η)x k+1 -y k+1 2 + 2η 2 (1 + λ k η) 2 w k s k 2 ≤ 2η 2 (1 + λ k η) 2 w k s k 2 + 2ρ k+1 η2 (η -η) 2 τ 2 k (1 + λ k η) 2 k i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i s i 2 The proof is completed.

G.3 PROOF OF LEMMA 6

Proof. From Lemma 4, we have E m k -∇F (z k ) 2 ≤(1 -β 1,k )E m k-1 -∇F (z k-1 ) 2 + (1 -β 1,k ) 2 L 2 β 1,k E z k -z k-1 2 + β 2 1,k σ 2 b x ≤(1 -β 1,k )E m k-1 -∇F (z k-1 ) 2 + Π k (1 -β 1,k ) 2 L 2 β 1,k + β 2 1,k σ 2 b where in x, we use the results in Lemma 5 that z k -z k-1 2 ≤Π k with Π k := 2η 2 (1 + λ k-1 η) 2 w k-1 s k-1 2 + 2ρ k η2 (η -η) 2 τ 2 k-1 (1 + λ k-1 η) 2 k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i s i 2 . Then we have E m k -∇F (x k ) 2 ≤2E m k -∇F (z k ) 2 + 2E ∇F (z k ) -∇F (x k ) 2 ≤2E m k -∇F (z k ) 2 + 2LE z k -x k 2 x ≤2(1 -β 1,k )E m k-1 -∇F (z k-1 ) 2 + 2Π k (1 -β 1,k ) 2 L 2 β 1,k + 2β 2 1,k σ 2 b + 2LΠ k , where in x, we use the results in Lemma 4 that E m k -∇F (z k ) 2 ≤(1 -β 1,k )E m k-1 -∇F (z k-1 ) 2 + (1 -β 1,k ) 2 L 2 β 1 E z k -z k-1 2 + β 2 1,k σ 2 b . and also the results in Lemma 5 that z k -x k 2 ≤ Π k :=τ k-1 ρ k η(η -η) 2 k-1 i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i s i 2 . The proof is completed.



CONCLUSIONIn this work, we adopt proximal point method to derive a weight-decay-integrated Nesterov acceleration for AdamW and Adam, and extend it to LAMB and SGD. Moreover, we prove the convergence of our accelerated algorithms, i.e. accelerated AdamW, Adam and SGD, and observe the superiority of the accelerated Adam-type algorithm over the vanilla ones in terms of stochastic gradient complexity. Finally, experimental results validate the advantages of our accelerated algorithms.



s k * x with an element-wise product operation * . Here we use the regularizer x-x k 2 s k instead of the 2 -regularization x-x k 2 2 , since 1) this new regularization can induce adaptive algorithms as shown below Eqn. (

Figure 2: Test accuracy curves of AdamW-Win and LAMB-Win on ResNet18.Results Analysis. Here we investigate the convergence behaviors of our accelerated algorithms, and aim to explain their better test performance over their non-accelerated counterparts. In Fig.1, we plot the curves of training and test losses along with the training epochs on ResNet18 and ViT-B. One can find that our accelerated algorithms, e.g. AdamW-Win, show much faster convergence behaviors than their non-accelerated counterparts, e.g. AdamW. Moreover, SGD-Win also converges faster than Nesterove-accelerated SGD, i.e. SGD-M. We also plot the curves of test accuracy in Fig.2, showing the superior convergence speed of AdamW-Win and LAMB-Win over their nonaccelerated versions. Fig.3in Appendix B also reveals SGD-Win and Adam-Win enjoy faster convergence than their non-accelerated counterparts in terms of test accuracy. So these faster convergence behaviors could contribute to our accelerated algorithms for their higher performance over non-accelerated counterparts under the same computational cost.

77.9 78.0 78.0 77.9 78.1 78.0 Robust Analysis. For the only extra hyperparameter ηk in our accelerated algorithms over their non-accelerated counterparts, in experiments, we always set ηk = γη k , where γ = 2. Here we investigate the effects of γ to the accelerated algorithms on ResNet50 by taking AdamW-Win and LAMB-Win as examples because of their superior performance. Table

on the Penn TreeBank dataset (Marcinkiewicz, 1994) for 200 epochs. See optimization and training details in Appendix B.

Figure 3: Test accuracy curve of SGD-Win and Adam-Win on ResNet18. See the curves of AdamW-Win and LAMB-Win in manuscript.

In this way, by setting ηk = γη k , γ > 1,η k = η ≤ O b 2 c 1.5 γ 2.5 σ 2 L , β 1 ≤ O b 2 cσ 2 , β 1 = 1-β 1 , λ k = λ, λ 0 = 0, after T = O ∆σ 2 L b 4iterations with minibatch size b and ∆ = F (x 0 ) -F (x ), the sequence {(x k , z k )} T k=0 generated by accelerated SGD satisfies the following four properties. a) The gradient ∇F k (x k ) of the sequence {x k } T k=0 can be upper bounded by1 T T -1 k=0 E ∇F k (x k ) k + λ k x k 2 2 ≤ 2 . b)The gradient moment m k can well estimate the full gradient ∇F (x k ) and ∇F (z k ):1T T -1 k=0 max E m k -∇F (x k ) 2 2 , E m k -∇F (z k )

y k+1 -(1 + λ k η)x k+1 2 ≤ ρ k+1 (η -η) 2 k i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) can also bound z k+1 -x k+1 2 = ητ k x k+1 + ητ k y k+1 -x k+1 2 =ητ k y k+1 -(1 + λ k η)x k+1 2 ≤τ k ρ k+1 η(η -η) 2 k i=0 1 ρ i+1 (1 -ητ i-1 )(1 + λ i η) 2 w i s iOn the other hand, we havez k+1 -z k = ητ k x k+1 + ητ k y k+1 -z k x = ητ k x k+1 + ητ k y k+1 -y k+1 -η m k s k = ητ k x k+1 + ητ k y k+1 -y k+1 -η w k -λ k x k * s k s k = ητ k x k+1 + ητ k y k+1 -y k+1 -η w k s k + ηλ k x k y = (ητ k + ηλ k )x k+1 -(1 -ητ k )y k+1 -η 1 + λ k η w k s k z = ητ k (1 + λ k η) ((1 + λ k η)x k+1 -y k+1 ) -η 1 + λ k η w k s k ≤ητ k (1 + λ k η) (1 + λ k η)x k+1 -y k+1 + η 1 + λ k η w k s k where x we plug in y k+1 = z k -ηk m k s k ; in y we plug in x k+1 = 1 1+λ k η x k -η m k s k = 1 1+λ k η x k -η w k -λ k x k * s k s k = x k -η 1+λ k η w ks k ; and z we have ητ k +ηλ k = ητ k (1+ηλ k )(1+ηλ k ) and (1 -ητ k ) = ητ k (1 + ηλ k ). Then we can upper bound

ImageNet top-1 accuracy (%) of ResNet50&101 whose official optimizer is LAMB due to the stronger data augmentation for better performance. * is reported in(Wightman et al., 2021).



ImageNet top-1 accuracy (%) of ViT and PoolFormer whose default optimizers are both AdamW. * and are respectively reported in

).

Effects of γ to top-1 accuracy (%) of AdamW-Win and LAMB-Win on ResNet50.

Test perplexity of LSTM on Penn Treebank. * is reported by Ad-aBelief (Zhuang et al., 2020).

Test PPL of Transformer-XLbase on WikiText-103 where Adam is the official optimizer. * is reported in the official implementation.

shows that under different training steps, our accelerated Adam-Win always achieves lower test PPL than the official Adam optimizer. Spefically, it improves 1.5 average test PPL over Adam on the three test cases. All these results are consistent with observations on vision tasks, and they together demonstrate the advantages of our accelerated algorithms.Pan Zhou, Caiming Xiong, Xiaotong Yuan, and Steven Hoi. A theory-driven self-labeling refinement method for contrastive representation learning. In Neural Information Processing Systems, 2021a.Pan Zhou, Hanshu Yan, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. Towards understanding why lookahead generalizes better than sgd and beyond. In Neural Information Processing Systems, 2021b.

ACKNOWLEDGE

Xingyu Xie was supported by National Key R&D Program of China (2022ZD0160302) and the National Natural Science Foundation of China (No. 62276004).

