ADAN: ADAPTIVE NESTEROV MOMENTUM ALGO-RITHM FOR FASTER OPTIMIZING DEEP MODELS

Abstract

Adaptive gradient algorithms combine the moving average idea with heavy ball acceleration to estimate accurate first-and second-order moments of the gradient for accelerating convergence. However, Nesterov acceleration which converges faster than heavy ball acceleration in theory and also in many empirical cases, is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm, Adan for short, to speed up the training of deep neural networks effectively. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the first-and second-order moments of the gradient in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an ϵ-approximate first-order stationary point within O ϵ -3.5 stochastic gradient complexity on the non-convex stochastic problems (e.g. deep learning problems), matching the bestknown lower bound. Extensive experimental results show that Adan surpasses the corresponding SoTA optimizers on vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g. ResNet, ConvNext, ViT, Swin, MAE, LSTM, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, ResNet, MAE, etc, and also shows great tolerance to a large range of minibatch size, e.g. from 1k to 32k. We hope Adan can contribute to developing deep learning by reducing training costs and relieving the engineering burden of trying different optimizers on various architectures. Smooth Condition

1. INTRODUCTION

Deep neural networks (DNNs) have made remarkable success in many fields, e.g. computer vision (Szegedy et al., 2015; He et al., 2016) and natural language processing (Sainath et al., 2013; Abdel-Hamid et al., 2014) . A noticeable part of such success is contributed by the stochastic gradient-based optimizers, which find satisfactory solutions with high efficiency. Among current deep optimizers, SGD (Robbins & Monro, 1951) is the earliest and also the most representative stochastic optimizer, with dominant popularity for its simplicity and effectiveness. It adopts a single common learning rate for all gradient coordinates but often suffers unsatisfactory convergence speed on sparse data or ill-conditioned problems. In recent years, adaptive gradient algorithms, e.g. Adam (Kingma & Ba, 2014) and AdamW (Loshchilov & Hutter, 2018) , have been proposed, which adjust the learning rate for each gradient coordinate according to the current geometry curvature of the loss objective. These adaptive algorithms, e.g. Adam, often offer a faster convergence speed than SGD in practice. However, none of the above optimizers can always stay undefeated among all its competitors across different network architectures and application settings. For instance, for vanilla ResNet (He et al., 2016) , SGD often achieves better generalization performance than adaptive gradient algorithms such as Adam, whereas on vision transformers (ViTs) (Touvron et al., 2021) , SGD often fails, and AdamW is the dominant optimizer with higher and more stable performance. Moreover, these commonly used optimizers usually fail for large-batch training, which is a default setting of the prevalent distributed training. Although there is some performance degradation, we still tend to choose the large-batch setting for large-scale deep learning training tasks due to the unaffordable training time. For example, training the ViT-B with the batch size of 512 usually takes several days, but when the batch size comes to 32K, we may finish the training within three hours (Liu et al., 2022a) . Although some Table 1 : Comparison of different adaptive gradient algorithms on nonconvex stochastic problems. "Separated Reg." refers to whether the ℓ 2 regularizer (weight decay) can be separated from the loss objective like AdamW. "Complexity" denotes stochastic gradient complexity to find an ϵapproximate first-order stationary point. Adam-type methods (Guo et al., 2021) includes Adam, and AdaGrad (Duchi et al., 2011) , etc. AdamW has no available convergence result. For SAM (Foret et al., 2020) , A-NIGT (Cutkosky & Mehta, 2020) and Adam + (Liu et al., 2020) , we compare their adaptive versions. d is the variable dimension. The lower bound is proven in (Arjevani et al., 2020) . methods, e.g. LARS (You et al., 2017) and LAMB (You et al., 2019) , have been proposed to handle large batch sizes, their performance often varies significantly across batch sizes. This performance inconsistency increases the training cost and engineering burden, since one usually has to try various optimizers for different architectures or training settings. When we rethink the current adaptive gradient algorithms, we find that they mainly combine the moving average idea with the heavy ball acceleration technique to estimate the first-and second-order moments of the gradient, e.g. Adam, AdamW and LAMB. However, previous studies (Nesterov, 1983; 1988; 2003) have revealed that Nesterov acceleration can theoretically achieve a faster convergence speed than heavy ball acceleration, as it uses gradient at an extrapolation point of the current solution and sees a slight "future". Moreover, recent work (Nado et al., 2021; He et al., 2021) have shown the potential of Nesterov acceleration for large-batch training. Thus we are inspired to consider efficiently integrating Nesterov acceleration with adaptive algorithms. Contributions: 1) We propose an efficient DNN optimizer, named Adan. Adan develops a Nesterov momentum estimation method to estimate stable and accurate first-and second-order moments of the gradient in adaptive gradient algorithms for acceleration. 2) Moreover, Adan enjoys a provably faster convergence speed than previous adaptive gradient algorithms such as Adam. 3) Empirically, Adan shows superior performance over the SoTA deep optimizers across vision, language, and reinforcement learning (RL) tasks. Our detailed contributions are highlighted below. Firstly, we propose an efficient Nesterov-acceleration-induced deep learning optimizer termed Adan. Given a function f and the current solution θ k , Nesterov acceleration (Nesterov, 1983; 1988; 2003) estimates the gradient g k = ∇f (θ ′ k ) at the extrapolation point θ ′ k = θ k -η(1 -β 1 )m k-1 with the learning rate η and momentum coefficient β 1 ∈ (0, 1), and updates the moving gradient average as m k = (1 -β 1 )m k-1 + g k . Then it runs a step by θ k+1 = θ k -ηm k . However, the inconsistency of the positions for parameter updating at θ k and gradient estimation at θ ′ k leads to the additional cost of model parameter reloading during back-propagation (BP), which is unaffordable especially for large DNNs. To avoid the model reloading during BP, we propose an alternative Nesterov momentum estimation (NME). We compute the gradient g k = ∇f (θ k ) at the current solution θ k , and estimate the moving gradient average as m k = (1 -β 1 )m k-1 + g ′ k , where g ′ k = g k + (1 -β 1 )(g k -g k-1 ). Our NME is provably equivalent to the vanilla one yet can avoid the extra model reloading. Then by regarding g ′ k as the current stochastic gradient in adaptive gradient algorithms, e.g. Adam, we accordingly estimate the first-and second-moments as m k = (1 -β 1 )m k-1 + β 1 g ′ k and n k = (1 -β 2 )n k-1 + β 2 (g ′ k ) 2 respectively. Finally, we update θ k+1 = θ k -ηm k / √ n k + ε. In this way, Adan enjoys the merit of Nesterov acceleration, namely faster convergence speed and tolerance to large mini-batch size (Lin et al., 2020) , which is verified in our experiments in Sec. 5. Secondly, as shown in Table 1 , we theoretically justify the advantages of Adan over previous SoTA adaptive gradient algorithms on nonconvex stochastic problems, e.g. deep learning problems. 1) Given Lipschitz gradient condition, to find an ϵ-approximate first-order stationary point, Adan has the stochastic gradient complexity O c 2.5 ∞ ϵ -4 which accords with the lower bound Ω(ϵ -4 ) (up to a constant factor) (Arjevani et al., 2019) . This complexity is lower than O c 6 2 ϵ -4 of Adabelief (Zhuang et al., 2020) and O c 2 2 dϵ -4 of LAMB, especially on over-parameterized networks. Specifically, for the d-dimensional gradient, compared with its ℓ 2 norm c 2 , its ℓ ∞ norm c ∞ is usually much smaller, and can be √ d× smaller for the best case. Moreover, different from Adam-type optimizers (e.g. Adam), Adan can separate the ℓ 2 regularizer with the loss objective like AdamW whose generalization benefits have been validated in many works (Touvron et al., 2021) . 2) Given the Lipschitz Hessian condition, Adan has a complexity O c 1.25 ∞ ϵ -3.5 which also matches the lower bound Ω(ϵ -3.5 ) in Arjevani et al. (2020) . This complexity is superior to O ϵ -3.5 log c2 ϵ of A-NIGT (Cutkosky & Mehta, 2020) and also O ϵ -3.625 of Adam + (Liu et al., 2020) . Indeed, Adam + needs the minibatch size of order O ϵ -1.625 which is prohibitive in practice. For other optimizers, e.g. Adam, their convergence has not been provided yet under Lipschitz Hessian condition. Finally, Adan simultaneously surpasses the corresponding SoTA optimizers across vision, language, and RL tasks, and establishes new SoTAs for many networks and settings, e.g. ResNet, ConvNext (Liu et al., 2022b) , ViT (Touvron et al., 2021) , Swin (Liu et al., 2021) , MAE (He et al., 2022) , LSTM (Schmidhuber et al., 1997) , Transformer-XL (Dai et al., 2019) and BERT (Devlin et al., 2018) . More importantly, with half of the training cost (epochs) of SoTA optimizers, Adan can achieve higher or comparable performance. Besides, Adan works well in a large range of minibatch size, e.g. from 1k to 32k on ViTs. The improvement of Adan for various architectures and settings can greatly relieve the engineering burden by avoiding trying different optimizers.

2. RELATED WORK

Current DNN optimizers can be grouped into two families: SGD and its accelerated variants, and adaptive gradient algorithms. SGD computes stochastic gradient and updates the variable along the gradient direction. Later, heavy-ball acceleration (Polyak, 1964) movingly averages stochastic gradient in SGD for faster convergence. Nesterov acceleration runs a step along the moving gradient average and then computes gradient at the new point to look ahead for correction. Typically, Nesterov acceleration converges faster both empirically and theoretically at least on convex problems, and also has superior generalization resutls on DNNs (Foret et al., 2020; Kwon et al., 2021) . Unlike SGD, adaptive gradient algorithms, e.g. AdaGrad, RMSProp and Adam, view the second moment of gradient as a preconditioner and also use moving gradient average to update the variable. Later, many variants have been proposed to estimate a more accurate and stable first moment of gradient or its second moment, e.g. AMSGrad, Adabound, and Adabelief. To improve generalization, AdamW splits the objective and trivial regularization, and its effectiveness is validated across many applications; SAM and its variants (Kwon et al., 2021) aim to find flat minima but need forward and backward twice per iteration. LARS and LAMB train DNNs with a large batch but suffer unsatisfactory performance on small batch. Xie et al. (2022) reveal the generalization and convergence gap between Adam and SGD from the perspective of diffusion theory and propose the optimizers, Adai, which accelerates the training and provably favors flat minima. Padam (Chen et al., 2021a) provides a simple but effective way to improve the generalization performance of Adam by adjusting the second-order moment in Adam. The most related work to ours is NAdam. It simplifies Nesterov acceleration to estimate the first moment of gradient in Adam. But its acceleration does not use any gradient from the extrapolation points and thus does not look ahead for correction. Moreover, there is no theoretical result to ensure its convergence. See more difference discussion in Sec. 3.2.

3. METHODOLOGY

In this work, we study the following regularized nonconvex optimization problem: min θ F (θ) := Eζ∼D [f (θ, ζ)] + λ 2 ∥θ∥ 2 2 , where loss f ( RMSProp: n k = (1 -β)n k-1 + βg 2 k θ k+1 = θ k -η/( √ n k + ε) • g k , ⇒ Adam:      m k = (1 -β 1 )m k-1 + β 1 g k n k = (1 -β 2 )n k-1 + β 2 g 2 k θ k+1 = θ k -η/( √ n k + ε) • m k , where m 0 = g 0 , n 0 = g 2 0 , the scalar η is the base learning rate, and • denotes the element-wise product. Based on RMSProp, Adamfoot_0 replaces the estimated gradient g k with a moving average m k of all previous gradient g k . By inspection, one can easily observe that the moving average idea in Adam is similar to the classical (stochastic) heavy-ball acceleration (HBA) technique (Polyak, 1964) : HBA: g k = ∇f (θ k ) + ξ k , m k = (1 -β 1 )m k-1 + g k , θ k+1 = θ k -ηm k . Both Adam and HBA share the spirit of moving gradient average, though HBA does not have the factor β 1 on the gradient g k . That is, given one gradient coordinate, if its gradient directions are more consistent along the optimization trajectory, Adam/HBA accumulates a larger gradient value in this direction and thus goes ahead for a bigger gradient step, which accelerates convergence. In addition to HBA, Nesterov's accelerated (stochastic) gradient descent (AGD) (Nesterov, 1983; 1988; 2003) is another popular acceleration technique in the optimization community: AGD: g k = ∇f (θ k -η(1 -β 1 )m k-1 )+ξ k , m k = (1 -β 1 )m k-1 +g k , θ k+1 = θ k -ηm k . (2) Unlike HBA, AGD uses the gradient at the extrapolation point θ ′ k = θ k -η(1 -β 1 )m k-1 . Hence when the adjacent iterates share consistent gradient directions, AGD sees a slight future to converge faster. Indeed, AGD theoretically converges faster than HBA and achieves optimal convergence rate on the general smooth convex problems (Nesterov, 2003) . Meanwhile, since the over-parameterized DNNs have been observed/proved to have many convex-alike local basins (Hardt & Ma, 2016; Xie et al., 2017; Li & Yuan, 2017) , AGD seems more suitable than HBA for DNNs. For large-batch training, Nado et al. (2021) showed that AGD has the potential to achieve comparable performance to some specifically designed optimizers, e.g. LARS and LAMB. With its advantage in convergence and large-batch training, we consider applying AGD to improve adaptive algorithms.

3.2. ADAPTIVE NESTEROV MOMENTUM ALGORITHM

Main Iteration. We temporarily set λ = 0 in Eqn. (1). As aforementioned, AGD computes gradient at an extrapolation point θ ′ k instead of the current iterate θ k , which however brings extra computation and memory overhead for computing θ ′ k and preserving both θ k and θ ′ k . To solve the issue, Lemma 1 with proof in Appendix D reformulates AGD (2) into its equivalent but more DNN-efficient version. Lemma 1. Assume E(ξk) = 0, Cov(ξ i , ξ j ) = 0 for any k, i, j > 0, θk and mk be the iterate and momentum of the vanilla AGD in Eqn. (2), respectively. Let θ k+1 := θk+1 -η(1 -β 1 ) mk and m k := (1 -β 1 ) 2 mk-1 + (2 -β 1 )(∇f (θ k ) + ξ k ). The vanilla AGD in Eqn. (2) becomes AGD-II: g k = Eζ∼D[∇f (θ k , ζ)]+ξ k , m k = (1-β 1 )m k-1 +[g k + (1-β 1 )(g k -g k-1 )], θ k+1 = θ k -ηm k . Moreover, if vanilla AGD in Eqn. (2) converges, so does AGD-II, and E(θ∞) = E( θ∞ ). The main idea in Lemma 1 is that we maintain (θ k -η(1 -β 1 )m k-1 ) rather than θ k in vanilla AGD at each iteration, since there is no difference between them when the algorithm converges. Like other adaptive optimizers, by regarding g ′ k = g k + (1 -β 1 )(g k -g k-1 ) as the current stochastic gradient and movingly averaging g ′ k to estimate the first-and second-moments of gradient, we obtain Vanilla Adan:      m k = (1 -β 1 )m k-1 + β 1 [g k + (1 -β 1 )(g k -g k-1 )] n k = (1 -β 3 )n k-1 + β 3 [g k + (1 -β 1 )(g k -g k-1 ) 2 ] θ k+1 = θ k -η k • m k with η k = η/( √ n k + ε). The main difference of Adan with Adam-type methods and Nadam (Dozat, 2016) is that as compared in Eqn. (3), the momentum m k of Adan is the average of {g t + (1 -β 1 )(g t -g t-1 )} k t=1 while those of Adam-type and Nadam are the average of {g t } k t=1 . So is their second-order term n k . m k =      k t=0 c k,t [g t + (1 -β 1 )(g t -g t-1 )], Adan, k t=0 c k,t g t , Adam, µ k+1 µ ′ k+1 k t=0 c k,t g t + 1-µ k µ ′ k g k , Nadam, c k,t =      β 1 (1 -β 1 ) (k-t) t > 0, (1 -β 1 ) k t = 0, where {µ t } ∞ t=1 is a predefined exponentially decaying sequence, µ ′ k = 1 -k t=1 µ t . So Nadam is more like Adam than Adan, as their m k movingly averages the historical gradients instead of gradient differences in Adan. For a large k (i.e. small µ k ), m k in Nadam and Adam are almost the same. As shown in Eqn. (3), the moment m k in Adan consists of two terms, i.e. gradient term g t and gradient difference term (g t -g t-1 ), which actually have different physic meanings. So here we decouple them for greater flexibility and also better trade-off between them. Specifically, we estimate (θ k+1 -θ k )/η k = k t=0 c k,t g t + (1 -β 2 )c ′ k,t (g t -g t-1 ) = m k + (1 -β 2 )v k , where c ′ k,t = β 2 (1 -β 2 ) (k-t) for t > 0, c ′ k,t = (1 -β 2 ) k for t = 0, and m k and v k are defined as m k = (1 -β 1 )m k-1 + β 1 g k , v k = (1 -β 2 )v k-1 + β 2 (g k -g k-1 ). This change for a flexible estimation does not impair convergence speed. As we show in Theorem 1, the complexity of Adan under this change matches the lower complexity bound. We do not separate the gradients and their difference in the second-order moment n k , since E(nk) contains the correlation term Cov(g k , g k-1 ) ̸ = 0 which may have statistical significance. Decay Weight by Proximation. As observed in AdamW, decoupling the optimization objective and simple-type regularization (e.g. ℓ 2 regularizer) can largely improve the generalization performance. Here we follow this idea but from a rigorous optimization perspective. Intuitively, at each iteration, we minimize the first-order approximation of F (•) at the point θ k : θ k+1 = θ k -η k • mk = argmin θ F (θ k ) + ⟨ mk , θ -θ k ⟩ + 1 2η ∥θ -θ k ∥ 2 √ n k , where ∥x∥ 2 √ n k := x, √ n k + ε • x and mk := m k + (1 -β 2 )v k is the first-order derivative of F (•) in some sense. Follow the idea of proximal gradient descent (Parikh & Boyd, 2014; Zhuang et al., 2022) , we decouple the ℓ 2 regularizer from F (•) and only linearize the loss function f (•): θ k+1 = argmin θ λ k 2 ∥θ∥ 2 √ n k + ⟨ mk , θ -θ k ⟩ + 1 2η ∥θ -θ k ∥ 2 √ n k = θ k -η k • mk 1 + λ k η , where λ k > 0 is the weight decay at the k-th iteration. One can find that the optimization objective of at the k-th iteration is changed from the vanilla "static" function F (•) in Eqn. (1) to a "dynamic" function F k (•) in Eqn. ( 6), which adaptively regularizes the coordinates with larger gradient more: F k (θ) := Eζ∼D [f (θ, ζ)] + λ k 2 ∥θ∥ 2 √ n k . ( ) We summarize our Adan in Algorithm 1. We reset the momentum term properly by the restart condition, a common trick to stabilize optimization and benefit convergence (Li & Lin, 2022; Jin et al., 2018) . But to make Adan simple, in all experiments except Table 8 , we do not use this restart strategy although it can improve performance as shown in Table 8 .

4. CONVERGENCE ANALYSIS

For analysis, we make several mild assumptions used in many works, e.g. (Guo et al., 2021) . Assumption 1 (L-smoothness). The function f (•, •) is L-smooth w.r.t. the parameter, if ∃L > 0, ∥∇ Eζ[f (x, ζ)] -∇ Eζ[f (y, ζ)]∥ ≤ L∥x -y∥, ∀x, y. Assumption 2 (Unbiased and bounded gradient oracle). The stochastic gradient oracle g k = Eζ[∇f (θ k , ζ)] + ξ k is unbiased, and its magnitude and variance are bounded with probability 1: E (ξ k ) = 0, ∥g k ∥ ∞ ≤ c ∞ /3, E ∥ξ k ∥ 2 = E ∥∇ Eζ[f (θ k , ζ)] -g k ∥ 2 ≤ σ 2 , ∀k ∈ [T ]. Algorithm 1: Adan (Adaptive Nesterov Momentum Algorithm) Input: initialization θ 0 , step size η, weight decay λ k > 0, restart condition. Output: some average of {θ k } K k=1 . while k < K do compute the stochastic gradient estimator g k at θ k ; m k = (1 -β 1 )m k-1 + β 1 g k / * set m 0 = g 0 * /; v k = (1 -β 2 )v k-1 + β 2 (g k -g k-1 ) / * set v 1 = g 1 -g 0 * /; n k = (1 -β 3 )n k-1 + β 3 [g k + (1 -β 2 )(g k -g k-1 )] 2 ; θ k+1 = (1 + λ k η) -1 [θ k -η k • (m k + (1 -β 2 )v k )] with η k = η/ √ n k + ε ; if restart condition holds then get stochastic gradient estimator g 0 at θ k+1 ; m 0 = g 0 , v 0 = 0, n 0 = g 2 0 , update θ 1 by Line 6, k = 1; end if end while Assumption 3 (ρ-Lipschitz continuous Hessian). The function f (•, •) has ρ-continuous Hessian: ∇ 2 Eζ[f (x, ζ)] -∇ 2 Eζ[f (x, ζ)] ≤ ρ∥x -y∥, ∀x, y, where ∥•∥ is the spectral norm for matrix and the ℓ 2 norm for vector. For a general nonconvex problem, if Assumptions 1 and 2 hold, the lower bound of the stochastic gradient complexity to find an ϵ-approximate first-order stationary point (ϵ-ASP) is Ω(ϵ -4 ) (Arjevani et al., 2019; 2020) . Moreover, if Assumption 3 further holds, the lower complexity bound becomes Ω(ϵ -3.5 ) for a non-variance-reduction algorithm (Arjevani et al., 2019; 2020) . Lipschitz Gradient. Theorem 1 with proof in Appendix E proves the convergence of Adan on problem ( 6) with lipschitz gradient condition. Theorem 1. Suppose Assumptions 1 and 2 hold. Let max {β 1 , β 2 } = O ϵ 2 , µ := √ 2β 3 c ∞ /ε ≪ 1, η = O ϵ 2 , and λ k = λ(1 -µ) k . Algorithm 1 runs at most K = Ω c 2.5 ∞ ϵ -4 iterations to achieve 1 K + 1 K k=0 E ∥∇F k (θ k )∥ 2 ≤ 4ϵ 2 . That is, to find an ϵ-ASP, the stochastic gradient complexity of Adan on problem (6) is O c 2.5 ∞ ϵ -4 . Theorem 1 shows that under Assumptions 1 and 2, Adan can converge to an ϵ-ASP of a nonconvex stochastic problem with stochastic gradient complexity O c 2.5 ∞ ϵ -4 which accords with the lower bound Ω(ϵ -4 ) in Arjevani et al. (2019) . For this convergence, Adan has no requirement on minibatch size and only assumes gradient estimation to be unbiased and bounded. Moreover, as shown in Table 1 in Sec. 1, the complexity of Adan is superior to those of previous adaptive gradient algorithms. For Adabelief and LAMB, Adan always has lower complexity and respectively enjoys d 3 × and d 2 × lower complexity for the worst case. Adam-type optimizers (e.g. Adam and AMSGrad) enjoy the same complexity as Adan. But they cannot separate the ℓ 2 regularizer with the objective like AdamW and our Adan. The regularizer separation can boost generalization performance (Touvron et al., 2021; Liu et al., 2021) and already helps AdamW dominate training of ViT-alike architectures. Besides, some previous analyses (Luo et al., 2018; Zaheer et al., 2018; Liu et al., 2019a; Shi et al., 2020 ) need the momentum coefficient (i.e. βs) to be close or increased to one, which contradicts with the practice that βs are close to zero. In contrast, Theorem 1 assumes that all βs are very small, which is more consistent with the practice. Note that when µ = c/T , we have λ k /λ ∈ [(1 -c), 1] during training. Hence we could choose the λ k as a fixed constant in the experiment for convenience. Lipschitz Hessian. With Assumption 3, we further need a restart condition. Consider an extension point y k+1 := θ k+1 + η k • [m k + (1 -β 2 )v k -βg k ], and a restart condition: (k + 1) k t=0 ∥y t+1 -y t ∥ 2 √ nt > R 2 , ( ) where the constant R controls the restart frequency. Intuitively, when the parameters have accumulated enough updates, the iterate may reach a new local basin. Resetting the momentum at this moment helps Adan to better use the local geometric information. Besides, we change  η k from η/ √ n k + ε to η/ √ n k-1 + ε to ensure η k to be independent of noise ζ k . See its proof in Appendix F. . Let R = O ϵ 0.5 , max {β 1 , β 2 } = O ϵ 2 , β 3 = O ϵ 4 , η = O ϵ 1.5 , K = O ϵ -2 , λ = 0. Then Algorithm 1 with restart condition Eqn.( 7) satisfies: E ∇F k ( θ) = O c 0.5 ∞ ϵ , where θ := 1 K0 K0 k=1 θ k , K 0 = argmin ⌊ K 2 ⌋≤k≤K-1 ∥y t+1 -y t ∥ 2 √ nt . Moreover, to find an ϵ-ASP, Algorithm 1 restarts at most O c 0.5 ∞ ϵ -1.5 times in which each restarting cycle has at most K = O ϵ -2 iterations, and hence needs at most O c 1.25 ∞ ϵ -3.5 stochastic gradient complexity. From Theorem 2, one can observe that with an extra smooth Hessian condition in Assumption 3 and a restart condition (7), Adan improves its vanilla stochastic gradient complexity from O c 2.5 ∞ ϵ -4 to O c 1.25 ∞ ϵ -3.5 , which also matches the corresponding lower bound Ω(ϵ -3.5 ). This complexity is lower than O ϵ -3.5 log c2 ϵ of A-NIGT and O ϵ -3.625 of Adam + . For other DNN optimizers, e.g. Adam, their convergence under Lipschitz Hessian condition has not been proved yet. Moreover, Theorem 2 still holds for the large batch size. For example, by using minibatch size b = O ϵ -1.5 , our results still hold when R = O ϵ 0.5 , max {β 1 , β 2 } = O ϵ 0.5 , β 3 = O(ϵ), η = O(1), K = O ϵ -0. 5 and λ = 0. In this case, our η is of the order O(1), and is much larger than O(ploy(ϵ)) of other optimizers (e.g., LAMB and Adam + ) for handling large minibatch. This large step size often boosts convergence speed in practice, which is actually desired.

5. EXPERIMENTAL RESULTS

We evaluate Adan on vision, NLP and RL tasks. For vision tasks, we test Adan on several representative SoTA backbones under the supervised settings, including 1) CNN-type architectures (ResNets and ConvNexts (Liu et al., 2022b) ) and 2) ViTs vanilla ViTs and Swins (Liu et al., 2021) ). Moreover, we also investigate Adan via the self-supervised pretraining by using it to train MAE ViT (He et al., 2022) . For NLP tasks, we train LSTM, Transformer-XL (Dai et al., 2019) , and BERT (Devlin et al., 2018) for sequence modeling. On RL tasks, we evaluate Adan on four games in MuJoCo (Todorov et al., 2012) . In all experiments, we only replace the optimizer with Adan and tune the step size, warmup epochs, and weight decay, etc, while fixing the optimizer-independent hyper-parameters, e.g. data augmentation and model architectures. Moreover, to make Adan simple, in all experiments except Table 8 , we do not use the restart strategy in Algorithm 1. Due to space limitation, we defer the RL results and the ablation study into Appendix B.3 and B.5, respectively.

5.1. EXPERIMENTS FOR VISION TASKS

Besides the vanilla supervised training setting used in ResNets (He et al., 2016) , we further consider two prevalent training settings on ImageNet, namely the following Training Setting I and II. Training Setting I. The recently proposed "A2 training recipe" in (Wightman et al., 2021) has lifted the performance limits of many SoTA CNN-type architectures by stronger data augmentation. Specifically, for data augmentation, this setting uses random crop, horizontal flipping, Mixup (0.1)/CutMix (1.0) with probability 0.5, and RandAugment with M = 7, N = 2 and MSTD = 0.5. It sets stochastic depth (0.05), and adopts cosine learning rate decay and binary cross-entropy loss. Training Setting II. For this setting, data augmentation includes random crop, horizontal flipping, Mixup (0.8), CutMix (1.0), RandAugment (M = 9, MSTD = 0.5) and Random Erasing (p = 0.25). It uses cross-entropy loss, cosine decay, and stochastic depth. For both settings, please refer to their details, e.g. data augmentation, in Appendix Sec. A.1. 3 shows that across different model sizes of ViT and Swin, Adan outperforms the official AdamW optimizer by a large margin. For ViTs, their gradient per iteration differs much from the previous one due to the much sharper loss landscape than CNNs (Chen et al., 2021b) and the strong random augmentations for training. So it is hard to train ViTs to converge within a few epochs. Thanks to its faster convergence, as shown in Figure 1 , Adan is very suitable for this situation. Moreover, the direction correction term from the gradient difference v k of Adan can also better correct the first-and second-order moments. One piece of evidence is that the first-order moment decay coefficient β 1 = 0.02 of Adan is much smaller than 0.1 used in other deep optimizers. Results on Transformer-XL. We evaluate Adan on Transformer-XL (Dai et al., 2019) which is often used to model long sequences. We follow the exact official settingfoot_1 to train Transformer-XL-base on the WikiText-103 dataset that is the largest available word-level language modeling benchmark with long-term dependency. We only replace the default Adam optimizer of Transformer-XLbase by our Adan, and do not make other changes for the hyper-parameter. For Adan, we set β 1 = 0.1, β 2 = 0.1, and β 3 = 0.001, and choose learning rate as 0.001. We test Adan and Adam with several training steps, including 50k, 100k, and 200k (official). pretraining, we use Adan with its default weight decay (0.02) and βs (β 1 = 0.02, β 2 = 0.08, and β 3 = 0.01), and choose learning rate as 0.001. For fine-tuning, we consider a limited hyper-parameter sweep for each task, with a batch size of 16, and learning rates ∈ {2e -5, 4e -5} and use Adan with β 1 = 0.02, β 2 = 0.01, and β 3 = 0.01 and weight decay 0.01. Following the conventional setting, we run each fine-tuning experiment three times and report the median performance in Table 6 . Same as the official setting, on MNLI, we report the mismatched and matched accuracy. And we report Matthew's Correlation and Person Correlation on the task of CoLA and STS-B, respectively. The performance on the other tasks is measured by classification accuracy. The performance of our reproduced one (second row) is slightly better than the vanilla results of BERT reported in Huggingface-transformer (Wolf et al., 2020) (widely used codebase for transformers in NLP), since the vanilla Bookcorpus data in (Wolf et al., 2020) is not available and thus we train on the latest Bookcorpus data version.

B.1 RESUTLS ON RESNET-18

Since some well-known deep optimizers also test ResNet-18 for 90 epochs under the official vanilla training setting in (He et al., 2016) , we also run Adan 90 epochs under this setting for more comparison. Table 9 shows that Adan consistently outperforms SGD and all compared adaptive optimizers. Note for this setting, it is not easy for adaptive optimizers to surpass SGD due to the absence of heavy-tailed noise, which is the crucial factor helping adaptive optimizers beat AGD (Zhang et al., 2020) .

B.2 DETAILED COMPARISON AND CONVERGENCE CURVE

Besides AdamW, we also compare Adan with several other popular optimizers, including Adam, SGD-M, and LAMB, on ViT-S. Table 10 shows that SGD, Adam, and LAMB perform poorly on ViT-S, which is also observed in the works (Xiao et al., 2021; Nado et al., 2021) . These results demonstrate that the decoupled weight decay in Adan and AdamW is much more effective than 1) the vanilla weight decay, namely the commonly used ℓ 2 regularization in SGD, and 2) the one without any weight decay, since as shown in Eqn. ( 6), the decoupled weight decay is a dynamic regularization along the training trajectory and could better regularize the loss. Compared with AdamW, the advantages of Adan mainly come from its faster convergence shown in Figure 2 (b). We will discuss this below. In Figure 2 shows faster convergence behaviors than other baselines in terms of both training loss and test loss. This also partly explains the good performance of Adan over other optimizers. Discussion about convergence complexity Under the corresponding assumptions, most compared optimizers already achieve the optimal complexity in terms of the dependence on optimization ϵ, and their complexities only differ from their constant factors, e.g. c 2 , c ∞ and d. For instance, with Lipschitz gradient but without Lipschitz Hessian, most optimizers have complexity O x ϵ 4 which matches the lower bound O 1 ϵ 4 in Arjevani et al. (2019) , where the constant factor x varies from different optimizers, e.g.x = c 2 ∞ d in Adam-type optimizer, x = c 6 2 in Adabelief, x = c 2 2 d in LAMB, and x = c 2.5 ∞ in Adan. So under the same conditions, one cannot improve the complexity dependence on ϵ but can improve the constant factors, which are significant, especially for the network. Actually, we empirically find c ∞ = O(8.2), c 2 = O(430), d = 2.2 × 10 7 in the ViT-small across different optimizers, e.g., AdamW, Adam, Adan, LAMB. In the extreme case, under the widely used Lipschitz gradient assumption, the complexity bound of Adan is 7.6 × 10 6 smaller than the one of Adam, 3.3 × 10 13 smaller than the one of AdaBlief, 2.1 × 10 10 smaller than the one of LAMB, etc. For ResNet50, we also observe c ∞ = O(78), c 2 = O(970), d = 2.5 × 10 7 which also means a large big improvement of Adan over other optimizers.

B.3 RESULTS ON REINFORCEMENT LEARNING TASKS

Here we evaluate Adan on reinforcement learning tasks. Specifically, we replace the default Adam optimizer in PPO (Duan et al., 2016) which is one of the most popular policy gradient method, and do not many any other change in PPO. For brevity, we call this new PPO version "PPO-Adan". Then we test PPO and PPO-Adan on several games which are actually continuous control environments simulated by the standard and widely-used engine, MuJoCo (Todorov et al., 2012) . For these test games, their agents receive a reward at each step. Following standard evaluation, we run each game under 10 different and independent random seeds (i.e. 1 ∼ 10), and test the performance for 10 episodes every 30,000 steps. All these experiments are based on the widely used codebase Tianshoufoot_2 (Weng et al., 2021) . For fairness, we use the default hyper-parameters in Tianshou, e.g. batch size, discount, and GAE parameter. We use Adan with its default βs (β 1 = 0.02, β 2 = 0.08, and β 3 = 0.01). Following the default setting, we do not adopt the weight decay and choose the learning rate as 3e-4. We report the results on four test games in Figure 3 , in which the solid line denotes the averaged episodes rewards in the evaluation and the shaded region is its 75% confidence intervals. From Figure 3 , one can observe that on the four test games, PPO-Adan achieves much higher rewards than vanilla PPO which uses Adam as its optimizer. These results demonstrate the advantages of Adan over Adam, since PPO-Adan simply replaces the Adam optimizer in PPO with our Adan and does not make other changes.

B.4 RESULTS ON LSTM

To begin with, we test our Adan on LSTM (Schmidhuber et al., 1997) by using the Penn TreeBank dataset (Marcinkiewicz, 1994) , and report the perplexity (the lower, the better) on the test set in Table 11 . We follow the exact experimental setting in Adablief (Zhuang et al., 2020) . Indeed, all our implementations are also based on the code provided by Adablief (Zhuang et al., 2020) foot_3 . We use the default setting for all the hyper-parameters provided by Adablief, since it provides more baselines for a fair comparison. For Adan, we utilize its default weight decay (0.02) and βs (β 1 = 0.02, β 2 = 0.08, and β 3 = 0.01). We choose the learning rate as 0.01 for Adan. Table 11 shows that on the three LSTM models, Adan always achieves the lowest perplexity, making about 1.0 overall average perplexity improvement over the runner-up. Moreover, when the LSTM depth increases, the advantage of Adan becomes more remarkable.

B.5.1 ROBUSTNESS TO IN MOMENTUM COEFFICIENTS

Here we choose MAE to investigate the effects of the momentum coefficients (βs) to Adan, since as shown in MAE, its pre-training is actually sensitive to momentum coefficients of AdamW. To this end, following MAE, we pretrain and fine tune ViT-B on ImageNet for 800 pretraining and 100 fine-tuning epochs. We also fix one of (β 1 , β 2 , β 3 ) and tune others. Figure 4 shows that by only pretraining 800 epochs, Adan achieves 83.7%+ in most cases and outperforms the official accuracy 83.6% obtained by AdamW with 1600 pretraining epochs, indicating the robustness of Adan to βs. We also observe 1) Adan is not sensitive to β 2 ; 2) β 1 has a certain impact on Adan, namely the smaller the (1.0 -β 1 ), the worse the accuracy; 3) similar to findings of MAE, a small second-order coefficient (1.0 -β 3 ) can improve the accuracy. The smaller the (1.0 -β 3 ), the more current landscape information the optimizer would utilize to adjust the coordinate-wise learning rate. Maybe the complex pre-training task of MAE is more preferred to the local geometric information.

B.5.2 ROBUSTNESS TO TRAINING SETTINGS

In convention, many works (Liu et al., 2021; 2022b; Touvron et al., 2022; Wightman et al., 2021; Touvron et al., 2021) 

C NOTATION

We provide some notation that are frequently used throughout the paper. The scale c is in normal font. And the vector is in bold lowercase. Give two vectors x and y, x ≥ y means that (x -y) is a non-negative vector. x/y or x y represents the element-wise vector division. x • y means the element-wise multiplication, and (x) 2 = x • x. ⟨•, •⟩ is the inner product. Given a non-negative vector n ≥ 0, we let ∥x∥ 2 √ n := x, √ n + ε • x . Unless otherwise specified, ∥x∥ is the vector ℓ 2 norm. Note that E(x) is the expectation of random random vector x.

D PROOF OF LEMMA 1: EQUIVALENCE BETWEEN THE AGD AND AGD II

In this section, we show how to get AGD II from AGD. For convenience, we omit the noise term ζ k . Note that, let α := 1 -β 1 : AGD:    g k = ∇f (θ k -ηαm k-1 ) m k = αm k-1 + g k θ k+1 = θ k -ηm k . We can get: θ k+1 -ηαm k = θ k -ηm k -ηαm k =θ k -η(1 + α)(αm k-1 + ∇f (θ k -ηαm k-1 )) =θ k -ηαm k-1 -ηα 2 m k-1 -η(1 + α)(∇f (θ k -ηαm k-1 )). (8) Let θk+1 := θ k+1 -ηαm k , mk := α 2 m k-1 + (1 + α)∇f (θ k -ηαm k-1 ) = α 2 m k-1 + (1 + α)∇f ( θk ) Then, by Eq.( 8), we have: θk+1 = θk -η mk . On the other hand, we have mk-1 = α 2 m k-2 + (1 + α)∇f ( θk-1 ) and : mk -α mk-1 = α 2 m k-1 + (1 + α)∇f ( θk ) -α mk-1 = (1 + α)∇f ( θk ) + α 2 αm k-2 + ∇f ( θk-1 ) -α mk-1 = (1 + α)∇f ( θk ) + α α 2 m k-2 + α∇f ( θk-1 ) -mk-1 = (1 + α)∇f ( θk ) + α α 2 m k-2 + α∇f ( θk-1 ) -α mk-1 = (1 + α)∇f ( θk ) -α∇f ( θk-1 ) = ∇f ( θk ) + α ∇f ( θk ) -∇f ( θk-1 ) . Finally, due to Eq.( 9) and Eq.10, we have:    mk = α mk-1 + ∇f ( θk ) + α ∇f ( θk ) -∇f ( θk-1 ) θk+1 = θk -η mk E CONVERGENCE ANALYSIS WITH LIPSCHITZ GRADIENT Before starting the proof, we first provide several notations. Let F k (θ) := E ζ [f (θ, ζ)] + λ k 2 ∥θ∥ 2 √ n k and µ := √ 2β 3 c ∞ /ε, ∥x∥ 2 √ n k := ⟨x, ( √ n k + ε) • x⟩ , λ k = λ(1 -µ) k . Moreover, we let θk := ( √ n k + ε) • θ k . Lemma 2. Assume f (•) is L-smooth. For θ k+1 = argmin θ λ k 2 ∥θ∥ 2 √ n k + f (θ k ) + ⟨u k , θ -θ k ⟩ + 1 2η ∥(θ -θ k )∥ 2 √ n k . With η ≤ min{ ε 3L , 1 10λ }, then we have: F k+1 (θ k+1 ) ≤ F k (θ k ) - η 4c ∞ u k + λ k θk 2 + η 2ε ∥g k -u k ∥ 2 , where g k := ∇f (θ k ). Proof. We denote p k := u k / √ n k + ε . By the optimality condition of θ k+1 , we have λ k θ k + p k = λ k θk + u k √ n k + ε = 1 + ηλ k η (θ k -θ k+1 ). Then for η ≤ ε 3L , we have: F k+1 (θ k+1 ) ≤ f (θ k ) + ⟨∇f (θ k ), θ k+1 -θ k ⟩ + L 2 ∥θ k+1 -θ k ∥ 2 + λ k+1 2 ∥θ k+1 ∥ 2 √ n k+1 (a) ≤ f (θ k ) + ⟨∇f (θ k ), θ k+1 -θ k ⟩ + L 2 ∥θ k+1 -θ k ∥ 2 + λ k 2 ∥θ k+1 ∥ 2 √ n k (b) ≤F k (θ k ) + θ k+1 -θ k , λ k θ k + g k √ n k + ε √ n k + L/ε + λ k 2 ∥θ k+1 -θ k ∥ 2 √ n k =F k (θ k ) + L/ε + λ k 2 ∥θ k+1 -θ k ∥ 2 √ n k + θ k+1 -θ k , λ k θ k + p k + g k -u k √ n k + ε √ n k (c) = F k (θ k ) + L/ε + λ k 2 - 1 + ηλ k η ∥θ k+1 -θ k ∥ 2 √ n k + θ k+1 -θ k , g k -u k √ n k + ε √ n k (d) ≤ F k (θ k ) + L/ε 2 - 1 η ∥θ k+1 -θ k ∥ 2 √ n k + 1 2η ∥θ k+1 -θ k ∥ 2 √ n k + η 2ε ∥g k -u k ∥ 2 ≤F k (θ k ) - 1 3η ∥θ k+1 -θ k ∥ 2 √ n k + η 2ε ∥g k -u k ∥ 2 ≤F k (θ k ) - η 4c ∞ u k + λ k θk 2 + η 2ε ∥g k -u k ∥ 2 , where (a) comes from the fact λ k+1 (1 -µ) -1 = λ k and Proposition 3: √ n k + ε √ n k+1 + ε i ≥ 1 -µ, which implies: λ k+1 ∥θ k+1 ∥ 2 √ n k+1 ≤ λ k+1 1 -µ ∥θ k+1 ∥ 2 √ n k = λ k ∥θ k+1 ∥ 2 √ n k , and (b) is from: ∥θ k+1 ∥ 2 √ n k = ∥θ k ∥ 2 √ n k + 2 ⟨θ k+1 -θ k , θ k ⟩ √ n k + ∥θ k+1 -θ k ∥ 2 √ n k , is due to Eqn. (11), and for (d), we utilize: θ k+1 -θ k , g k -u k √ n k + ε √ n k ≤ 1 2η ∥θ k+1 -θ k ∥ 2 √ n k + η 2ε ∥g k -u k ∥ 2 , the last inequality comes from the fact in Eqn. ( 11) and η ≤ 1 10λ , such that: 1 3η ∥(θ k+1 -θ k )∥ 2 √ n k = η 3 √ n k + ε (1 + ηλ k ) u k + λ k θk 2 ≥ η 4c ∞ u k + λ k θk 2 . Theorem 1. Suppose Assumptions 1 and 2 hold. Let c l := 1 c∞ and c u : = 1 ε . With β 3 c ∞ /ε ≪ 1, η 2 ≤ c l β 2 1 8c 3 u L 2 , max {β 1 , β 2 } ≤ c l ϵ 2 96c u σ 2 , T ≥ max 24∆ 0 ηc l ϵ 2 , 24c u σ 2 β 1 c l ϵ 2 , where ∆ 0 := F (θ 0 ) -f * and f * := min θ Eζ[∇f (θ k , ζ)], then we let u k := m k + (1 -β 1 )v k and have: 1 T + 1 T k=0 E u k + λ k θk 2 ≤ ϵ 2 , and 1 T + 1 T k=0 E m k -g f ull k 2 ≤ ϵ 2 4 , 1 T + 1 T k=0 E ∥v k ∥ 2 ≤ ϵ 2 4 . Hence, we have: 1 T + 1 T k=0 E ∇ θ k λ k 2 ∥θ∥ 2 √ n k + Eζ[∇f (θ k , ζ)] 2 ≤ 4ϵ 2 . Proof. For convince, we let u k := m k + (1 -β 1 )v k and g f ull k := Eζ[∇f (θ k , ζ)]. We have: u k -g f ull k 2 ≤ 2 m k -g f ull k 2 + 2(1 -β 1 ) 2 ∥v k ∥ 2 . By Lemma 2, Lemma 5, and Lemma 6, we already have: F k+1 (θ k+1 ) ≤ F k (θ k ) - ηc l 4 u k + λ k θk 2 + ηc u g f ull k -m k 2 + ηc u (1 -β 1 ) 2 ∥v k ∥ 2 , E m k+1 -g f ull k+1 2 ≤ (1 -β 1 )E m k -g f ull k 2 + (1 -β 1 ) 2 L 2 β 1 E ∥θ k+1 -θ k ∥ 2 + β 2 1 σ 2 (13) E ∥v k+1 ∥ 2 ≤ (1 -β 2 )E ∥v k ∥ 2 + 2β 2 E g f ull k+1 -g f ull k 2 + 3β 2 2 σ 2 Then by adding Eq.( 12) with ηcu β1 × Eq.( 13) and ηcu(1-β1) 2 β2 × Eq.( 14), we can get: E(Φ k+1 ) ≤ E Φ k - ηc l 4 u k + λ k θk 2 + ηc u β 1 (1 -β 1 ) 2 L 2 β 1 ∥θ k+1 -θ k ∥ 2 + β 2 1 σ 2 + ηc u (1 -β 1 ) 2 β 2 E 2β 2 L 2 ∥θ k+1 -θ k ∥ 2 + 3β 2 2 σ 2 ≤E Φ k - ηc l 4 u k + λ k θk 2 + ηc u L 2 (1 -β 1 ) 2 β 2 1 + 2(1 -β 1 ) 2 ∥θ k+1 -θ k ∥ 2 + (β 1 + 3β 2 )ηc u σ 2 (a) ≤ E Φ k - ηc l 4 u k + λ k θk 2 + ηc u L 2 β 2 1 ∥θ k+1 -θ k ∥ 2 + 4β m ηc u σ 2 (b) ≤E Φ k + (ηc u ) 3 L 2 β 2 1 - ηc l 4 u k + λ k θk 2 + 4β m ηc u σ 2 ≤E Φ k - ηc l 8 u k + λ k θk 2 + 4β m ηc u σ 2 , where we let: Φ k := F k (θ k ) -f * + ηc u β 1 m k -g f ull k 2 + ηc u (1 -β 1 ) 2 β 2 ∥v k ∥ 2 , β m = max {β 1 , β 2 } ≤ 2 3 , η ≤ c l β 2 1 8c 3 u L 2 , and for (a), when β 1 ≤ 2 3 , we have: (1 -β 1 ) 2 β 2 1 + 2(1 -β 1 ) 2 < 1 β 2 1 , and (b) is due to Eq.( 11) from Lemma 2. And hence, we have: T k=0 E(Φ k+1 ) ≤ T k=0 E(Φ k ) - ηc l 8 T k=0 u k + λ k θk 2 + (T + 1)4ηc u β m σ 2 . Hence, we can get: 1 T + 1 T k=0 E u k + λ k θk 2 ≤ 8Φ 0 ηc l T + 32c u βσ 2 c l = 8∆ 0 ηc l T + 8c u σ 2 β 1 c l T + 32c u β m σ 2 c l ≤ ϵ 2 , where ∆ 0 := F (θ 0 ) -f * , β m ≤ c l ϵ 2 96c u σ 2 , T ≥ max 24∆ 0 ηc l ϵ 2 , 24c u σ 2 β 1 c l ϵ 2 . We finish the first part of the theorem. From Eq.( 13), we can conclude that: 1 T + 1 T k=0 E m k -g f ull k 2 ≤ σ 2 βT + L 2 η 2 c 2 u ϵ 2 β 2 1 + β 1 σ 2 < ϵ 2 4 . From Eq.( 14), we can conclude that: 1 T + 1 T k=0 E ∥v k ∥ 2 ≤ 2L 2 η 2 c 2 u ϵ 2 + 3β 2 σ 2 < ϵ 2 4 . Finally we have: 1 T + 1 T k=0 E ∇ θ k λ k 2 ∥θ∥ 2 √ n k + Eζ[f (θ k , ζ)] 2 ≤ 1 T + 1 T k=0 E 2 u k + λ k θk 2 + 4 m k -g f ull k 2 + 4∥v k ∥ 2 ≤ 4ϵ 2 . Now, we have finished the proof.

F FASTER CONVERGENCE WITH LIPSCHITZ HESSIAN

For convince, we let λ = 0, β 1 = β 2 = β and β 3 = β 2 in the following proof. To consider the weight decay term in the proof, we refer to the previous section for more details. For the ease of notation, we denote x instead of θ the variable needed to be optimized in the proof, and abbreviate E ζ [f (θ k , ζ)] as f (θ k ).

F.1 REFORMULATION

Algorithm 2: Nesterov Adaptive Momentum Estimation Reformulation Input: initial point θ 0 , stepsize η, average coefficients β, and ε. begin while k < K do get stochastic gradient estimator g k at x k ; mk = (1 -β) mk-1 + β(g k + (1 -β)(g k -g k-1 )); n k = 1 -β 2 n k-1 + β 2 (g k-1 + (1 -β)(g k-1 -g k-2 )) 2 ; η k = η/ √ n k + ε ; y k+1 = x k -η k βg k ; x k+1 = y k+1 + (1 -β)[(y k+1 -y k ) + (η k-1 -η k )( mk-1 -βg k-1 )]; if (k + 1) k t=0 √ n t + ε 1/2 • (y t+1 -y t ) 2 ≥ R 2 then 10 get stochastic gradient estimator g 0 at x k+1 ; 11 m0 = g 0 , n 0 = g 2 0 , x 0 = y 0 = x k+1 , x 1 = y 1 = x 0 -η m0 √ n0+ε , k = 1; end if end while K 0 = argmin ⌊ K 2 ⌋≤k≤K-1 √ n k + ε 1/2 • (y k+1 -y k ) ; end Output: x := 1 K0 K0 k=1 x k We first prove the equivalent form between Algorithm 1 and Algorithm 2. The main iteration in Algorithm 1 is:    m k = (1 -β)m k-1 + βg k , v k = (1 -β)v k-1 + β((g k -g k-1 )), x k+1 = x k -η k • (m k + (1 -β)v k ). Let mk := (m k + (1 -β)v k ), we can simplify the variable: mk = (1 -β) mk-1 + β(g k + (1 -β)(g k -g k-1 )), x k+1 = x k -η k • mk . We let y k+1 := x k+1 + η k ( mk -βg k ), then we can get: y k+1 = x k+1 + η k mk -βη k g k = x k+1 + x k -x k+1 -βη k g k = x k -βη k g k . On one hand, we have: x k+1 = x k -η k mk = y k+1 -η k ( mk -βg k ). On the other hand: η k ( mk -βg k ) = (1 -β)η k ( mk-1 + β(g k -g k-1 )) =(1 -β)η k ( mk-1 + β(g k -g k-1 )) =(1 -β)η k x k-1 -x k η k-1 + β(g k -g k-1 ) =(1 -β) η k η k-1 (x k-1 -x k + βη k-1 (g k -g k-1 )) =(1 -β) η k η k-1 (y k -x k + βη k-1 g k ) =(1 -β) η k η k-1 (y k -y k+1 -β(η k -η k-1 )g k ) =(1 -β) (y k -y k+1 ) + η k -η k-1 η k-1 (y k -y k+1 -βη k g k ) =(1 -β) (y k -y k+1 ) + η k -η k-1 η k-1 (y k -x k ) =(1 -β)[(y k -y k+1 ) + (η k -η k-1 )(m k-1 -βg k-1 )]. Hence, we can conclude that: x k+1 = y k+1 + (1 -β)[(y k+1 -y k ) + (η k-1 -η k )( mk-1 -βg k-1 )]. The main iteration in Algorithm 1 becomes:    y k+1 = x k -βη k g k , x k+1 = y k+1 + (1 -β) (y k+1 -y k ) + η k-1 -η k η k-1 (y k -x k ) .

F.2 AUXILIARY BOUNDS

We first show some interesting property. Define K to be the iteration number when the 'if condition' triggers, that is, K := min k k k k-1 t=0 ( √ n t + ε) 1/2 • (y t+1 -y t ) 2 > R 2 . Proposition 1. Given k ≤ K and β ≤ ε/ √ 2c ∞ + ε , we have: ( √ n k + ε) 1/2 • (x k -y k ) ≤ R. Proof. First of all, we let nk := √ n k + ε 1/2 . Due to Proposition 3, we have: √ n k-1 + ε √ n k + ε i ∈ 1 - √ 2β 2 c ∞ ε , 1 + √ 2β 2 c ∞ ε , then, we get: nk ≤ 1 - √ 2β 2 c ∞ ε -1/2 nk-1 ≤ (1 -β) -1/4 nk-1 , where we use the fact β ≤ ε/ 2 √ 2c ∞ + ε .For any 1 ≤ k ≤ K, we have: ∥n k • (y k -y k-1 )∥ 2 ≤ (1 -β) -1/2 ∥n k-1 • (y k -y k-1 )∥ 2 ≤(1 -β) -1 k-1 t=1 ∥n t • (y t+1 -y t )∥ 2 ≤ R 2 k(1 -β) , hence, we can conclude that: ∥n k • (y k -y k-1 )∥ 2 ≤ R 2 k(1 -β) . On the other hand, by Eq.( 15), we have: x k+1 -y k+1 = (1 -β) (y k+1 -y k ) + η k -η k-1 η k-1 (x k -y k ) , and hence, ∥n k • (x k -y k )∥ ≤ (1 -β) ∥n k • (y k -y k-1 )∥ + η k-1 -η k-2 η k-2 ∞ ∥n k • (x k-1 -y k-1 )∥ (a) ≤ 1 -β R √ k + (1 -β) √ 2β 2 c ∞ ε 1 - √ 2β 2 c ∞ ε -1/2 ∥n k-1 • (x k-1 -y k-1 )∥ ≤ 1 -β R √ k + β(1 -β) 3/4 ∥n k-1 • (x k-1 -y k-1 )∥ ≤ 1 -βR 1 √ k + β(1 -β) 3/4 √ k -1 + • • • + β(1 -β) 3/4 k-1 (b) ≤ 1 -βR k-1 t=1 1 t 2 1/4 k t=0 β(1 -β) 3/4 4t/3 3/4 (c) < R, where (a) comes from Eq.( 16) and the proposition 3, (b) is the application of Hölder's inequality and (c) comes from the facts when β ≤ 1/2: ∞ t=1 1 t 2 = π 2 6 , 1 -β k t=0 β(1 -β) 3/4 4t/3 3/4 ≤ (1 -β) 2/3 1 -β 4/3 (1 -β) 3/4 . F.3 DECREASE OF ONE RESTART CYCLE Lemma 3. Suppose that Assumptions 1-2 hold. Let R = O ϵ 0.5 , β = O ϵ 2 , η = O ϵ 1.5 , K ≤ K = O ϵ -2 . Then we have: E (f (y K ) -f (x 0 )) = -O ϵ 1.5 . ( ) Proof. Recall Eq.( 15) and denote g f ull k := ∇f (θ k ) for convenience:      y k+1 = x k -βη k • g f ull k + ξ k x k+1 -y k+1 = (1 -β) (y k+1 -y k ) + η k -η k-1 η k-1 • (x k -y k ) , In this proof, we let nk := √ n k + ε 1/2 , and hence η k = η/n 2 k . By the L-smoothness condition, for 1 ≤ k ≤ K, we have: E (f (y k+1 ) -f (x k )) ≤ E ⟨g k , y k+1 -x k ⟩ + L 2 ∥y k+1 -x k ∥ 2 = E - y k+1 -x k βη k + ξ k , y k+1 -x k + L 2 ∥y k+1 -x k ∥ 2 (a) ≤ E - 1 ηβ ∥n k • (y k+1 -x k )∥ 2 + L 2 ∥y k+1 -x k ∥ 2 + ηβσ 2 ε ≤ E - 1 ηβ ∥n k • (y k+1 -x k )∥ 2 + L 2ε ∥n k • (y k+1 -x k )∥ 2 + ηβσ 2 ε ≤ E - 1 2ηβ ∥n k • (y k+1 -x k )∥ 2 + ηβσ 2 ε , where (a) comes from the facts: E (⟨ξ k , y k+1 -x k ⟩) = E (⟨ξ k , x k -βη k • (g k + ξ k )⟩) = E (⟨ξ k , βη k • ξ k ⟩) ≤ ηβσ 2 ε . and the last inequality is due to Lη ≤ ε. On the other hand, we have: E(f (x k ) -f (y k )) ≤ E ⟨∇f (y k ), x k -y k ⟩ + L 2 ∥x k -y k ∥ 2 = E ⟨g k , x k -y k ⟩ + ⟨∇f (y k ) -∇f (x k ), x k -y k ⟩ + L 2 ∥x k -y k ∥ 2 ≤ E ⟨g k , x k -y k ⟩ + 1 2L ∥∇f (y k ) -∇f (x k )∥ 2 + L 2 ∥x k -y k ∥ 2 + L 2 ∥x k -y k ∥ 2 ≤ E ⟨g k , x k -y k ⟩ + 3L 2 ∥x k -y k ∥ 2 = E - y k+1 -x k βη k + ξ k , x k -y k + 3L 2 ∥x k -y k ∥ 2 = E 1 ηβ n2 k • (y k+1 -x k ), y k -x k + 3L 2 ∥x k -y k ∥ 2 (a) ≤ E 1 2ηβ ∥n k • (y k+1 -x k )∥ 2 + ∥n k • (y k -x k )∥ 2 -∥n k • (y k+1 -y k )∥ 2 + 3L 2 ∥x k -y k ∥ 2 (b) ≤ E 1 2ηβ ∥n k • (y k+1 -x k )∥ 2 -∥n k • (y k+1 -y k )∥ 2 + 1 + β/2 2ηβ ∥n k • (y k -x k )∥ 2 ) where (a) comes from the following facts, and in (b), we use 3Lη ≤ ε 2 : 2 n2 k • (y k+1 -x k ), y k -x k = ∥n k • (y k+1 -x k )∥ 2 +∥n k • (y k -x k )∥ 2 -∥n k • (y k+1 -y k )∥ 2 . By combing Eq.( 19) and Eq.( 20), we have: E (f (y k+1 ) -f (y k )) ≤ E - 1 2ηβ ∥n k • (y k+1 -y k )∥ 2 + 1 + β/2 2ηβ ∥n k • (y k -x k )∥ 2 + ηβσ 2 ε (a) ≤ E - 1 2ηβ ∥n k • (y k+1 -y k )∥ 2 + 1 -β/2 -β 2 /2 2ηβ ∥n k-1 • (y k -y k-1 )∥ 2 + 4β 2 R 2 c 2 ∞ ηε 2 + ηβσ 2 ε , where (a) comes from the following fact, and note that by Proposition 1 we already have nk ≤ (1 -β) -1/4 nk-1 : ∥n k • (x k -y k )∥ 2 ≤(1 -β) 2 (1 + α)∥n k • (y k -y k-1 )∥ 2 + (1 + 1 α ) β2 ∥n k • (x k-1 -y k-1 )∥ 2 ≤(1 -β) 3/2 (1 + α)∥n k-1 • (y k -y k-1 )∥ 2 + (1 + 1 α ) β2 ∥n k-1 • (x k-1 -y k-1 )∥ 2 ≤(1 -β)∥n k-1 • (y k -y k-1 )∥ 2 + β2 (1 -β) 3/2 1 -(1 -β) 1/2 ∥n k-1 • (x k-1 -y k-1 )∥ 2 ≤(1 -β)∥n k-1 • (y k -y k-1 )∥ 2 + 2 β2 β ∥n k-1 • (x k-1 -y k-1 )∥ 2 ≤(1 -β)∥n k-1 • (y k -y k-1 )∥ 2 + 4β 3 R 2 c 2 ∞ /ε 2 , where we let β := √ 2β 2 c ∞ /ε, α = (1 -β) -1/2 -1, and the last inequality we use the results in Proposition 1. Summing over k = 2, • • • , K -1, and note that y 1 = x 1 , and hence we have E (f (y 2 ) -f (x 1 )) = E (f (y 2 ) -f (y 1 )) ≤ ηβσc ∞ / √ ε due to Eq. ( 19), then we get: E (f (y K ) -f (y 1 )) ≤ E - 1 4η K-1 t=1 ∥n k • (y t+1 -y t )∥ 2 + 4Kβ 2 R 2 c 2 ∞ ηε 2 + Kηβσ 2 ε . On the other hand, similar to the results given in Eq.( 19), we have: E (f (y 1 ) -f (y 0 )) = E (f (x 1 ) -f (x 0 )) ≤ E - 1 2η ∥n k • (y 1 -y 0 )∥ 2 + ησ 2 ε . Therefore, using βK = O(1) and the restart condition K K-1 t=0 ( √ n t + ε) 1/2 • (y t+1 -y t ) 2 ≥ R 2 , we can get: E (f (y K ) -f (y 0 )) ≤ E - 1 4η K-1 t=0 ∥n k • (y k+1 -y k )∥ 2 + 4Kβ 2 R 2 c 2 ∞ ηε 2 + (Kβ + 1)ησ 2 ε ≤ - R 2 4Kη + 4Kβ 2 R 2 c 2 ∞ ηε 2 + (Kβ + 1)ησ 2 ε = -O R 2 Kη - βR 2 η -η = -O ϵ 1.5 . Now, we finish the proof of this claim.

F.4 GRADIENT IN THE LAST RESTART CYCLE

Before showing the main results, we first provide several definitions for the convenience of proof. Note that, for any k < K we already have: (ε) 1/2 ∥y k -y 0 ∥ ≤ (ε) 1/2 k k-1 t=0 ∥y t+1 -y t ∥ 2 ≤ R. and we have: E (∥x k -x 0 ∥) ≤ E (∥y k -x k ∥ + ∥y k -x 0 ∥) ≤ 2R ε 1/2 , ( ) where we utilize the results from Proposition 1. For each epoch, denote H := ∇ 2 f (x 0 ). We then define: h(y) := g f ull 0 , y -x 0 + 1 2 (y -x 0 ) ⊤ H(y -x 0 ). Recall the Eq. ( 15):      y k+1 = x k -βη k • g f ull k + ξ k = x k -βη k • (∇h(x k ) + δ k + ξ k ) x k+1 -y k+1 = (1 -β) (y k+1 -y k ) + η k -η k-1 η k-1 • (x k -y k ) , where we let δ k := g f ull k -∇h(x k ), and we can get that: E (∥δ k ∥) = E g f ull k -g f ull 0 -H(x k -x 0 ) = E 1 0 ∇ 2 h(x 0 + t(x k -x 0 )) -H (x k -x 0 )dt ≤ ρ 2 E ∥x k -x 0 ∥ 2 ≤ 2ρR 2 ε . Iterations in Eq.( 23) can be viewed as applying the proposed optimizer to the quadratic approximation h(x) with the gradient error δ k , which is in the order of O ρR 2 /ε . Lemma 4. Suppose that Assumptions 1-3 hold. Let B = O ϵ 0.5 , β = O ϵ 2 , η = O ϵ 1.5 , K ≤ K = O ϵ -2 . Then we have: E (∥∇f (x)∥) = O(ϵ), where x := 1 K0-1 K0 k=1 x k . Proof. Since h(•) is quadratic, then we have: E (∥∇h(x)∥) = E 1 K 0 -1 K0 k=1 ∇h(x k ) = 1 K 0 -1 E K0 k=1 (βη k ) -1 • (y k+1 -x k ) + ξ k + δ k ≤ 1 (K 0 -1)β E K0 k=1 (βη k ) -1 • (y k+1 -x k ) + 1 (K 0 -1) E K0 k=1 ξ k + 1 (K 0 -1) E K0 k=1 δ k (a) ≤ 1 (K 0 -1)β E K0 k=1 (η k ) -1 • (y k+1 -x k ) + σ √ K 0 -1 + 2ρR 2 ε = 1 (K 0 -1)β E K0 k=1 y k+1 -y k -(1 -β)(y k -y k-1 ) η k -(1 -β) η k-1 -η k-2 η k-2 η k (x k-1 -y k-1 ) + σ √ K 0 -1 + 2ρR 2 ε (b) ≤ 1 (K 0 -1)β E K0 k=1 y k+1 -y k -(1 -β)(y k -y k-1 ) η k + 2βc 1.5 ∞ R ηε + σ √ K 0 -1 + 2ρR 2 ε (c) ≤ 1 (K 0 -1)β E K0 k=1 y k+1 -y k η k - (1 -β)(y k -y k-1 ) η k-1 + 4βc 1.5 ∞ R ηε + σ √ K 0 -1 + 2ρR 2 ε ≤ 1 (K 0 -1)β E y K0 -y K0-1 η K0 + 1 (K 0 -1) E K0-1 k=1 y k+1 -y k η k + 4βc 1.5 ∞ R ηε + σ √ K 0 -1 + 2ρR 2 ε (d) ≤ 1 (K 0 -1) E K0 k=1 y k+1 -y k η k + 4R √ c ∞ βηK 2 + 4βc 1.5 ∞ R ηε + σ √ K 0 -1 + 2ρR 2 ε ≤ √ 2c ∞ ηK E K0 k=1 ( √ n k + ε) 1/2 • (y k+1 -y k ) + 4R √ c ∞ βηK 2 + 4βc 1.5 ∞ B ηε + σ √ K 0 -1 + 2ρR 2 ε ≤ √ 2c ∞ R ηK + 4R √ c ∞ βηK 2 + 4βc 1.5 ∞ R ηε + σ √ K 0 -1 + 2ρR 2 ε =O R ηK + βR η + 1 √ K + R 2 = O(ϵ), where (a) is due to the independence of ξ k 's and Eq.( 24), (b) comes from Propositions 1 and 2: η k-1 -η k-2 η k-2 η k (x k-1 -y k-1 ) ≤ √ n k + ε η √ n k-1 + ε 1/2 η k-1 -η k-2 η k-2 ∞ ∥n k-1 • (x k-1 -y k-1 )∥ ≤ √ n k + ε 1/2 η √ 2β 2 c ∞ ε 1 - √ 2β 2 c ∞ ε -1/2 R ≤ (c ∞ + ε) 1/2 η √ 2β 2 c ∞ ε R (1 -β) 1/4 ≤ 1 1 -β 1/4 2β 2 c 1.5 ∞ R ηε , we use the following bounds in (c): (y k -y k-1 ) η k-1 - (y k -y k-1 ) η k = η k -η k-1 η k-1 η k (y k -y k-1 ) ≤ √ n k-1 + ε 1/2 η η k -η k-1 η k ∞ ( √ n k-1 + ε) 1/2 • (y k -y k-1 ) ≤ √ n k-1 + ε 1/2 η √ 2β 2 c ∞ ε R k ≤ (c ∞ + ε) 1/2 η √ 2β 2 c ∞ ε R k ≤ 2β 2 c 1.5 ∞ R ηεk , (d) is implied by K 0 = argmin ⌊ K 2 ⌋≤k≤K-1 √ n k + ε 1/2 • (y k+1 -y k ) and restart condition: y K0 -y K0-1 η K0 2 ≤ √ n K0 + ε η 2 √ n K0 + ε 1/2 • (y K0 -y K0-1 ) 2 √ n K0 + ε 1/2 • (y K0 -y K0-1 ) 2 ≤ 1 K -⌊K/2⌋ K-1 k=⌊K/2⌋ ( √ n k + ε) 1/2 • (y k+1 -y k ) 2 ≤ 1 K -⌊K/2⌋ K k=1 ( √ n k + ε) 1/2 • (y k+1 -y k ) 2 ≤ 1 K -⌊K/2⌋ R 2 K ≤ 2R 2 K 2 . Finally, we have: E (∥∇f (x)∥) = E (∥∇h(x)∥) + E (∥∇f (x) -∇h(x)∥) = O(ϵ) + 2ρR 2 ε = O(ϵ), where we use the results from Eq.( 24), namely: E (∥∇f (x) -∇h(x)∥) = E ∇f (x) -g f ull 0 -H(x -x 0 ) ≤ ρ 2 E ∥x -x 0 ∥ 2 , and we also note that, by Eq.( 22): E ∥x -x 0 ∥ ≤ 1 K 0 -1 K0 k=1 E ∥x k -x 0 ∥ ≤ 2R ε 1/2 .

F.5 PROOF FOR MAIN THEOREM

Theorem 2. Suppose that Assumptions 1-3 hold. Let B = O ϵ 0.5 , β = O ϵ 2 , η = O ϵ 1.5 , K ≤ K = O ϵ -2 . Then Algorithm 1 find an ϵ-approximate first-order stationary point within at most O ϵ -3.5 iterations. Namely, we have: E (f (y K ) -f (x 0 )) = -O ϵ 1.5 , E (∥∇f (x)∥) = O(ϵ). Proof. Note that at the beginning of each restart cycle in Algorithm 2, we set x 0 to be the last iterate x K in the previous restart cycle. Due to Lemma 3, we already have: .5 . Summing this inequality over all cycles, say N total restart cycles, we have: E (f (y K ) -f (x 0 )) = -O ϵ 1 min x f (x) -f (x init ) = -O N ϵ 1.5 , Hence, the Algorithm 2 terminates within at most O ϵ -1.5 ∆ f restart cycles, where ∆ f := f (x init )min x f (x). Note that each cycle contain at most K = O ϵ -2 iteration step, therefore, the total iteration number must be less than O ϵ -3.5 ∆ f . On the other hand, by Lemma 4, in the last restart cycle, we have: E (∥∇f (x)∥) = O(ϵ). Now, we obtain the final conclusion for the theorem.

G AUXILIARY LEMMAS

Proposition 2. If Assumption 2 holds. We have: ∥m k ∥ ∞ ≤ c ∞ , ∥n k ∥ ∞ ≤ c 2 ∞ . Proof. By the definition of m k , we can have that: m k = k t=0 c k,t g t , where c k,t =      β 1 (1 -β 1 ) (k-t) when t > 0, (1 -β 1 ) k when t = 0. Similar, we also have: n k = k t=0 c ′ k,t (g t + (1 -β 2 )(g t -g t-1 )) 2 , where c ′ k,t =      β 3 (1 -β 3 ) (k-t) when t > 0, (1 -β 3 ) k when t = 0. If is obvious that: k t=0 c k,t = 1, k t=0 c ′ k,t = 1, hence, we get: ∥m k ∥ ∞ ≤ k t=0 c k,t ∥g t ∥ ∞ , ∥n k ∥ ∞ ≤ k t=0 c ′ k,t ∥g t + (1 -β 2 )(g t -g t-1 )∥ 2 ∞ ≤ c 2 ∞ . Proposition 3. If Assumption 2 holds, we have: η k -η k-1 η k-1 ∞ ≤ √ 2β 3 c ∞ ε . Proof. Give any index i ∈ [d] and the definitions of η k , we have: η k -η k-1 η k-1 i = √ n k-1 + ε √ n k + ε i -1 = √ n k-1 - √ n k √ n k + ε i . Note that, by the definition of n k , we have: √ n k-1 - √ n k √ n k + ε i ≤ |n k-1 -n k | √ n k + ε i =β 3     n k-1 -(g k + (1 -β 2 )(g k -g k-1 )) 2 √ n k + ε     i ≤ √ 2β 3 c ∞ ε , hence, we have: η k -η k-1 η k-1 i ∈ 0, √ 2β 3 c ∞ ε . We finish the proof. Lemma 5. Consider a moving average sequence: m k = (1 -β)m k-1 + βg k , where we note that: g k = Eζ[∇f (θ k , ζ)] + ξ k , and we denote g f ull k := E ζ [∇f (θ k , ζ)] for convenience. Then we have: E m k -g f ull k 2 ≤ (1 -β)E m k-1 -g f ull k-1 2 + (1 -β) 2 L 2 β E ∥θ k-1 -θ k ∥ 2 + β 2 σ 2 . Proof. Note that, we have: m k -g f ull k =(1 -β) m k-1 -g f ull k-1 + (1 -β)g f ull k-1 -g f ull k + βg k =(1 -β) m k-1 -g f ull k-1 + (1 -β) g f ull k-1 -g f ull k + β g k -g f ull k . Then, take expectation on both sides: E m k -g f ull k 2 =(1 -β) 2 E m k-1 -g f ull k-1 2 + (1 -β) 2 E g f ull k-1 -g f ull k 2 + β 2 σ 2 + 2(1 -β) 2 E m k-1 -g f ull k-1 , g f ull k-1 -g f ull k ≤ (1 -β) 2 + (1 -β) 2 a E m k-1 -g f ull k-1 2 + 1 + 1 a (1 -β) 2 E g f ull k-1 -g f ull k 2 + β 2 σ 2 (a) ≤ (1 -β)E m k-1 -g f ull k-1 2 + (1 -β) 2 β E g f ull k-1 -g f ull k 2 + β 2 σ 2 ≤(1 -β)E m k-1 -g f ull k-1 2 + (1 -β) 2 L 2 β E ∥θ k-1 -θ k ∥ 2 + β 2 σ 2 , where for (a), we set a = β 1-β . Lemma 6. Consider a moving average sequence: v k = (1 -β)v k-1 + β(g k -g k-1 ), where we note that: g k = Eζ[∇f (θ k , ζ)] + ξ k , and we denote g f ull k := E ζ [f (θ k , ζ)] for convenience. Then we have: E ∥v k ∥ 2 ≤ (1 -β)E ∥v k-1 ∥ 2 + 2βE g f ull k -g f ull k-1 2 + 3β 2 σ 2 . Proof. Take expectation on both sides: E ∥v k ∥ 2 = (1 -β) 2 E ∥v k-1 ∥ 2 + β 2 E ∥g k -g k-1 ∥ 2 + 2β(1 -β)E(⟨v k-1 , g k -g k-1 ⟩) (a) = (1 -β) 2 E ∥v k-1 ∥ 2 + β 2 E ∥g k -g k-1 ∥ 2 + 2β(1 -β)E v k-1 , g f ull k -g k-1 (b) ≤(1 -β) 2 E ∥v k-1 ∥ 2 + 2β 2 E g f ull k -g f ull k-1 2 + 2β(1 -β)E v k-1 , g f ull k -g k-1 + 3β 2 σ 2 (c) ≤ (1 -β) 2 E ∥v k-1 ∥ 2 + 2β 2 E g f ull k -g f ull k-1 2 + 2β(1 -β)E v k-1 , g f ull k -g f ull k-1 + 3β 2 σ 2 (d) ≤ (1 -β)E ∥v k-1 ∥ 2 + 2βE g f ull k -g f ull k-1 2 + 3β 2 σ 2 , where for (a), we utilize the independence between g k and v k-1 , while for (b): E ∥g k -g k-1 ∥ 2 ≤ E g k -g f ull k 2 + 2E g f ull k-1 -g k-1 2 + 2E g f ull k -g f ull k-1 2 , for (c), we know: E v k-1 , g f ull k-1 -g k-1 = E (1 -β)v k-2 + β(g k-1 -g k-2 ), g f ull k-1 -g k-1 =E (1 -β)v k-2 -βg k-2 , g f ull k-1 -g k-1 + βE g k-1 -g f ull k-1 + g f ull k-1 , g f ull k-1 -g k-1 = -βE g f ull k-1 -g k-1 2 , and thus E v k-1 , g f ull k -g k-1 = E v k-1 , g f ull k -g f ull k-1 -βE g f ull k-1 -g k-1 2 . Finally, for (d), we use: 2E v k-1 , g f ull k -g f ull k-1 ≤ E ∥v k-1 ∥ 2 + E g f ull k -g f ull k-1 2 .



For presentation convenience, we omit the de-bias term in adaptive gradient methods. https://github.com/kimiyoung/transformer-xl https://github.com/thu-ml/tianshou https://github.com/juntang-zhuang/Adabelief-Optimizer. The reported results in(Zhuang et al., 2020) slightly differ from the those in(Chen et al., 2021a) because of their different settings for LSTM and training hyper-parameters.



(a), we plot the curve of training and test loss along with the training epochs on ResNet50. One can observe that Adan converges faster than the compared baselines and enjoys the smallest training and test losses. This demonstrates its fast convergence property and good generalization ability. To sufficiently investigate the fast convergence of Adan, we further plot the curve of training and test loss on the ViT-Small in Figure 2 (b). From the results, we can see that Adan consistently and test curves on ViT-S under Training Setting II.

Figure 2: Training and test curves of various optimizers on ImageNet dataset. Training loss is larger due to its stronger data argumentation.

Figure 3: Comparison of PPO and our PPO-Adan on several RL games simulated by MuJoCo. Here PPO-Adan simply replaces the Adam optimizer in PPO with our Adan and does not change others.Table 11: Test perplexity (the lower, the better) on Penn Treebank for one-, two-and three-layered LSTMs. All results except Adan and Padam in the table are reported by AdaBelief.

Figure 4: Effects of momentum coefficients (β 1 , β 2 , β 3 ) to top-1 accuracy (%) of Adan on ViT-B under MAE training framework (800 pretraining and 100 fine-tuning epochs on ImageNet).Table 12: Top-1 accuracy (%) of ViT-S on ImageNet trained under Training Setting I and II. * is reported in(Touvron et al., 2021).

PRELIMINARIESAdaptive gradient algorithms, Adam and AdamW, have become the default choice to train CNNs and ViTs. Unlike SGD which uses one learning rate for all gradient coordinates, adaptive algorithms adjust the learning rate for each gradient coordinate according to the current geometry curvature of the objective function, and thus converge faster. Take RMSProp and Adam as examples. Given stochastic gradient estimator g k := Eζ∼D[∇f (θ k , ζ)] + ξ k , e.g. minibatch gradient, where ξ k is the gradient noise, RMSProp updates the variable θ as follows:

Top-1 accuracy (%) of ResNet and ConvNext on ImageNet under their official settings. * and ⋄ are respectively reported in(Wightman et al., 2021; Liu et al., 2022b).

Top-1 accuracy (%) of ViT and Swin on ImageNet. We use their official Training Setting II to train them. * and ⋄ are respectively reported in(Touvron et al., 2021;Liu et al., 2021).Results on CNN-type Architectures. To train ResNet and Con-vNext, we respectively use their official Training Setting I and II. For SoTA ResNet/ConvNext, its default official optimizer is LAMB/AdamW. From Table2, one can observe that on ResNet, 1) in most cases, Adan only running 200 epochs can achieve higher or comparable top-1 accuracy on ImageNet compared with the official SoTA result trained by LAMB with 300 epochs; 2) Adan gets more improvements over other optimizers, when training is insufficient, e.g. 100 epochs. The possible reason for observation 1) is the regularizer separation, which can dynamically adjust the weight decay for each coordinate instead of sharing a common one. For observation 2), this can be explained by the faster convergence speed of Adan than other optimizers. As shown in Figure1, Adan converges faster than many adaptive gradient optimizers. This faster speed partially comes from its large learning rate guaranteed by The-

Top-1 Acc. (%) of ViT-B and ViT-L trained by MAE under the official Training Setting II. * and ⋄ are respectively reported in(Chen et al., 2022;He et al., 2022).

Top-1 Acc. (%) of ViT-S on Ima-geNet under Training Setting I.

Results (the higher, the better) of BERT-base model on the development set of GLUE. The first line is from(Wolf et al., 2020) while the second line is reproduced by us. Similar to the pretraining experiments of MAE which is also a self-supervised learning framework on vision tasks, we utilize Adan to train BERT(Devlin et al., 2018) from scratch, which is one of the most widely used pretraining models/frameworks for NLP tasks. We employ the exact BERT training setting in the widely used codebase-Fairseq(Ott et al., 2019). See more training details in Appendix A.3.From Table6, one can see that in the most commonly used BERT training experiment, Adan reveals much better advantage over Adam. Specifically, in all GLUE tasks, on BERT-base model, Adan achieves higher performance than Adam, and makes 1.8 average improvements on all tasks. In addition, on some tasks of Adan, BERT-base trained by Adan can outperform some large models. e.g., BERT-large which achieves 70.4% on RTE, 93.2% on SST-2 and 60.6 correlation on CoLA, and XLNet-large which has 63.6 correlation on CoLA. See(Liu et al., 2019b)  for more results.

Test PPL (the lower, the better) for Transformer-XL-base model on the WikiText-103 dataset.

shows that on Transformer-XL-base, Adan surpasses its default Adam optimizer in terms of test PPL (the lower, the better) under all training steps. Surprisingly, Adan using 100k training steps can even achieve comparable results to Adam with 200k training steps. All these results demonstrate the superiority of Adan over the default SoTA Adam optimizer in Transformer-XL.Results on LSTM. In Appendix B.4, the results on LSTM shows the superiority of our Adan over several representative optimizers, e.g. SGD, Adam and AdamW, on the Penn TreeBank dataset.

Top-1 Acc. (%) of ViT-S and ConvNext-T on ImageNet under Training Setting II trained with 300 epochs.Here we investigate the performance Adan with and without restart strategy on ViT and ConvNext under 300 training epochs. From the results in Table8, one can observe that the restart strategy slightly improves test performance of Adan on both ViT and ConvNext. However, to make our Adan simple and avoid hyper-parameter tuning of the restart strategy (e.g., restart frequency), in all experiments except Table 8, we do not use this restart strategy.

Top-1 accuracy (%) of ResNet18 under the official setting in(He et al., 2016). * are reported in(Zhuang et al., 2020).

Test perplexity (the lower, the better) on Penn Treebank for one-, two-and three-layered LSTMs. All results except Adan and Padam in the table are reported by AdaBelief.

Top-1 accuracy (%) of ViT-S on ImageNet trained under Training Setting I and II. * is reported in(Touvron et al., 2021).

often preferably chose LAMB/Adam/SGD for Training Setting I and AdamW for Training Setting II. Table12investigates Adan under both settings and shows consistent improvement of Adan. Moreover, one can also observe that Adan under Setting I largely improves the accuracy of

APPENDIX

The appendix contains some additional experimental results and the technical proofs of convergence results of the paper entitled "Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models". It is structured as follows. Sec. A provides details of the training setting and Adan's implementation. It also gives detailed steps to perform the experiment on BERT. Sec. B include the additional experimental results, which contains the results on ResNet-18 in Sec. B.1, convergence curve in Sec. B.2, experiments on RL tasks in Sec. B.3, results on LSTM in Sec. B.4 , and the ablation study in Sec. B.5. After Sec. C, which summarizes the notations throughout this document, we provide the technical proofs of convergence results. Then Sec. D provides the proof of the equivalence between AGD and reformulated AGD, i.e., the proof of Lemma 1. And then, given Lipschitz gradient condition, Sec. E provides the convergence analysis in Theorem 1. Next, we show Adan's faster convergence speed with Lipschitz Hessian condition in Sec. F, by first reformulating our Algorithm 1 and introducing some auxiliary bounds. Finally, we present some auxiliary lemmas in Sec. G.

A TRAINING SETTING AND IMPLEMENTATION DETAILS A.1 TRAINING SETTING

Training Setting I. The recently proposed "A2 training recipe" in (Wightman et al., 2021) has pushed the performance limits of many SoTA CNN-type architectures by using stronger data augmentation and more training iterations. For example, on ResNet50, it sets new SoTA 80.4%, and improves the accuracy 76.1% under vanilla setting in (He et al., 2016) . Specifically, for data augmentation, this setting uses random crop, horizontal flipping, Mixup (0.1) (Zhang et al., 2018) /CutMix (1.0) (Yun et al., 2019) with probability 0.5, and RandAugment (Cubuk et al., 2020) with M = 7, N = 2 and MSTD = 0.5. It sets stochastic depth (0.05) (Huang et al., 2016) , and adopts cosine learning rate decay and binary cross-entropy (BCE) loss. For Adan, we use batch size 2048 for ResNet and ViT.Training Setting II. We follow the same official training procedure of ViT/Swin/ConvNext. For this setting, data augmentation includes random crop, horizontal flipping, Mixup (0.8), CutMix (1.0), RandAugment (M = 9, MSTD = 0.5) and Random Erasing (p = 0.25). We use CE loss, the cosine decay for base learning rate, the stochastic depth (with official parameters), and weight decay. For Adan, we set batch size 2048 for Swin/ViT/ConvNext and 4096 for MAE. We follow MAE and tune β 3 as 0.1.

A.2 IMPLEMENTATION DETAILS OF ADAN

For the large-batch training experiment, we use the sqrt rule to scale the learning rate: lr = batch size 256 × 6.25e-3, and respectively set warmup epochs {20, 40, 60, 100, 160, 200} for batch size bs = {1k, 2k, 4k, 8k, 16k, 32k}. For other remaining experiments, we use the hyper-parameters: learning rate 1.5e-2 for ViT/Swin/ResNet/ConvNext and MAE fine-tuning, and 2.0e-3 for MAE pre-training according to the official settings. We set β 1 = 0.02, β 2 = 0.08 and β 3 = 0.01, and let weight decay be 0.02 unless noted otherwise. We clip the global gradient norm to 5 for ResNet training and do not clip the gradient for ViT, Swin, ConvNext, and MAE. In the implementation, to keep consistent with Adam-type optimizers, we utilize the de-bias strategy for Adan.

A.3 DETAILED STEPS FOR BERT

We replace the default Adam optimizer in BERT with our Adan for both pretraining and fune-tuning. Specifically, we first pretrain BERT-base on the Bookcorpus and Wikipedia datasets, and then finetune BERT-base separately for each GLUE task on the corresponding training data. Note, GLUE is a collection of 9 tasks/datasets to evaluate natural language understanding systems, in which the tasks are organized as either single-sentence classification or sentence-pair classification.Here we simply replace the Adam optimizer in BERT with our Adan and do not make other changes, e.g. random seed, warmup steps and learning rate decay strategy, dropout probability, etc. For 

