ADAN: ADAPTIVE NESTEROV MOMENTUM ALGO-RITHM FOR FASTER OPTIMIZING DEEP MODELS

Abstract

Adaptive gradient algorithms combine the moving average idea with heavy ball acceleration to estimate accurate first-and second-order moments of the gradient for accelerating convergence. However, Nesterov acceleration which converges faster than heavy ball acceleration in theory and also in many empirical cases, is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm, Adan for short, to speed up the training of deep neural networks effectively. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the first-and second-order moments of the gradient in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an ϵ-approximate first-order stationary point within O ϵ -3.5 stochastic gradient complexity on the non-convex stochastic problems (e.g. deep learning problems), matching the bestknown lower bound. Extensive experimental results show that Adan surpasses the corresponding SoTA optimizers on vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g. ResNet, ConvNext, ViT, Swin, MAE, LSTM, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, ResNet, MAE, etc, and also shows great tolerance to a large range of minibatch size, e.g. from 1k to 32k. We hope Adan can contribute to developing deep learning by reducing training costs and relieving the engineering burden of trying different optimizers on various architectures.

1. INTRODUCTION

Deep neural networks (DNNs) have made remarkable success in many fields, e.g. computer vision (Szegedy et al., 2015; He et al., 2016) and natural language processing (Sainath et al., 2013; Abdel-Hamid et al., 2014) . A noticeable part of such success is contributed by the stochastic gradient-based optimizers, which find satisfactory solutions with high efficiency. Among current deep optimizers, SGD (Robbins & Monro, 1951) is the earliest and also the most representative stochastic optimizer, with dominant popularity for its simplicity and effectiveness. It adopts a single common learning rate for all gradient coordinates but often suffers unsatisfactory convergence speed on sparse data or ill-conditioned problems. In recent years, adaptive gradient algorithms, e.g. Adam (Kingma & Ba, 2014) and AdamW (Loshchilov & Hutter, 2018) , have been proposed, which adjust the learning rate for each gradient coordinate according to the current geometry curvature of the loss objective. These adaptive algorithms, e.g. Adam, often offer a faster convergence speed than SGD in practice. However, none of the above optimizers can always stay undefeated among all its competitors across different network architectures and application settings. For instance, for vanilla ResNet (He et al., 2016) , SGD often achieves better generalization performance than adaptive gradient algorithms such as Adam, whereas on vision transformers (ViTs) (Touvron et al., 2021) , SGD often fails, and AdamW is the dominant optimizer with higher and more stable performance. Moreover, these commonly used optimizers usually fail for large-batch training, which is a default setting of the prevalent distributed training. Although there is some performance degradation, we still tend to choose the large-batch setting for large-scale deep learning training tasks due to the unaffordable training time. For example, training the ViT-B with the batch size of 512 usually takes several days, but when the batch size comes to 32K, we may finish the training within three hours (Liu et al., 2022a) . Although some  ℓ∞ ≤ c∞ O c 2 ∞ dϵ -4 Ω ϵ -4 RMSProp ℓ∞ ≤ c∞ O √ c∞dϵ -4 Ω ϵ -4 Lipschitz AdamW - - - - Adabelief ℓ 2 ≤ c 2 O c 6 2 ϵ -4 Ω ϵ -4 Gradient Padam ℓ∞ ≤ c∞ O √ c∞dϵ -4 Ω ϵ -4 LAMB O ϵ -4 ℓ 2 ≤ c 2 O c 2 2 dϵ -4 Ω ϵ -4 Adan (ours) ℓ∞ ≤ c∞ O c 2.5 ∞ ϵ -4 Ω ϵ -4 Lipschitz Hessian A-NIGT ℓ 2 ≤ c 2 O ϵ -3.5 log c 2 ϵ Ω ϵ -3.5 Adam + O ϵ -1.625 ℓ 2 ≤ c 2 O ϵ -3.625 Ω ϵ -3.5 Adan (ours) ℓ∞ ≤ c∞ O c 1.25 ∞ ϵ -3.5 Ω ϵ -3.5 methods, e.g. LARS (You et al., 2017) and LAMB (You et al., 2019) , have been proposed to handle large batch sizes, their performance often varies significantly across batch sizes. This performance inconsistency increases the training cost and engineering burden, since one usually has to try various optimizers for different architectures or training settings. When we rethink the current adaptive gradient algorithms, we find that they mainly combine the moving average idea with the heavy ball acceleration technique to estimate the first-and second-order moments of the gradient, e.g. Adam, AdamW and LAMB. However, previous studies (Nesterov, 1983; 1988; 2003) have revealed that Nesterov acceleration can theoretically achieve a faster convergence speed than heavy ball acceleration, as it uses gradient at an extrapolation point of the current solution and sees a slight "future". Moreover, recent work (Nado et al., 2021; He et al., 2021) have shown the potential of Nesterov acceleration for large-batch training. Thus we are inspired to consider efficiently integrating Nesterov acceleration with adaptive algorithms. Contributions: 1) We propose an efficient DNN optimizer, named Adan. Adan develops a Nesterov momentum estimation method to estimate stable and accurate first-and second-order moments of the gradient in adaptive gradient algorithms for acceleration. 2) Moreover, Adan enjoys a provably faster convergence speed than previous adaptive gradient algorithms such as Adam. 3) Empirically, Adan shows superior performance over the SoTA deep optimizers across vision, language, and reinforcement learning (RL) tasks. Our detailed contributions are highlighted below. Firstly, we propose an efficient Nesterov-acceleration-induced deep learning optimizer termed Adan. Given a function f and the current solution θ k , Nesterov acceleration (Nesterov, 1983; 1988; 2003) estimates the gradient g k = ∇f (θ ′ k ) at the extrapolation point θ ′ k = θ k -η(1 -β 1 )m k-1 with the learning rate η and momentum coefficient β 1 ∈ (0, 1), and updates the moving gradient average as m k = (1 -β 1 )m k-1 + g k . Then it runs a step by θ k+1 = θ k -ηm k . However, the inconsistency of the positions for parameter updating at θ k and gradient estimation at θ ′ k leads to the additional cost of model parameter reloading during back-propagation (BP), which is unaffordable especially for large DNNs. To avoid the model reloading during BP, we propose an alternative Nesterov momentum estimation (NME). We compute the gradient g k = ∇f (θ k ) at the current solution θ k , and estimate the moving gradient average as m k = (1 -β 1 )m k-1 + g ′ k , where g ′ k = g k + (1 -β 1 )(g k -g k-1 ). Our NME is provably equivalent to the vanilla one yet can avoid the extra model reloading. Then by regarding g ′ k as the current stochastic gradient in adaptive gradient algorithms, e.g. Adam, we accordingly estimate the first-and second-moments as m k = (1 -β 1 )m k-1 + β 1 g ′ k and n k = (1 -β 2 )n k-1 + β 2 (g ′ k ) 2 respectively. Finally, we update θ k+1 = θ k -ηm k / √ n k + ε. In this way, Adan enjoys the merit of Nesterov acceleration, namely faster convergence speed and tolerance to large mini-batch size (Lin et al., 2020) , which is verified in our experiments in Sec. 5.



Comparison of different adaptive gradient algorithms on nonconvex stochastic problems. "Separated Reg." refers to whether the ℓ 2 regularizer (weight decay) can be separated from the loss objective like AdamW. "Complexity" denotes stochastic gradient complexity to find an ϵapproximate first-order stationary point. Adam-type methods(Guo et al., 2021)  includes Adam, and AdaGrad(Duchi et al., 2011), etc. AdamW has no available convergence result. For SAM(Foret  et al., 2020), A-NIGT (Cutkosky & Mehta, 2020) and Adam+ (Liu et al., 2020), we compare their adaptive versions. d is the variable dimension. The lower bound is proven in(Arjevani et al., 2020).

