ADAN: ADAPTIVE NESTEROV MOMENTUM ALGO-RITHM FOR FASTER OPTIMIZING DEEP MODELS

Abstract

Adaptive gradient algorithms combine the moving average idea with heavy ball acceleration to estimate accurate first-and second-order moments of the gradient for accelerating convergence. However, Nesterov acceleration which converges faster than heavy ball acceleration in theory and also in many empirical cases, is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm, Adan for short, to speed up the training of deep neural networks effectively. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the first-and second-order moments of the gradient in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an ϵ-approximate first-order stationary point within O ϵ -3.5 stochastic gradient complexity on the non-convex stochastic problems (e.g. deep learning problems), matching the bestknown lower bound. Extensive experimental results show that Adan surpasses the corresponding SoTA optimizers on vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g. ResNet, ConvNext, ViT, Swin, MAE, LSTM, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, ResNet, MAE, etc, and also shows great tolerance to a large range of minibatch size, e.g. from 1k to 32k. We hope Adan can contribute to developing deep learning by reducing training costs and relieving the engineering burden of trying different optimizers on various architectures.

1. INTRODUCTION

Deep neural networks (DNNs) have made remarkable success in many fields, e.g. computer vision (Szegedy et al., 2015; He et al., 2016) and natural language processing (Sainath et al., 2013; Abdel-Hamid et al., 2014) . A noticeable part of such success is contributed by the stochastic gradient-based optimizers, which find satisfactory solutions with high efficiency. Among current deep optimizers, SGD (Robbins & Monro, 1951) is the earliest and also the most representative stochastic optimizer, with dominant popularity for its simplicity and effectiveness. It adopts a single common learning rate for all gradient coordinates but often suffers unsatisfactory convergence speed on sparse data or ill-conditioned problems. In recent years, adaptive gradient algorithms, e.g. Adam (Kingma & Ba, 2014) and AdamW (Loshchilov & Hutter, 2018), have been proposed, which adjust the learning rate for each gradient coordinate according to the current geometry curvature of the loss objective. These adaptive algorithms, e.g. Adam, often offer a faster convergence speed than SGD in practice. However, none of the above optimizers can always stay undefeated among all its competitors across different network architectures and application settings. For instance, for vanilla ResNet (He et al., 2016) , SGD often achieves better generalization performance than adaptive gradient algorithms such as Adam, whereas on vision transformers (ViTs) (Touvron et al., 2021) , SGD often fails, and AdamW is the dominant optimizer with higher and more stable performance. Moreover, these commonly used optimizers usually fail for large-batch training, which is a default setting of the prevalent distributed training. Although there is some performance degradation, we still tend to choose the large-batch setting for large-scale deep learning training tasks due to the unaffordable training time. For example, training the ViT-B with the batch size of 512 usually takes several days, but when the batch size comes to 32K, we may finish the training within three hours (Liu et al., 2022a) . Although some

