WIN: WEIGHT-DECAY-INTEGRATED NESTEROV AC-CELERATION FOR ADAPTIVE GRADIENT ALGORITHMS

Abstract

Training deep networks on large-scale datasets is computationally challenging. In this work, we explore the problem of "how to accelerate adaptive gradient algorithms in a general manner", and aim to provide practical efficiency-boosting insights. To this end, we propose an effective and general Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, we minimize a dynamical loss per iteration which combines the vanilla training loss and a dynamic regularizer inspired by proximal point method (PPM) to improve the convexity of the problem. To introduce Nesterov-alike-acceleration into AdamW and Adam, we respectively use the firstand second-order Taylor approximations of vanilla loss to update the variable twice. In this way, we arrive at our Win acceleration for AdamW and Adam that uses a conservative step and a reckless step to update twice and then linearly combines these two updates for acceleration. Next, we extend Win acceleration to LAMB and SGD. Our transparent acceleration derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we prove the convergence of Win-accelerated adaptive algorithms and justify their convergence superiority over their non-accelerated counterparts by taking AdamW and Adam as examples. Experimental results testify to the faster convergence speed and superior performance of our Win-accelerated AdamW, Adam, LAMB and SGD over their non-accelerated counterparts on vision classification tasks and language modeling tasks with both CNN and Transformer backbones. We hope Win shall be a default acceleration option for popular optimizers in deep learning community to improve the training efficiency. Code will be released at https://github.com/sail-sg/win.

1. INTRODUCTION

Deep neural networks (DNNs) are effective to model realistic data and have been successfully applied to many applications, e.g. image classification (He et al., 2016) and speech recognition (Sainath et al., 2013) . Typically, their training models can be formulated as a nonconvex problem: min z∈R d F (z) := E ζ∼D [f (z, ζ)] + λ 2 z 2 2 , where z ∈ R d is the model parameter; sample ζ is drawn from a data distribution D; the loss f is differentiable; λ is a constant. Though many algorithms, e.g. gradient descent (Cauchy et al., 1847) and variance-reduced algorithms (Rie Johnson, 2013) , can solve problem (1), SGD (Robbins & Monro, 1951) uses the compositional structure in (1) to efficiently estimate gradient via minibatch data, and has become a dominant algorithm to train DNNs in practice because of its higher efficiency and effectiveness. However, on sparse data or ill-conditioned problems, SGD suffers from slow convergence speed (Kingma & Ba, 2014) , as it scales the gradient uniformly in all parameter coordinate and ignores the problem properties on each coordinate. To resolve this issue, recent work has proposed a variety of adaptive methods, e.g. Adam (Kingma & Ba, 2014) and AdamW (Loshchilov & Hutter, 2018) , that scale each gradient coordinate according to the current geometry curvature of the loss F (z). This coordinate-wise scaling greatly accelerates the optimization convergence and helps them, e.g. Adam and AdamW, become more popular in DNN training, especially for transformers. Unfortunately, along with the increasing scale of both datasets and models, efficient DNN training even with SGD or adaptive algorithms has become very challenging. In this work, we are particularly interested in the problem of "how to accelerate the convergence of adaptive algorithms in a general manner" because of their dominant popularity across many DNNs. Heavy ball acceleration (Polyak, 1964) and Nesterov acceleration (Nesterov, 2003) are widely used in SGD but are rarely studied in adaptive algorithms. Among the very few, NAdam (Dozat, 2016) simplifies Nesterov acceleration to estimate the first moment of gradient in Adam while totally ignoring the second-order moments, which is not exact Nesterov acceleration and may not inherit its full acceleration merit.  F (z)+ 1 2η k z -x k 2 √ v k +ν with the second-order gradient moment v k and the stabilizing constant ν in AdamW and Adam. Then to introduce Nesterov-alike acceleration and also make the problem solvable iteratively, we respectively approximate F (z) by its firstand second-order Taylor expansions to update the variable z twice while always fixing the above dynamic regularization and also an extra regularizer 1 2η k z 2 √ v k +ν induced by the weight decay in AdamW. As a result, we arrive at our Win acceleration, a Nesterov-alike acceleration, for AdamW and Adam that uses a conservative step and a reckless step to update twice and then linearly combines these two updates for acceleration. Then we extend this Win acceleration to LAMB (You et al., 2019) and SGD. The above acceleration derivation is transparent and general which could motivate other accelerations and provide examples to introduce other accelerations into adaptive algorithms. Secondly, we prove the convergence of our Win-accelerated AdamW and Adam. For both, to find an -approximate first-order stationary point, their stochastic gradient complexity is O Finally, experimental results on both vision classification tasks and language modeling tasks show that our Win-accelerated algorithms, i.e. accelerated AdamW, Adam, LAMB and SGD, can accelerate the convergence speed and also improve the performance of their corresponding non-accelerated counterparts by a remarkable margin on both CNN and transformer architectures. All these results show the strong compatibility, generalization and superiority of our acceleration technique.

2. RELATED WORK

In the context of deep learning, when considering efficiency and generalization, one often prefers to adopt SGD and adaptive gradient algorithms, e.g. Adam, instead of other algorithms, e.g. variancereduced algorithms (Rie Johnson, 2013) , to solve problem (1). But, in practice and theory, adaptive algorithms often suffer from inferior generalization performance than SGD (Zhou et al., 2020a; b) . To solve this issue, AdamW (Loshchilov & Hutter, 2018) proposes a decoupled weight decay which introduces an 2 -alike regularization into Adam to decay network weight iteratively, and its effectiveness is widely validated on ViTs (Touvron et al., 2021) and CNNs (Touvron et al., 2021) . Later, LAMB (You et al., 2019) scales the update in AdamW to the weight magnitude for avoiding too large or small update, but suffers from unsatisfactory performance on small batch. In this work, we aim to design a general acceleration to accelerate these adaptive algorithms. Heavy-ball acceleration (Polyak, 1964) and Nesterov acceleration (Nesterov, 2003) are two classical acceleration techniques, and their effectiveness in SGD is well testified. Later, NAdam (Dozat,



In this work, based on a recent Nesterov-type acceleration formulation (Nesterov et al., 2018) and proximal point method (PPM) (Moreau, 1965), we propose a new Weight-decay-Integrated Nesterov acceleration (Win for short) to accelerate adaptive algorithms, and also further analyze the convergence of Win-accelerated adaptive algorithms to justify their convergence superiority by taking AdamW and Adam as examples. Our main contributions are highlighted below. Firstly, we use PPM to rigorously derive our Win acceleration for accelerating adaptive algorithms. By taking AdamW and Adam as examples, at the k-iteration, we follow PPM spirit and minimize a dynamically regularized loss

lower bound Ω( 1 4 ) in(Arjevani et al., 2019; 2020)  (up to constant factors) under the same conditions, where c ∞ upper bounds the ∞ norm of stochastic gradient. Moreover, this complexity improves a factor O( 25 4 ) of Adam-type optimizers in(Zhou et al., 2018; Guo et al., 2021), e.g. Adam, AdaGrad(Duchi et al., 2011), AdaBound (Luo  et al., 2018), since network parameter dimension d is often much larger than c 0.5 ∞ , especially for over-parameterized networks. Indeed, Win-accelerated Adam and AdamW also enjoy superior complexity than other Adam variants, e.g.Adabelief (Zhuang et al., 2020)  with compelxity O( ), especially on over-parameterized networks, where c 2 is the maximum 2 -norm of stochastic gradient.

