WIN: WEIGHT-DECAY-INTEGRATED NESTEROV AC-CELERATION FOR ADAPTIVE GRADIENT ALGORITHMS

Abstract

Training deep networks on large-scale datasets is computationally challenging. In this work, we explore the problem of "how to accelerate adaptive gradient algorithms in a general manner", and aim to provide practical efficiency-boosting insights. To this end, we propose an effective and general Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, we minimize a dynamical loss per iteration which combines the vanilla training loss and a dynamic regularizer inspired by proximal point method (PPM) to improve the convexity of the problem. To introduce Nesterov-alike-acceleration into AdamW and Adam, we respectively use the firstand second-order Taylor approximations of vanilla loss to update the variable twice. In this way, we arrive at our Win acceleration for AdamW and Adam that uses a conservative step and a reckless step to update twice and then linearly combines these two updates for acceleration. Next, we extend Win acceleration to LAMB and SGD. Our transparent acceleration derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we prove the convergence of Win-accelerated adaptive algorithms and justify their convergence superiority over their non-accelerated counterparts by taking AdamW and Adam as examples. Experimental results testify to the faster convergence speed and superior performance of our Win-accelerated AdamW, Adam, LAMB and SGD over their non-accelerated counterparts on vision classification tasks and language modeling tasks with both CNN and Transformer backbones. We hope Win shall be a default acceleration option for popular optimizers in deep learning community to improve the training efficiency. Code will be released at https://github.com/sail-sg/win.

1. INTRODUCTION

Deep neural networks (DNNs) are effective to model realistic data and have been successfully applied to many applications, e.g. image classification (He et al., 2016) and speech recognition (Sainath et al., 2013) . Typically, their training models can be formulated as a nonconvex problem:  min z∈R d F (z) := E ζ∼D [f (z, ζ)] + λ 2 z 2 2 ,



where z ∈ R d is the model parameter; sample ζ is drawn from a data distribution D; the loss f is differentiable; λ is a constant. Though many algorithms, e.g. gradient descent(Cauchy et al.,  1847)  and variance-reduced algorithms(Rie Johnson, 2013), can solve problem (1), SGD(Robbins  & Monro, 1951)  uses the compositional structure in (1) to efficiently estimate gradient via minibatch data, and has become a dominant algorithm to train DNNs in practice because of its higher efficiency and effectiveness. However, on sparse data or ill-conditioned problems, SGD suffers from slow convergence speed (Kingma & Ba, 2014), as it scales the gradient uniformly in all parameter coordinate and ignores the problem properties on each coordinate. To resolve this issue, recent work has proposed a variety of adaptive methods, e.g. Adam (Kingma & Ba, 2014) and AdamW(Loshchilov &  Hutter, 2018), that scale each gradient coordinate according to the current geometry curvature of the loss F (z). This coordinate-wise scaling greatly accelerates the optimization convergence and helps them, e.g. Adam and AdamW, become more popular in DNN training, especially for transformers.

