ADAM + : A STOCHASTIC METHOD WITH ADAPTIVE VARIANCE REDUCTION

Abstract

Adam is a widely used stochastic optimization method for deep learning applications. While practitioners prefer Adam because it requires less parameter tuning, its use is problematic from a theoretical point of view since it may not converge. Variants of Adam have been proposed with provable convergence guarantee, but they tend not be competitive with Adam on the practical performance. In this paper, we propose a new method named Adam + (pronounced as Adam-plus). Adam + retains some of the key components of Adam but it also has several noticeable differences: (i) it does not maintain the moving average of second moment estimate but instead computes the moving average of first moment estimate at extrapolated data points; (ii) its adaptive step size is formed not by dividing the square root of second moment estimate but instead by dividing the root of the norm of first moment estimate. As a result, Adam + requires few parameter tuning, as Adam, but it enjoys a provable convergence guarantee. Our analysis further shows that Adam + enjoys adaptive variance reduction, i.e., the variance of the stochastic gradient estimator reduces as the algorithm converges, hence enjoying an adaptive convergence. We also propose a more general variant of Adam + with different adaptive step sizes and establish their fast convergence rate. Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam + significantly outperforms Adam and achieves comparable performance with besttuned SGD and momentum SGD.

1. INTRODUCTION

Adaptive gradient methods (Duchi et al., 2011; McMahan & Streeter, 2010; Tieleman & Hinton, 2012; Kingma & Ba, 2014; Reddi et al., 2019) are one of the most important variants of Stochastic Gradient Descent (SGD) in modern machine learning applications. Contrary to SGD, adaptive gradient methods typically require little parameter tuning still retaining the computational efficiency of SGD. One of the most used adaptive methods is Adam (Kingma & Ba, 2014) , which is considered by practitioners as the de-facto default optimizer for deep learning frameworks. Adam computes the update for every dimension of the model parameter through a moment estimation, i.e., the estimates of the first and second moments of the gradients. The estimates for first and second moments are updated using exponential moving averages with two different control parameters. These moving averages are the key difference between Adam and previous adaptive gradient methods, such as Adagrad (Duchi et al., 2011) . Although Adam exhibits great empirical performance, there still remain many mysteries about its convergence. First, it has been shown that Adam may not converge for some objective functions (Reddi et al., 2019; Chen et al., 2018b) . Second, it is unclear what is the benefit that the moving average brings from theoretical point of view, especially its effect on the convergence rate. Third, it has been empirically observed that adaptive gradient methods can have worse generalization performance than its non-adaptive counterpart (e.g., SGD) on various deep learning tasks due to the coordinate-wise learning rates (Wilson et al., 2017) . The above issues motivate us to design a new algorithm which achieves the best of both worlds, i.e., provable convergence with benefits from the moving average and enjoying good generalization performance in deep learning. Specifically, we focus on the following optimization problem: min w∈R d F (w), Table 1 : Summary of different algorithms with different assumptions and complexity results for finding an -stationary point. "Individual Smooth" means assuming that F (w) = E ξ∼D [f (w; ξ)] and that every component function f (w; ξ) is L-smooth. "Hessian Lipschitz" means that ∇ 2 F (x) -∇ 2 F (y) ≤ L H xy holds for x, y and L H ≥ 0. "Type I" means that the complexity depends on E[ T i=1 g 1:T,i ], where g 1:T,i stands for the i-th row of the matrix [g 1 , . . . , g T ] with g t being the stochastic gradient at t-th iteration and T being the number of iterations. "Type II" means that complexity depends on E[ T t=1 z t ], where z t is the variance-reduced gradient estimator at t-th iteration.

Algorithm

Individual Smooth Hessian Lipschitz Worst-case Complexity better than O( -4 )? Data-dependent Complexity Generalized Adam (Chen et al., 2018b) PAdam where we only have access to stochastic gradients of F . Note that F could possibly be nonconvex in w. Due to the non-convexity, our goal is to design a stochastic first-order algorithm to find the -stationary point, i.e., finding w such that E [ ∇F (w) ] ≤ , with low iteration complexity.Our key contribution is the design and analysis of a new stochastic method named Adam + . Adam + retains some of the key components of Adam but it also has several noticeable differences: (i) it does not maintain the moving average of second moment estimate but instead computes the moving average of first moment estimate at extrapolated points; (ii) its adaptive step size is formed not by dividing the square root of coordinate-wise second moment estimate but instead by dividing the root of the norm of first moment estimate. These features allow us to establish the adaptive convergence of Adam et al., 2018; Zhou et al., 2018; Pham et al., 2020; Cutkosky & Orabona, 2019) . In contrast, we do not necessarily require large minibatch or computing stochastic gradients at two points per-iteration to achieve the variance reduction. In addition, we also establish a fast rate that matches the state-of-theart complexity under the same conditions of a variant of Adam + . Table 1 provides an overview of our results and a summary of existing results. We refer readers to Section F for a comprehensive survey of other related work. We further corroborate our theoretical results with an extensive empirical study on various deep learning tasks.Our contributions are summarized below.• We propose a new algorithm with adaptive step size, namely Adam + , for general nonconvex optimization. We show that it enjoys a new type of data-dependent adaptive convergence that depends on the variance reduction property of first moment estimate. Notably, this data-dependent complexity does not require the presence of sparsity in stochastic gradients to guarantee fast convergence as in previous works (Duchi et al., 2011; Kingma & Ba, 2014; Reddi et al., 2019; Chen et al., 2019; 2018a) . To the best of our knowledge, this is the first work establishing such new type of data-dependent complexity.• We show that a general variant of our algorithm can achieve O( -3.5 ) worst-case complexity, which matches the state-of-the-art complexity guarantee under the Hessian Lipschitz assumption (Cutkosky & Mehta, 2020).• We demonstrate the effectiveness of our algorithms on image classification, language modeling, and automatic speech recognition. Our empirical results show that our proposed algorithm consistently outperforms Adam on all tasks, and it achieves comparable performance with the best-tuned SGD and momentum SGD.

