TRAINING BY VANILLA SGD WITH LARGER LEARNING RATES Anonymous authors Paper under double-blind review

Abstract

The stochastic gradient descent (SGD) method, first proposed in 1950's, has been the foundation for deep-neural-network (DNN) training with numerous enhancements including adding a momentum or adaptively selecting learning rates, or using both strategies and more. A common view for SGD is that the learning rate should be eventually made small in order to reach sufficiently good approximate solutions. Another widely held view is that the vanilla SGD is out of fashion in comparison to many of its modern variations. In this work, we provide a contrarian claim that, when training over-parameterized DNNs, the vanilla SGD can still compete well with, and oftentimes outperform, its more recent variations by simply using learning rates significantly larger than commonly used values. We establish theoretical results to explain this local convergence behavior of SGD on nonconvex functions, and also present computational evidence, across multiple tasks including image classification, speech recognition and natural language processing, to support the practice of using larger learning rates.

1. INTRODUCTION

We are interested in minimizing a function f : R d → R with an expectation form: min x∈R d f (x) := E ξ [K(x, ξ)], where the subscript indicates expectation is computed on the random variable ξ. Specially, if the ξ probability distribution is clear and random variable ξ is uniformly distributed on N terms, the objective function f can be expressed into a finite-sum form: min x∈R d f (x) := 1 N N i=1 f i (x), where f i (x) : R d → R is the i-th component function. The optimization of Eqs.( 1a) and ( 1b) is widely encountered in machine learning tasks (Goodfellow et al., 2016; Simonyan & Zisserman, 2014) . Figure 1 : The step size vs. iteration for nonconvex functions: in a neighborhood of the solution, the learning rate can be a constant value below a threshold. To solve this problem given in Eq.( 1b), we can compute the gradient of objective function directly with a classic GD (Gradient Descent) algorithm. However, this method suffers from expensive gradient computation for extremely large N , and hence people apply stochastic gradient descent method (SGM) to address this issue. Incremental gradient descent (IGD) is a primary version of SGM, where its calculation of gradient proceeds on single component f i at each iteration, instead of the whole. As a special case of IGD, SGD (Robbins & Monro, 1951) , a fundamental method to train neural networks, always updates parameters by the gradient computed on a minibatch. There exists an intuitive view that SGD with constant step size (SGD-CS) potentially leads to faster convergence rate. Some studies (Solodov, 1998; Tseng, 1998) show that, under a so-called strong growth condition (SGC), SGD-CS converges to an optimal point faster than SGD with a diminishing step size. In this work, we study the local convergence (or last-iterate convergence as is called in Jain et al. ( 2019)) of SGD-CS on nonconvex functions. Note that SGD-CS does not provide a guarantee of converge when starting from any initialization point. Therefore, a useful strategy is to use SGD with a decreasing step size at the beginning, and then switch to SGD-CS in a neighborhood of a minimizer. Fig. 1 illustrates the range of the step size of SGD versus the number of iterations. Our main theoretical and experimental contributions are as follows. • We establish local (or last-iterate) convergence of SGD with a constant step size (SGD-CS) on nonconvex functions under the interpolation condition. We note that previous results are mostly for strongly convex functions under strong (or weak) growth condition. Our result is much closer to common situations in practice. • We discover that on linear regression problems with 2 regularization, the size of convergent learning rates can be quite large for incremental gradient descent (IGD). Our numerical results show that, within a fairly large range, the larger step size is, the smaller spectral radius is, and the faster the convergence rate is. • Based on the above observations, we further propose a strategy called SGDL that uses the SGD with a large initial learning rate (more than 10 times larger than the learning rate in SGD with momentum), while still being a vanilla SGD. We conduct extensive experiments on various popular deep-learning tasks and models in computer vision, audio recognition and natural language processing. Our results show that the method converges successfully and has a strong generalization performance, and sometimes outperforms its advanced variant (SGDM) and other several popular adaptive methods (e.g., Adam, AdaDelta, etc).

2. RELATED WORK

There are many papers on stochastic optimization and we summarize typical ones which are most relevant with our work. The convergence of SGD for over-parameterized models is analyzed in (Vaswani et al., 2018; Mai & Johansson, 2020; Allen-Zhu et al., 2019; Li & Liang, 2018) . The power of interpolation is studied in (Ma et al., 2018; Vaswani et al., 2019) . The work (Jastrzębski et al., 2017) investigates a large ratio of learning rate to batch size often leads to a wide endpoint. The work (Smith & Topin, 2019) shows a phenomenon called super-convergence which is in contrast to the results in (Bottou et al., 2018) . More recently, several new learning rate schedules have proposed for SGD (Loshchilov & Hutter, 2016; Smith, 2017; Agarwal et al., 2017; Carmon et al., 2018) . Adaptive gradient methods are widely in deep learning application. Popular solutions include AdaDelta (Zeiler, 2012 ), RMSProp(Hinton et al., 2012 ), Adam(Kingma & Ba, 2014) and so on. Unfortunately, it is believed that the adaptive methods may have a poor empirically performance. For example, Wilson et al. (2017) have observed that Adam hurts generalization performance in comparison to SGD with or without momentum. The work (Schmidt & Roux, 2013) introduces SGD-CS that attains linear convergence rate for strongly convex functions under the strong growth condition (SGC), and sublinear convergence rate for convex functions under the SGC. The work (Cevher & Vu, 2018) investigates the weak growth condition (WGC), the necessary condition for linear convergence of SGD-CS. For a general convex function, the work (Ma et al., 2018) shows that SGD-CS attains linear convergence rate under the interpolation property. For a finite sum f , the interpolation property means ∇f i (x * ) = 0 for i = 1, . . . , N , which is held apparently in over-parameterized DNN. However, the theoretical analysis of SGD-CS for nonconvex functions is less as mature as in the convex case.

