TRAINING BY VANILLA SGD WITH LARGER LEARNING RATES Anonymous authors Paper under double-blind review

Abstract

The stochastic gradient descent (SGD) method, first proposed in 1950's, has been the foundation for deep-neural-network (DNN) training with numerous enhancements including adding a momentum or adaptively selecting learning rates, or using both strategies and more. A common view for SGD is that the learning rate should be eventually made small in order to reach sufficiently good approximate solutions. Another widely held view is that the vanilla SGD is out of fashion in comparison to many of its modern variations. In this work, we provide a contrarian claim that, when training over-parameterized DNNs, the vanilla SGD can still compete well with, and oftentimes outperform, its more recent variations by simply using learning rates significantly larger than commonly used values. We establish theoretical results to explain this local convergence behavior of SGD on nonconvex functions, and also present computational evidence, across multiple tasks including image classification, speech recognition and natural language processing, to support the practice of using larger learning rates.

1. INTRODUCTION

We are interested in minimizing a function f : R d → R with an expectation form: min x∈R d f (x) := E ξ [K(x, ξ)], where the subscript indicates expectation is computed on the random variable ξ. Specially, if the ξ probability distribution is clear and random variable ξ is uniformly distributed on N terms, the objective function f can be expressed into a finite-sum form: min x∈R d f (x) := 1 N N i=1 f i (x), where Figure 1 : The step size vs. iteration for nonconvex functions: in a neighborhood of the solution, the learning rate can be a constant value below a threshold. f i (x) : R d → R is the i-th component To solve this problem given in Eq.( 1b), we can compute the gradient of objective function directly with a classic GD (Gradient Descent) algorithm. However, this method suffers from expensive gradient computation for extremely large N , and hence people apply stochastic gradient descent method (SGM) to address this issue. Incremental gradient descent (IGD) is a primary version of SGM, where its calculation of gradient proceeds on single component f i at each iteration, instead of the whole. As a special case of IGD, SGD (Robbins & Monro, 1951) , a fundamental method to train neural networks, always updates parameters by the gradient computed on a minibatch. There exists an intuitive view that SGD with constant step size (SGD-CS) potentially leads to faster



function. The optimization of Eqs.(1a) and (1b) is widely encountered in machine learning tasks(Goodfellow et al., 2016; Simonyan & Zisserman,  2014).

