TRAINING BY VANILLA SGD WITH LARGER LEARNING RATES Anonymous authors Paper under double-blind review

Abstract

The stochastic gradient descent (SGD) method, first proposed in 1950's, has been the foundation for deep-neural-network (DNN) training with numerous enhancements including adding a momentum or adaptively selecting learning rates, or using both strategies and more. A common view for SGD is that the learning rate should be eventually made small in order to reach sufficiently good approximate solutions. Another widely held view is that the vanilla SGD is out of fashion in comparison to many of its modern variations. In this work, we provide a contrarian claim that, when training over-parameterized DNNs, the vanilla SGD can still compete well with, and oftentimes outperform, its more recent variations by simply using learning rates significantly larger than commonly used values. We establish theoretical results to explain this local convergence behavior of SGD on nonconvex functions, and also present computational evidence, across multiple tasks including image classification, speech recognition and natural language processing, to support the practice of using larger learning rates.

1. INTRODUCTION

We are interested in minimizing a function f : R d → R with an expectation form: min x∈R d f (x) := E ξ [K(x, ξ)], where the subscript indicates expectation is computed on the random variable ξ. Specially, if the ξ probability distribution is clear and random variable ξ is uniformly distributed on N terms, the objective function f can be expressed into a finite-sum form: min x∈R d f (x) := 1 N N i=1 f i (x), where f i (x) : R d → R is the i-th component function. The optimization of Eqs.(1a) and ( 1b) is widely encountered in machine learning tasks (Goodfellow et al., 2016; Simonyan & Zisserman, 2014) . Figure 1 : The step size vs. iteration for nonconvex functions: in a neighborhood of the solution, the learning rate can be a constant value below a threshold. To solve this problem given in Eq.( 1b), we can compute the gradient of objective function directly with a classic GD (Gradient Descent) algorithm. However, this method suffers from expensive gradient computation for extremely large N , and hence people apply stochastic gradient descent method (SGM) to address this issue. Incremental gradient descent (IGD) is a primary version of SGM, where its calculation of gradient proceeds on single component f i at each iteration, instead of the whole. As a special case of IGD, SGD (Robbins & Monro, 1951) , a fundamental method to train neural networks, always updates parameters by the gradient computed on a minibatch. There exists an intuitive view that SGD with constant step size (SGD-CS) potentially leads to faster convergence rate. Some studies (Solodov, 1998; Tseng, 1998) show that, under a so-called strong growth condition (SGC), SGD-CS converges to an optimal point faster than SGD with a diminishing step size. In this work, we study the local convergence (or last-iterate convergence as is called in Jain et al. (2019) ) of SGD-CS on nonconvex functions. Note that SGD-CS does not provide a guarantee of converge when starting from any initialization point. Therefore, a useful strategy is to use SGD with a decreasing step size at the beginning, and then switch to SGD-CS in a neighborhood of a minimizer. Fig. 1 illustrates the range of the step size of SGD versus the number of iterations. Our main theoretical and experimental contributions are as follows. • We establish local (or last-iterate) convergence of SGD with a constant step size (SGD-CS) on nonconvex functions under the interpolation condition. We note that previous results are mostly for strongly convex functions under strong (or weak) growth condition. Our result is much closer to common situations in practice. • We discover that on linear regression problems with 2 regularization, the size of convergent learning rates can be quite large for incremental gradient descent (IGD). Our numerical results show that, within a fairly large range, the larger step size is, the smaller spectral radius is, and the faster the convergence rate is. • Based on the above observations, we further propose a strategy called SGDL that uses the SGD with a large initial learning rate (more than 10 times larger than the learning rate in SGD with momentum), while still being a vanilla SGD. We conduct extensive experiments on various popular deep-learning tasks and models in computer vision, audio recognition and natural language processing. Our results show that the method converges successfully and has a strong generalization performance, and sometimes outperforms its advanced variant (SGDM) and other several popular adaptive methods (e.g., Adam, AdaDelta, etc).

2. RELATED WORK

There are many papers on stochastic optimization and we summarize typical ones which are most relevant with our work. The convergence of SGD for over-parameterized models is analyzed in (Vaswani et al., 2018; Mai & Johansson, 2020; Allen-Zhu et al., 2019; Li & Liang, 2018) . The power of interpolation is studied in (Ma et al., 2018; Vaswani et al., 2019) . The work (Jastrzębski et al., 2017) investigates a large ratio of learning rate to batch size often leads to a wide endpoint. The work (Smith & Topin, 2019) shows a phenomenon called super-convergence which is in contrast to the results in (Bottou et al., 2018) . More recently, several new learning rate schedules have proposed for SGD (Loshchilov & Hutter, 2016; Smith, 2017; Agarwal et al., 2017; Carmon et al., 2018) . Adaptive gradient methods are widely in deep learning application. Popular solutions include AdaDelta (Zeiler, 2012), RMSProp (Hinton et al., 2012 ), Adam(Kingma & Ba, 2014) and so on. Unfortunately, it is believed that the adaptive methods may have a poor empirically performance. For example, Wilson et al. (2017) have observed that Adam hurts generalization performance in comparison to SGD with or without momentum. The work (Schmidt & Roux, 2013) introduces SGD-CS that attains linear convergence rate for strongly convex functions under the strong growth condition (SGC), and sublinear convergence rate for convex functions under the SGC. The work (Cevher & Vu, 2018) investigates the weak growth condition (WGC), the necessary condition for linear convergence of SGD-CS. For a general convex function, the work (Ma et al., 2018) shows that SGD-CS attains linear convergence rate under the interpolation property. For a finite sum f , the interpolation property means ∇f i (x * ) = 0 for i = 1, . . . , N , which is held apparently in over-parameterized DNN. However, the theoretical analysis of SGD-CS for nonconvex functions is less as mature as in the convex case.

3.1. NOTATIONS

We denote the minimizer of object function f as x * and use usual partial ordering for symmetric matrices: A B means A -B is positive semidefinite; similar for the relations , , ≺. The norm • either denotes the Euclidean norm for vectors or Frobenius norm for matrices. The Hessian matrix of each component function f i (x) is denoted as H i (x). With slight abuse of notation, the Hessian matrix H i at x * is denoted as H * i . We also denote the mean and variance of {H * 1 , . . . , H * N } as H * ≡ 1 N N i=1 H * i and Σ * ≡ 1 N N i=1 (H * i ) 2 -(H * ) 2 , and define a ∧ b = min(a, b). X P -→ Y means random variable X converges to random variable Y in probability, while X L 1 --→ Y means X converges to random variable Y in L 1 norm.

3.2. ASSUMPTIONS

Our results are under the following mild assumptions: (A.1) The objective function f has the same structure as given in Eq.(1b). (A.2) Each component function f i in Eq.( 1b) is twice continously differentiable, with the corresponding Hessian matrix H i being Lipschitz continuous for constant L i at the minimizer x * . Without loss of generality, we suppose that H i (x) -H i (x * ) ≤ L x -x * , for i = 1, . . . , N, where L = max i=1,...,N L i . (A.3) As in (Zhang et al., 2000, Remark 5 .2), we assume the point of interest x * is a strong minimizer (i.e. H(x * ) 0). Assumption (A.3) restricts that the objective function is essentially strongly convex in a sufficiently small neighborhood of a minimizer. The work (Liu et al., 2020) investigates that the loss function is typically nonconvex in any neighborhood of a minimizer. However, this assumption is not restrictive in the regularization setting. Remark 1. Even if the Hessian matrix of the point of interest x * has zero eigenvalues, we can study the 2 regularized problem instead, and then the Hessian matrix becomes positive definite. This phenomenon is also validated in our numerical experiment in Section 5, and we add 2 regularizer in DNNs in Section 6. Based on the above technical assumptions, we focus on the convergence analysis of SGD-CS by discussing the convergence behavior precisely: x k+1 = x k -α 1 m m j=1 ∇f ξ (j) k (x k ). Next, we introduce the notion of point attraction to help us understand the convergence behavior to Eq.( 2). Definition 1. We say x * is a point of attraction to Eq.(2) if there is an open ball N (x * , ) of x * , such that for any x 0 ∈ N , {x k } generated by Eq.(2) all lie in N and converges to x * in the 1 norm. The Ostrowski Theorem says a sufficient condition for x * to be a point of attraction of the deterministic iteration x k+1 = T (x k ) is that the spectral radius of T (x) is strictly less than one, which further implies the condition that there exists an open ball N such that once the initial x 0 ∈ N , the iterates {x k } will converge to x * with linear rate. We define the strong minimizer x * as a point of strong attraction following the similar manner. Definition 2. We say x * is a point of strong attraction to Eq.( 2) if there exists a neighborhood N of x * , such that for any x 0 ∈ N , the sequence {x k } generated by Eq.(2) all lie in N and satisfy E x k -x * 2 ≤ ρ k • x 0 -x * 2 , for some ρ ∈ [0, 1). The following discussion studies necessary and sufficient conditions for the minimizer x * being a point of attraction of SGD with the constant step size α. However, starting from any initialization point, SGD-CS cannot be guaranteed to find such a local neighborhood around x * , except in some special cases including the loss function studied in Section 5 and Appendix A.2. Therefore, a practical strategy is to use diminishing step size when starting from a random initial point (Allen-Zhu, 2018), and as long as the iteration points belong to a neighborhood of a point of attraction, one can use a constant step size.

4. RESULTS FOR NONCONVEX FUNCTIONS

In this section, we rigorously show the necessary condition for the minimizer x * being a point of attraction, and Theorem 1 provides a sufficient condition for the strong minimizer x * to be a point of strong attraction with high probability. The detailed proofs are in Appendix.A.1.

4.1. NECESSARY CONDITION

Lemma 1. Suppose that the assumptions (A.1) and (A.2) hold, then ∇f i (x) < L 2 + H * i for i = 1, . . . , N , provided that x ∈ N (x * , ). Theorem 1. Suppose that the assumptions (A.1) and (A.2) hold. If the minimizer x * is a point of attraction of Eq.( 2), then the interpolation property is satisfied, i.e., ∇f i (x * ) = 0 for i = 1, . . . , N .

4.2.1. SUFFICIENT CONDITION FOR POINTS OF STRONG ATTRACTION

Define the error function e i (x) as e i (x) = -H * i • (x -x * ) + [∇f i (x) -∇f i (x * )]. This error function quantifies the residual for the first-order Taylor expansion for ∇f i (x) around the point x * since ∇f i (x) = ∇f i (x * ) + H * i (x -x * ) + e i (x). By substituting e i (x) into Eq.( 2), we have x k+1 -x * = x k -α 1 m m j=1 [H * ξ (j) k • (x k -x * ) + e ξ (j) k (x k )] -x * = 1 m m j=1 (I -αH * ξ (j) k )(x k -x * ) -αe ξ (j) k (x k ) . In addition, our proof for the sufficient condition indicates a bound on the error function. Lemma 2. Suppose that the assumptions (A.1) and (A.2) hold, and x * is a local minimizer of f . If the interpolation property is satisfied, then the error function defined in Eq.( 3) is bounded by: e i (x) ≤ L x -x * 2 . ( ) Under the assumption that {x k } stay in the local neighborhood, we can show that interpolation property is a sufficient condition for x * being a point of strong attraction of Eq.( 2), provided that the step-size α is sufficiently small. Theorem 2. Suppose that the assumptions (A.1)-(A.3) hold, then {x k } generated from Eq.(2) all lie in the neighborhood N (x * , ) and E[ x k -x * 2 ] ≤ ρE[ x k-1 -x * 2 ], if the radius ρ and the step size α satisfy the following conditions: ρ ≡ λ max I -2αH * + α 2 (H * ) 2 + α 2 Σ * + α 2 L 2 2 I + 2αL I < 1, (6a) 0 < α ≤ min i=1,...,N 1 λ max (H * i ) , 1 2λ max (H * ) , 1 L λ min (H * ) , (6b) 0 < α < 2λ min [(H * ) 2 + Σ * + L 2 2 I] -1/2 (H * -L I)[(H * ) 2 + Σ * + L 2 2 I] -1/2 . (6c) If ∇f i (x * ) = 0 for i = 1, . . . , N , then the minimizer x * is a point of strong attraction of the iteration (2).

4.2.2. STAYING IN THE LOCAL NEIGHBORHOOD

Now we show that {x k } generated from Eq.( 2) all stay in the local neighborhood with high probability. The proof follows the idea in (Tan & Vershynin, 2019). In the remaining of this section, we abuse the notation slightly and denote X k as a random vector in the k-th iteration of SGD-CS scheme, and x k as the realization of X k . Considering the setup of Theorem 2 without assuming that {x k } always lies in the neighborhood N (x * , ), we have the following technical results. Lemma 3. Let x 0 ∈ N (x * , ) such that x 0 -x * ≤ √ δ for some δ ∈ [0, 1). Define the stopping time τ = min{k : X k / ∈ N (x * , )}, then P {τ = ∞} ≥ 1 -δ. Remark 2. With the setup stated in Lemma 3, the sequence {x k } stay in the local neighborhood N (x * , ) with probability at least 1 -δ. Indeed, we can show the result in Theorem 2 by relaxing the assumption into that the sequence generated by Eq.( 2) stays in the local neighborhood within infinite iterates. Under the setup stated in the Lemma 3, we build the sufficient condition for points of strong attraction with the probabilistic guarantee. Theorem 3. There exists an event E which holds with probability at least P (E) ≥ 1 -δ and the sequence {x k } generated by Eq (2) satisfies E[ X k -x * 2 1 E ] ≤ ρ k x 0 -x * 2 (7) Remark 3. It's clear that x * is a point of attraction if it is a point of strong attraction. Therefore, once the event that the iterates generated from SGD-CS stay in the local neighborhood happens, the interpolation property becomes a necessary and sufficient condition for the strong minimizer x * to be a point of attraction to Eq.(2).

5. CASE STUDIES ON LARGE LEARNING RATES

This section shows Incremental Gradient Descent (IGD) with a constant step size actually converges into the global optimum point for some special problems, and then some numerical experiments are presented to illustrate this point. We present the results through an example as follows, with more examples appeared in Appendix.A.2. f (x) = 1 2 Ax -b 2 + µ 2 D 1/2 x 2 . ( ) Under some mild assumptions (c.f Example 3 in Appendix.A.2), the IGD with constant step size t converges when t ∈ 0, 2N ρ(µD + A T A) . where ρ is the spectral radius. The problem in Eq.( 8) can be extended into general nonlinear least squares: A sufficient condition for the convergence of IGD is f (x) = 1 2 N i=1 r 2 i (x) = 1 2 R(x) T R(x). 0 0. t ∈ 0, 1 λ max (E[H i (x * )]) , where E[H i (x * )] ≡ 1 N i H i (x * ) , and H i (•) denotes the Hessian matrix for the i-th component function r i (•). We conduct a numerical simulation for the regularized least squares problem in Eq.( 8). Let training data x be generated via x = 1 1+e z + N (0, σ 2 ) where z is uniformly sampled from (0, 10). The minimizer x * does not necessarily represent the best solution available due to noise, so all approximate solutions within the same noise level should be equally good in fact. The experimental results demonstrate that with large step sizes, IGD still converge and have a decent performance, which meets our expectations. After the first sharp rise, the step size can be updated insensitive to the error. In our next section, we will implement vanilla SGD with a large initial learning rate and use this optimizer for large scale problems.

6. EXPERIMENTS

We show the empirical results of different models to compare our SGDL with other popular optimization methods, including SGD, SGDM, Adam, and AdaDelta. We focus on the CIFAR10 and CIFAR100 image classification task (Krizhevsky et al., 2009) , with the downsampled variant of Im-ageNet named as ImageNet32 (Chrabaszcz et al., 2017) , the Speech Commands Dataset for audio recognition (Warden, 2018) , and the language modeling task on Penn Treebank (Marcus et al., 1993) . The first experiment is carried out on the CIFAR10 and CIFAR100 datasets, which are standard image collection with 10 and 100 classes respectively. After the preprocessing in Appendix.A.3.1, we train VGG (Simonyan & Zisserman, 2014) , ResNet (He et al., 2016) and DenseNet (Huang et al., 2017) We compare different optimization method including SGD,SGDM,Adam,AdaDelta, and shows the performance in Fig. 4 . We can see that our method still works fairly well and has the same or even better performance in comparison to SGDM and other optimizers.

6.2. IMAGENET32

The second experiment is carried out on ImageNet32 dataset (Deng et al., 2009) with more than a million images in 1000 classes, out of which 50000 images are used as a testing set. Each image has 32 × 32 pixels. The objective is to train an image classifier. We apply a similar train strategy as on the CIFAR10 and CIFAR100 datasets in Section 6.1, except that the epoch budget is down to 120 and the learning rate is shrunk to one-tenth of its current value every 30 epochs. Our experiments is only focus on ResNet56 due to computational resource limitation. To adjust the difficulty on this dataset, we increase the model capacity for the previously used model by quadrupling the number of channels on each convolution layer. From the results on Fig. 5 , we find our method outperforms SGDM slightly about 0.5% for the best accuracy. More interestingly, SGD also has a good performance, while two adaptive methods both work just passably.

6.3. AUDIO RECOGNITION

The third experiment is carried out on the Speech Commands Dataset, which consists of recordings from thousands of different people in uncontrolled recording conditions. Each sample is represented as a 16000-dimension vector. Each recording is one second in length. We divide the data into a training set and a testing set. With the dataset, we train a 2-layered neural net with 20 channels, a 2D dropout layer and 2 full-connection layers with 1,000 hidden nodes. More specific hpyerparameter settings is in Appendix.A.3.2. From the results in Fig. 6 , we can see that SGDL has quite competitive effect and outperforms vanilla SGD exceedingly.

6.4. LSTM LANGUAGE MODEL

The fourth experiment is carried out on the standard language modeling task Penn Treebank (PTB) dataset. The learning rate is selected and optimized for several years and the current state-of-the-art results supported our perspective about large learning rate (Merity et al., 2017a) . We focus to compare the effect of SGDL, SGDM and other optimizers on top of a state-of-the-art {2, 3}-LSTM training recipe with some training tricks (Merity et al., 2017b; a; Inan et al., 2016; Gal & Ghahramani, 2016) . The details on LSTMs are in Appendix.A.3.3. From the result in Fig. 7 , we evaluate the performance by perplexity metric. We find that Adam is the fastest on the initial progress, but its final performance is worse than our SGDL. We also investigate AdaDelta has terrible performance, while SGDM has almost same curves with ADAM.

6.5. DISCUSSION

In Sections 6.1-6.4, we have observed that if vanilla SGD has the same learning rate as SGDM, it tends to a poor generalization performance in some tasks and this is one of the reasons why SGD is usually shelved for training models nowadays. Our experiments show that a larger initial learning rate increases the test accuracy on Fig. 3 . Further more, it can be seen that SGDL has an admirable performance in our experiments and outperforms SGDM in some tasks. For various deep neural networks such as ResNet,DenseNet,VGG, our SGDL works well, which is not affected by abundance layers or the existence of the residual mechanism. The experimental results have provided evidence that SGD with large learning rate is a good alternative to SGD with momentum in some practical tasks.

7. CONCLUSION

We provide a rigorous proof on the local convergence behavior of SGD-CS on smooth and nonconvex functions. 

A APPENDIX

A.1 PROOFS FOR SECTION 4 Proof of Lemma 1. For i = 1, . . . , N , by Lipschitz continuity of H i at x * , for any y ∈ N (x * , ), H i (y) -H * i ≤ L y -x * < L , and therefore H i (y) ≤ H i (y) -H * i + H * i < L + H * i . For fixed x ∈ N (x * , ), define g i (t) = ∇f i (x * + t(x -x * )), and therefore g i (t) = (x -x * ) T H i (x * + t(x -x * )). By the fundamental theorem of calculus, ∇f i (x) -∇f i (x * ) = g i (1) -g i (0) = 1 0 g i (t) dt ≤ 1 0 g i (t) dt = 1 0 (x -x * ) T H i (x * + t(x -x * )) ≤ 1 0 x -x * H i (x * + t(x -x * )) dt < 1 0 x -x * (L + H * i ) dt < L 2 + H * i . Proof of Theorem 1. Being a point of attraction, the sequence {x k } satisfies the Cauchy criteria: lim N →∞ sup m,n≥N E x m -x n = 0. Then lim k→∞ E x k+1 -x k = 0. Substituting x k+1 with the iteration (2), 0 = lim k→∞ Eα 1 m m j=1 ∇f ξ (j) k (x k ) = α lim k→∞ E    E   1 m m j=1 ∇f ξ (j) k (x k ) x k      = α lim k→∞ E 1:m ∈{1,...,N } m 1 N m • 1 m m j=1 ∇f j (x k ) . Since 1 m m j=1 ∇f j (x k ) ≥ 0 for 1:m ∈ {1, . . . , N } m and all iterates k, lim k→∞ E 1 m m j=1 ∇f j (x k ) = 0, for any 1:m ∈ {1, . . . , N } m . In particular, when 1 = • • • = m = i for fixed i = 1, . . . , N , lim k→∞ E ∇f i (x k ) = 0. For fixed i = 1, . . . , N , the hyphothesis of the bounded convergence theorem (Durrett, 2019, Theorem 1.5.3 ) is satisfied: • ∇f i (x) ≤ L 2 + H * i for any x ∈ N (x * , ), following the result in Lemma (1). • ∇f i (x k ) P -→ ∇f i (x * ) : x k L 1 --→ x * implies ∇f i (x k ) L 1 --→ ∇f i (x * ) . By compari- son theorem for convergence, ∇f i (x k ) P -→ ∇f (x * ) . Applying the bounded convergence theorem on (9) over the region N (x * , ) gives E lim k→∞ ∇f i (x k ) = 0 =⇒ E ∇f i (x * ) = ∇f i (x * ) = 0, for i = 1, . . . , N, which implies the desired result. Proof of Lemma 2. By the interpolation property and definition of e i (x) for i = 1, . . . , N , e i (x) = -H * i • (x -x * ) + ∇f i (x) = -H * i • (x -x * ) + ∇f i (x * + (x -x * )) = -H * i • (x -x * ) + [∇f i (x * ) + H i (x * + t(x -x * )) • (x -x * )] for some t ∈ [0, 1] (10a) = [H i (x * + t(x -x * )) -H * i ] • (x -x * ), where (10a) is by the Taylor expansion on ∇f i (x), and (10b) is because ∇f i (x * ) = 0. By Cauchyschwarz inequality, e i (x) ≤ H i (x * + t(x -x * )) -H * i x -x * ≤ L t(x -x * ) x -x * (10c) ≤ L x -x * 2 , where (10c) is by Lipschitz continuity of H i at x * . Proof of Theorem 2. Since the step size α ≤ min i∈{1,...,N } 1 λmax(H * i ) , the term 0 I -αH * i I, for i = 1, . . . , N. Substituting (4) into the term x k+1 -x * 2 gives x k+1 -x * 2 = 1 m m j=1 (I -αH * ξ (j) k )(x k -x * ) -αe ξ (j) k (x k ) 2 ≤ max j∈{1,...,m} (x k -x * ) T (I -αH * ξ (j) k ) 2 (x k -x * ) + α 2 e ξ (j) k (x k ) 2 -2α(x k -x * ) T (I -αH * ξ (j) k )e ξ (j) k (x k ) (11a) ≤ max j∈{1,...,m} (x k -x * ) T (I -αH * ξ (j) k ) 2 (x k -x * ) + α 2 e ξ (j) k (x k ) 2 + 2α x k -x * I -αH * ξ (j) k e ξ (j) k (x k ) (11b) ≤ max j∈{1,...,m} (x k -x * ) T (I -αH * ξ (j) k ) 2 (x k -x * ) + α 2 L 2 x k -x * 4 + 2αL I -αH * ξ (k) x k -x * 3 (11c) = max j∈{1,...,m} (x k -x * ) T (I -αH * ξ (j) k ) 2 + α 2 L 2 x k -x * 2 + 2αL x k -x * I -αH * ξ (j) k (x k -x * ) ≤ max j∈{1,...,m} (x k -x * ) T (I -αH * ξ (j) k ) 2 + α 2 L 2 2 + 2αL (x k -x * ) (11d) D d i=1 a i 2 ≤ max i∈{1,...,d} a i 2 ; (11b) follows from the Cauchy-Schwarz inequality on the last term; (11c) follows from Lemma (2); (11d) is because that the sequence {x k } ∈ N (x * , ), and that I -αH * ξ (k) ≤ 1. Therefore, taking conditional expectation both sides given x k implies E[ x k+1 -x * 2 | x k ] ≤ max j∈{1,...,m} (x k -x * ) T E ξ (j) k [(I -αH * ξ (k) ) 2 ] + α 2 L 2 2 + 2αL (x k -x * ) The expectation term E ξ (j) k {(I -αH * ξ (k) ) 2 } can be simplified as follows: E ξ (j) k [(I -αH * ξ (j) k ) 2 ] = E ξ (j) k I -2αH * ξ (j) k + α 2 (H * ξ (j) k ) 2 = I -2αH * + α 2 E ξ (j) k [(H * ξ (j) k ) 2 ] = I -2αH * + α 2 1 N N i=1 (H * i ) 2 = I -2αH * + α 2 (H * ) 2 + α 2 1 N N i=1 (H * i ) 2 -(H * ) 2 = I -2αH * + α 2 (H * ) 2 + α 2 Σ * . Combining ( 12) and ( 13) gives an upper bound on E[ x k+1 -x * 2 | x k ]: E[ x k+1 -x * 2 | x k ] ≤ (x k -x * ) T I -2αH * +α 2 (H * ) 2 +α 2 Σ * +α 2 L 2 2 I +2αL I (x k -x * ) Since α and are sufficiently small such that 0 ≤ ρ ≡ λ max I -2αH * + α 2 (H * ) 2 + α 2 Σ * + α 2 L 2 2 I + 2αL I < 1, the conditional expectation is further upper bounded by E[ x k+1 -x * 2 | x k ] ≤ ρ x k -x * 2 . Therefore, E[ x k+1 -x * 2 ] = E E[ x k+1 -x * 2 | x k ] ≤ ρE[ x k -x * 2 ]. Furtheremore, E[ x k -x * 2 ] ≤ ρE[ x k-1 -x * 2 ] ≤ • • • ≤ ρ k x 0 -x * . By (6b), H * L I, I -2αH * 0. ( ) Since H * -L I 0 and by (6c), α((H * ) 2 + Σ * ) + αL 2 2 I ≺ 2H * -2L I. Combining ( 16) and ( 17) gives 0 I -2αH * + α 2 (H * ) 2 + α 2 Σ * + α 2 L 2 2 I + 2αL I ≺ I. Proof of Lemma 3. Let F k denote the σ-algebra generated by the first k SGD-CS random vectors ξ (1:m) 1 , . . . , ξ (1:m) k . Construct Z k = X τ ∧k -x * 2 ρ τ ∧k . Firstly we show that Z k is a supermatingale: E[Z k+1 | F k ] = E X τ ∧(k+1) -x * 2 ρ τ ∧(k+1) 1 τ ≤k F k + E X τ ∧(k+1) -x * 2 ρ τ ∧(k+1) 1 τ >k F k = E X τ ∧k -x * 2 ρ τ ∧k 1 τ ≤k F k + E X k+1 -x * 2 ρ k+1 1 τ >k F k (i) = Z k 1 τ ≤k + 1 ρ k+1 E X k+1 -x * 2 1 τ >k F k (ii) ≤ Z k 1 τ ≤k + 1 ρ k+1 ρE X k -x * 2 1 τ >k F k = Z k 1 τ ≤k + Z k 1 τ >k = Z k , where (i) is because that X τ ∧k -x * 2 ρ τ ∧k 1 τ ≤k is measurable with respect to F k ; (ii) is by applying ( 14) in Theorem (2). As a result, Z 0 ≥ E[Z k | F 0 ] ≥ E X τ ∧k -x * 2 ρ τ ∧k 1 k≥τ F 0 ≥ E X τ -x * 2 ρ τ 1 k≥τ F 0 . By definition of the stopping time, X τ -x * 2 ≥ 2 ; and the term Z 0 := x 0 -x * 2 ≤ δ 2 since x 0 -x * ≤ √ δ . Substituting these two relations into the inequality above gives 18) and ( 19): δ 2 ≥ E 2 ρ τ 1 k≥τ | F 0 =⇒ δ ≥ E 1 k≥τ ρ τ F 0 ≥ E [1 k≥τ |F 0 ] ≥ P {k ≥ τ }, ∀k. Or after a rearrangement, δ ≥ P (τ < ∞), which implies P (τ = ∞) ≥ 1 -δ. Furthermore, E[ X k -x * 2 1 τ =∞ ] = E[ X k -x * 2 | τ = ∞]P (τ = ∞) ≥ (1 -δ)E[ X k -x * 2 | τ = ∞]. (18) Apply Theorem (2) to bound E[ X k -x * 2 1 τ =∞ ]: E[ X k -x * 2 1 τ =∞ ] ≤ ρ k x 0 -x * 2 . ( ) We now estimate E[ X k -x * 2 | τ = ∞] by utilizing ( E[ X k -x * 2 | τ = ∞] ≤ ρ k 1 -δ x 0 -x * 2 . Proof of Theorem 3. Define the event E = {τ = ∞}. It suffices to show the relation (7) . By direct calculation, E[ X k+1 -x * 2 2 1 τ >k+1 | X k = x k ] ≤ E[ X k+1 -x * 2 2 1 τ >k | X k = x k ] = E[ X k+1 -x * 2 2 1 τ >k | X k = x k , F k ] = E[ X k+1 -x * 2 2 | X k = x k , F k ]1 τ >k ≤ ρ x k -x * 2 1 τ >k where the last inequality follows from ( 14). As a result, E[ X k+1 -x * 2 2 1 τ >k+1 ] = E E[ X k+1 -x * 2 2 1 τ >k+1 | X k ] ≤ ρE[ X k -x * 2 1 τ >k ] Inductively, E[ X k -x * 2 1 τ >k ] ≤ ρ k x 0 -x * 2 . Therefore, E[ X k -x * 2 1 E ] = E[ X k -x * 2 1 τ =∞ ] ≤ E[ X k -x * 2 1 τ >k ] ≤ ρ k x 0 -x * 2 , which completes the proof. Proof of Remark 2. Conditioned on the event {τ = ∞}, we have P ( X k -x * 2 > 2 | τ = ∞) = 0. It follows that P ( X k -x * 2 ≤ 2 ) ≥ P ( X k -x * 2 ≤ 2 | τ = ∞)P (τ = ∞) ≥ (1 -δ).

A.2 EXAMPLES

Example 1. (Quadratic Functions) Consider minimizing for the objective function f (x) = N i=1 x 2 i ≡ x T x, where x ∈ R N ×1 , and f i = x 2 i , i ∈ [N ]. Then the gradient of each component f i is given by ∇f i (x) = 2E i • x, where E i ∈ R N ×N is a all-zero matrix except E i,i = 1. Then the end-to-end N runs of IGD update with step size t can be expressed as a matrix compact form: x new = K N (t)x old , where K N (t) = N i=1 I -2tE i = diag(1 -2t, 1 -2t, . . . , 1 -2t). ( ) Applying the basic knowledge in linear algegra, ρ(K N (t)) = |1 -2t|. As long as t ∈ (0, 1), the IGD will converge from any initial point. Example 2. (Standard Least Squares Problem) Consider the standard un-determined least squares problem f (x) = 1 2 Ax -b 2 , A ∈ R m×N , N < n. Similarly, the end-to-end N runs of IGD update with step size t forms the linear system x new = K N (t)x old + tc(t), with K 0 (t) ≡ I, K j (t) = j i=1 I -ta i a T i , j = 1, 2, . . . , N, and c(t) = N j=1 b j K j-1 (t)a j . The sufficient condition for convergence is λ max (K N (t)) < 1, and the necessary condition is λ max (K N (t)) ≤ 1. In this situation, we can assert that λ max (K N (t)) ≥ 1: we can pick x 0 ∈ R n \ {0} so that Ax 0 = 0, which implies (K N (t))x 0 = x 0 . This means that 1 is an eigenvalue of K N (t), i.e., λ max (K N (t)) ≥ 1. This example also suggests that regularization helps with the convergence in optimization. Example 3. (Regularized Least Squares Problem) Consider the regularized least squares objective function f (x) = 1 2 Ax -b 2 + µ 2 D 1/2 x 2 . It can be shown that the end-to-end N iteration of IGD with step size t can be expressed as a compact matrix form: x new = K N (t)x old + tc(t), K 0 (t) ≡ I, K j (t) = j i=1 I -ta i a T i -t µ m D , j = 1, . . . , N, c(t) = m j=1 b j K j-1 (t)a j . ( ) The necessary and sufficient conditions for the convergence of IGD suffice to characterize the condition ρ(t) ≡ ρ(K N (t)) < 1. We need to make use of the famous conjecture about matrix AM-GM inequality: For any positive semi-definite matrix P 1 , P 2 , . . . , P n ∈ R n×n , 1 n! σ=(σ1,...,σn)∈Γ P σ1 P σ2 • • • P σn 1 n i A i n . We make a reasonable assumption that the products within K N (t) are commutative, then applying this conjecture gives K N (t) 1 N N i=1 I -ta i a T i -t µ N D N = I - t N [µD + A T A] N Hence, a sufficient condition for the convergence would be: ρ I - t N [µD + A T A] N < 1. Or equivalently, we need to pick t such that, for any eigenvalue λ of the matrix µD + A T A, we have 1 - tλ N N < 1 ⇐⇒ t ∈ 0, 2N ρ(µD + A T A) . Example 4. (General Nonlinear Least Squares Problem) Consider the general nonlinear least squares problem: f (x) = 1 2 N i=1 r 2 i (x) = 1 2 R(x) T R(x). By assuming that r i (x) are twice continuously differentiable for all i ∈ [N ], the end-to-end N iteration of IGD with step size t can be safely approximated as x new = K N (t)x old + tc(t) + o( x 0 -x * 2 ), K 0 (t) ≡ I, K j (t) = j i=1 I -tH i (x * ) , j = 1, . . . , N, c(t) = m j=1 K j-1 (t)H j (x * )x * . It is reasonable to assume that H i (x * ) is full rank for all i, since adding small regularization terms can resolve the rank deficiency issue. Applying the conjecture again, we imply that K N (t) I - t N i H i (x * ) N Therefore, a sufficient condition for the convergence of IGD would be t ∈ 0, 1 λ max (E[H i (x * )]) , where E[H i (x * )] ≡ 1 N i H i (x * ).

A.3.1 IMAGE CLASSIFICATION

We normalize data and then augment them by horizontal flips and random crops from the image padded by 4 pixels, filling missing pixels with reflections of the original image. We adopt Kaiming initialization (He et al., 2015) , BN (Ioffe & Szegedy, 2015) but without dropout. The models are trained for up to 150 epochs and minibatch size to 256. We select the best learning rate from Table.1 in Appendix.A.3.1 for SGD(M), and for SGDL, the learning rate is more than 10 times larger than the best SGDM lr, in our paper, we select it from {1.0, 1.1, 1.2, 1.3}, more specific parameter settings are as follows. Additional, for all optimizers without AdaDelta, we use an annealing strategy to divide learning rate by 10 on 50 and 100 epochs. The figures show results for ResNet110,VGG,DenseNet on CIFAR10 and CIFAR100. For each figure, the specific learning rate is given for SGDL. On CIFAR10, SGDL learning rate is 1.0 for ResNet 56, 1.1 for DenseNet and 1.2 for ResNet110 and VGG16, while on CIFAR100, learning rate is 1.0 for VGG16 and 1.3 for ResNet 56, ResNet110 and DenseNet on SGDL. For SGDM, the learning rate is 0.1 for most experiments, while it is 0.05 for VGG16 on CIFAR10. Finally, 0.001 is the most suitable learning rate for ADAM. On ImageNet, we use 1.1, 0.1, 0.001 as learning rate for SGDL, SGDM, and ADAM.



Figure 3: Different learning rate for the vanilla SGD in the ResNet56 model on CIFAR10 and CIFAR100. Models are trained with learning rate from 0.2 to 1.

Figure 4: The performance of several popular models on CIFAR10 and CIFAR100. Figures from left to right are for ResNet56, ResNet110,VGG16 and DenseNet, respectively.

Figure 5: Left: Train error for ResNet56 on ImageNet32; Middle: Top-1 error; Right: Top-5 error.

Figure 6: LeNet5 on Speech Commands Dataset.

Perplexity in 3-layer LSTM.

Figure 7: LSTM on Penn Treebank Dataset. For Fig. (a) and Fig. (b), the left shows Train Perplexity and the right shows Test Perplexity.

Figure2: Left: the range of spectral radius where IGD converges, it can be seen there are a wide range t for convergence. Middle: the Log error x -x 2 . The error line in blue grows slowly after an initial rapid rise. This implies that in the presence of noise, large t-values can still calculate x close to x * . Right: calculated solution of a large radius (3.0),the computed values of x(t) do not significantly deviate from the noisy data x within the large t-value.

for up to 150 epochs and minibatch size to 256. Additional, for all optimizers without AdaDelta, we use an annealing strategy that the learning rate is lowered by 10 times at epoch 50 and 100. For ResNet experiments, we select a ResNet network with 56 layers and 120 layers respectively. For VGG, we use VGG16. For DenseNet, we use a DenseNet with 100 layers and growth rate k = 12. We first run a ResNet56 model on CIFAR10 and CIFAR100 with different learning rates. From the results show in Fig.3we can see a small learning rate like 0.2, 0.4 has a relatively poor performance on the test set.

Motivated by numerical experiments on IGD, we recommend a vanilla SGD with large learning rate (SGDL) for training neural networks. Extensive evaluations have been carried out in deep learning tasks (CV,Speech,NLP) with popular neural network architectures. The results have demonstrated the smooth convergence and effective generalization performance of SGDL. For our future work, we will analyze the effect from more training strategies such as batch normalization about convergence. We will also analyze the pros and cons in SGD and its variants (SGDM) in our further work. Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018. Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pp. 4148-4158, 2017.

annex

For LSTMs, there are 1150 units in the hidden layer, an embedding of 400 and a batch size of 20.We select the best learning rate 1 from {0.001, 0.01, 0.1, 1} for SGD(M) and the learning rate is same as above sections for Adam and Adadelta. For SGDL, we use 30 as an initial learning rate from the author's advise (Merity et al., 2017a) on {2, 3}-Layer LSTM for 150 epochs with the same annealing mechanism similarly as above. To train all models, we carry out gradient clipping with maximum norm 0.25. More specifically, for those dropout values, we use (0.4, 0.3, 0.4, 0.1, 0.5) on word vector, output between LSTM layers, output of final LSTM layer, embedded dropout and DropConnect respectively. 

