SCHEDULED RESTART MOMENTUM FOR ACCELERATED STOCHASTIC GRADIENT DESCENT

Abstract

Stochastic gradient descent (SGD) algorithms, with constant momentum and its variants such as Adam, are the optimization methods of choice for training deep neural networks (DNNs). There is great interest in speeding up the convergence of these methods due to their high computational expense. Nesterov accelerated gradient (NAG) with a time-varying momentum, denoted as NAG below, improves the convergence rate of gradient descent (GD) for convex optimization using a specially designed momentum; however, it accumulates error when an inexact gradient is used (such as in SGD), slowing convergence at best and diverging at worst. In this paper, we propose scheduled restart SGD (SRSGD), a new NAG-style scheme for training DNNs. SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule. Using a variety of models and benchmarks for image classification, we demonstrate that, in training DNNs, SRSGD significantly improves convergence and generalization; for instance, in training ResNet-200 for ImageNet classification, SRSGD achieves an error rate of 20.93% vs. the benchmark of 22.13%. These improvements become more significant as the network grows deeper. Furthermore, on both CIFAR and ImageNet, SRSGD reaches similar or even better error rates with significantly fewer training epochs compared to the SGD baseline.

1. INTRODUCTION

Training many machine learning (ML) models reduces to solving the following finite-sum optimization problem min w f (w) := min w 1 N N i=1 f i (w) := min w 1 N N i=1 L(g(x i , w), y i ), w ∈ R d , where {x i , y i } N i=1 are the training samples and L is the loss function, e.g., cross-entropy loss for a classification task, that measure the discrepancy between the ground-truth label y i and the prediction by the model g(•, w), parametrized by w. The problem (1) is known as empirical risk minimization (ERM). In many applications, f (w) is non-convex, and g(•, w) is chosen among deep neural networks (DNNs) due to their preeminent performance across various tasks. These deep models are heavily overparametrized and require large amounts of training data. Thus, both N and the dimension of w can scale up to millions or even billions. These complications pose serious computational challenges. One of the simplest algorithms to solve (1) is gradient descent (GD), which updates w according to: w k+1 = w k -s k 1 N N i=1 ∇f i (w k ), where s k > 0 is the step size at the k-th iteration. Computing ∇f (w k ) on the entire training set is memory intensive and often prohibitive for devices with limited random access memory (RAM) such as graphics processing units (GPUs) used for deep learning (DL). In practice, we sample a subset of the training set, of size m with m N , to approximate ∇f (w k ) by the mini-batch gradient 1/m m j=1 ∇f ij (w k ), resulting in the (mini-batch)-stochastic gradient descent (SGD). SGD and its accelerated variants are among the most used optimization algorithms in ML. These gradient-based algorithms have low computational complexity, and they are easy to parallelize, making them suitable for large scale and high dimensional problems (Zinkevich et al., 2010; Zhang et al., 2015) . Nevertheless, GD and SGD have issues with slow convergence, especially when the problem is ill-conditioned. There are two common techniques to accelerate GD and SGD: adaptive step size (Duchi et al., 2011; Hinton et al.; Zeiler, 2012) and momentum (Polyak, 1964) . The integration of both adaptive step size and momentum with SGD leads to Adam (Kingma & Ba, 2014) , one of the most used optimizers for training DNNs. Many recent developments have improved Adam (Reddi et al., 2019; Dozat, 2016; Loshchilov & Hutter, 2018; Liu et al., 2020) . GD with constant momentum leverages the previous step to accelerate GD according to: v k+1 = w k -s k ∇f (w k ); w k+1 = v k+1 + µ(v k+1 -v k ), where µ > 0 is a constant. A similar acceleration can be achieved by the heavy-ball (HB) method (Polyak, 1964) . The momentum update in both (3) and HB have the same convergence rate of O(1/k) as that of GD for convex smooth optimization. A breakthrough due to Nesterov (1983; 2018) replaces µ with (k -1)/(k + 2), which is known as the Nesterov accelerated gradient (NAG) with time-varying momentum. For simplicity, we denote this method as NAG below. NAG accelerates the convergence rate to O(1/k 2 ), which is optimal for convex and smooth loss functions (Nesterov, 1983; 2018) . NAG can also speed up the process of escaping from saddle points (Jin et al., 2017) . In practice, NAG momentum can accelerate GD for nonconvex optimization, especially when the underlying problem is poorly conditioned (Goh, 2017) . However, NAG accumulates error and causes instability when the gradient is inexact (Devolder et al., 2014; Assran & Rabbat, 2020) . In many DL applications, constant momentum achieves state-of-the-art result. For instance, training DNNs for image classification. Since NAG momentum achieves a much better convergence rate than constant momentum with exact gradient for general convex optimization, we consider the following question: Can we leverage NAG with a time-varying momentum parameter to accelerate SGD in training DNNs and improve the test accuracy of the trained models? Contributions. We answer the above question by proposing the first algorithm that integrates scheduled restart NAG momentum with plain SGD. Here, we restart the momentum, which is orthogonal to the learning rate restart (Loshchilov & Hutter, 2016) . We name the resulting algorithm scheduled restart SGD (SRSGD). Theoretically, we prove the error accumulation of Nesterov accelerated SGD (NASGD) and the convergence of SRSGD. The major practical benefits of SRSGD are fourfold: • SRSGD remarkably speeds up DNN training. For image classification, SRSGD significantly reduces the number of training epochs while preserving or even improving the network's accuracy. In particular, on CIFAR10/100, the number of training epochs is reduced by half with SRSGD, while on ImageNet the reduction in training epochs is also remarkable. • DNNs trained by SRSGD generalize significantly better than the current benchmark optimizers. The improvement becomes more significant as the network grows deeper as shown in Fig. 1 . • SRSGD reduces overfitting in training very deep networks such as ResNet-200 for ImageNet classification, enabling the accuracy to keep increasing with depth. • SRSGD is straightforward to implement and only requires changes in a few lines of the SGD code. There is also no additional computational or memory overhead. We focus on image classification with DNNs, in which SGD with constant momentum is the choice. Related Work. Momentum has long been used to accelerate SGD. SGD with scheduled momentum and a good initialization can handle the curvature issues in training DNNs and enable the trained models to generalize well (Sutskever et al., 2013) . Kingma & Ba (2014) and Dozat (2016) integrated momentum with adaptive step size to accelerate SGD. In this work, we study the time-varying momentum version of NAG with restart for stochastic optimization. Adaptive and scheduled restart have been used to accelerate NAG with the exact gradient (Nemirovskii & Nesterov, 1985; Nesterov, 2013; Iouditski & Nesterov, 2014; Lin & Xiao, 2014; Renegar, 2014; Freund & Lu, 2018; Roulet et al., 2015; O'donoghue & Candes, 2015; Giselsson & Boyd, 2014; Su et al., 2014) . These studies of restart NAG momentum are for convex optimization with the exact gradient. Restart techniques have also been used for stochastic optimization (Kulunchakov & Mairal, 2019) . In particular, Aybat et al. devoted to studying the non-acceleration issues of SGD with HB and NAG momentum (Kidambi et al., 2018; Liu & Belkin, 2020) , as well as accelerating first-order algorithms with noise-corrupted gradients (Cohen et al., 2018; Aybat et al., 2018; Lan, 2012) . Ghadimi & Lan (2013; 2016) provides analysis for the general stochastic gradient-based optimization algorithms. .

Organization.

In Section 2, we review and discuss momentum for accelerating GD for convex smooth optimization. In Section 3, we present the SRSGD algorithm and its theoretical guarantees. In Section 4, we verify the efficacy of the proposed SRSGD in training DNNs for image classification on CIFAR and ImageNet. In Section 4.3, we perform empirical analysis of SRSGD. We end with some concluding remarks. Technical proofs, some experimental details, and more results in training LSTMs (Hochreiter & Schmidhuber, 1997) and WGANs (Arjovsky et al., 2017; Gulrajani et al., 2017) are provided in the Appendix. Notation. We denote scalars and vectors by lower case and lower case bold face letters, respectively, and matrices by upper case bold face letters. For a vector x = (x 1 , • • • , x d ) ∈ R d , we denote its p norm (p ≥ 1) by x p = ( d i=1 |x i | p ) 1/p . For a matrix A, we use A p to denote its induced norm by the vector p norm. Given two sequences {a n } and {b n }, we write a n = O(b n ) if there exists a positive constant s.t. a n ≤ Cb n . We denote the interval a to b (included) as (a, b] . For a function f (w) : R d → R, we denote its gradient as ∇f (w) and its Hessian as ∇ 2 f (w).

2. REVIEW: MOMENTUM IN GRADIENT DESCENT

GD. GD (2) is a popular approach to solve (1), which dates back to Cauchy (1847) . If f (w) is convex and L-smooth (i.e., ∇ 2 f (w) 2 ≤ L), then GD converges with rate O(1/k) by letting s k ≡ 1/L (we use this s k in all the discussion below), which is independent of the dimension of w. HB. HB (4) (Polyak, 1964) accelerates GD by using the historical information, which gives w k+1 = w k -s k ∇f (w k ) + µ(w k -w k-1 ), µ > 0. We can also accelerate GD by using the Nesterov/lookahead momentum, which leads to (3). Both (3) and ( 4) have a convergence rate of O(1/k) for convex smooth optimization. Recently, several variants of (3) have been proposed for DL, e.g., (Sutskever et al., 2013) and (Bengio et al., 2013) . NAG. NAG (Nesterov, 1983; 2018; Beck & Teboulle, 2009) replaces µ with (t k -1)/t k+1 , where t k+1 = (1 + 1 + 4t 2 k )/2 with t 0 = 1. NAG iterates as following v k+1 = w k -s k ∇f (w k ); w k+1 = v k+1 + t k -1 t k+1 (v k+1 -v k ). ( ) NAG achieves a convergence rate O(1/k 2 ) with the step size s k = 1/L. Remark 1. Su et al. (2014) showed that (k -1)/(k + 2) is the asymptotic limit of (t k -1)/t k+1 . In the following presentation of NAG with restart, for the ease of notation, we will replace the momentum coefficient (t k -1)/t k+1 with (k -1)/(k + 2). Adaptive Restart NAG (ARNAG). The sequences, {f (w k )f (w * )} where w * is the minimum of f (w), generated by GD and GD with constant momentum (GD + Momentum, which follows (3)) converge monotonically to zero. However, that sequence generated by NAG oscillates, as illustrated in Fig. 2 (a) when f (w) is a quadratic function. O'donoghue & Candes (2015) proposed ARNAG (6), which restart the time-varying momentum of NAG according to the change of function values, to alleviate this oscillatory phenomenon. ARNAG iterates as following v k+1 = w k -s k ∇f (w k ); w k+1 = v k+1 + m(k) -1 m(k) + 2 (v k+1 -v k ), where m(1) = 1; m(k + 1) = m(k) + 1 if f (w k+1 ) ≤ f (w k ) , and m(k + 1) = 1 otherwise. Scheduled Restart NAG (SRNAG). SR is another strategy to restart the time-varying momentum of NAG. We first divide the total iterations (0, T ] (integers only) into a few intervals {I i } m i=1 = (T i-1 , T i ], such that (0, T ] = m i=1 I i . In each I i we restart the momentum after every F i iterations. The update rule is then given by: v k+1 = w k -s k ∇f (w k ); w k+1 = v k+1 + (k mod Fi) (k mod Fi) + 3 (v k+1 -v k ). Both AR and SR accelerate NAG to linear convergence for convex problems with the Polyak-Lojasiewicz (PL) condition (Roulet & d'Aspremont, 2017) . Case Study -Quadratic Function. Consider the following quadratic optimization (Hardt, 2014 ) min x f (x) = 1 2 x T Lx -x T b, where L ∈ R d×d is the Laplacian of a cycle graph, and b is a d-dimensional vector whose first entry is 1 and all the other entries are 0. Note that f (x) is convex with Lipschitz constant 4. In particular, we set d = 1K (1K:= 10 3 ). We run T = 50K iterations with step size 1/4. In SRNAG, we restart, i.e., we set the momentum to 0, after every 1K iterations. Fig. 2 (a) shows that GD + Momentum as in (3) converges faster than GD, while NAG speeds up GD + Momentum dramatically and converges to the minimum in an oscillatory fashion. Both AR and SR accelerate NAG significantly.  f(x k ) -f(x * ) Iteration (a) (b) (c) GD GD + Momentum NAG ARNAG SRNAG

3. ALGORITHM PROPOSED: SCHEDULED RESTART SGD (SRSGD)

Computing gradient for ERM, (1), can be computational costly and memory intensive, especially when the training set is large. In many applications, such as training DNNs, SGD is used. In this section, we first prove that the error bound of SGD with NAG cannot be bounded by a convergent sequence, then we formulate our new SRSGD as a solution to accelerate the convergence of SGD using the NAG momentum.

3.1. UNCONTROLLED BOUND OF NESTEROV ACCELERATED SGD (NASGD)

Replacing ∇f (w k ) := 1/N N i=1 ∇fi(w k ) in ( 5) with the mini-batch gradient 1/m m j=1 ∇fi j (w k ) will lead to uncontrolled error bound. Theorem 1 formulates this observation for NASGD. Theorem 1 (Uncontrolled Bound of NASGD). Let f (w) be a convex and L-smooth function with ∇f (w) ≤ R, where R > 0 is a constant. The sequence {w k } k≥0 generated by (5), with stochastic gradient of bounded variance (Bubeck, 2014; Bottou et al., 2018) foot_0 and using any constant step size s k ≡ s ≤ 1/L, satisfies E f (w k ) -f (w * ) = O(k), where w * is the minimum of f , and the expectation is taken over the generation of the stochastic gradient. One idea to prove Theorem 1 is by leveraging the established resulting in Lan (2012) . We will provide a new proof of Theorem 1 in Appendix A. The proof shows that the uncontrolled error bound is because the time-varying momentum gets close to 1 as iteration increases. To remedy this, we can restart the momentum in order to guarantee that the time-varying momentum with restart is less than a number that is strictly less than 1. Devolder et al. (2014) proved a similar error bound for the δ-inexact gradient, and we provide a brief review of NAG with δ-inexact gradient in Appendix B. As far as we know that there is no lower bound of E(f (w k )f (w * )) available even for the δ-inexact gradient, and we leave the lower bound estimation as an open problem. We consider three different inexact gradients: Gaussian noise with constant and decaying variance corrupted gradients for the quadratic optimization (8), and training logistic regression model for MNIST (LeCun & Cortes, 2010) classification. The detailed settings and discussion are provided in Appendix B. We denote SGD with NAG momentum as NASGD and NASGD with AR and SR as ARSGD and SRSGD, respectively. The results shown in Fig. 2 (b) and (c) (iteration vs. optimal gap for quadratic optimization ( 8)) and Fig. 3 (a) (iteration vs. loss for training logistic regression) confirm Theorem 1. For these cases, SR improves the performance of NAG with inexact gradients. Moreover, when an inexact gradient is used, ARNAG/ARSGD performs almost the same as GD/SGD asymptotically because ARNAG/ARSGD restarts too often and almost degenerates to GD/SGD.

3.2. SRSGD AND ITS CONVERGENCE

For ERM (1), SRSGD replaces ∇f (w) in ( 7) with stochastic gradient with batch size m and gives v k+1 = w k -s k 1 m m j=1 ∇fi j (w k ); w k+1 = v k+1 + (k mod Fi) (k mod Fi) + 3 (v k+1 -v k ), where F i is the restart frequency used in the interval I i . We implemented SRSGD, in both PyTorch (Paszke et al., 2019) and Keras (Chollet et al., 2015) , by changing just a few lines of code on top of the existing implementation of the SGD optimizer. We provide a snippet of SRSGD code in Appendix J (PyTorch) and K (Keras). We formulate the convergence of SRSGD for general convex and nonconvex problems in Theorem 2 and provide its proof in Appendix C. Theorem 2 (Convergence of SRSGD). Suppose f (w) is L-smooth. Consider the sequence {w k } k≥0 generated by (10) with stochastic gradient that is bounded and has bounded variance, and consider any restart frequency F using any constant step size s k := s ≤ 1/L. Assume that k∈A Ef (w k+1 ) -Ef (w k ) = R < +∞ with R being a constant and the set A := {k ∈ Z + |Ef (w k+1 ) ≥ Ef (w k )}, then we have min 1≤k≤K E ∇f (w k ) 2 2 = O s + 1 sK . ( ) If f (w) is further convex and k∈B Ef (w k+1 ) -Ef (w k ) = R < +∞ with R being a constant and the set B := {k ∈ Z + |E w k+1 -w * 2 ≥ E w k -w * 2 }, then min 1≤k≤K E f (w k ) -f (w * ) = O s + 1 sK , ( ) where w * is the minimum of f . To obtain (∀ > 0) error, we set s = O( ) and K = O(1/ 2 ). Theorem 2 relies on the assumption that k∈A or B Ef (w k+1 ) -Ef (w k ) is bounded, and we provide an empirical verification in Appendix C.1. We leave it open for how to establish the convergence result for SRSGD without this assumption.

4. EXPERIMENTAL RESULTS

We evaluate SRSGD on a variety of benchmarks for image classification, including CIFAR10, CIFAR100, and ImageNet. In all experiments, we show the advantage of SRSGD over the widely used and well-calibrated SGD baselines with a constant momentum of 0.9 and decreasing learning rate at certain epochs, which we denote as SGD. We also compare SRSGD with the well-calibrated SGD in which we switch momentum to the Nesterov momentum of 0.9, and we denote this optimizer as SGD + NM. We fine tune the SGD and SGD + NM baselines to obtain the best validation performance, and we then adopt the same set of parameters for training with SRSGD. In the SRSGD experiments, we tune the restart frequencies on small DNNs for each task based on the validation performance and apply the calibrated restart frequencies to large DNNs for the same task. Note that ARSGD is impractical for training on large-scale datasets since it requires to compute the loss over the whole training set at each iteration, which is very computationally inefficient. Alternatively, ARSGD can estimate loss and restart using mini-batches, but then ARSGD restarts too often and degenerates to SGD without momentum as we mentioned in Section 3. Thus, we do not compare with ARSGD in our CIFAR and ImageNet experiments. The details about hyper-parameters calibration can be found in Appendix D.4. We provide the detailed description of datasets and experimental settings in Appendix D. Additional experimental results in training LSTMs (Hochreiter & Schmidhuber, 1997) and WGANs (Arjovsky et al., 2017; Gulrajani et al., 2017) with SRSGD, as well as the comparison between SRSGD and SGD + NM on ImageNet classification task, are provided in Appendix E. We also note that in all the following experiments, the training loss will blow up if we apply NASGD without restart. These further confirm the stabilizing effect of scheduled restart in training DNNs.

4.1. CIFAR10 AND CIFAR100

We summarize our results for CIFAR in Tables 1 and 2 . We also explore two different restarting frequency schedules for SRSGD: linear and exponential schedule. These schedules are governed by two parameters: the initial restarting frequency F 1 and the growth rate r. In both scheduling schemes, the restarting frequency at the 1st learning rate stage is set to F 1 during training. Then the restarting frequency at the (k + 1)-th learning rate stage is determined by: F k+1 = F1 × r k , exponential schedule F1 × (1 + (r -1) × k), linear schedule. We search F 1 and r using the method outlined in Appendix D.4. For CIFAR10, (F 1 = 40, r = 1.25) and (F 1 = 30, r = 2) are good initial restarting frequencies and growth rates for the exponential and linear schedules, respectively. For CIFAR100, those values are (F 1 = 45, r = 1.5) for the exponential schedule and (F 1 = 50, r = 2) for the linear schedule. Improvement in Accuracy Increases with Depth. We observe that the linear schedule of restart yields better test error on CIFAR than the exponential schedule for most of the models except for Pre-ResNet-470 and Pre-ResNet-1001 on CIFAR100 (see Tables 1 and 2 ). SRSGD with either linear or exponential restart schedule outperforms SGD. Furthermore, the advantage of SRSGD over SGD is more significant for deeper networks. This observation holds strictly when using the linear schedule (see Fig. 1 ) and is generally true when using the exponential schedule with only a few exceptions. Faster Convergence Reduces the Training Time by Half. SRSGD also converges faster than SGD. This result is consistent with our MNIST case study in Section 3 and indeed expected since SRSGD can avoid the error accumulation when there is an inexact oracle. For CIFAR, Fig. 3 (b) shows that SRSGD yields smaller training loss than SGD during the training. Interestingly, SRSGD converges quickly to good loss values in the 2nd and 3rd stages. This suggests that the model can be trained with SRSGD in many fewer epochs compared to SGD while achieving a similar error rate. Results in Table 3 confirm the hypothesis above. We train Pre-ResNet models with SRSGD in only 100 epochs, decreasing the learning rate by a factor of 10 at the 80th, 90th, and 95th epoch while using the same linear schedule for restarting frequency as before with (F 1 = 30, r = 2) for CIFAR10 and Comparison with Adam and RMSProp. SRSGD outperforms not only SGD with momentum but also other popular optimizers including Adam and RMSProp (Tieleman & Hinton, 2012) for image classification tasks. In fact, for image classification tasks, Adam and RMSProp yield worse performance than the baseline SGD with momentum (Chen & Kyrillidis, 2019) . Table 4 compares SRSGD with Adam and RMSprop on CIFAR10. (F 1 = 50, r Table 5 : Single crop validation errors (%) on ImageNet of ResNets trained with SGD baseline and SRSGD. We report the results of SRSGD with the increasing restarting frequency in the first two learning rates. In the last learning rate, the restarting frequency is linearly decreased to 1. For baseline results, we also include the reported single-crop validation errors (He et al., 2016c) 

4.2. IMAGENET

Next, we discuss our experimental results on the 1000-way ImageNet classification task (Russakovsky et al., 2015) . We conduct our experiments on ResNet-50, 101, 152, and 200 with 5 different seeds. We use the official PyTorch implementation for all of our ResNet models (Paszke et al., 2019) . Following common practice, we train each model for 90 epochs and decrease the learning rate by a factor of 10 at the 30th and 60th epoch. We use an initial learning rate of 0.1, a momentum scaled by 0.9, and a weight decay value of 0.0001. Additional details and comparisons between SRSGD and SGD + NM are given in Appendix E. We report single crop validation errors of ResNet models trained with SGD and SRSGD on ImageNet in Table 5 . In contrast to our CIFAR experiments, we observe that for ResNets trained on ImageNet with SRSGD, linearly decreasing the restarting frequency to 1 at the last stage (i.e., after the 60th epoch) helps improve the generalization of the models. Thus, in our experiments, we use linear scheduling with (F 1 = 40, r = 2). From epoch 60 to 90, the restarting frequency decays to 1 linearly. Advantage of SRSGD continues to grow with depth. Similar to the CIFAR experiments, we observe that SRSGD outperforms the SGD baseline for all ResNet models that we study. As shown in Fig. 1 , the advantage of SRSGD over SGD grows with network depth, just as in our CIFAR experiments with Pre-ResNet architectures. Avoiding Overfitting in ResNet-200. ResNet-200 demonstrates that SRSGD is better than the SGD baseline at avoiding overfittingfoot_2 . The ResNet-200 trained with SGD has a top-1 error of 22.13%, higher than the ResNet-152 trained with SGD, which achieves a top-1 error of 22.03% (see Table 5 ). He et al. (2016b) pointed out that ResNet-200 suffers from overfitting. The ResNet-200 trained with our SRSGD has a top-1 error of 20.93%, which is 1.2% lower than the ResNet-200 trained with the SGD and also lower than the ResNet-152 trained with both SRSGD and SGD, an improvement by 0.53% and 1.1%, respectively. We hypothesize that SRSGD with appropriate restart frequency is locally not monotonic (see Fig. 3 (b, c )), and this property allows SRSGD to escape from bad minima in order to reach a better one, which helps avoid overfitting in very deep networks. Theoretical analysis of the observation that SRSGD is less overfitting in training DNNs is under our investigation. Training ImageNet in Fewer Number of Epochs. As in the CIFAR experiments, we note that when training on ImageNet, SRSGD converges faster than SGD at the first and last learning rate while quickly reaching a good loss value at the second learning rate (see Fig. 3 (c) ). This observation suggests that ResNets can be trained with SRSGD in fewer epochs while still achieving comparable error rates to the same models trained by the SGD baseline using all 90 epochs. We summarize the results in Table 6 . On ImageNet, we note that SRSGD helps reduce the number of training epochs for very deep networks 152, 200) . For smaller networks like ResNet-50, training with fewer epochs slightly decreases the accuracy.

4.3. EMPIRICAL ANALYSIS

SRSGD Helps Reduce the Training Time. We find that SRSGD training using fewer epochs yields comparable error rates to both the SGD baseline and the SRSGD full training with 200 epochs on CIFAR. We conduct an ablation study to understand the impact of reducing the number of epochs on the final error rate when training with SRSGD on CIFAR10 and ImageNet. In the CIFAR10 experiments, we vary the number of epoch reduction from 15 to 90 while in the ImageNet experiments, we vary the number of epoch reduction from 10 to 30. We summarize our results in Fig. 4 , and provide detailed results in Appendix F. For CIFAR10, we can train with 30 fewer epochs while still maintaining a comparable error rate to the full SRSGD training, and with a better error rate than the SGD baseline trained in full 200 epochs. For ImageNet, SRSGD training with fewer epochs decreases the accuracy but still obtains comparable results to the 90-epoch SGD baseline. Impact of Restarting Frequency. We examine the impact of restarting frequency on the network training. We choose a case study of training a Pre-ResNet-290 on CIFAR10 using SRSGD with a linear schedule scheme for the restarting frequency. We fix the growth rate r = 2 and vary the initial restarting frequency F 1 from 1 to 80. As shown in Fig. 5 , SRSGD with a large F 1 , e.g. F 1 = 80, approximates NASGD (yellow). We also show the training loss and test accuracy of NASGD in red. As discussed in Section 3, it suffers from error accumulation due to stochastic gradients and converges slowly or even diverges. SRSGD with small F 1 , e.g.  F 1 = 1, F 1 =1 F 1 =10 F 1 =30 F 1 =50 F 1 =80 NASGD (F 1 =+∞)

5. CONCLUSIONS

We propose the Scheduled Restart SGD (SRSGD), with two major changes from the widely used SGD with constant momentum. First, we replace the momentum in SGD with the iteration-dependent momentum that used in Nesterov accelerated gradient (NAG). Second, we restart the NAG momentum according to a schedule to prevent error accumulation when the stochastic gradient is used. For image classification, SRSGD can significantly improve the accuracy of the trained DNNs. Also, compared to the SGD baseline, SRSGD requires fewer training epochs to reach the same trained model's accuracy. There are numerous avenues for future work: 1) deriving the optimal restart scheduling and the corresponding convergence rate of SRSGD and 2) integrating the scheduled restart NAG momentum with adaptive learning rate algorithms, e.g., Adam (Kingma & Ba, 2014) .

Part Appendices

The appendices are structured as follows. In Section A, we prove Theorem 1. In Section B, we review an error accumulation result of the Nesterov accelerated gradient with δ-inexact gradient. In Section C, we prove Theorem 2. In Section D, we provide some experimental details; in particular, the calibration of restarting hyperparameters. In Section E, we compare SRSGD with benchmark optimization algorithms on some other tasks, including training LSTM and Wasserstein GAN. In Section F, we provide detailed experimental settings in studying the effects of reducing the number of epoch in training deep neural networks with SRSGD, and we provide some more experimental results. In Section G and H, we further study the effects of restarting frequency and training with less epochs by using SRSGD. In Section I, we visualize the optimization trajectory of SRSGD and compare it with benchmark methods. A snippet of our implementation of SRSGD in PyTorch and Keras are available in Section J and K, respectively. 

Table of Contents

where f (w) is L-smooth and convex. Start from w k , GD update, with step size 1 r , can be obtained based on the minimization of the function Q r (v, w k ) := v -w k , ∇f (w k ) + r 2 v -w k 2 2 . ( ) With direct computation, we can get that Q r (v k+1 , w k ) -min Q r (v, w k ) = g k -∇f (w k ) 2 2r , where g k := 1 m m j=1 ∇f ij (w k ). We assume the variance is bounded, which gives The stochastic gradient rule, R s , satisfies E[Q r (v k+1 , w k ) -min Q r (v, w k )|χ k ] ≤ δ, with δ being a constant and χ k being the sigma algebra generated by w 1 , w 2 , • • • , w k , i.e., χ k := σ(w 1 , w 2 , • • • , w k ). NASGD can be reformulated as v k+1 ≈ arg min v Q r (v, w k ) with rule R s , w k+1 = v k+1 + t k -1 t k+1 (v k+1 -v k ), where t 0 = 1 and t k+1 = (1 + 1 + 4t 2 k )/2.

A.1 PRELIMINARIES

To proceed, we introduce several definitions and some useful properties in variational and convex analysis. More detailed background can be found at Mordukhovich (2006) ; Nesterov (1998) ; Rockafellar & Wets (2009); Rockafellar (1970) . Let f be a convex function, we say that f is L-smooth (gradient Lipschitz) if f is differentiable and ∇f (v) -∇f (w) 2 ≤ L v -w 2 , and we say f is ν-strongly convex if for any w, v ∈ dom(f ) f (w) ≥ f (v) + ∇f (v), w -v + ν 2 w -v 2 2 . Below of this subsection, we list several basic but useful lemmas, the proof can be found in Nesterov (1998) . Lemma 1. If f is ν-strongly convex, then for any v ∈ dom(f ) we have f (v) -f (v * ) ≥ ν 2 v -v * 2 2 , ( ) where v * is the minimizer of f . Lemma 2. If f is L-smooth, for any w, v ∈ dom(f ), f (w) ≤ f (v) + ∇f (v), w -v + L 2 w -v 2 2 . A.2 UNCONTROLLED BOUND OF NASGD: ANALYSIS In this part, we denote ṽk+1 := arg min v Q r (v, w k ). ( ) Lemma 3. If the constant r > 0, then E v k+1 -ṽk+1 2 2 |χ k ≤ 2δ r . ( ) Proof. Note that Q r (v, w k ) is strongly convex with constant r, and ṽk+1 in ( 17) is the minimizer of Q r (v, w k ). With Lemma 1 we have Q r (v k+1 , w k ) -Q r (ṽ k+1 , w k ) ≥ r 2 v k+1 -ṽk+1 2 2 . ( ) Notice that E Q r (v k+1 , w k ) -Q r (ṽ k+1 , w k ) = E Q r (v k+1 , w k ) -min v Q r (v, w k ) ≤ δ. The inequality ( 18) can be established by combining the above two inequalities. Lemma 4. If the constant satisfy r > L, then we have E f (ṽ k+1 ) + r 2 ṽk+1 -w k 2 2 -(f (v k+1 ) + r 2 v k+1 -w k 2 2 ) (20) ≥ -τ δ - r -L 2 E[ w k -ṽk+1 2 2 ], where τ = L 2 r(r-L) + 1. Proof. The convexity of f gives us 0 ≤ ∇f (v k+1 ), v k+1 -ṽk+1 + f (ṽ k+1 ) -f (v k+1 ). From the definition of the stochastic gradient rule R s , we have 21) and ( 22), we have -δ ≤ E Q r (ṽ k+1 , w k ) -Q r (v k+1 , w k ) (22) = E ṽk+1 -w k , ∇f (w k ) + r 2 ṽk+1 -w k 2 2 - E v k+1 -w k , ∇f (w k ) + r 2 v k+1 -w k 2 2 . With ( -δ ≤ f (ṽ k+1 ) + r 2 ṽk+1 -w k 2 2 -f (v k+1 ) + r 2 v k+1 -w k 2 2 + (23) E ∇f (w k ) -∇f (ṽ k+1 ), ṽk+1 -v k+1 . With the Schwarz inequality a, b ≤ 23) and ( 24), we have a 2 2 2µ + µ 2 b 2 2 with µ = L 2 r-L , a = ∇f (v k+1 ) -∇f (ṽ k+1 ) and b = w k -ṽk+1 , ∇f (w k ) -∇f (ṽ k+1 ), ṽk+1 -v k+1 (24) ≤ (r -L) 2L 2 ∇f (w k ) -∇f (ṽ k+1 ) 2 2 + L 2 2(r -L) v k+1 -ṽk+1 2 2 ≤ (r -L) 2 w k -ṽk+1 2 2 + L 2 2(r -L) v k+1 -ṽk+1 2 2 . Combining ( -δ ≤ E f (ṽ k+1 ) + r 2 ṽk+1 -w k 2 2 -E f (v k+1 ) + r 2 v k+1 -w k 2 2 (25) + L 2 2(r -L) E v k+1 -ṽk+1 2 2 + r -L 2 E w k -ṽk+1 2 2 . By rearrangement of the above inequality (25) and using Lemma 3, we obtain the result. Lemma 5. If the constants satisfy r > L, then we have the following bounds E f (v k ) -f (v k+1 ) ≥ r 2 E w k -v k+1 2 2 + rE w k -v k , ṽk+1 -w k -τ δ, E f (v * ) -f (v k+1 ) ≥ r 2 E w k -v k+1 2 2 + rE w k -v * , ṽk+1 -w k -τ δ, where τ := L 2 r(r-L) + 1 and v * is the minimum . Proof. With Lemma 2, we have -f (ṽ k+1 ) ≥ -f (w k ) -ṽk+1 -w k , ∇f (w k ) - L 2 ṽk+1 -w k 2 2 . ( ) Using the convexity of f , we have f (v k ) -f (w k ) ≥ v k -w k , ∇f (w k ) , i.e., f (v k ) ≥ f (w k ) + v k -w k , ∇f (w k ) . ( ) According to the definition of ṽk+1 in ( 14), i.e., ṽk+1 = arg min v Q r (v, w k ) = arg min v v -w k , ∇f (w k ) + r 2 v -w k 2 2 , and the optimization condition gives ṽk+1 = w k - 1 r ∇f (w k ). Substituting ( 30) into (29), we obtain f (v k ) ≥ f (w k ) + v k -w k , r(w k -ṽk+1 ) . ( ) Direct summation of ( 28) and (31) gives f (v k ) -f (ṽ k+1 ) ≥ r - L 2 ṽk+1 -w k 2 2 + r w k -v k , ṽk+1 -w k . ( ) Summing ( 32) and ( 20), we obtain the inequality (26) E f (v k ) -f (v k+1 ) ≥ r 2 E w k -v k+1 2 2 + rE w k -v k , ṽk+1 -w k -τ δ. On the other hand, with the convexity of f , we have f (v * ) -f (w k ) ≥ v * -w k , ∇f (w k ) = v * -w k , r(w k -ṽk+1 ) . ( ) The summation of ( 28) and ( 34) results in f (v * ) -f (ṽ k+1 ) ≥ r - L 2 w k -ṽk+1 2 2 + r w k -v * , ṽk+1 -w k . ( ) Summing ( 35) and (20), we obtain E f (v * ) -f (v k+1 ) ≥ r 2 E w k -v k+1 2 2 + rE w k -v * , ṽk+1 -w k -τ δ, which is the same as (27). Theorem 3 (Uncontrolled Bound of NASGD (Theorem 1 with detailed bounded)). Let the constant r satisfy r < L and the sequence {v k } k≥0 be generated by NASGD with stochastic gradient that has bounded variance. By using any constant step size s k ≡ s ≤ 1/L, then we have E[f (v k ) -min v f (v)] ≤ ( 2τ δ r + R 2 ) 4k 3 . ( ) Proof. We denote F k := E(f (v k ) -f (v * )). By (26) × (t k -1) + (27), we have 2[(t k -1)F k -t k F k+1 ] r ≥ t k E v k+1 -w k 2 2 (38) + 2E ṽk+1 -w k , t k w k -(t k -1)v k -v * - 2τ t k δ r . With t 2 k-1 = t 2 k -t k , (38) × t k yields 2[t 2 k-1 F k -t 2 k F k+1 ] r ≥ E t k v k+1 -t k w k 2 2 (39) + 2t k E ṽk+1 -w k , t k w k -(t k -1)v k -v * - 2τ t 2 k δ r Substituting a = t k v k+1 -(t k -1)v k -v * and b = t k w k -(t k -1)v k -v * into identity a -b 2 2 + 2 a -b, b = a 2 2 -b 2 2 . ( ) It follows that E t k v k+1 -t k w k 2 2 + 2t k E ṽk+1 -w k , t k w k -(t k -1)v k -v * (41) = E t k v k+1 -t k w k 2 2 + 2t k E v k+1 -w k , t k w k -(t k -1)v k -v * +2t k E ṽk+1 -v k+1 , t k w k -(t k -1)v k -v * = (40) E t k v k+1 -(t k -1)v k -v * 2 2 -t k w k -(t k -1)v k -v * 2 2 +2t k E ṽk+1 -v k+1 , t k w k -(t k -1)v k -v * = E t k v k+1 -(t k -1)v k -v * 2 2 -E t k-1 v k -(t k-1 -1)v k-1 -v * 2 2 + 2t k E ṽk+1 -v k+1 , t k-1 v k -(t k-1 -1)v k-1 -v * . In the third identity, we used the fact t k w k = t k v k + (t k-1 -1)(v k -v k-1 ). If we denote u k = E t k-1 v k -(t k-1 -1)v k-1 -v * 2 2 , (39) can be rewritten as 2t 2 k F k+1 r + u k+1 ≤ 2t 2 k-1 F k r + u k + 2τ t 2 k δ r (42) + 2t k E v k+1 -ṽk+1 , t k-1 v k -(t k-1 -1)v k-1 -v * ≤ 2t 2 k F k r + u k + 2τ t 2 k δ r + t 2 k-1 R 2 , where we used 2t k E v k+1 -ṽk+1 , t k-1 v k -(t k-1 -1)v k-1 -v * ≤ t 2 k E v k+1 -ṽk+1 2 2 + E t k-1 v k -(t k-1 v k -(t k-1 -1)v k-1 -v * ) 2 2 = 2t 2 k δ/r + t 2 k-1 R 2 . Denoting ξ k := 2t 2 k-1 F k r + u k , then, we have ξ k+1 ≤ ξ 0 + ( 2τ δ r + R 2 ) k i=1 t 2 i = ( 2τ δ r + R 2 ) k 3 3 . ( ) With the fact, For ∀w ∈ R d and exact first-order oracle returns a pair (f (w), ∇f (w)) ∈ R × R d so that for ∀v ∈ R d we have ξ k ≥ 2t 2 k-1 F k r ≥ k 2 F k /4, 0 ≤ f (v) -f (w) + ∇f (w), v -w ≤ L 2 w -v 2 2 . A δ-inexact oracle returns a pair f δ (w), ∇f δ (w) ∈ R × R d so that ∀v ∈ R d we have 0 ≤ f (v) -f δ (w) + ∇f δ (w), v -w ≤ L 2 w -v 2 2 + δ. We have the following convergence results of GD and NAG under a δ-Inexact Oracle for convex smooth optimization. Theorem 4. Devolder et al. (2014) foot_3 Consider min f (w), w ∈ R d , where f (w) is convex and L-smooth with w * being the minimum. Given access to δ-inexact oracle, GD with step size 1/L returns a point w k after k steps so that f (w k ) -f (w * ) = O L k + δ. On the other hand, NAG, with step size 1/L returns f (w k ) -f (w * ) = O L k 2 + O(kδ). Theorem 4 says that NAG may not robust to a δ-inexact gradient. In the following, we will study the numerical behavior of a variety of first-order algorithms for convex smooth optimizations with the following different inexact gradients. Constant Variance Gaussian Noise: We consider the inexact oracle where the true gradient is contaminated with a Gaussian noise N (0, 0.001 2 ). We run 50K iterations of different algorithms. For SRNAG, we restart after every 200 iterations. Fig. 2 (b) shows the iteration vs. optimal gap, f (x k )f (x * ), with x * being the minimum. NAG with the inexact gradient due to constant variance noise does not converge. GD performs almost the same as ARNAG asymptotically, because ARNAG restarts too often and almost degenerates into GD. GD with constant momentum outperforms the three schemes above, and SRNAG slightly outperforms GD with constant momentum. Decaying Variance Gaussian Noise: Again, consider minimizing (8) with the same experimental setting as before except that ∇f (x) is now contaminated with a decaying Gaussian noise N (0, ( 0.1 t/100 +1 ) 2 ). For SRNAG, we restart every 200 iterations in the first 10k iterations, and restart every 400 iterations in the remaining 40K iterations. Fig. 2 (c ) shows the iteration vs. optimal gap by different schemes. ARNAG still performs almost the same as GD. The path of NAG is oscillatory. GD with constant momentum again outperforms the previous three schemes. Here SRNAG significantly outperforms all the other schemes. Logisitic Regression for MNIST Classification: We apply the above schemes with stochastic gradient to train a logistic regression model for MNIST classification LeCun & Cortes (2010) . We consider five different schemes, namely, SGD, SGD + (constant) momentum, NASGD, ASGD, and SRSGD. In ARSGD, we perform restart based on the loss value of the mini-batch training data. In SRSGD, we restart the NAG momentum after every 10 iterations. We train the logistic regression model with a 2 weight decay of 10 -4 by running 20 epochs using different schemes with batch size of 128. The step sizes for all the schemes are set to 0.01. Fig. 3 (a) plots the training loss vs. iteration. In this case, NASGD does not converge, and SGD with momentum does not speed up SGD. ARSGD's performance is on par with SGD's. Again, SRSGD gives the best performance with the smallest training loss among these five schemes.

C CONVERGENCE OF SRSGD

We prove the convergence of Nesterov accelerated SGD with scheduled restart, i.e., the convergence of SRSGD. We denote that θ k := t k -1 t k+1 in the Nesterov iteration and θk is its use in the restart version, i.e., SRSGD. For any restart frequency F (positive integer), we have θk = θ k-k/F * F . In the restart version, we can see that θk ≤ θ F =: θ < 1. Lemma 6. Let the constant satisfies r > L and the sequence {v k } k≥0 be generated by the SRSGD with restart frequency F (any positive integer), we have k i=1 v i -v i-1 2 2 ≤ r 2 kR 2 (1 -θ) 2 , ( ) where θ := θ F < 1 and R := sup x { ∇f (x) 2 }. Proof. It holds that v k+1 -w k 2 = v k+1 -v k + v k -w k 2 (45) ≥ v k+1 -v k 2 -v k -w k 2 ≥ v k+1 -v k 2 -θ v k -v k-1 2 . Thus, v k+1 -w k 2 2 ≥ v k+1 -v k 2 -θ v k -v k-1 2 2 (46) = v k+1 -v k 2 2 -2 θ v k -v k-1 2 v k -v k-1 2 + θ2 v k -v k-1 2 2 ≥ (1 -θ) v k+1 -v k 2 2 -θ(1 -θ) v k+1 -v k 2 2 . Summing (46) from k = 1 to K, we get (1 -θ) 2 K k=1 v k -v k-1 2 2 ≤ K k=1 v k+1 -w k 2 2 ≤ r 2 KR 2 . ( ) In the following, we denote A := {k ∈ Z + |Ef (v k ) ≥ Ef (v k-1 )}. Theorem 5 (Convergence of SRSGD). (Theorem 2 with detailed bound) Suppose f (w) is L-smooth. Consider the sequence {w k } k≥0 generated by (10) with stochastic gradient that is bounded and has bound variance. Using any restart frequency F and any constant step size s k := s ≤ 1/L. Assume that k∈A Ef (w k+1 ) -Ef (w k ) = R < +∞, then we have min 1≤k≤K E ∇f (w k ) 2 2 ≤ rR 2 (1 -θ) 2 L(1 + θ) 2 + rLR 2 2 + θ R rK . ( ) If f (w) is further convex and the set B := {k ∈ Z + |E w k+1 -w * 2 ≥ E w k -w * 2 } obeys k∈B Ef (w k+1 ) -Ef (w k ) = R < +∞, then min 1≤k≤K E f (w k ) -f (w * ) ≤ w 0 -w * 2 + R 2γk + γR 2 2 , ( ) where w * is the minimum of f . To obtain (∀ > 0) error, we set s = O( ) and K = O(1/ 2 ). Proof. Firstly, we show the convergence of SRSGD for nonconvex optimization. L-smoothness of f , i.e., Lipschitz gradient continuity, gives us f (v k+1 ) ≤ f (w k ) + ∇f (w k ), v k+1 -w k + L 2 v k+1 -w k 2 2 . ( ) Taking expectation, we get Ef (v k+1 ) ≤ Ef (w k ) -rE ∇f (w k ) 2 2 + r 2 LR 2 2 . ( ) On the other hand, we have f (w k ) ≤ f (v k ) + θk ∇f (v k ), v k -v k-1 + L( θk ) 2 2 v k -v k-1 2 2 . ( ) Then, we have Ef (v k+1 ) ≤ Ef (v k ) + θk E ∇f (v k ), v k -v k-1 (53) + L( θk ) 2 2 E v k -v k-1 2 2 -rE ∇f (w k ) 2 2 + r 2 LR 2 2 . We also have θk ∇f (v k ), v k -v k-1 ≤ θk f (v k ) -f (v k-1 ) + L 2 v k -v k-1 2 2 . ( ) We then get that Ef (v k+1 ) ≤ Ef (v k ) + θk Ef (v k ) -Ef (v k-1 ) -rE ∇f (w k ) 2 2 + A k , where A k := E L 2 v k -v k-1 2 2 + L( θk ) 2 2 E v k -v k-1 2 2 + r 2 LR 2 2 . Summing the inequality gives us Ef (v K+1 ) ≤ Ef (v 0 ) + θ k∈A Ef (v k ) -Ef (v k-1 ) (56) -r K k=1 E ∇f (w k ) 2 2 + K k=1 A k . It is easy to see that θ k∈A Ef (v k ) -Ef (v k-1 ) = θ R. We get the result by using Lemma 6 Secondly, we prove the convergence of SRSGD for convex optimization. Let w * be the minimizer of f . We have E v k+1 -w * 2 2 = E w k -γ∇f (w k ) -w * 2 2 (57) = E w k -w * 2 2 -2γE ∇f (w k ), w k -w * + γ 2 E ∇f (w k ) 2 2 ≤ E w k -x * 2 2 -2γE ∇f (w k ), w k -w * + γ 2 R 2 . We can also derive E w k -w * 2 = E v k + θk (v k -v k-1 ) -w * 2 2 = E v k -w * 2 2 + 2 θk E v k -v k-1 , v k -w * + ( θk ) 2 E v k -v k-1 2 2 = E v k -w * 2 2 + θk E v k -w * 2 2 + v k-1 -v k 2 2 -v k-1 -w * 2 2 + ( θ) 2 E v k -v k-1 2 2 = E v k -w * 2 2 + θk E v k -w * 2 2 -v k-1 -w * 2 2 + 2( θk ) 2 E v k -v k-1 2 2 , where we used the following identity (a -b) T (a -b) = 1 2 [ a -d 2 2 -a -c 2 2 + b -c 2 2 -b -d 2 2 ]. Then, we have E v k+1 -w * 2 2 ≤ E v k -w * 2 2 -2γE ∇f (w k ), w k -w * + 2( θk ) 2 E v k -v k-1 2 2 (58) + r 2 R 2 + θk E( v k -w * 2 2 -v k-1 -w * 2 2 ). We then get that 2γE f (w k ) -f (w * ) ≤ E v k -w * 2 2 -E v k+1 -w * 2 2 (59) + θk E v k -w * 2 2 -E v k-1 -w * 2 2 + r 2 R 2 . Summing the inequality gives us the desired convergence result for convex optimization.

C.1 NUMERICAL VERIFICATION OF THE ASSUMPTIONS IN THEOREM 2

In this part, we numerically verify the assumptions in Theorem 2. In particular, we apply SRSGD with learning rate 0.1 to train LeNetfoot_4 for MNIST classification (we test on MNIST due to extremely large computational cost). We conduct numerical verification as follows: starting from a given point w 0 , we randomly sample 469 mini-batches (note in total we have 469 batches in the training data) with batch size 128 and compute the stochastic gradient using each mini-batch. Next, we advance to the next step with each of these 469 stochastic gradients and get the approximated Ef (w 1 ). We randomly choose one of these 469 positions as the updated weights of our model. By iterating the above procedure, we can get w 1 , w 2 , • • • and Ef (w 1 ), Ef (w 2 ), • • • and we use these values to verify our assumptions in Theorem 2. We set restart frequencies to be 20, 40, and 80, respectively. Figure 6 top panels plot k vs. the cardinality of the set A := {k ∈ Z + |Ef (w k+1 ) ≥ Ef (w k )}, and Figure 6 bottom panels plot k vs. k∈A Ef (w k+1 ) -Ef (w k ) . Figure 6 shows that k∈A Ef (w k+1 ) -Ef (w k ) converges to a constant R < +∞. We also noticed that when the training gets plateaued, E(f (w k )) still oscillates, but the magnitude of the oscillation diminishes as iterations goes, which is consistent with our plots that the cardinality of A increases linearly, but R converges to a finite number. These numerical results show that our assumption in Theorem 2 is reasonable. < l a t e x i t s h a 1 _ b a s e 6 4 = " i i 5 (2016b) . We train each model for 200 epochs with batch size of 128 and initial learning rate of 0.1, which is decayed by a factor of 10 at the 80th, 120th, and 160th epoch. The weight decay rate is 5 × 10 -foot_5 and the momentum for the SGD baseline is 0.9. Random cropping and random horizontal flipping are applied to training data. Our code is modified based on the Pytorch classification project Yang (2017), 5 which was also used by Liu et al. Liu et al. (2020) . We provide the restarting frequencies for the exponential and linear scheme for CIFAR10 and CIFAR100 in Table 7 below. Using the same notation as in the main text, we denote F i as the restarting frequency at the i-th learning rate. Linear schedule 2019),foot_6 and run our experiments on 8 Nvidia V100 GPUs. We report single-crop top-1 and top-5 errors of our models. In our experiments, we set F 1 = 40 at the 1st learning rate, F 2 = 80 at the 2nd learning rate, and F 3 is linearly decayed from 80 to 1 at the 3rd learning rate (see Table 8 ). -110, 290, 470, 650, and 1001. In particular, we use 10,000 images from the original training set as a validation set. This validation set contains 1,000 and 100 images from each class for CIFAR10 and CIFAR100, respectively. We first train Pre-ResNet-110 on the remaining 40,000 training images and use the performance on the validation set averaged over 5 random seeds to select the initial restarting frequency F 1 and the growth rate r. Both F 1 and r are selected using grid search from the sets of {20, 25, 30, 35, 40, 45, 50} and {1, 1.25, 1.5, 1.75, 2}, respectively. We then train all models including Pre-ResNet-110, 290, 470, 650, and 1001 on all 50,000 training images using the selected values of F 1 and r and report the results on the test set which contains 10,000 test images. The reported test performance is averaged over 5 random seeds. We also use the same selected values of F 1 and r for our short training experiments in Section 4.3. C S P W M + M B A M m Z 6 p U 0 d 3 Y u T l n g = " > A A A C V H i c d V H R a h Q x F M 1 M r d Z V 6 1 o f f Q k u w h Z x m d H S 2 g e h K o K P V d y 2 s F m X J H t n N 0 w m M y R 3 W p c w H 6 k P g l / i i w 9 m p o u o 1 Q M X D u f c S 2 7 O F Z V W D p P k W x R v X N u 8 f m P r Z u / W 7 T v b d / v 3 d k 5 c W V s J Y 1 n q 0 p 4 J 7 k A r A 2 N U q O G s s s A L o e F U 5 K 9 b / / Q c r F O l + Y C r C q Y F X x i V K c k x S L N + z g S 3 / n 1 D X 1 D m 6 m L m c 6 Y M Z Q X H p e T a v 2 w a p i H D Y a c I 4 d 8 0 2 Z A h f E K R + Y v m o 8 8 f p 8 0 u f U L / 6 w e X W b V Y 4 i 6 d 9 Q f J K O l A r 5 J 0 T Q Z k j e N Z / w u b l 7 I u w K D U 3 L l J m l Q 4 9 d y i k h q a H q s d V F z m f A G T Q A 0 v w E 1 9 F 0 p D H w V l T r P S h j J I O / X 3 C c 8 L 5 1 a F C J 3 t 7 u 5 v r x X / 5 U 1 q z J 5 P v T J V j W D k 5 U N Z r S m W t E 2 Y z p U F i X o V C J d W h V 2 p X H L L J Y Y 7 9 L o Q D l v s / / r y V X L y d J Q + G + 2 9 2 x s c v V F 1 = 30, F 2 = 60, F 3 = 90, F 4 = 120 (r = 2) F 1 = 50, F 2 = 100, F 3 = 150, F 4 = 200 (r = 2) Exponential schedule F 1 = 40, F 2 = 50, F 3 = 63, F 4 = 78 (r = 1.25) F 1 = 45, F 2 = For ImageNet experiments, we use linear scheduling and sweep over the initial restarting frequency F 1 and the growth rate r in the set of {20, 30, 40, 50, 60} and {1, 1.25, 1.5, 1.75, 2}, respectively. We select the values of F 1 = 40 and r = 2 which have the highest final validation accuracy averaged over 5 random seeds. Same as in CIFAR10 and CIFAR100 experiments, we select F 1 and r using our smallest model, ResNet-50, and apply the same selected hyperparameter values for all models including ResNet-50, 101, 152, and 200. We also use the same selected values of F 1 and r for our short training experiments in Section 4.3. However, for ResNet-50, we observe that F 1 = 60 and r = 1. Cortes (2010) . We follow the implementation of Le et al. (2015) and feed each pixel of the image into the RNN sequentially. In addition, we choose a random permutation of 28 × 28 = 784 elements at the beginning of the experiment. This fixed permutation is applied to training and testing sequences. This task is known as permuted MNIST classification, which has become standard to measure the performance of RNNs and their ability to capture long term dependencies. Table 10 : Single crop validation errors (%) on ImageNet of ResNets trained with SGD + NM and SRSGD. We report the results of SRSGD with the increasing restarting frequency in the first two learning rates. In the last learning rate, the restarting frequency is linearly decreased from 70 to 1. For baseline results, we also include the reported single-crop validation errors He et al. (2016c) Implementation and Training Details: For the LSTM model, we initialize the forget bias to 1 and other biases to 0. All weights matrices are initialized orthogonally except for the hidden-to-hidden weight matrices, which are initialized to be identity matrices. We train each model for 350 epochs with the initial learning rate of 0.01. The learning rate was reduced by a factor of 10 at epoch 200 and 300. The momentum is set to 0.9 for SGD with standard and Nesterov constant momentum. The restart schedule for SRSGD is set to 90, 30, 90 . The restart schedule changes at epoch 200 and 300. In all experiments, we use batch size 128 and the gradients are clipped so that their L2 norm are at most 1. Our code is based on the code from the exponential RNN's Github.foot_7  Results: Our experiments corroborate the superiority of SRSGD over the two baselines. SRSGD yields much smaller test error and converges faster than SGD with standard and Nesterov constant momentum across all settings with different number of LSTM hidden units. We summarize our results in Table 11 and Figure 7 . 

E.3 WASSERSTEIN GENERATIVE ADVERSARIAL NETWORKS (WGAN) TRAINING ON MNIST

We the advantage of SRSGD over SGD with standard and Nesterov momentum in training deep generative models. In our experiments, we train a WGAN with gradient penalty Gulrajani et al. (2017) on MNIST. We evaluate our models using the discriminator's loss, i.e. the Earth Moving distance estimate, since in WGAN lower discriminator loss and better sample quality are correlated Arjovsky et al. (2017) .

Implementation and Training Details:

The detailed implementations of our generator and discriminator are given below. For the generator, we set latent dim to 100 and d to 32. For the discriminator, we set d to 32. We train each model for 350 epochs with the initial learning rate of 0.01. The learning rate was reduced by a factor of 10 at epoch 200 and 300. The momentum is set to 0.9 for SGD with standard and Nesterov constant momentum. The restart schedule for SRSGD is set to 60, 120, 180 Results: Our SRSGD is still better than both the baselines. SRSGD achieves smaller discriminator loss, i.e. Earth Moving distance estimate, and converges faster than SGD with standard and Nesterov constant momentum. We summarize our results in Table 12 and Figure 8 . We also demonstrate the digits generated by the trained WGAN in Figure 9 . By visually evaluation, we observe that samples generated by the WGAN trained with SRSGD look slightly better than those generated by the WGAN trained with SGD with standard and Nesterov constant momentum. 13 above. In particular, for both SGD and SRSGD training using 100 epochs, we decrease the learning rate by a factor of 10 at the 80th, 90th, and 95th epoch. We observe that SGD short training has the worst performance compared to the others while SRSGD short training yields either comparable or even better results than SGD full training.

F.3 ADDITIONAL EXPERIMENTAL RESULTS

Figure 10 shows error rate vs. reduction in epochs for all models trained on CIFAR10 and ImageNet. It is a more complete version of Figure 4 in the main text. 4 and Figure 10 . We also conduct an additional ablation study of error rate vs. reduction in epochs for CIFAR100 and include the results in Figure 11 and Table 18 below.

G.3 ADDITIONAL EXPERIMENTAL RESULTS

To complete our study on the impact of restarting frequency in Section 5.2 in the main text, we examine the case of CIFAR100 and ImageNet in this section. We summarize our results in Figure 14 and 15 below. Also, Figure 13 is a more detailed version of Figure 5 in the main text. v { t +1} = p tl r * g t p { t +1} = v { t +1} + ( i t e r c o u n t ) / ( i t e r c o u n t + 3 ) * ( v { t +1} -v t ) " " " d e f i n i t ( s e l f , params , l r = r e q u i r e d , w e i g h t d e c a y = 0 . , i t e r c o u n t =1 , r e s t a r t i n g i t e r = 10 0) : i f l r i s n o t r e q u i r e d and l r < 0 . 0 : r a i s e V a l u e E r r o r ( " I n v a l i d l e a r n i n g r a t e : {} " . f o r m a t ( l r ) ) i f w e i g h t d e c a y < 0 . 0 : r a i s e V a l u e E r r o r ( " I n v a l i d w e i g h t d e c a y v a l u e : {} " . f o r m a t ( w e i g h t d e c a y ) ) i f i t e r c o u n t < 1 : r a i s e V a l u e E r r o r ( " I n v a l i d i t e r c o u n t : {} " . f o r m a t ( i t e r c o u n t ) ) i f r e s t a r t i n g i t e r < 1 : r a i s e V a l u e E r r o r ( " I n v a l i d i t e r t o t a l : {} " . f o r m a t ( r e s t a r t i n g i t e r ) ) d e f a u l t s = d i c t ( l r = l r , w e i g h t d e c a y = w e i g h t d e c a y , i t e r c o u n t = i t e r c o u n t , r e s t a r t i n g i t e r = r e s t a r t i n g i t e r ) s u p e r ( SRSGD , s e l f ) . i n i t ( params , d e f a u l t s ) d e f s e t s t a t e ( s e l f , s t a t e ) : s u p e r ( SRSGD , s e l f ) . s e t s t a t e ( s t a t e ) i n i t ( s e l f , l e a r n i n g r a t e = 0 . 0 1 , i t e r c o u n t =1 , r e s t a r t i n g i t e r =40 , * * k w a r g s ) : l e a r n i n g r a t e = k w a r g s . pop ( ' l r ' , l e a r n i n g r a t e ) s e l f . i n i t i a l d e c a y = k w a r g s . pop ( ' d e c a y ' , 0 . 0 ) s u p e r ( SRSGD , s e l f ) . i n i t ( * * k w a r g s ) w i t h K . n a m e s c o p e ( s e l f . c l a s s . n a m e ) : s e l f . i t e r a t i o n s = K . v a r i a b l e ( 0 , d t y p e = ' i n t 6 4 ' , name= ' i t e r a t i o n s ' ) s e l f . l e a r n i n g r a t e = K . v a r i a b l e ( l e a r n i n g r a t e , name= ' l e a r n i n g r a t e ' ) s e l f . d e c a y = K . v a r i a b l e ( s e l f . i n i t i a l d e c a y , name= ' d e c a y ' ) # f o r s r s g d s e l f . i t e r c o u n t = K . v a r i a b l e ( i t e r c o u n t , d t y p e = ' i n t 6 4 ' , name= ' i t e r c o u n t ' ) s e l f . r e s t a r t i n g i t e r = K . v a r i a b l e ( r e s t a r t i n g i t e r , d t y p e = ' i n t 6 4 ' , name= ' r e s t a r t i n g i t e r ' ) s e l f . n e s t e r o v = n e s t e r o v @ i n t e r f a c e s . l e g a c y g e t u p d a t e s s u p p o r t @K. s y m b o l i c d e f g e t u p d a t e s ( s e l f , l o s s , p a r a m s ) : g r a d s = s e l f . g e t g r a d i e n t s ( l o s s , p a r a m s ) s e l f . u p d a t e s = [K . u p d a t e a d d ( s e l f . i t e r a t i o n s , 1 ) ] momentum = (K . c a s t ( s e l f . i t e r c o u n t , d t y p e =K . d t y p e ( s e l f . d e c a y ) ) -1 . ) / ( K . c a s t ( s e l f . i t e r c o u n t , d t y p e =K . d t y p e ( s e l f . d e c a y ) ) + 2 . ) l r = s e l f . l e a r n i n g r a t e i f s e l f . i n i t i a l d e c a y > 0 : l r = l r * ( 1 . / ( 1 . + s e l f . d e c a y * K . c a s t ( s e l f . i t e r a t i o n s , K . c o n d i t i o n = K . a l l (K . l e s s ( s e l f . i t e r c o u n t , s e l f . r e s t a r t i n g i t e r ) ) n e w i t e r c o u n t = K . s w i t c h ( c o n d i t i o n , s e l f . i t e r c o u n t + 1 , s e l f . i t e r c o u n ts e l f . r e s t a r t i n g i t e r + 1 ) s e l f . u p d a t e s . a p p e n d (K . u p d a t e ( s e l f . i t e r c o u n t , n e w i t e r c o u n t ) ) r e t u r n s e l f . u p d a t e s d e f g e t c o n f i g ( s e l f ) : c o n f i g = { ' l e a r n i n g r a t e ' : f l o a t (K . g e t v a l u e ( s e l f . l e a r n i n g r a t e ) ) , ' d e c a y ' : f l o a t (K . g e t v a l u e ( s e l f . d e c a y ) ) , ' i t e r c o u n t ' : i n t (K . g e t v a l u e ( s e l f . i t e r c o u n t ) ) , ' r e s t a r t i n g i t e r ' : i n t (K . g e t v a l u e ( s e l f . r e s t a r t i n g i t e r ) ) } b a s e c o n f i g = s u p e r ( SRSGD , s e l f ) . g e t c o n f i g ( ) r e t u r n d i c t ( l i s t ( b a s e c o n f i g . i t e m s ( ) ) + l i s t ( c o n f i g . i t e m s ( ) ) )



We leave the analysis under the other assumptions(Jain et al., 2018) as a future work. By overfitting, we mean that the model achieves low training error but high test error. We adopt the result fromHardt (2014). We used the PyTorch implementation of LeNet at https://github.com/activatedgeek/LeNet-5. Implementation available at https://github.com/bearpaw/pytorch-classification Implementation available at https://github.com/pytorch/examples/tree/master/imagenet Implementation available at https://github.com/Lezcano/expRNN Implementation available at https://github.com/arturml/pytorch-wgan-gp



Figure 1: Error rate vs. depth of ResNet models trained with SRSGD and the baseline SGD with constant momemtum. Advantage of SRSGD continues to grow with depth.

Figure 2: Comparison between different schemes in optimizing the quadratic function in (8) with (a) exact gradient, (b) gradient with constant variance Gaussian noise, and (c) gradient with decaying variance Gaussian noise. NAG, ARNAG, and SRNAG can speed up convergence remarkably when exact gradient is used. Also, SRNAG is more robust to noisy gradient than NAG and ARNAG.

Figure 3: (a) Training loss comparison between different schemes in training logistic regression for MNIST classification. Here, SGD is the plain SGD without momentum, and SGD + Momentum that follows (3) and replaces gradient with the mini-batch stochastic gradient. NASGD is not robust to noisy gradient, ARSGD almost degenerates to SGD, and SRSGD performs the best in this case. (b, c) Training loss vs. training epoch of ResNet models trained with SRSGD (blue) and the SGD baseline with constant momentum as in PyTorch implementation, which is denoted by SGD in Section 4 (red).

2) for CIFAR100. We compare the test error of the trained models with those trained by the SGD baseline in 200 epochs. We observe that SRSGD training consistently yields lower test errors than SGD except for the case of Pre-ResNet-110 even though the number of training epochs of our method is only half of the number of training epochs required by SGD. For Pre-ResNet-110, SRSGD needs 110 epochs with learning rate decreased at the 80th, 90th, and 100th epoch to achieve the same error rate as the 200-epoch SGD training on CIFAR10. On CIFAR100, SRSGD training for Pre-ResNet-110 needs 140 epochs with learning rate decreased at the 80th, 100th and 120th epoch to outperform the 200-epoch SGD. Comparison with SGD short training is provided in Appendix F.2.

Uncontrolled Bound of NASGD 15 A.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Uncontrolled Bound of NASGD: Analysis . . . . . . . . . . . . . . . . . . . . 16 B NAG with δ-Inexact Oracle & Experimental Settings in Section 3.1 19 C Convergence of SRSGD 20 C.1 Numerical Verification of the assumptions in Theorem 2 . . . . . . . . . . . . . 22 D Datasets and Implementation Details 23 D.1 CIFAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 D.2 ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 D.3 Training ImageNet in Fewer Number of Epochs: . . . . . . . . . . . . . . . . . 23 D.4 Details on Restarting Hyper-parameters Search . . . . . . . . . . . . . . . . . . 23 E SRSGD vs. SGD and SGD + NM on ImageNet Classification and Other Tasks 24 E.1 Comparing with SGD with Nesterov Momentum on ImageNet Classification . . 24 E.2 Long Short-Term Memory (LSTM) Training for Pixel-by-Pixel MNIST . . . . . 24 E.3 Wasserstein Generative Adversarial Networks (WGAN) Training on MNIST . . 25 F Error Rate vs. Reduction in Training Epochs 27 F.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 F.2 Short Training on CIFAR10/CIFAR100 Using SGD . . . . . . . . . . . . . . . 28 F.3 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 28 G Impact of Restarting Frequency for ImageNet and CIFAR100 29 G.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 G.2 Impact of the Growth Rate r . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 G.3 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 31 H Full Training with Less Epochs at the Intermediate Learning Rates 32 I Visualization of SRSGD's trajectory 33 J SRSGD Implementation in Pytorch 35 K SRSGD Implementation in Keras 36 A UNCONTROLLED BOUND OF NASGD Consider the following optimization problem min w f (w),

we then proved the result. B NAG WITH δ-INEXACT ORACLE & EXPERIMENTAL SETTINGS IN SECTION 3.1 In Devolder et al. (2014), the authors defines δ-inexact gradient oracle for convex smooth optimization as follows: Definition 1 (δ-Inexact Oracle). Devolder et al. (2014) For a convex L-smooth function f : R d → R.

A := {k 2 Z + |Ef (w k+1 ) Ef (w k )} < l a t e x i t s h a 1 _ b a s e 6 4 = " K G x F g h y p A j X I i c c / n / K w g j n 1 E S 0 = "> A A A C S 3 i c d V D P S x w x G M 2 s 9 U f X H 1 3 t s Z f Q R V C E Z c a K 2 o J g W w o e F V y V 7 q x L k v 1 m D Z P J j M k 3 r U s 6 / 1 8 v v f T W f 6 K X H h T p o T P j I m 1 t H w R e 3 n s f + f J 4 p q R F 3 / / m N a Y e T c / M z j 1 u z i 8 s L j 1 p L a + c 2 D Q 3 A r o i V a k 5 4 8 y C k h q 6 K F H B W W a A J V z B K Y / f V v 7 p B z B W p v o Yx x n 0 E z b S M p K C Y S k N W j x M G F 4 I p t z r g r 7 a o 6 G L Q 6 l p r X L u 3 h f n G 5 / u b + + K a C 1 E u E I e u Y / F u Y s 3 g m I 9 H M H l / x L x O g 2 L Q a v t d / w a 9 C E J J q R N J j g c t L 6 G w 1 T k C W g U i l n b C / w M + 4 4 Z l E J B 0 Q x z C x k T M R t B r 6 S a J W D 7 r u 6 i o K u l M q R R a s q j k d b q 7 x O O J d a O E 1 4 m q 5 3 t 3 1 4 l / s v r 5 R j t 9 p 3 U W Y 6 g x d 1 D U a 4 o p r Q q l g 6 l A Y F q X B I m j C x 3 p e K C G S a w r L 9 Z l / C y w v b 9 l x + S k 8 1 O 8 K K z d b T V 3 n 8 z q W O O P C P P y R o J y A 7 Z J w f k k H S J I J / J d 3 J N b r w v 3 g / v 1 v t 5 F 2 1 4 k 5 m n 5 A 8 0 p n 8 B h 6 q 0 e A = = < / l a t e x i t > R = X k2A Ef (w k+1 ) Ef (w k )

Figure 6: Cardinality of the set A := {k ∈ Z + |Ef (w ) Ef (w k )} (Top panels) and the value of R = k∈A Ef k+1) -Ef (w k ) (Bottom panels). We notice that when the training gets plateaued, E(f (w k )) still oscillates, but the magnitude of the oscillation diminishes as iterations goes, which is consistent with our plots that the cardinality of A increases linearly, but R converges to a finite number under different restart frequencies. These results confirm that our assumption in Theorem 2 is reasonable.

68, F 3 = 101, F 4 = 152 (r = 1.50) D.2 IMAGENET The ImageNet dataset contains roughly 1.28 million training color images and 50K validation color images from 1000 classes Russakovsky et al. (2015). We run our ImageNet experiments on ResNet-50, 101, 152, and 200 with 5 different seeds. Following He et al. (2016a;b), we train each model for 90 epochs with a batch size of 256 and decrease the learning rate by a factor of 10 at the 30th and 60th epoch. The initial learning rate is 0.1, the momentum is 0.9, and the weight decay rate is 1 × 10 -5 . Random 224 × 224 cropping and random horizontal flipping are applied to training data. We use the official Pytorch ResNet implementation Paszke et al. (

F2 = 105, F3: linearly decayed from 105 to 1 in the last 24 epochs ResNet-101 Decrease the learning rate by a factor of 10 at the 30th and 56th epoch. Train for a total of 80 epochs. F1 = 40, F2 = 80, F3: linearly decayed from 80 to 1 in the last 24 epochs ResNet-152 Decrease the learning rate by a factor of 10 at the 30th and 51th epoch. Train for a total of 75 epochs. F1 = 40, F2 = 80, F3: linearly decayed from 80 to 1 in the last 24 epochs ResNet-200 Decrease the learning rate by a factor of 10 at the 30th and 46th epoch. Train for a total of 60 epochs. F1 = 40, F2 = 80, F3: linearly decayed from 80 to 1 in the last 14 epochs making choices based on final validation performance. The same chosen restarting frequencies are applied for all models including Pre-ResNet

Figure 6: Training loss vs. training iterations of LSTM trained with SGD (red), SGD + NM (green), and SRSGD (blue) for PMNIST classification tasks.We evaluate our models using the discriminator's loss, i.e. the Earth Moving distance estimate, since 572

Figure 7: Training loss vs. training iterations of LSTM trained with SGD (red), SGD + NM (green), and SRSGD (blue) for PMNIST classification tasks.

Figure 4 (left) in the main text and Figure 10 below.638

Figure9shows error rate vs. reduction in epochs for all models trained on CIFAR10 and ImageNet.646It is a more complete version of Figure4in the main text. Table13 and Table 14 provide detailed test

Figure 8: Earth Moving distance estimate (i.e. discriminator loss) vs. epochs of WGAN with gradient penalty trained with SGD (red), SGD + NM (green), and SRSGD (blue) on MNIST.

Figure 9: MNIST digits generated by WGAN trained with gradient penalty by SGD (left), SGD + NM (middle), and SRSGD (right).

which are the blue dots. Those bad local minima achieve good training error but bad test error. We plots the trained models and bad local minima using PCAWold et al. (1987) and t-SNEMaaten & Hinton (2008) embedding. The blue color bar is for the test accuracy of bad local minima; the red color bar is for the number of training epochs.

Figure19: Trajectory through bad minima of SGD, SGD with constant momentum, and SRSGD during the training: we train a neural net classifier and plot the iterates of SGD after each ten epoch (red dots). We also plot locations of nearby "bad" minima with poor generalization (blue dots). We visualize these using PCA and t-SNE embedding. The blue color bar is for the test accuracy of bad local minima while the red color bar is for the number of training epochs. All blue dots for SGD with constant momentum and SRSGD achieve near perfect train accuracy, but with test accuracy below 59%. All blue dots for SGD achieves average train accuracy of 73.11% and with test accuracy also below 59%. The final iterate (yellow star) of SGD, SGD with constant momentum, and SRSGD achieve 73.13%, 99.25%, and 100.0% test accuracy, respectively.(CONTINUED NEXT PAGE)

d e f u p d a t e i t e r ( s e l f ) : i d x = 1 f o r g r o u p i n s e l f . p a r a m g r o u p s : i f i d x == 1 : g r o u p [ ' i t e r c o u n t ' ] += 1 i f g r o u p [ ' i t e r c o u n t ' ] >= g r o u p [ ' r e s t a r t i n g i t e r ' ] : g r o u p [ ' i t e r c o u n t ' ] = 1 i d x += 1 r e t u r n g r o u p [ ' i t e r c o u n t ' ] , g r o u p [ ' r e s t a r t i n g i t e r ' ] d e f s t e p ( s e l f , c l o s u r e =None ) : " " " P e r f o r m a s i n g l e o p t i m i z a t i o n s t e p . Arguments : c l o s u r e ( c a l l a b l e , o p t i o n a l ) : A c l o s u r e t h a t r e e v a l u a t e s t h e model and r e t u r n s t h e l o s s . " " " l o s s = None i f c l o s u r e i s n o t None : l o s s = c l o s u r e ( ) f o r g r o u p i n s e l f . p a r a m g r o u p s : w e i g h t d e c a y = g r o u p [ ' w e i g h t d e c a y ' ] momentum = ( g r o u p [ ' i t e r c o u n t ' ] -1 . ) / ( g r o u p [ ' i t e r c o u n t ' ] + 2 . ) f o r p i n g r o u p [ ' p a r a m s ' ] : i f p . g r a d i s None : c o n t i n u e d p = p . g r a d . d a t a i f w e i g h t d e c a y ! = 0 : d p . a d d ( w e i g h t d e c a y , p . d a t a ) p a r a m s t a t e = s e l f . s t a t e [ p ] i f ' m o m e n t u m b u f f e r ' n o t i n p a r a m s t a t e : b u f 0 = p a r a m s t a t e [ ' m o m e n t u m b u f f e r ' ] = t o r c h . c l o n e ( p . d a t a ) . d e t a c h ( ) e l s e : b u f 0 = p a r a m s t a t e [ ' m o m e n t u m b u f f e r ' ] b u f 1 = p . d a t ag r o u p [ ' l r ' ] * d p p . d a t a = b u f 1 + momentum * ( b u f 1 -b u f 0 ) p a r a m s t a t e [ ' m o m e n t u m b u f f e r ' ] = b u f 1 i t e r c o u n t , i t e r t o t a l = s e l f . u p d a t e i t e r ( ) r e t u r n l o s s K SRSGD IMPLEMENTATION IN KERAS i m p o r t numpy a s np i m p o r t t e n s o r f l o w a s t f from k e r a s i m p o r t b a c k e n d a s K from k e r a s . o p t i m i z e r s i m p o r t O p t i m i z e r from k e r a s . l e g a c y i m p o r t i n t e r f a c e s i f K . b a c k e n d ( ) == ' t e n s o r f l o w ' : i m p o r t t e n s o r f l o w a s t f c l a s s SRSGD( O p t i m i z e r ) : " " " S c h e d u l e d R e s t a r t S t o c h a s t i c g r a d i e n t d e s c e n t o p t i m i z e r . I n c l u d e s s u p p o r t f o r N e s t e r o v momentum and l e a r n i n g r a t e d e c a y . # Arguments l e a r n i n g r a t e : f l o a t >= 0 . L e a r n i n g r a t e .

d t y p e ( s e l f . d e c a y ) ) ) ) # momentum s h a p e s = [K . i n t s h a p e ( p ) f o r p i n p a r a m s ] moments = [K . v a r i a b l e ( v a l u e =K . g e t v a l u e ( p ) , d t y p e =K . d t y p e ( s e l f . d e c a y ) , name= ' moment ' + s t r ( i ) ) f o r ( i , p ) i n e n u m e r a t e ( p a r a m s ) ] s e l f . w e i g h t s = [ s e l f . i t e r a t i o n s ] + moments + [ s e l f . i t e r c o u n t ] f o r p , g , m i n z i p ( params , g r a d s , moments ) : v = pl r * g new p = v + momentum * ( vm) s e l f . u p d a t e s . a p p e n d (K . u p d a t e (m, v ) ) # Apply c o n s t r a i n t s . i f g e t a t t r ( p , ' c o n s t r a i n t ' , None ) i s n o t None : new p = p . c o n s t r a i n t ( new p ) s e l f . u p d a t e s . a p p e n d (K . u p d a t e ( p , new p ) )

Classification test error (%) on CIFAR10 using SGD, SGD + NM, and SRSGD. We report the results of SRSGD with two restarting schedules: linear (lin) and exponential (exp). The numbers of iterations after which we restart the momentum in the lin schedule are 30, 60, 90, 120 for the 1st, 2nd, 3rd, and 4th stage. Those numbers for the exp schedule are 40, 50, 63, 78. We include the reported results from(He et al., 2016b)  (in parentheses) in addition to our reproduced results.



(in parentheses). Comparison of single crop validation errors on ImageNet (%) between SRSGD training with fewer epochs and SGD training with full 90 epochs.

Restarting frequencies for CIFAR10 and CIFAR100 experiments

Restarting frequencies for ImageNet experiments ImageNet Linear schedule F 1 = 40, F 2 = 80, F 3 : linearly from 80 to 1 in the last 30 epochs D.3 TRAINING IMAGENET IN FEWER NUMBER OF EPOCHS:Table 9 contains the learning rate and restarting frequency schedule for our experiments on training ImageNet in fewer number of epochs, i.e. the reported results in Table 6 in the main text. Other settings are the same as in the full-training ImageNet experiments described in Section D.2 above.

Learning rate and restarting frequency schedule for ImageNet short training, i.e. Table6in the main text. Decrease the learning rate by a factor of 10 at the 30th and 56th epoch. Train for a total of 80 epochs.

75 yields better performance in short training. All reported results are averaged over 5 random seeds. E SRSGD VS. SGD AND SGD + NM ON IMAGENET CLASSIFICATION AND OTHER TASKS E.1 COMPARING WITH SGD WITH NESTEROV MOMENTUM ON IMAGENET CLASSIFICATION In this section, we compare SRSGD with SGD with Nesterov constant momentum (SGD + NM) in training ResNets for ImageNet classification. All hyper-parameters of SGD with constant Nesterov momentum used in our experiments are the same as those of SGD described in section D.2. We list the results in Table 10. Again, SRSGD remarkably outperforms SGD + NM in training ResNets for ImageNet classification, and as the network goes deeper the improvement becomes more significant. E.2 LONG SHORT-TERM MEMORY (LSTM) TRAINING FOR PIXEL-BY-PIXEL MNIST In this task, we examine the advantage of SRSGD over SGD and SGD with Nesterov Momentum in training recurrent neural networks. In our experiments, we use an LSTM with different numbers of hidden units (128, 256, and 512) to classify samples from the well-known MNIST dataset LeCun &

(in  parentheses).

Test errors (%) on Permuted MNIST of trained with SGD, SGD + NM and SRSGD. The LSTM model has 128 hidden units. In all experiments, we use the initial learning rate of 0.01, which is reduced by a factor of 10 at epoch 200 and 300. All models are trained for 350 epochs. The momentum for SGD and SGD + NM is set to 0.9. The restart schedule in SRSGD is set to90, 30,  and 90.

Test errors (%) on Permuted MNIST of trained with SGD, SGD + NM and SRSGD. The LSTM model has 128 hidden units. In all experiments, we use the initial learning rate of 0.01, which is reduced by a factor of 10 at epoch 200 and 300. All models are trained for 350 epochs. The momentum for SGD and SGD + NM is set to 0.9. The restart schedule in SRSGD is set to90, 30,  and 90.

. The restart schedule changes at epoch 200 and 300. In all experiments, we use batch size 64. Our code is based on the code from the Pytorch WGAN-GP Github. 8

Discriminator loss (i.e. Earth Moving distance estimate) of the WGAN with gradient penalty trained on MNIST with SGD, SGD + NM and SRSGD. In all experiments, we use the initial learning rate of 0.01, which is reduced by a factor of 10 at epoch 200 and 300. All models are trained for 350 epochs. The momentum for SGD and SGD + NM is set to 0.9. The restart schedule in SRSGD is set to 60, 120, and 180. Our SRSGD is still better than both the baselines. SRSGD achieves smaller discriminator 627 loss, i.e. Earth Moving distance estimate, and converges faster than SGD with standard and Nesterov 628 constant momentum. We summarize our results in Table11 and Figure 7. We also demonstrate the 629 digits generated by the trained WGAN in Figure8. By visually evaluation, we observe that samples 630 generated by the WGAN trained with SRSGD look slightly better than those generated by the WGAN 631 trained with SGD with standard and Nesterov constant momentum.

Discriminator loss (i.e. Earth Moving distance estimate) of the WGAN with gradient penalty trained on MNIST with SGD, SGD + NM and SRSGD. In all experiments, we use the initial learning rate of 0.01, which is reduced by a factor of 10 at epoch 200 and 300. All models are trained for 350 epochs. The momentum for SGD and SGD + NM is set to 0.9. The restart schedule in SRSGD is set to60, 120, and 180.    Table 12 contains the learning rate schedule for each number of epoch reduction in

Table 14 provide detailed test 647 errors vs. number of training epoch reduction reported in Figure4and Figure9. We also conduct an

Learning rate (LR) schedule for the ablation study of error rate vs. reduction in training epochs for CIFAR10 experiments, i.e. Figure4in the main text and for CIFAR100 experiments, i.e. Figure11in this Appendix.

Classification test error (%) of SGD short training (100 epochs), SGD full training (200 epochs), SRSGD short training (100 epochs), and SRSGD full training (200 epochs) on CIFAR10. SGD short training yields much worse test errors than the others while SRSGD short training yields either comparable or even better results than SGD full training.

Classification test error (%) of SGD short training (100 epochs), SGD full training (200 epochs), SRSGD short training (100 epochs), and SRSGD full training (200 epochs) on CIFAR100. SGD short training yields worse test errors than the others while SRSGD short training yields either comparable or even better results than SGD full training.For better comparison between SRSGD training using fewer epochs and SGD full training, we also conduct experiments with SGD training using fewer epochs on CIFAR10 and CIFAR100. Table14and 15 compares SRSGD short training using 100 epoch, SGD short training using 100 epochs, SRSGD full training using 200 epochs, and SGD full training using 200 epochs for Pre-ResNet-110, 290, and 470 on CIFAR10 and CIFAR100, respectively. The learning rate schedule for SGD short training using 100 epochs is the same as the learning rate schedule for SRSGD short training using 100 epoch given in Section 4 and in Table

Table 16 and Table 17 provide detailed test errors vs. number of training epoch reduction reported in Figure

Figure 13: Training loss (left) and test error (right) of Pre-ResNet-290 trained on CIFAR10 with different initial restarting frequencies F 1 (linear schedule). SRSGD with small F 1 approximates SGD without momentum, while SRSGD with large F 1 approximates NASGD.The training loss curve and test accuracy of NASGD are shown in red and confirm the result of Theorem 1 that NASGD accumulates error due to the stochastic gradients.Figure 14: Training loss and test error of Pre-ResNet-290 trained on CIFAR100 with different initial restarting frequencies F 1 (linear schedule). SRSGD with small F 1 approximates SGD without momentum, while SRSGD with large F 1 approximates NASGD. The training loss curve and test accuracy of NASGD are shown in red and confirm the result of Theorem 1 that NASGD accumulates error due to the stochastic gradients. FULL TRAINING WITH LESS EPOCHS AT THE INTERMEDIATE LEARNING RATES We explore SRSGD full training (200 epochs on CIFAR and 90 epochs on ImageNet) with less number of epochs at the intermediate learning rates and report the results inTable 19, 20, 21 and Figure 16, 17, 18 below. The settings and implementation details here are similar to those in Section F, but using all 200 epochs for CIFAR experiments and 90 epochs for ImageNet experiments. Figure 16: Test error when using new learning rate schedules with less training epochs at the 2nd and 3rd learning rate for CIFAR10. We still train in full 200 epochs in this experiment. On the x-axis, 10, for example, means we reduce the number of training epochs by 10 at each intermediate learning rate, i.e. the 2nd and 3rd learning rate. The dashed lines are test errors of the SGD baseline.

Test error when using new learning rate schedules with less training epochs at the 2nd and 3rd learning rate for CIFAR10. We still train in full 200 epochs in this experiment. In the table, 80-90-100, for example, means we reduce the learning rate by factor of 10 at the 80th, 90th, and 100th epoch.

annex

Here, we fix the initial restarting frequency F 1 = 30 for all trainings. Increasing the restarting frequency during training yields better results than decreasing the restarting frequency, but increasing the restarting frequency too fast and too much also diminishes the performance of SRSGD.

G.2 IMPACT OF THE GROWTH RATE r

We do an ablation study for the growth rate r to understand its impact on the behavior of SRSGD. We choose a case study of training a Pre-ResNet-110 on CIFAR10 using SRSGD with a linear schedule scheme for the restarting frequency. We fix the initial restarting frequency F 1 = 30 and vary the growth rate r. We choose r from the set of {0.7, 1.0, 2.0, 10.0}. These values of r represent four different scenarios. When r = 0.7, the restarting frequency decreases every time the learning rate is reduced by a factor of 10. When r = 1.0, the restarting frequency stays constant during the training. When r = 2.0, the restarting frequency increases every time the learning rate is reduced by a factor of 10. Finally, when r = 10.0, it is similar to when r = 2.0, but the restarting frequency increases much faster and to larger values. Figure 12 summarizes the results of our ablation study. We observe that for CIFAR10, decreasing the restarting frequency or keeping it constant during training yields worse results than increasing the restarting frequency. However, increasing the restarting frequency too much also diminishes the performance of SRSGD. Table 21 : Top 1 single crop validation error when using new learning rate schedules with less training epochs at the 2nd learning rate for ImageNet. We still train in full 90 epochs in this experiment. In the table, 30-40, for example, means we reduce the learning rate by factor of 10 at the 30th and 40th epoch.

