STOCHASTIC OPTIMIZATION WITH NON-STATIONARY NOISE: THE POWER OF MOMENT ESTIMATION Anonymous

Abstract

We investigate stochastic optimization under weaker assumptions on the distribution of noise than those used in usual analysis. Our assumptions are motivated by empirical observations in training neural networks. In particular, standard results on optimal convergence rates for stochastic optimization assume either there exists a uniform bound on the moments of the gradient noise, or that the noise decays as the algorithm progresses. These assumptions do not match the empirical behavior of optimization algorithms used in neural network training where the noise level in stochastic gradients could even increase with time. We address this nonstationary behavior of noise by analyzing convergence rates of stochastic gradient methods subject to changing second moment (or variance) of the stochastic oracle. When the noise variation is known, we show that it is always beneficial to adapt the step-size and exploit the noise variability. When the noise statistics are unknown, we obtain similar improvements by developing an online estimator of the noise level, thereby recovering close variants of RMSProp (Tieleman and Hinton, 2012). Consequently, our results reveal why adaptive step size methods can outperform SGD, while still enjoying theoretical guarantees.

1. INTRODUCTION

Stochastic gradient descent (SGD) is one of the most popular optimization methods in machine learning because of its computational efficiency compared to traditional full gradient methods. Great progress has been made in understanding the performance of SGD under different smoothness and convexity conditions (Agarwal et al., 2009; Arjevani et al., 2019; Drori and Shamir, 2019; Ghadimi and Lan, 2012; 2013; Nemirovsky and Yudin, 1983; Rakhlin et al., 2012) . These results show that with a fixed step size, SGD can achieve the minimax optimal convergence rate for both convex and nonconvex optimization problems, provided the gradient noise is uniformly bounded. Yet, despite the theoretical minimax optimality of SGD, adaptive gradient methods (Duchi et al., 2011; Kingma and Ba, 2014; Tieleman and Hinton, 2012) have become the methods of choice for training deep neural networks, and have received a surge of attention recently (Agarwal et al., 2018; Chen et al., 2019; Huang et al., 2019; Levy, 2017; Levy et al., 2018; Li and Orabona, 2019; Liu et al., 2019; 2020; Ma and Yarats, 2019; Staib et al., 2019; Ward et al., 2019; Zhang et al., 2019; 2020; Zhou et al., 2018; 2019; Zou and Shen, 2018; Zou et al., 2019) . Instead of using fixed stepsizes, these methods construct their stepsizes adaptively using the current and past gradients. Despite advances in the literature on adaptivity, theoretical understanding of the benefits of adaptation is very limited. We provide a different perspective on understanding the benefits of adaptivity by considering it in the context of non-stationary gradient noise, i.e., the noise intensity varies with iterations. Surprisingly, this setting is rarely studied, even for SGD. To our knowledge, this paper is the first work to formally study stochastic gradient methods in this varying noise scenario. Our main goal is to show that: Adaptive step-sizes can guarantee faster rates than SGD when the noise is non-stationary. We focus on this goal based on several empirical observations (Section 2), which lead us to model the noise of stochastic gradient oracles via the following iteration dependent quantities: m 2 k := E[ g(x k ) 2 ], or σ 2 k := E[ g(x k ) -∇f (x k ) 2 ], where g(x k ) is the stochastic gradient and ∇f (x k ) the true gradient at iteration k. Notation (1) provides more fine-grained description of noise behavior than uniform bounds on the variance by We observe that the magnitude of these quantities changes significantly as iteration count increases, ranging from 10 times (ResNet) to 10 6 times (Transformer). This phenomenon motivates us to consider a setting with non-stationary noise. permitting iteration dependent noise intensities. It is intuitive that one should prefer smaller stepsizes when the noise is large and vice versa. Thus, under non-stationarity, an ideal algorithm should adapt its stepsize according to the parameters m k or σ k , suggesting a potential benefit of adaptive stepsizes. Contributions. The primary contribution of our paper is to show that a stochastic optimization method with adaptive stepsize can achieve a faster rate of convergence (by a factor that is polynomialin-T ) than fixed-step SGD. We first analyze an idealized setting where the noise intensities are known, using it to illustrate how to select noise dependent stepsizes that are provably more effective (Theorem 1). Next, we study the case with unknown noise, where we show under an appropriate smoothness assumption on the noise variation that a variant of RMSProp (Tieleman and Hinton, 2012) can achieve the idealized convergence rate (Theorem 3). Remarkably, this variant does not require the noise levels. Finally, we generalize our results to nonconvex settings (Theorems 12 and 14).

2. MOTIVATING OBSERVATION: NONSTATIONARY NOISE IN DEEP LEARNING

Neural network training involves optimizing an empirical risk minimization problem of the form min x f (x) := 1 n n i=1 f i (x) , where each f i represents the loss function with respect to the i-th data or minibatch. Stochastic methods optimize this objective randomly sampling an incremental gradient ∇f i at each iteration and using it as an unbiased estimate of the full gradient. The noise intensity of this stochastic gradient is measured by its second moments or variances, defined as, 1. Second moment: m 2 (x) = 1 n n i=1 ∇f i (x) 2 ; 2. Variance: σ 2 (x) = 1 n n i=1 ∇f i (x) -∇f (x) 2 , where ∇f (x) is the full gradient, To illustrate how these quantities evolve over iterations, we empirically evaluate them on three popular tasks of neural network training: ResNet18 training on Cifar10 dataset for image classificationfoot_0 , LSTM training on PTB dataset for language modellingfoot_1 ; transformer training on WMT16 en-de for language translationfoot_2 . The results are shown in Figure 1 , where both the second moments and variances are evaluated using the default training procedure of the original code. On one hand, the variation of the second moment/variance has a very different shape in each of the considered tasks. In the CIFAR experiment, the noise intensity is quite steady after the first iteration, indicating a fast convergence of the training model. In LSTM training, the noise level increases and converges to a threshold. While, in training Transformers, the noise level increases very fast at the early epochs, then reaches a maximum, and turns down gradually. On the other hand, the preferred optimization algorithms in these tasks are also different. For CIFAR10, SGD with momentum is the most popular choice. While for language models, adaptive methods such as Adam or RMSProp are the rule of thumb. This discrepancy is usually taken as granted, based on empirical validation; and little theoretical understanding of it exists in the literature. Based on the observations made in Figure 1 , a natural candidate emerges to explain this discrepancy in the choice of algorithms: the performance of stochastic algorithms varies according the characteristics of gradient noise encountered during training. Despite this behavior, noise level modeling has drawn surprisingly limited attention in prior art. Reference (Moulines and Bach, 2011) studies convergence of SGD assuming each component function is convex and smooth; extensions to the variation of the full covariance matrix are in (Gadat and Panloup, 2017) . A more fine-grained stochastic oracle assumes that the variances grow with the gradient norm as σ 2 + c ∇f (x) 2 , or grow with the suboptimality σ 2 + c x -x * 2 (Bottou et al., 2018; Jofré and Thompson, 2019; Rosasco et al., 2019) . Unfortunately, these existing oracles fail to express the variation of noise observed in Figure 1 . Indeed, the norm of the full gradient, represented as the difference between the orange and the blue line, is significantly smaller compared to the noise level. This suggests that noise variation is not due to the gradient norm, but due to some implicit properties of the objective function. This observation motivates us to introduce the following non-stationary noise oracle: Definition 1 (non-stationary noise oracle). The stochasticity of the problem is governed by a sequence of second moments {m k } k∈N or variances {σ k } k∈N , such that, at the k th iteration, the gradient oracle returns an unbiased gradient g(x k ) such that E[g(x k )] = ∇f (x k ), and either (a) with second moment E[ g(x k ) 2 ] = m 2 k ; or (b) with variance E[ g(x k ) -∇f (x k ) 2 ] = σ 2 k . The non-stationary noise oracle is a relaxation of the standard uniform noise oracle in which case m k or σ k are constant. By introducing the time dependency, we aim to understand how the variation of noise influences the convergence rate of optimization algorithms. An example that falls into this category is when the noise is additive to the gradient, namely g(x k ) ∼ ∇f (x k ) + N (0, σ 2 k ). We emphasize that our goal is to demystify the correlation between the noise intensity and the performance of optimization algorithm, instead of explaining why certain shape of noise occurs. In general, the variation in noise is a consequence of the combination on data distribution, training model and optimization method, which is complex and highly non trivial. We simplify it by assuming that the noise intensity is decoupled from its location, meaning that, the parameters m k or σ k only depend on the iteration number k, but do not depend on the specific location where the gradient is evaluated. This is empirically justified as the pattern of the noise is mostly determined by the task instead of the optimization algorithms, see Appendix A. The simplification helps to focus on the shape of noise, taking a first step towards the goal: characterize the convergence rate of adaptive algorithms under non-stationary noise.

3. THE BENEFIT OF ADAPTIVITY UNDER NONSTATIONARY NOISE

In this section, we investigate the influence of nonstationary noise in an idealized setting where the noise parameters m k are known. For brevity, we will first focus on the convex setting and present our results based on the second moment parameters m k . We defer the discussion on nonconvex problems and variants on variance parameters σ k to Section 5. One reason that we prioritize the second moment than the variance is to draw a connection with the well-known adaptive method RMSProp (Tieleman and Hinton, 2012) and Adam (Kingma and Ba, 2014) . One common feature shared by both algorithms is that they scale the step sizes inversely to an exponential moving average of estimated second moments. Below, we start from the idealized case assuming the second moments are known and show why inverse scaling could speed up convergence. Let f be convex and differentiable. We consider the problem min x f (x), where the gradient is given by the nonstationary noise oracle defined in Definition 1. We assume that the optimum is attained at x * and we denote f * the minimum of the objective. We are interested in studying the convergence rate of a stochastic algorithm with update rule x k+1 = x k -η k g(x k ), where η k are stepsizes that are oblivious of the iterates {x k } k∈N . Theorem 1. Under the second moment oracle given in Definition 1 (a), the weighted average x T = ( T k=1 η k x k )/( T k=1 η k ) obtained by the update rule (2) satisfies the suboptimality bound E[f (x T ) -f * ] ≤ x 1 -x * 2 + T k=1 η 2 k m 2 k T k=1 η k . (3) Although the theorem follows from standard analysis, it leads to valuable results as explained below. Corollary 2. Denote M = max m k and R = x 1 -x * . We have the following convergence rate: 1. SGD with constant stepsize: if η k = η = R √ T k=1 m 2 k , then E[f (x T ) -f * ] ≤ 2R √ T k=1 m 2 k T = 2RM √ T • 1 T T k=1 m 2 k M 2 . (constant baseline) 2. SGD with idealized stepsize: if η k = R √ T m k , then E[f (x T ) -f * ] ≤ 2R √ T T k=1 1 m k = 2RM √ T • 1 M 1 T T k=1 1 m k . (idealized baseline) To facilitate comparison, we have normalized the convergence rates with respect to the conventional rate 2RM/ √ T (Nemirovski et al., 2009) . In the standard setting, the values of m k are unavailable, but the uniform bound M is known, in such a case taking m k = M recovers the standard result. When the values of m k are given, both constant baseline and idealized baseline benefits from it, improving upon the conventional rate 2RM/ √ T . The improvement factor in constant baseline has depends on the average of the second moments m 2 k /T , while as the improvement factor in idealized baseline depends on the average of the harmonic sum 1 T 1/m k . In particular, from Jensen's inequality E[X] -2 ≤ E[X -2 ], we have ( 1 T k 1 m k ) -2 ≤ 1 T k m 2 k , implying that the idealized baseline is always better than the constant baseline. This result is rather expected, as the stepsizes are adapted to the noise intensity. As a consequence, the accumulations of the parameters m k in different forms governed the convergence rate. To further illustrate such difference, we consider an illustrative synthetic noise model, mimicking the shape of noise we observed in the training of Transformer (see Figure 1(c) ). Example 1. Consider the following piece-wise linear noise model with γ = 5(1 -T -α )/T . In this example, the maximum noise intensity M is 1, and the minimum intensity is 1/T α , inducing a large ratio of order T α . Following the bounds developed in Corollary 2, the performance of the constant baseline maintains the standard O(1/ √ T ) convergence rate, while as the idealized baseline converges at O(1/T 1 2 +α ). Hence a nontrivial acceleration of order T α is obtained by using the idealized stepsize, and this acceleration can be arbitrarily large as α increases. This example is encouraging, showing that the speedup due to adaptive stepsizes can be polynomial in the number of iterations, especially when the ratio between the maximum and the minimum noise intensity is large. However, explicit knowledge on m k is required to implement these idealized stepsizes, which is unrealistic. The next sections of the paper demonstrates that estimating the moment bound in an online fashion can achieve a convergence rate comparable to the idealized setting. m k =              1 T α if k ∈ [1, T 5 ]; γ(k -2T 5 ) + 1 if k ∈ ( T 5 , 2T 5 ]; 1 if k ∈ ( 2T 5 , 3T 5 ]; γ( 3T 5 -k) + 1 if k ∈ ( 3T 5 , 4T 5 ]; 1 T α if k ∈ ( 4T 5 , T ].

4. ADAPTIVE METHODS: ONLINE ESTIMATION OF MOMENTS

From now on, we assume that the moment bounds m k are not given. To address the non-stationarity, we estimate the noise intensity based on an exponential moving average, a technique commonly used in adaptive methods. More precisely, the moment estimator mk is constructed recursively as m2 k+1 = β m2 k + (1 -β) g k 2 , (ExpMvAvg) Algorithm 1 Adaptive SGD (x 1 , T, c, m) 1: Initialize m2 1 = g(x 1 ) 2 . 2: for k = 1, 2, ..., T do 3: Evaluate stochastic gradient g k at x k . 4: x k+1 = x k -η k g k with η k = c mk +m . 5: m2 k+1 = β m2 k + (1 -β) g k 2 . 6: end for 7: return x T = ( T i=1 η i x i )/( T i=1 η i ). where g k is the k-th stochastic gradient and β is the decay paramter. Then we choose the stepsizes inversely proportional to mk+1 , leading to Algorithm 1. Algorithm 1 could be viewed as a "norm" version of RMSProp (Tieleman and Hinton, 2012) : the exponential moving average is performed coordinate-wise in RMSProp, whereas we use the full norm of g k to update the moment estimator mk+1 . Such a simplification via a full norm variant has also been analyzed in the uniformly bounded noise setting (Levy, 2017; Levy et al., 2018; Li and Orabona, 2019; Ward et al., 2019) -we leave investigation of the more advanced coordinate-wise version as a topic of future research. Another important component of the stepsize is the correction constant m, the term appearing in the denominator. This constant provides a safety threshold when mk underestimates m k , which is commonly used in the practical implementation of adaptive methods, and even beyond, in reinforcement learning as so-called exploration bonus (Azar et al., 2017; Jin et al., 2018a; Strehl and Littman, 2008) . To show the convergence of the algorithm, we need to impose a regularity assumption on the sequence of noise intensities. Otherwise, previous estimate may not provide any information of the next one. Assumption 1. We assume that an upper bound M on m k is given, i.e. max k m k ≤ M such that (a) The fourth moment of g k is bounded by M 4 , namely, E[( g k 2 -m 2 k ) 2 ] ≤ M 4 , ∀k. (b) The total variation on m k is bounded, namely k |m 2 k -m 2 k+1 | ≤ D 2 with D 2 = Ω(M 2 ). The bounded fourth moment ensures concentration of g k 2 , which is necessary to guarantee the quality of the online estimator. In particular, this assumption is satisfied when g k follows a m k sub-Gaussian distribution. Though stronger than bounded variance, this assumption does not lead to better convergence rates for SGD, as many existing results (including lower bounds) require sub-Gaussian or bounded noise, see Agarwal et al. (2009) ; Ghadimi and Lan (2013) . The bounded total variation assumption should be viewed as a regularity condition on the sequence of noise. It is motivated and common in the dynamic online learning literature (Besbes et al., 2014; Jadbabaie et al., 2015; Mokhtari et al., 2016) . A key aspect of it is to avoid infinite oscillation, such as the pathological setting where m 2k = 1 and m 2k+1 = M , in which case the total variation scales with the number of iterations T . The specific constant in D 2 = Ω(M 2 ) depends on the shape of the noise. When m k is increasing in the first half and decreasing in the second half, as in the Transformer experiments and Example 1, the total variation is bounded by D 2 ≤ 2M 2 . More generally, if the noise can be decomposed into K piece-wise monotone fragments, then the bound D 2 ≤ KM 2 holds. With the above assumptions, we are now ready to present our convergence result. Theorem 3. Under Assumptions 1, with probability at least 1/2, the iterates generated by Algorithm 1 using parameters β = 1 -2T -2/3 , m = 4 √ D 2 + M 2 T -1 9 ln(T ) 1 2 , c = R √ T satisfy f (x T ) -f * ≤ 2RM √ T • 32 M 1 T T k=1 1 m k +m . Remark 4. Our result directly implies a 1 -δ high probability style convergence rate, by restarting it 2 log(1/δ) times. An additional log(1/δ) dependency will be introduced in the complexity, as in standard high probability results Fang et al. (2018) ; Jin et al. (2018b) ; Nemirovski et al. (2009) . The key to prove the theorem is to effectively bound the estimation error | m2 k -m 2 k | relying on concentration, and on bounded variation in Assumption 1. In particular, the choice of the decay

Constant

Adaptive Idealized 0 ≤ α ≤ 1 9 O T -1 2 Õ T -1+2α 2 O T -1+2α 2 1 9 < α O T -1 2 Õ T -11 18 O T -1+2α 2 Table 1 : Comparison of the convergence rate under the noise example 1. parameter β is critical, determining how fast the contribution of past gradients decays. Because of the non-stationarity in noise, the online estimator mk is biased. The proposed choice of β carefully balances the bias error and the variance error, leading to sublinear regret, see Appendix C. Due to the correction constant m, the obtained convergence rate inversely depends on T k=1 1 m k +m , instead of the idealized dependency T k=1 1 m k . This additional term makes the comparison less straightforward and we now discuss different scenarios for obtaining a better understanding.

4.1. DISCUSSION OF THE CONVERGENCE RATE

To illustrate the difference between convergence rates, we first consider the synthetic noise model introduced in Example 1. The detailed comparison is presented in Table 1 , where we observe two regimes regarding the exponent α: • When 0 ≤ α ≤ 1 9 , the rate of the adaptive algorithm matches (idealized baseline) up to logarithmic dependency, and is T α better than the (constant baseline). • When 1 9 ≤ α, the adaptive convergence rate no longer matches the (idealized baseline). Nevertheless, it is always T 1 9 faster than the (constant baseline). In both cases, the adaptive method achieves a non-trivial improvement, polynomial in T , compared to the (constant baseline). Even though the improvement T 1 9 might seem in-significant, it is the first result showing a plausible non-trivial advantage of adaptive methods over SGD under nonstationarynoise. Further, note that the adaptive convergence rate does not always match the (idealized baseline) when α is large. Such a discrepancy comes from the correction term m, which makes the stepsize more conservative than it should be, especially when m k is small. The above comparison relies on the specific choice on the noise model given in Example 1. Now we formalize some simple conditions allowing comparison in more general settings. Corollary 5. If the ratio M/(min k m k ) ≤ T 1 9 , then adaptive method converges in the same order as the (idealized baseline), up to logarithmic dependency. This result is remarkable since the adaptive method does not require any knowledge of m k values, and yet it achieves the idealized rate. In other words, the exponential moving average estimator successfully adapts to the variation in noise, allowing faster convergence than constant stepsizes. Corollary 6. Let m 2 avg = m 2 k /T be the average second moment. If M/m avg ≤ T 1 9 , then adaptive method is no slower than the (constant baseline), up to logarithmic dependency. The condition in Corollary 6 is strictly weaker than the condition in Corollary 5, which means even though an adaptive method may not match the idealized baseline, it could still be non-trivially better than the constant baseline. This case happens e.g., when α > 1 9 in Table 1 , where the adaptive method is O(T 1 9 ) faster than the constant baseline. Indeed, O(T 1 9 ) is the maximum improvement one can expect according to our current analysis. Corollary 7. Recall that M is an upper bound on m k , i.e., max m k ≤ M . Therefore 1. The convergence rate of the constant baseline is no slower than O(2RM/ √ T ). 2. The convergence rate of the adaptive method is no faster than Õ(2RM/T 1 2 + 1 9 ). The order of maximum improvement O(T 1 9 ) is determined by the specific choice of m in Theorem 3, which is chosen to be Õ(M T -1 9 ). Indeed, the correction term is helpful when the estimator mk underestimates the true value m k , avoiding the singularity at zero. Hence, the choice of m is related to the average deviation between mk and m k . Under a stronger concentration assumption, we can strengthen the maximum improvement to O(T 1 6 ), as shown in Appendix E. The noise model in Example 1 provides a favorable scenario where the maximum improvement is attained. However, in some scenarios, the convergence rate of an adaptive method can be slower than the constant baseline. Adversarial scenario. If m k = 1/T α for all i ∈ [1, T ] except at T /2 it takes the value m T /2 = 1 with α > 1/9, then the convergence rate of both constant and idealized baselines are O(T -1 2 ), while the adaptive method only converges in Õ(T -1-2α 2 ). The subtle change at iteration T /2 amplifies the exponential moving average estimator and requires a non-negligible period to get back to the constant level. It is clear that the estimator becomes less meaningful under such a subtle change. Overall, it is hard to provide a complete characterization of the variation in noise. In Corollary 5 and 6, we show that when the ratio between the maximum and the minimum/average second moment is not growing too fast, adaptive methods do improve upon SGD.

5. EXTENSIONS OF THM 3

In this section, we discuss several extensions to Thm 3. The results are nontrivial but the analysis is almost the same. Hence we defer the exact statements and proofs to appendices. Addressing the variance oracle. So far, we have focused on the noise oracle based on the second moment m k and made the connection with existing adaptive methods. However, there is some unnaturalness underlying the non-stationary oracle on m k . Indeed, it is hard to argue that m k is iterate independent since m 2 k = σ 2 k + ∇f (x k ) 2 . Even though the influence of ∇f (x k ) 2 might be minor when the variance σ 2 k is high (e.g. as in Figure 1 ), it is still changing m k . In contrast, the variance σ k is an intrinsic quantity coming from the noise model, which could be iterate independent. Hence the variance oracle is theoretically more sound. We now present the necessary modifications in order to adapt to the variance oracle. First, in order to estimate the variance, we need to query two stochastic gradients g k and g k at the same iterate, then we construct the estimator following the recursion σ2 k+1 = β σ2 k + (1 -β) g k -g k 2 . Second, the smoothness condition on f is required, i.e., L-Lipschitzness on the gradient of f . In this case, it is necessary to ensure that the step-size being not larger than 1/2L. This translates to an additional constraint on the correcting constant m. More precisely, the stepsize is given by η k = c σk +m with m ≥ 2cL. Note that the L-smoothness condition is not required in the second moment oracle. This is why the second moment oracle is more suitable to nonsmooth setting (see Section 6.1 of Bubeck ( 2014)). A complete algorithm for the variance oracle is provided in Algorithm 2. The convergence results are essentially the same by replacing m k with σ k , see Appendix H. Extension to nonconvex setting. We also provide an extension of our analysis to non-convex smooth setting. In which case, we characterize the convergence with respect to the gradient norm ∇f (x k ) 2 , i.e., convergence to a stationary point. The conclusions are very similar to the one in the convex setting and the results (Theorems 12 and 14) are deferred to Appendix F. Variants on stepsizes. To go beyond the second moment of noise, one could apply an estimator of the form mp k+1 = β mp k + (1 -β) g k p when the p-th moment of the gradient is bounded. This allows stepsize of the shape η k ∝ 1/( mp k + m p ) 1/p as in ADAM, Adamax Kingma and Ba (2014) .

6. EXPERIMENTS

In this section, we describe two sets of experiments that verify the faster convergence of Algorithm 1 against vanilla SGD. The first experiment is on linear regression with synthetic noise described in Example 1 and the second set of experiments is on neural network training.

6.1. SYNTHETIC EXPERIMENTS

In the synthetic experiment, we generate a random linear regression dataset using the sklearnfoot_3 library. We design the stochastic oracle as full gradient with injected Gaussian noise, whose coordinate-wise We observe that the performance is ranked as follows: idealized baseline, Alg 2, Alg 1 and standard baseline.

6.2. NEURAL NETWORK TRAINING

We demonstrate how the proposed algorithm performs in real-world neural network training. We first tested our algorithm on Cifar10 classification task. We then implement our algorithm into the AWD-LSTM codebase described in Merity et al. (2018) . We see from Figure 3 that our proposed algorithm can achieve slightly better performance than baselines. Despite our main contribution being providing theoretical analysis for the fast convergence of adaptive methods with moment estimation, these results show that our analysis can also lead to efficient and practical algorithm design. Besides convergence, we also measured noise level during neural network training with different optimizers and found that the noise pattern is mostly determined by the learning task instead of by the optimization algorithm. Details can be found in Appendix A.

7. CONCLUSIONS

This paper discusses convergence rates of stochastic gradient methods in an empirically motivated setting where the noise level is changing over iterations. We show that under mild assumptions, one can achieve faster convergence than the fixed step SGD by a factor that is polynomial in number of iterations, by applying online noise estimation and using adaptive step sizes. Our analysis, therefore provides one explanation for the recent success of adaptive methods in neural network training. There is much more to be done along the line of non-stationary stochastic optimization. Under our current analysis, there is a gap between the adaptive method and the idealized method when the noise variation is large (see second row in Table 1 ). A natural question to ask is whether one could reduce this gap, or alternatively, is there any threshold preventing the adaptive method from getting arbitrarily close to the idealized baseline? Moreover, could one attain further acceleration by combining momentum or coordinate-wise update techniques? Answering these questions would provide more insight and lead to a better understanding of widely used adaptive methods. Perhaps a more fundamental question is regarding the iterate dependency: the setting where the moments m k or the variance σ k are functions of the current update x k , not just of the iteration index k. Significant effort needs to be spent to address this additional correlation under appropriate regularity conditions. We believe our work lays the foundation to address this challenging research problem. 

A ADDITIONAL EXPERIMENT DETAILS AND RESULTS

In this section, we provide more details on our experiment setup. The code for reproducing noise estimation and neural network training on AWS-LSTM model is uploaded in the supplementary. The code for Cifar10 classification can be uploaded upon request.

A.1 DETAILS ON PTB TRAINING

Our implementation is based on the author's github repositoryfoot_4 . The original codebase trains the network using clipped gradient descent followed by an average SGD (ASGD) algorithm to prevent overfitting. As generalization error is beyond our discussion, we focus on the first phase (which takes about 200 epochs) by removing the ASGD training part. Aside of the number of epochs, all other parameters are the same as the default training procedure. For our algorithm, we update the clipping value to be 0.25, and set the hyperparameters of Algorithm 1 as η k = 5, β = 0.95, m = 0.01.

A.2 DETAILS ON CIFAR10 TRAINING

Our implementation is based on a pytorch implementationfoot_5 of the original ResNet paper (He et al., 2016) . We train the ResNet18 model on Cifar10 dataset for 200 epochs. The baseline uses SGD with learning rate initialized as 0.1. The learning rate is decayed by 10 at epochs 100, 150. We eventually achieves approximately 95% validation accuracy. For our algorithm, we set η k = 1, β = 0.99, m = 0.01 and use the same learning rate schedule as the baseline.

A.3 NOISE BEHAVIOR FOR DIFFERENT ALGORITHMS

In this subsection, we include one interesting observation on the noise pattern of neural network training. From Figure 1 , we see that the noise pattern looks very different for different task. One may naturally wonder whether the difference results from the particular training task or from the optimization algorithms. Noticing that each of these three tasks are trained with a different algorithm (Cifar10 with momentum SGD; PTB with clipped SGD; En-De translation with ADAM), we retrained the Cifar10 task and the PTB language modelling with Adam. and rerun the En-De translation experiment with momentum SGD. We found that the En-De translation task diverges when trained with SGD even when learning rate is 0.001. Hence we only report the results from the other two experiments in Figure 4 . From Figure 4 , we see that the pattern looks very similar to the original plots in Figure 1 . Hence these experiments suggest that the noise pattern is mostly determined by the learning task instead of by the optimization algorithms.

B PROOF OF THEOREM 1

Proof. The iterate suboptimality have the following relationship: x k+1 -x * 2 = x k -η k g k -x * 2 = x k -x * 2 -2η k g k , x k -x * + η 2 k g k 2 . Rearrange and take expectation with respect to g k we have 2η k (f (x k ) -f * ) ≤ 2η k ∇f (x k ), x k -x * ≤ E x k+1 -x * 2 -E x k -x * 2 + η 2 k m 2 k Sum over k and take expectation we get E[ T k=1 2η k (f (x k ) -f * )] ≤ x 1 -x * 2 + T k=1 η 2 k m 2 k Then from convexity, we have E[f (x T ) -f * ] ≤ x 1 -x * 2 + T k=1 η 2 k m 2 k T k=1 η k , where x T = ( T i=1 η i x i )/( T i=1 η i ). Corollary 2 follows from specifying the particular choices of the stepsizes.

C KEY LEMMA

Lemma 8. Under Assumptions 1, taking β = 1 -2T -2/3 , the total estimation error of the m2 k based on (ExpMvAvg) is bounded by: E T k=1 | m2 k -m 2 k | ≤ 2(D 2 + M 2 )T 2/3 ln(T 2/3 ) Proof. On a high level, we decouple the error in a bias term and a variance term. We use the total variation assumption to bound the bias term, and use the exponential moving average to reduce variance. Then we pick β to balance the two terms. From triangle inequality, we have T k=0 E | m2 k -m 2 k | ≤ T k=1 E | m2 k -E[ m2 k ]| Variance term + T k=1 E[ m2 k ] -m 2 k Bias term We first bound the bias term. By definition of mk , we have E[ m2 k ] -m 2 k = βE[ m2 k-1 ] + (1 -β)m 2 k-1 -m 2 k = β(E[ m2 k-1 ] -m 2 k-1 ) + (m 2 k-1 -m 2 k ) Hence by recursion, E[ m2 k ] -m 2 k = β k-1 (E[ m2 1 ] -m 2 1 ) =0 +β k-2 (m 2 1 -m 2 2 ) + • • • + (m 2 k-1 -m 2 k ) Therefore, the bias term could be bounded by T k=1 E[ m2 k ] -m 2 k ≤ T k=1 k-1 j=1 β k-1-j m 2 j -m 2 j+1 = T -1 k=1 m 2 k -m 2 k+1 T -1-k j=0 β j ≤ 1 1 -β T -1 k=1 m 2 k -m 2 k+1 ≤ D 2 1 -β (From Assumption (1)) The first inequality follows by traingle inequality. The third inequality uses the geometric sum over β. To bound the variance term, we remark that m2 k = (1 -β)g 2 k-1 + (1 -β)βg 2 k-2 + • • • + (1 -β)β k-2 g 2 1 + β k-1 g 2 0 . Hence from independence of the gradients, we have E | m2 k -E[ m2 k ]| ≤ Var[ m2 k ] = Var[(1 -β)g 2 k-1 ] + Var[(1 -β)βg 2 k-2 ] + • • • + Var[(1 -β)β k-2 g 2 1 ] + Var[β k-1 g 2 0 ] ≤ (1 -β) 2 + (1 -β) 2 β 2 + • • • + (1 -β) 2 β 2(k-2) + β 2(k-1) M 2 , where M 2 is an upperbound on the variance. The first inequality follows by Jensen's inequality. The second equality uses independence of g i given g 1 , ..., g i-1 . The last inequality follows by assumption 1. We distinguish two cases, when k is small, we simply bound the coefficient by 1, i.e. (1 -β) 2 + (1 -β) 2 β 2 + • • • + (1 -β) 2 β 2(k-2) + β 2(k-1) ≤ 1 When k is large such that k ≥ 1 + γ, with γ = 1 2(1-β) ln( 1 1-β ), we have β 2(k-1) ≤ 1 -β, thus (1 -β) 2 + (1 -β) 2 β 2 + • • • + (1 -β) 2 β 2(k-2) + β 2(k-1) ≤ (1 -β) 2 1 -β 2 + β 2(k-1) ≤ (1 -β) 2 1 -β 2 + (1 -β) ≤ 2(1 -β) The second inequality follows by k ≥ 1 + γ, with γ = 1 2(1-β) ln( 1 1-β ). Therefore, when k ≥ 1 + γ, E | m2 k -E[ m2 k ]| ≤ 2(1 -β)M Therefore, substitute in the above equation into the T k=1 E | m2 k -E[ m2 k ]| = γ k=1 E | m2 k -E[ m2 k ]| + T k=γ+1 E | m2 k -E[ m2 k ]| ≤ (γ + (T -γ) 2(1 -β))M 2 Summing up the variance term and the bias term yields, T k=0 E | m2 k -m 2 k | ≤ D 2 1 -β + (γ + (T -γ) 2(1 -β))M 2 (5) Taking β = 1 -T -2/3 /2 yields, T k=0 E | m2 k -m 2 k | ≤ 2(D 2 + M 2 )T 2/3 ln(T 2/3 ) D PROOF OF THEOREM 3 On a high level, the difference between the adaptive stepsize and the idealized stepsize mainly depends on the estimation error | m2 k -m 2 k |, which has a sublinear regret according to Lemma C. Then we carefully integrate this regret bound to control the derivation from the idealized algorithm, reaching the conclusion. Proof. By the update rule of x k+1 , we have, x k+1 -x * 2 = x k -η k g k -x * 2 = x k -x * 2 -2η k g k , x k -x * + η 2 k g k 2 . Noting that the stepsize η k is independent of g k , taking expectation with respect to g k conditional on the past iterates lead to 2η k (f (x k ) -f * ) ≤ 2η k ∇f (x k ), x k -x * = E[2η k g k , x k -x * |x k , • • • , x 1 ] = -E[ x k+1 -x * 2 |x k , • • • , x 1 ] + x k -x * 2 + η 2 k m 2 k . Recall that R = x 1 -x * , taking expectation and sum over iterations k, we get E[2( T k=1 η k )(f (x T ) -f * )] ≤ R 2 + E[ T k=1 η 2 k m 2 k ]. Hence by Markov's inequality, with probability at least 3/4, 2( T k=1 η k )(f (x T ) -f * ) ≤ 4E[2( T k=1 η k )(f (x T ) -f * )] ≤ 4(R 2 + E[ T k=1 η 2 k m 2 k ] ). (7) Now we can upper bound the right hand side, indeed T k=1 E[η 2 k m 2 k ] = c 2 T k=1 E m 2 k ( mk + m) 2 ≤ c 2 T k=1 E m 2 k -m2 k ( mk + m) 2 + T k=1 E m2 k ( mk + m) 2 ≤ c 2 1 m 2 T k=1 E |m 2 k -m2 k | + T ≤ c 2 (M 2 + D 2 )T 2/3 ln(T 2/3 ) m 2 + T ≤ 3c 2 T (8) The last inequality follows by the choice on m. Hence, from Eq. ( 7), we have with probability at least 3/4, 2( T k=1 η k )(f (x T ) -f * ) ≤ 4(R 2 + 3c 2 T ) Next, by denoting (x) + = max(x, 0), we lower bound the left hand side, 1 c η k = 1 mk + m = 1 m k + m + 1 mk + m - 1 m k + m ≥ 1 m k + m - ( mk -m k ) + (m k + m)( mk + m) ≥ 1 m k + m - ( mk -m k ) + √ m k + m • m 3/2 ≥ 1 m k + m - 1 2 ( mk -m k ) 2 + m 3 + 1 m k + m = 1 2 1 m k + m - 1 2m 3 ( mk -m k ) 2 + (10) Constant Adaptive Idealized 0 ≤ α ≤ 1 6 O T -1 2 Õ T -1+2α 2 O T -1+2α 2 1 6 < α O T -1 2 Õ T -2 3 O T -1+2α 2 Table 2 : Comparison of the convergence rate under the noise example 1. Under this stronger assumption, we can perform our online estimator on the first moment E[ g k ] instead of the second moment, replacing line 6 of Algorithm 1 by mk+1 = β mk + (1 -β) g k . ( ) The theorem below shows the convergence rate of the new algorithm. Theorem 10. Under assumptions 2, 3,with m = 16(D + M )T -1/6 ln(T ), c = R √ T , Algorithm 1 with update rule (13) achieves convergence rate f (x T ) -f * ≤ 2R √ T • 12T k 1 (m k +m) , With a better concentration of the online estimator, we could allow a less conservative correction constant m, in the order of M T -1 6 . It is this parameter that controls the maximum attainable improvement compared to the constant baseline. Indeed, we again consider the noise example 1, given in Table 2 . In this case, the adaptive method can obtain an improvement of order T 1 6 compared to the constant baseline, while as previously only T 1 9 is achievable. The proof of the result follows a similar routine as the proof of Theorem 3. We start by presenting an equivalent lemma of Lemma 8. Lemma 11. Under assumption 3, we can achieve the following bound on total estimation error.

E[

T k=1 | mk -λ k |] ≤ 2(D + M )T 2/3 ln(T 2/3 ) Proof. The proof is the same as the proof of Lemma 8, by replacing the second m 2 k by the first moment λ k = E[ g k ]. Proof of Theorem 10. By Assumption 2, we can use first moment of g k to bound the second moment. Hence, Eq. ( 7) implies that with probability at least 3/4, 2( T k=1 η k )(f (x T ) -f * ) ≤ 4(R 2 + E[4 T k=1 η 2 k λ 2 k ]). Now we upper bound the right hand side, indeed T k=1 E[η 2 k λ 2 k ] = c 2 T k=1 E λ 2 k ( mk + m) 2 ≤ c 2 T k=1 E λ 2 k -m2 k ( mk + m) 2 + T k=1 E m2 k ( mk + m) 2 ≤ c 2 T k=1 E (λ k -mk )(λ k + mk ) ( mk + m) 2 + T ≤ c 2 2M m 2 T k=1 E [|λ k -mk |] + T ≤ c 2 4(M + D)M T 2/3 ln(T 2/3 ) m 2 + T ≤ 2c 2 T Under review as a conference paper at ICLR 2021 Hence by Markov inequality, with probability at least 3/4,

2(

T -1 k=0 η k )(f (x I ) -f * ) ≤ 4E[2( T -1 k=0 η k )(f (x I ) -f * )] ≤ 4(R 2 + 2c 2 T ) Next, we lower bound the left hand side, 1 c η k = 1 mk + m = 1 λ k + m + 1 mk + m - 1 λ k + m ≥ 1 λ k + m - ( mk -λ k ) + ( mk + m)(λ k + m) ≥ 1 λ k + m - 1 m 2 ( mk -λ k ) + By Markov's inequality and Lemma 11, with probability 3/4, we have ( mk -λ k ) 2 + ≤ 4E[ | mk -λ k |] ≤ 8(D + M )T 2/3 ln(T 2/3 ). Following the choice of m = 16(D + M )T -1 6 ln(T ), we have 1 m 2 E[( mk -λ k ) + ] ≤ T 2(M + m) ≤ 1 2 1 λ k + m Consequently, we know that with probability at least 1 - 1 4 -1 4 = 1/2, f (x T ) -f * ≤ 4(R 2 + 2c 2 T ) k c 2(λ k +m) ≤ 2R √ T • 12T k 1 (m k +m) , by setting c = R √ T and the fact that λ k ≤ m k .

F VARIANCE ORACLE AND EXTENSION TO NONCONVEX SETTING

In this section, we show that we can adapt our algorithm to the variance oracle in Definition 1 (b), where E[ g k -∇f (x k ) 2 ] = σ 2 k . To avoid redundancy, we present the result in the nonconvex smooth setting. We make the following smoothness assumptions. Assumption 4. The function is L-smooth, i.e. for any x, y, ∇f (x) -∇f (y) ≤ L x -y . Remark that the L-smoothness condition is not required in the second moment oracle. This is why the second moment assumption is usually imposed in the non-differentiable setting (see Section 6.1 of Bubeck (2014) ). We first provide the convergences of SGD serving as the baselines. Theorem 12 (Nonconvex baselines). Under the variance oracle in Definition 1 (b) and Assumption 4, the convergence of SGD using update x k+1 = x k -η k g k with η k ≤ 1 2L satisfies E[ ∇f (x I ) 2 ] ≤ f (x 1 ) -f * + L 2 T k=1 η 2 k σ 2 k T k=1 η k , ( ) where I is an random variable with P(I = i) ∝ η i . This convergence result is very similar to the one shown in the convex setting in Theorem 1. Instead of bounding the function suboptimality, the upper bound is only available on the norm of the gradient, implying convergence to a stationary point. We remark that an additional requirement on the stepsize is required, namely η k ≤ 1/2L. This is not surprising since taking a large stepsize will not be helpful due to the L-smoothness condition. Hence, we can not take a stepsize inversely depending on σ k when the noise is small. This restriction makes the comparison on convergence rate less straightforward. To facilitate the discussion on convergence rate, we make an additional assumption on the lower bound of the variance σ k . Algorithm 2 Variance Adaptive SGD (x 1 , T, c, m) 1: Initialize σ1 = g1-g 1 2 2 , where g 1 , g 1 are two independent stochastic gradients at x 1 . 2: for k = 1, 2, ..., T do 3: Query two independent stochastic gradient g k , g k at x k .

4:

Update x k+1 = x k -η k (g k + g k )/2 with η t = c σk +m and m ≥ 2cL.

5:

Update σ2 k+1 = β σ2 k + (1 -β) g k -g k 2 2 6: end for 7: return x I where I is the random variable such that P(I = i) ∝ η i . Assumption 5. For any k ∈ [1, T ], σ k ≥ 8L(f (x 1 ) -f (x * ))/ √ T . We emphasize that the above condition is not necessary to derive convergence analysis, but only for the clarity in terms of comparison on different convergence rate. This assumption help us focus on the case when noise (instead of the shape of the deterministic function) dominates the convergence rate and determines the step size. In otherword, our step size choice under this setting satisfies η k ≤ 1/2L, leading to the following convergence rate: Corollary 13. Let ∆ = f (x 1 ) -f * . We have the following two convergence rate bounds for SGD: 1. SGD with constant stepsize: if η k = η = 2∆ L k σ 2 k , then E[ ∇f (x I ) 2 ] ≤ √ 2L∆ T k=1 σ 2 k T = 2L∆ T • T k=1 σ 2 k T . (constant baseline) 2. SGD with idealized stepsize: if η k = 1 σ k 2∆ LT , then E[ ∇f (x I ) 2 ] ≤ √ 2LT ∆ T k=1 1 σ k = 2L∆ T • T T k=1 1 σ k . ( The resulting convergence rate are similar as the convex setting. We now modify the adaptive algorithm to make use of the variance oracle and the smoothness assumption. The through algorithm is presented in Algorithm 2. Particularly, we keep an exponential moving average of the variance instead of the moments, using two stochastic gradients g k and g k at the same iterate, σ2 k+1 = β σ2 k + (1 -β) g k -g k 2 2 . In particular, E[ g k -g k 2 /2] is an unbiased estimator of σ 2 k . To provide the convergence analysis of the algorithm, we make the following assumptions on the variation of the variance. Assumption 6. We assume an upper bound on σ 2 k = E[ g k -∇f (x k ) 2 ], i.e. max k σ k ≤ M . We also assume that the total variation in σ k is bounded. i.e. k |σ 2 k -σ 2 k+1 | ≤ D 2 = 4M 2 . With the above assumptions, algorithm 2 achieve the following convergence rate. Theorem 14. Under the assumptions 4, 6 and T large enough such that ln T ≤ T 1/3 , algorithm 2 with c = 2∆ LT , m = 4 √ D 2 + M 2 T -1/9 ln(T ) 1/2 + 2cL, achieves with probability 1/2, ∇f (x I ) 2 ≤ 2L∆ T • 32T k 1 (σ k +m) . The above theorem is almost the same as Theorem 3, and hence all the remarks for Theorem 3 also applies in the nonconvex case.

G PROOF OF THEOREM 12

Proof. By L-smoothness, we have f (x k+1 ) ≤ f (x k ) -η k g k , ∇f (x k ) + Lη 2 k 2 g k 2 Rearrange and take expectation with respect to g k , we get η k - Lη 2 k 2 ∇f (x k ) 2 ≤ f (x k ) -f (x k+1 ) + Lη 2 k 2 E[ g k -∇f (x k ) 2 ]. From the condition η k ≤ 1 2L , we have, η k 2 ∇f (x k ) 2 ≤ f (x k ) -f (x k+1 ) + L 2 η 2 k σ 2 k . Sum over k and take expectation, E[ T k=1 η k ∇f (x k ) 2 ] ≤ f (x 0 ) -f (x * ) + L 2 k η 2 k σ 2 k . Denote I as the random variable such that P(I = i) ∝ η i . We know ( T k=1 η k )E[ ∇f (x I ) 2 ] ≤ f (x 0 ) -f (x * ) + L 2 T k=1 η 2 k σ 2 k . This yields the desired convergence rate in (17).

H PROOF OF THEOREM 14

The proof is almost identical to the proof of Theorem 3. We start by presenting an equivalent theorem of Lemma 8 below. Lemma 15. Under assumption 6, we can achieve the following bound on total estimation error using the estimator (ExpMvAvg): E[ T k=γ |σ 2 k -σ 2 k |] ≤ 2(D 2 + M 2 )T 2/3 ln(T 2/3 ) Proof. This follows by exactly the same proof as Theorem 8 and the fact that E[ g - g 2 ] = 2E[ g -∇f 2 ]. Proof of Theorem 14. The first part of the proof follows the schema as in the proof of Theorem 12, when η k ≤ 1/2L, we know that E[( T k=1 η k ) ∇f (x I ) 2 ] ≤ ∆ + L 2 k E[η 2 k σ 2 k ]. The rest of the proof is exactly the same as the proof of Theorem 3. We can upper bound the right hand side, indeed T k=1 E[η 2 k σ 2 k ] = c 2 T k=1 E σ 2 k (σ k + m) 2 = c 2 T k=1 E σ 2 k -σ2 k (σ k + m) 2 + T k=1 E σ2 k (σ k + m) 2 ≤ c 2 1 m 2 T k=1 E |σ 2 k -σ2 k | + T ≤ c 2 2(M 2 + D 2 )T 2/3 ln(T 2/3 ) m 2 + T ≤ 3c 2 T The last inequality follows by the choice of parameters that M 2 +D 2 m 2 ≤ 1 16 T 1/3 ln(T ) . Hence by Markov inequality, with probability at least 3/4, ( T k=1 η k ) ∇f (x I ) 2 ≤ 4E[( T k=1 η k ) ∇f (x I ) 2 ] ≤ 4(∆ + 3Lc 2 T 2 ) Next, we lower bound the left hand side as in (10), 1 c η k = 1 σk + m ≥ 1 2 1 σ k + m - 1 2m 3 (σ k -σ k ) 2 + (18) Finally, by Markov's inequality, with probability 3/4, (σ k -σ k ) 2 + ≤ 4E[ (σ k -σ k ) 2 + ] ≤ 4E[ (σ k -σ k ) 2 ] ≤ 8(D 2 + M 2 )T 2/3 ln(T 2/3 ). Following the choice of m = 4 √ D 2 + M 2 T -1 9 ln(T ) 1 2 + 2cL, we have 1 2m 3 T k=1 (σ k -σ k ) 2 + ≤ T 4(M + m) ≤ 1 4 T k=1 1 σ k + m Together with (18) implies that with probability 3/4, η k ≥ c 4 T k=1 1 σ k + m Consequently, we know that with probability at least 1 -1 4 -1 4 = 1/2, ∇f (x I ) 2 ≤ 4(∆ + 3Lc 2 T 2 ) k c 4(σ k +m) ≤ 2L∆ T • 32T k 1 (σ k +m) , by setting c = 2∆ LT .



Code source for CIFAR10 https://github.com/kuangliu/pytorch-cifar Code source for LSTM https://github.com/salesforce/awd-lstm-lm Code source for Transformer https://github.com/jadore801120/attention-is-all-you-need-pytorch scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html Code source for LSTM https://github.com/salesforce/awd-lstm-lm Code source for CIFAR10 https://github.com/kuangliu/pytorch-cifar



Figure1: We empirically evaluate the second moment (in blue) and variance (in orange) of stochastic gradients during the training of neural networks. We observe that the magnitude of these quantities changes significantly as iteration count increases, ranging from 10 times (ResNet) to 10 6 times (Transformer). This phenomenon motivates us to consider a setting with non-stationary noise.

Figure 2: Left: The injected noise intensity over iterations. Middle:Average loss trajectory over 10 runs for four different algorithms: standard baseline, idealized baseline, Alg 1 and Alg 2. The curve (idealized vs standard) confirms that adopting step sizes inverse to the noise level lead to faster convergence and less variations. Right: Average and standard deviation of function suboptimality. The values are normalized by the average MSE of the standard baseline.

Figure 3: The left two plots show the accuracy of training ResNet18 on Cifar10 dataset. The right two plots present the negative log-likelihood loss for LSTM Language modelling from Merity et al. (2018). The baselines are provided by the repos cited on page 2. Algorithm 1 is described in Alg 1. standard deviation σ is shown in the left figure of Fig 2. We then run the four algorithms discussed in this work: standard baseline, idealized baseline, Alg 1 and Alg 2. We finetune the step sizes for each algorithm by grid-searching among 10 k , where k is an integer. We repeat the experiment for 10 runs and show the average training trajectory as well as the function suboptimality in Fig 2. We observe that the performance is ranked as follows: idealized baseline, Alg 2, Alg 1 and standard baseline.

Figure 4: We retrain the models with Adam optimizer and evaluate the second moment (in blue) and variance (in orange) of stochastic gradients for the Cifar10 experiments and PTB experiments in Figure 1.

Z. Zhou, Q. Zhang, G. Lu, H. Wang, W. Zhang, and Y. Yu. Adashift: Decorrelation and convergence of adaptive learning rate methods. In International Conference on Learning Representations (ICLR), 2019. F. Zou and L. Shen. On the convergence of weighted adagrad with momentum for training deep neural networks. arXiv preprint arXiv:1808.03408, 2018. F. Zou, L. Shen, Z. Jie, W. Zhang, and W. Liu. A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11127-11135, 2019.

annex

Finally, by Markov's inequality, with probability 3/4Following the choice of m = 41 2 , we haveConsequently, together with ( 9) and ( 10), we know that with probability at least 1where the last inequality follows by setting c = R √ T .Remark 9. For more general choices of stepsize η k = 1 ( mp k +m p ) 1/p , the upper bound in Eq.( 8) holds exactly as in the above proof, and the lower bound in Eq.( 10) follows from

E PROOF WITH CONCENTRATE NOISE

In this section, we add additional constraints on noise concentrations. Assumption 2. The expected absolute value is not very different from the square root of the second moment, i.e.The constant "2" in the assumption above is arbitrary and can be increased to any fixed constant. The above assumption is satisfied if g(x) follows Gaussian distribution. It is also satisfied if for some fixed constant γ, p( g(x) ≥ r) ≤ γE[ g(x) ] 3 r -4 , for all r ≥ γE[ g(x) ].We assume that the total variation on the first moment is bounded. 

