STOCHASTIC OPTIMIZATION WITH NON-STATIONARY NOISE: THE POWER OF MOMENT ESTIMATION Anonymous

Abstract

We investigate stochastic optimization under weaker assumptions on the distribution of noise than those used in usual analysis. Our assumptions are motivated by empirical observations in training neural networks. In particular, standard results on optimal convergence rates for stochastic optimization assume either there exists a uniform bound on the moments of the gradient noise, or that the noise decays as the algorithm progresses. These assumptions do not match the empirical behavior of optimization algorithms used in neural network training where the noise level in stochastic gradients could even increase with time. We address this nonstationary behavior of noise by analyzing convergence rates of stochastic gradient methods subject to changing second moment (or variance) of the stochastic oracle. When the noise variation is known, we show that it is always beneficial to adapt the step-size and exploit the noise variability. When the noise statistics are unknown, we obtain similar improvements by developing an online estimator of the noise level, thereby recovering close variants of RMSProp (Tieleman and Hinton, 2012). Consequently, our results reveal why adaptive step size methods can outperform SGD, while still enjoying theoretical guarantees.

1. INTRODUCTION

Stochastic gradient descent (SGD) is one of the most popular optimization methods in machine learning because of its computational efficiency compared to traditional full gradient methods. Great progress has been made in understanding the performance of SGD under different smoothness and convexity conditions (Agarwal et al., 2009; Arjevani et al., 2019; Drori and Shamir, 2019; Ghadimi and Lan, 2012; 2013; Nemirovsky and Yudin, 1983; Rakhlin et al., 2012) . These results show that with a fixed step size, SGD can achieve the minimax optimal convergence rate for both convex and nonconvex optimization problems, provided the gradient noise is uniformly bounded. Yet, despite the theoretical minimax optimality of SGD, adaptive gradient methods (Duchi et al., 2011; Kingma and Ba, 2014; Tieleman and Hinton, 2012) have become the methods of choice for training deep neural networks, and have received a surge of attention recently (Agarwal et al., 2018; Chen et al., 2019; Huang et al., 2019; Levy, 2017; Levy et al., 2018; Li and Orabona, 2019; Liu et al., 2019; 2020; Ma and Yarats, 2019; Staib et al., 2019; Ward et al., 2019; Zhang et al., 2019; 2020; Zhou et al., 2018; 2019; Zou and Shen, 2018; Zou et al., 2019) . Instead of using fixed stepsizes, these methods construct their stepsizes adaptively using the current and past gradients. Despite advances in the literature on adaptivity, theoretical understanding of the benefits of adaptation is very limited. We provide a different perspective on understanding the benefits of adaptivity by considering it in the context of non-stationary gradient noise, i.e., the noise intensity varies with iterations. Surprisingly, this setting is rarely studied, even for SGD. To our knowledge, this paper is the first work to formally study stochastic gradient methods in this varying noise scenario. Our main goal is to show that: Adaptive step-sizes can guarantee faster rates than SGD when the noise is non-stationary. We focus on this goal based on several empirical observations (Section 2), which lead us to model the noise of stochastic gradient oracles via the following iteration dependent quantities: m 2 k := E[ g(x k ) 2 ], or σ 2 k := E[ g(x k ) -∇f (x k ) 2 ], where g(x k ) is the stochastic gradient and ∇f (x k ) the true gradient at iteration k. Notation (1) provides more fine-grained description of noise behavior than uniform bounds on the variance by permitting iteration dependent noise intensities. It is intuitive that one should prefer smaller stepsizes when the noise is large and vice versa. Thus, under non-stationarity, an ideal algorithm should adapt its stepsize according to the parameters m k or σ k , suggesting a potential benefit of adaptive stepsizes. Contributions. The primary contribution of our paper is to show that a stochastic optimization method with adaptive stepsize can achieve a faster rate of convergence (by a factor that is polynomialin-T ) than fixed-step SGD. We first analyze an idealized setting where the noise intensities are known, using it to illustrate how to select noise dependent stepsizes that are provably more effective (Theorem 1). Next, we study the case with unknown noise, where we show under an appropriate smoothness assumption on the noise variation that a variant of RMSProp (Tieleman and Hinton, 2012) can achieve the idealized convergence rate (Theorem 3). Remarkably, this variant does not require the noise levels. Finally, we generalize our results to nonconvex settings (Theorems 12 and 14).

2. MOTIVATING OBSERVATION: NONSTATIONARY NOISE IN DEEP LEARNING

Neural network training involves optimizing an empirical risk minimization problem of the form min x f (x) := 1 n n i=1 f i (x) , where each f i represents the loss function with respect to the i-th data or minibatch. Stochastic methods optimize this objective randomly sampling an incremental gradient ∇f i at each iteration and using it as an unbiased estimate of the full gradient. The noise intensity of this stochastic gradient is measured by its second moments or variances, defined as, 1. Second moment: m 2 (x) = 1 n n i=1 ∇f i (x) 2 ; 2. Variance: σ 2 (x) = 1 n n i=1 ∇f i (x) -∇f (x) 2 , where ∇f (x) is the full gradient, To illustrate how these quantities evolve over iterations, we empirically evaluate them on three popular tasks of neural network training: ResNet18 training on Cifar10 dataset for image classificationfoot_0 , LSTM training on PTB dataset for language modellingfoot_1 ; transformer training on WMT16 en-de for language translationfoot_2 . The results are shown in Figure 1 , where both the second moments and variances are evaluated using the default training procedure of the original code. On one hand, the variation of the second moment/variance has a very different shape in each of the considered tasks. In the CIFAR experiment, the noise intensity is quite steady after the first iteration, indicating a fast convergence of the training model. In LSTM training, the noise level increases and converges to a threshold. While, in training Transformers, the noise level increases very fast at the early epochs, then reaches a maximum, and turns down gradually. On the other hand, the preferred optimization algorithms in these tasks are also different. For CIFAR10, SGD with momentum is the most popular choice. While for language models, adaptive methods such as Adam or RMSProp are the rule of thumb. This discrepancy is usually taken as granted, based on empirical validation; and little theoretical understanding of it exists in the literature.



Code source for CIFAR10 https://github.com/kuangliu/pytorch-cifar Code source for LSTM https://github.com/salesforce/awd-lstm-lm Code source for Transformer https://github.com/jadore801120/attention-is-all-you-need-pytorch



Figure1: We empirically evaluate the second moment (in blue) and variance (in orange) of stochastic gradients during the training of neural networks. We observe that the magnitude of these quantities changes significantly as iteration count increases, ranging from 10 times (ResNet) to 10 6 times (Transformer). This phenomenon motivates us to consider a setting with non-stationary noise.

