STOCHASTIC OPTIMIZATION WITH NON-STATIONARY NOISE: THE POWER OF MOMENT ESTIMATION Anonymous

Abstract

We investigate stochastic optimization under weaker assumptions on the distribution of noise than those used in usual analysis. Our assumptions are motivated by empirical observations in training neural networks. In particular, standard results on optimal convergence rates for stochastic optimization assume either there exists a uniform bound on the moments of the gradient noise, or that the noise decays as the algorithm progresses. These assumptions do not match the empirical behavior of optimization algorithms used in neural network training where the noise level in stochastic gradients could even increase with time. We address this nonstationary behavior of noise by analyzing convergence rates of stochastic gradient methods subject to changing second moment (or variance) of the stochastic oracle. When the noise variation is known, we show that it is always beneficial to adapt the step-size and exploit the noise variability. When the noise statistics are unknown, we obtain similar improvements by developing an online estimator of the noise level, thereby recovering close variants of RMSProp (Tieleman and Hinton, 2012). Consequently, our results reveal why adaptive step size methods can outperform SGD, while still enjoying theoretical guarantees.

1. INTRODUCTION

Stochastic gradient descent (SGD) is one of the most popular optimization methods in machine learning because of its computational efficiency compared to traditional full gradient methods. Great progress has been made in understanding the performance of SGD under different smoothness and convexity conditions (Agarwal et al., 2009; Arjevani et al., 2019; Drori and Shamir, 2019; Ghadimi and Lan, 2012; 2013; Nemirovsky and Yudin, 1983; Rakhlin et al., 2012) . These results show that with a fixed step size, SGD can achieve the minimax optimal convergence rate for both convex and nonconvex optimization problems, provided the gradient noise is uniformly bounded. Yet, despite the theoretical minimax optimality of SGD, adaptive gradient methods (Duchi et al., 2011; Kingma and Ba, 2014; Tieleman and Hinton, 2012) have become the methods of choice for training deep neural networks, and have received a surge of attention recently (Agarwal et al., 2018; Chen et al., 2019; Huang et al., 2019; Levy, 2017; Levy et al., 2018; Li and Orabona, 2019; Liu et al., 2019; 2020; Ma and Yarats, 2019; Staib et al., 2019; Ward et al., 2019; Zhang et al., 2019; 2020; Zhou et al., 2018; 2019; Zou and Shen, 2018; Zou et al., 2019) . Instead of using fixed stepsizes, these methods construct their stepsizes adaptively using the current and past gradients. Despite advances in the literature on adaptivity, theoretical understanding of the benefits of adaptation is very limited. We provide a different perspective on understanding the benefits of adaptivity by considering it in the context of non-stationary gradient noise, i.e., the noise intensity varies with iterations. Surprisingly, this setting is rarely studied, even for SGD. To our knowledge, this paper is the first work to formally study stochastic gradient methods in this varying noise scenario. Our main goal is to show that: Adaptive step-sizes can guarantee faster rates than SGD when the noise is non-stationary. We focus on this goal based on several empirical observations (Section 2), which lead us to model the noise of stochastic gradient oracles via the following iteration dependent quantities: m 2 k := E[ g(x k ) 2 ], or σ 2 k := E[ g(x k ) -∇f (x k ) 2 ], where g(x k ) is the stochastic gradient and ∇f (x k ) the true gradient at iteration k. Notation (1) provides more fine-grained description of noise behavior than uniform bounds on the variance by 1

