RMSPROP CONVERGES WITH PROPER HYPER-PARAMETER

Abstract

Despite the existence of divergence examples, RMSprop remains one of the most popular algorithms in machine learning. Towards closing the gap between theory and practice, we prove that RMSprop converges with proper choice of hyperparameters under certain conditions. More specifically, we prove that when the hyper-parameter β 2 is close enough to 1, RMSprop and its random shuffling version converge to a bounded region in general, and to critical points in the interpolation regime. It is worth mentioning that our results do not depend on "bounded gradient" assumption, which is often the key assumption utilized by existing theoretical work for Adam-type adaptive gradient method. Removing this assumption allows us to establish a phase transition from divergence to non-divergence for RMSprop. Finally, based on our theory, we conjecture that in practice there is a critical threshold β * 2 , such that RMSprop generates reasonably good results only if 1 > β 2 ≥ β * 2 . We provide empirical evidence for such a phase transition in our numerical experiments.

1. INTRODUCTION

RMSprop (Tieleman & Hinton, 2012) remains one of the most popular algorithms for machine learning applications. As a non-momentum version of a more general algorithm Adam, RMSprop's good empirical performance has been well acknowledged by practitioners in generative adversarial networks (GANs) (Seward et al., 2018; Yazıcı et al., 2019; Karnewar & Wang, 2020; Jolicoeur-Martineau, 2019 ), reinforcement learning (Mnih et al., 2016) , etc. In spite of its prevalence, however, Reddi et al. (2018) discovered that RMSprop (as well as the more general version Adam) can diverge even for simple convex functions. To fix the algorithm, the authors of Reddi et al. (2018) proposed a new variant called AMSGrad, which is guaranteed to converge under certain conditions. Since then, it has been an active area of research to design provably convergent variants of RMSprop. These variants include AdaFom (Chen et al., 2019 ), Adabound (Luo et al., 2019 ), Nostalgic Adam (Huang et al., 2019 ), Yogi (Zaheer et al., 2018) , and many more. Despite the variants, the vanilla RM-Sprop indeed works well in practice, and after proper hyper-parameter tuning, the non-convergence issue has not been commonly observed. Why is there a large gap between theory and practice? Is this because the real-world problems are likely to be "nice", or is it because the theoretical analysis of RMSprop does not match how it is used in practice? With the above questions in mind, we revisited the counter-example of Reddi et al. ( 2018), and found an interesting phenomenon. One counter-example of Reddi et al. ( 2018) is the following: f t (x) = Cx, for t mod C = 1 -x, otherwise where x ∈ [-1, 1]. They proved the divergence under the condition β 2 ≤ min{C -4 C-2 , 1 -9 2C 2 }, where β 2 is the second order momentum coefficient in Algorithm 1 (the algorithm is presented later). For instance, when C = 10, then the algorithm diverges if β 2 < 0.3. Reddi et al. ( 2018) mentioned x and -1 is smaller than 0.01 on average after 750000 iterations and as divergence otherwise. For each choice of β2, there exists a counter example, but for each counter example in which Adam diverges, there exists a larger β2 that can make Adam converge. We fix β1 = 0. Step size is set as ηt = 1 √ t . that "this explains why large β 2 is advisable while using Adam algorithm", but they did not analyze whether large β 2 leads to convergence in their example. We ran simulation for problem (1) with different β 2 and found there is always a threshold of β 2 above which RMSprop converges, see Figure 1 . For instance, when C = 10, the transition point of β 2 is roughly 0.955: the algorithm converges if β 2 > 0.956 but diverges if β 2 < 0.955. In general, there is a curve of phase transition from divergence to convergence, and such a curve slopes upward, which means the transition point is closer to 1 if C becomes larger. Based on this observation, we make the following conjecture: Conjecture: RMSprop converges if β 2 is large enough. Before further discussion, we introduce the following assumption. Assumption 1.1. f (x) = n-1 j=0 f j (x), and n-1 j=0 ∇f j (x) 2 2 ≤ D 1 ∇f (x) 2 2 + D 0 . We divide optimization problems into 2 classes: realizable problems where D 0 = 0 and non-realizable problems where D 0 > 0. When D 0 = 0, the assumption (1.1) becomes n-1 j=0 ∇f j (x) 2 2 ≤ D 1 ∇f (x) 2 2 , which is called "strong growth condition" (SGC) (Vaswani et al., 2019) . It requires the norm of the stochastic gradient to be proportional to the batch gradient norm. When ∇f (x) = 0, under SGC we have ∇f j (x) = 0 for all j. For linear regression problems, SGC holds if the linear model can fit all data. More specifically, for the problem min x Ax 2 = n j=1 a T j x 2 where A is an n by n matrix and a T j is the j-th row vector of A, SGC holds with & Bach, 2020) . SGC can be viewed as a simple condition that models overparameterized neural networks capable of interpolating all data points (Vaswani et al., 2019) . Therefore, in this work we use the terminology "realizable problems" to refer to the problems that satisfy SGC. D 1 ≤ λ max n i=1 a i a T i a i a T i /λ min A T A (Raj

1.1. MAIN CONTRIBUTIONS

In an attempt to resolve the conjecture, we delve into RMSprop's convergence issues and obtain a series of theoretical and empirical results. Our contributions are summarized below:



Figure 1: Phase diagram of the outcome of RMSprop on the counter example (1). Different marks represent different outcome: we label a data point as convergence if the distance between x and -1 is smaller than 0.01 on average after 750000 iterations and as divergence otherwise. For each choice of β2, there exists a counter example, but for each counter example in which Adam diverges, there exists a larger β2 that can make Adam converge. We fix β1 = 0.Step size is set as ηt = 1

