RMSPROP CONVERGES WITH PROPER HYPER-PARAMETER

Abstract

Despite the existence of divergence examples, RMSprop remains one of the most popular algorithms in machine learning. Towards closing the gap between theory and practice, we prove that RMSprop converges with proper choice of hyperparameters under certain conditions. More specifically, we prove that when the hyper-parameter β 2 is close enough to 1, RMSprop and its random shuffling version converge to a bounded region in general, and to critical points in the interpolation regime. It is worth mentioning that our results do not depend on "bounded gradient" assumption, which is often the key assumption utilized by existing theoretical work for Adam-type adaptive gradient method. Removing this assumption allows us to establish a phase transition from divergence to non-divergence for RMSprop. Finally, based on our theory, we conjecture that in practice there is a critical threshold β * 2 , such that RMSprop generates reasonably good results only if 1 > β 2 ≥ β * 2 . We provide empirical evidence for such a phase transition in our numerical experiments.

1. INTRODUCTION

RMSprop (Tieleman & Hinton, 2012) remains one of the most popular algorithms for machine learning applications. As a non-momentum version of a more general algorithm Adam, RMSprop's good empirical performance has been well acknowledged by practitioners in generative adversarial networks (GANs) (Seward et al., 2018; Yazıcı et al., 2019; Karnewar & Wang, 2020; Jolicoeur-Martineau, 2019) , reinforcement learning (Mnih et al., 2016) , etc. In spite of its prevalence, however, Reddi et al. ( 2018) discovered that RMSprop (as well as the more general version Adam) can diverge even for simple convex functions. To fix the algorithm, the authors of Reddi et al. (2018) proposed a new variant called AMSGrad, which is guaranteed to converge under certain conditions. Since then, it has been an active area of research to design provably convergent variants of RMSprop. These variants include AdaFom (Chen et al., 2019 ), Adabound (Luo et al., 2019 ), Nostalgic Adam (Huang et al., 2019 ), Yogi (Zaheer et al., 2018) , and many more. Despite the variants, the vanilla RM-Sprop indeed works well in practice, and after proper hyper-parameter tuning, the non-convergence issue has not been commonly observed. Why is there a large gap between theory and practice? Is this because the real-world problems are likely to be "nice", or is it because the theoretical analysis of RMSprop does not match how it is used in practice? With the above questions in mind, we revisited the counter-example of Reddi et al. (2018) , and found an interesting phenomenon. One counter-example of Reddi et al. (2018) is the following: f t (x) = Cx, for t mod C = 1 -x, otherwise where x ∈ [-1, 1]. They proved the divergence under the condition β 2 ≤ min{C -4 C-2 , 1 -9 2C 2 }, where β 2 is the second order momentum coefficient in Algorithm 1 (the algorithm is presented later). For instance, when C = 10, then the algorithm diverges if β 2 < 0.3. Reddi et al. ( 2018) mentioned

