RMSPROP CONVERGES WITH PROPER HYPER-PARAMETER

Abstract

Despite the existence of divergence examples, RMSprop remains one of the most popular algorithms in machine learning. Towards closing the gap between theory and practice, we prove that RMSprop converges with proper choice of hyperparameters under certain conditions. More specifically, we prove that when the hyper-parameter β 2 is close enough to 1, RMSprop and its random shuffling version converge to a bounded region in general, and to critical points in the interpolation regime. It is worth mentioning that our results do not depend on "bounded gradient" assumption, which is often the key assumption utilized by existing theoretical work for Adam-type adaptive gradient method. Removing this assumption allows us to establish a phase transition from divergence to non-divergence for RMSprop. Finally, based on our theory, we conjecture that in practice there is a critical threshold β * 2 , such that RMSprop generates reasonably good results only if 1 > β 2 ≥ β * 2 . We provide empirical evidence for such a phase transition in our numerical experiments.

1. INTRODUCTION

RMSprop (Tieleman & Hinton, 2012) remains one of the most popular algorithms for machine learning applications. As a non-momentum version of a more general algorithm Adam, RMSprop's good empirical performance has been well acknowledged by practitioners in generative adversarial networks (GANs) (Seward et al., 2018; Yazıcı et al., 2019; Karnewar & Wang, 2020; Jolicoeur-Martineau, 2019) , reinforcement learning (Mnih et al., 2016) , etc. In spite of its prevalence, however, Reddi et al. (2018) discovered that RMSprop (as well as the more general version Adam) can diverge even for simple convex functions. To fix the algorithm, the authors of Reddi et al. (2018) proposed a new variant called AMSGrad, which is guaranteed to converge under certain conditions. Since then, it has been an active area of research to design provably convergent variants of RMSprop. These variants include AdaFom (Chen et al., 2019) , Adabound (Luo et al., 2019) , Nostalgic Adam (Huang et al., 2019) , Yogi (Zaheer et al., 2018) , and many more. Despite the variants, the vanilla RM-Sprop indeed works well in practice, and after proper hyper-parameter tuning, the non-convergence issue has not been commonly observed. Why is there a large gap between theory and practice? Is this because the real-world problems are likely to be "nice", or is it because the theoretical analysis of RMSprop does not match how it is used in practice? With the above questions in mind, we revisited the counter-example of Reddi et al. (2018) , and found an interesting phenomenon. One counter-example of Reddi et al. (2018) is the following: f t (x) = Cx, for t mod C = 1 -x, otherwise where x ∈ [-1, 1]. They proved the divergence under the condition β 2 ≤ min{C -4 C-2 , 1 -9 2C 2 }, where β 2 is the second order momentum coefficient in Algorithm 1 (the algorithm is presented later). For instance, when C = 10, then the algorithm diverges if β 2 < 0.3. Reddi et al. (2018) mentioned 1). Different marks represent different outcome: we label a data point as convergence if the distance between x and -1 is smaller than 0.01 on average after 750000 iterations and as divergence otherwise. For each choice of β2, there exists a counter example, but for each counter example in which Adam diverges, there exists a larger β2 that can make Adam converge. We fix β1 = 0. Step size is set as ηt = 1 √ t . that "this explains why large β 2 is advisable while using Adam algorithm", but they did not analyze whether large β 2 leads to convergence in their example. We ran simulation for problem (1) with different β 2 and found there is always a threshold of β 2 above which RMSprop converges, see Figure 1 . For instance, when C = 10, the transition point of β 2 is roughly 0.955: the algorithm converges if β 2 > 0.956 but diverges if β 2 < 0.955. In general, there is a curve of phase transition from divergence to convergence, and such a curve slopes upward, which means the transition point is closer to 1 if C becomes larger. Based on this observation, we make the following conjecture: Conjecture: RMSprop converges if β 2 is large enough. Before further discussion, we introduce the following assumption. Assumption 1.1. f (x) = n-1 j=0 f j (x), and n-1 j=0 ∇f j (x) 2 2 ≤ D 1 ∇f (x) 2 2 + D 0 . We divide optimization problems into 2 classes: realizable problems where D 0 = 0 and non-realizable problems where D 0 > 0. When D 0 = 0, the assumption (1.1) becomes n-1 j=0 ∇f j (x) 2 2 ≤ D 1 ∇f (x) 2 2 , which is called "strong growth condition" (SGC) (Vaswani et al., 2019) . It requires the norm of the stochastic gradient to be proportional to the batch gradient norm. When ∇f (x) = 0, under SGC we have ∇f j (x) = 0 for all j. For linear regression problems, SGC holds if the linear model can fit all data. More specifically, for the problem min x Ax 2 = n j=1 a T j x 2 where A is an n by n matrix and a T j is the j-th row vector of A, SGC holds with Raj & Bach, 2020) . SGC can be viewed as a simple condition that models overparameterized neural networks capable of interpolating all data points (Vaswani et al., 2019) . Therefore, in this work we use the terminology "realizable problems" to refer to the problems that satisfy SGC. D 1 ≤ λ max n i=1 a i a T i a i a T i /λ min A T A (

1.1. MAIN CONTRIBUTIONS

In an attempt to resolve the conjecture, we delve into RMSprop's convergence issues and obtain a series of theoretical and empirical results. Our contributions are summarized below: • We find that RMSprop's convergence is contingent on the choice of β 2 . For general optimization problems, there are two types of hyper-parameters: problem-dependent hyperparameters such as step size in GD, and universal constants such as momentum coefficient in heavy ball methodfoot_0 . Our result reveals that β 2 is closer to the first type. • We prove that RMSprop converges to stationary point for realizable problems (interpolation regime), and to some bounded region for non-realizable problems. Combining with the divergence example of RMSprop, this indicates the existence of a phase transition from divergence to convergence dependent on β 2 . Note that when we say "convergence", in a weak sense it means the sequence converges to a bounded region for non-realizable case; and in a strong sense it means the sequence converges to stationary points for realizable case. • To our best knowledge, we are the first to prove the convergence of RMSprop and some of Adam without any form of assumption about the boundedness of the gradient norm. This is important for showing the transition: with added assumptions on bounded gradients, the gradients cannot diverge, while the counter-example shows that the gradient can.

2. PRELIMINARIES

We consider a finite-sum problem: min x∈R d f (x) = n-1 j=0 f j (x). In neural network training, f j usually represents the loss contributed by the j-th sample batch. We present randomly shuffled Adam in Algorithm 1. RMSProp is the special case of Adam with β 1 = 0. In this work, we mainly focus on RMSprop; nevertheless, we will present a result for a special case of Adam with small β 1 .

Algorithm 1 Randomly Shuffled Adam

Initialize m 1,-1 = 1 1-β1 ∇f (x 0 ) and v 1,-1 = 1 1-β2 max j {∇f j (x 0 ) • ∇f j (x 0 )}. for k = 1 → ∞ do Sample {τ k,0 , τ k,1 , • • • , τ k,n-1 } as a random permutation of {0, 1, 2, • • • , n -1} for i = 0 → n -1 do m k,i = β 1 m k,i-1 + (1 -β 1 ) ∇f τ k,i v k,i = β 2 v k,i-1 + (1 -β 2 ) ∇f τ k,i • ∇f τ k,i x k,i+1 = x k,i -η k * n √ v k,i + • m l,k,i end for Break if certain stopping criterion is satisfied. x k+1,0 = x k,n , v k+1,-1 = v k,n-1 , m k+1,-1 = m k,n-1 end for return x In Algorithm 1, x denotes the optimization variable, m denotes the first order momentum and v denotes the second order momentum. Specifically, we denote x k,i , m k,i , v k,i ∈ R d as the value of x, m, v at the k-th outer loop and i-th inner loop, respectively. We denote ∇f j as the gradient of f j and let • be the component-wise multiplication. The division of two vectors is component-wise as well. Moreover, we denote η t as the step-size and β 1 , β 2 as the hyper-parameters in the algorithm. When n = 1, we obtain full batch Adam. We replaced the bias correction step in (Kingma & Ba, 2015) with a special initialization on m 1,-1 and v 1,-1 . This initialization can also correct the bias, but has cleaner results. Since the effect of initialization or bias correction becomes more and more negligible as the training progresses, RMSprop with zero initialization or our initialization will have the same asymptotic behavior. We put our results for the original version of RMSprop in the appendix. As for hyper-parameters, we choose η t = η1 √ t and fix β 2 to be a constant that is independent of the iteration count. We allow to be an arbitrary non-negative constant; in particular, our result holds even for = 0. The constant is added in practice for numerical stability, and is typically chosen to be 10 -6 or even 10 -8 . It is much smaller than √ v k,i (which is roughly the size of gradient norm).

2.1. RELATED WORK

As discussed earlier, one line of research focuses on variants of RMSprop and Adam that can be proved to converge. These works usually modify the update rule of v t . For instance, AMSGrad (Reddi et al., 2018) , AdaFom (Chen et al., 2019) explicitly make v t non-decreasing. Nostalgic Adam (Huang et al., 2019) Besides modifying the algorithm, a few attempts have been made to address the non-convergence issues of the original versions, but they often rely on extra assumptions. A number of works (Zaheer et al., 2018; De et al., 2019; Défossez et al., 2020) prove the convergence of Adam under these additional assumptions. One representative work along this line, Défossez et al. (2020) , establishes a clean convergence result and also provides some insights on the momentum mechanisms by improving the dependence of the iteration complexity on 1 -β 1 . However, these works assume to be relatively large compared to √ v k,i . The issue is that such a choice essentially transforms RMSprop back to SGD since the effective step size is primarily controlled by , in lieu of √ v k,i . This is in contrary to the spirit of RMSprop, which is to use adaptive step size to accelerate convergence. A few other works do not need the assumption of , but they have other assumptions. De et al. (2018) analyze deterministic and stochastic RMSprop, but they utilize a rather unrealistic assumption that the sign of all noisy gradients are the same, i.e., sign(∇f p (x)) = sign(∇f q (x)) for all p, q. Chen et al. ( 2019) describe a few quantities based on the iterates, and prove that if they grow in a certain speed as the iterates go, the algorithm converges. The drawback is that the condition cannot be checked a priori. Besides the assumptions mentioned above, all the aforementioned works require the gradient to be bounded. In general, removing boundedness assumptions (of any kind, including bounded gradient, bounded iterates, etc.) is not necessarily easy. Thus, such results are appreciated even for basic SGD. For instance, Bertsekas & Tsitsiklis (2000) presents a nice discussion of various results on inexact GD without involving conventional bounded assumptions, and claims "bounded-assumption-free" as one of the main contributions of their work. Very recently, we notice another work (Liu et al., 2020b) which removes the bounded gradient assumption for SGDM (SGD with momentum) and obtains satisfactory rates. Nevertheless, we are not aware of an existing result on RMSprop that does not require bounded gradient assumption. We will explain later why removing this bounded gradient assumption is particularly important for our paper. 3 THE raison d'être FOR β 2 Figure 1 clearly demonstrates the important role of β 2 in the convergence of RMSprop. Specifically, a sufficiently large β 2 is critical for RMSprop's convergence. Indeed, some recent works (Reddi et al., 2018; Zhou et al., 2019) have also made similar arguments, but they focus on understanding one part of the phenomenon, that is, small β 2 leads to divergence. Our goal in this work is to complete the other part of the story by showing that, sufficiently large β 2 guarantees convergence. The formal result will be provided in Sec. 4. To understand the function of β 2 , we first discuss why RMSprop diverges. It is known that the stochastic noise due to mini-batch will distort the gradient direction, leading to possible divergence, but in standard SGD, the distortion in multiple iterations is eliminated since the stochastic gradient is an unbiased estimate of the gradient. For RMSprop, at a given iteration the scaling constant 1/ √ v in the update direction may cause larger gradient distortion than the standard SGD. The distortion can be so significant that the average updating direction falls outside the dual cone of the true gradient. To illustrate this, consider the extreme case that β 2 = 0 and = 0 (i.e., signSGD) and the special example (1). When applying signSGD to solve (1), in each epoch which consists of C iterations, one iteration will move x left followed by C -1 iterations that move x right. Since all step sizes are the same in one epoch, the accumulated effect of one epoch makes x move in the ascending direction, instead of the descending direction. Then why does large β 2 help? Intuitively, a large β 2 can control the distortion on update directions. In the extreme case that β 2 = 1 and = 0, RMSprop reduces to SGD where the distortion of multiple iterations can be mitigated, leading to convergence. We suspect that β 2 does not need to be exactly 1, and a large β 2 is enough to control the distortion. Our experiment in Figure 1 confirms that, at least for the counter-example of Reddi et al. (2018) , there is an interval β 2 ∈ [c, 1] such that RMSprop converges. What was initially not clear is whether the counter-example of Reddi et al. ( 2018) is a very special case or the convergence of large-β 2 -RMSprop holds for all problems. We found the real situation is somewhat more tricky. For non-realizable problems, we discovered an example for which RMSprop cannot converge to the minimum for a wide range of β 2 < 1, but unlike the small-β 2 -case the iterates converge to a small ball around the minimum. This motivates us to distinguish three convergentsituations: divergence, convergence to a small region, convergence to critical points. What we can prove for the general problem is (see Theorem 4.3): for small β 2 , RMSprop can diverge; for large β 2 , RMSprop must converge to a small region whose size depends on β 2 . Then why do we observe the convergence to a single point in the experiment for (1)? We suspect this is because the problem (1) is realizable, and conjecture that the property of "convergence to critical points" holds for all realizable problems. We indeed prove this conjecture (see Corollary 4.1): large-β 2 -RMSprop converges to critical points if the problem satisfies SGC. We summarize our findings about the convergence properties of RMSprop in Table 1 .

Setting

β 2 close to 1 β 2 close to 0 non-realizable converge to small region (Thm. 4.3) diverge realizable converge to critical points (Coro. 4.1) diverge Table 1 : Outcome of shuffled RMSprop under different hyperparameter settings. On unconstrained problems, if the gradient norm can decrease to zero after infinite iterations, we classify the result as "convergence to critical points". If the gradient norm keeps increasing to infinity during training, we classify the result as "divergence". We classify a result as "converge to small region" if the gradient norm falls below a certain level but not close to 0. Note that our results do not conflict with Theorem 3 in Reddi et al. ( 2018) which claims that "for any constant β 1 and β 2 there exists a divergent example" since here we choose β 2 to be problemdependent, just like one chooses a step size < 2/L for GD where L is a problem dependent parameter. Another remark is that though β 2 could be close to 1, RMSprop still retains the ability to adapt v to gradient square norm as long as β 2 < 1, because new gradient signals are added for each iteration and the impact of previous signals decays exponentially. It is the adaptive ability that distinguishes RMSprop from SGD. Proving the theoretical advantage of RMSprop over SGD (i.e., choosing β 2 < 1 is better than β 2 = 1) is a very intriguing question; in general, the theoretical advantage of adaptive gradient methods (including RMSprop and AdaGrad) over SGD is a long standing question. in this work, we focus on the fundamental problem of convergence, instead of the more challenging question of justifying the advantage of RMSprop.

4. CONVERGENCE RESULTS

In this section, we present the formal theoretical results. We start from the results of full batch RMSprop/Adam, and then present the results for the stochastic versions. Note that random shuffling is not a key factor, and the proof works for other settings.

4.1. FULL-BATCH VERSION

We first consider the full-batch version of RMSprop. The following theorem shows that if we use all samples to evaluate the gradient, RMSProp with diminishing stepsize converges to critical points regardless of the choice of β 2 . Here we consider one popular step size schedule that η t = η1 √ t . Theorem 4.1. (convergence of full-batch RMSprop) For problem (3) with n = 1, assume that f is gradient Lipschitz continuous with constant L and lower bounded by f * . Then, for full-batch RMSprop (Alg. 1 with β 1 = 0, = 0) with diminishing step size η t = η1 √ t and any β 2 ∈ (0, 1), we have: min t∈(1,T ] ∇f t 1 ≤ O log T √ T where T > 0 is the total iteration number. 2019) also proves the convergence of full batch RMSprop, but they require the gradient norm to be upper bounded; in contrast, we do not need this assumption, and only require lower-boundedness and L-smoothness of f . Our result suggests that the convergence property of batch-RMSprop is similar to signSGD, an algorithm that only uses the sign of gradient to calculate its descent direction (Bernstein et al., 2018) : in the full-batch setting, signSGD (which can be called sign GD) has also been proved to converge without bounded gradient assumption.

De et al. (

Below, we also derive an analogous result for full-batch Adam with only one additional constraint β 1 < √ β 2 < 1, which is often satisfied in practice: Theorem 4.2. (convergence of full-batch Adam) For optimization problem (3) with n = 1, assume that f is gradient Lipschitz continuous with constant L and lower bounded by f * . Then, for full-batch Adam with diminishing step size η t = η1 √ t and any β 1 < √ β 2 < 1, we have: min t∈(1,T ] ∇f t 1 ≤ O log T √ T .

4.2. STOCHASTIC VERSIONS

As mentioned earlier, our simulation shows that RMSprop may not converge to critical points for non-realizable problems (an example is provided in the appendix). Nevertheless, we can still show randomly shuffled large-β 2 -RMSprop converges to a bounded region: Theorem 4.3. (large-β 2 RMSprop converge to a region) For problem (3), assume f is lower-bounded by f * and all ∇f j is L-Lipschitz continuous. Furthermore, assume (2) holds, and β 2 satisfies T2 (β2) 10dn β n 2 dnD1   (1 -β2) 4n 2 β n 2 -1 2 + 1 β n 2 -1   ≤ √ 2 -1 2 √ 2 , Then, for randomly shuffled RMSprop with η t = η1 √ t , we have min t∈(1,T ] min{ ∇f nt 1 , ∇f nt 2 2 D 1 d D 0 } ≤ O log T √ T + O Q 3,3 D 0 , ∀ T ≥ 4. Here Q 3,3 > 0 is a β 2 -dependent constant that goes to zero in the limit as β 2 → 1. Remark 1. This result and the result in Reddi et al. ( 2018) together distinguish large-β 2 -RMSprop and small-β 2 -RMSprop: the former converges to a bounded region, while the latter can diverge. Note that there is a gap between the lower bound of β 2 and the upper bound of β 2 in the counter-example. We do not try to provide tight bounds on the threshold of β 2 , as our main goal is to show a qualitative difference between large-β 2 -RMSprop and small-β 2 -RMSprop.

Remark 2:

Condition (4) in Theorem 4.3 implies that 1 -β 2 ≤ O n -3.5 . In the appendix we introduce three problem-dependent parameters 14), ( 15) and ( 16), and improve the sufficient condition (4) to 1 -β 2 ≥ O (1/ (nρ 1 ρ 2 ρ 3 )). For the worst case, the bound is O n -3.5 , just like condition (4) in Theorem 4.3. In actual training process, ρ 1 , ρ 2 , and ρ 3 may not reach their upper bounds, thus the threshold of β 2 can be lower in practice (see Appendix A.5 for some empirical estimate of ρ i 's). The dependence on the number of batches n suggests that as n increases, the required hyper-parameter β 2 should be larger. This is understandable since more minibatches means larger noise in the stochastic gradient, and thus larger β 2 is required. There is a gap between our theoretical bound of β 2 and the empirical transition point, and it is an interesting future question to close this gap. ρ 1 ∈ [1, √ n], ρ 2 ∈ [0, n], and ρ 3 ∈ [1, √ n] in equations( Remark 3. We point out three possible algorithm behaviors: divergence to infinity (or divergence for short), convergence to a bounded region (or non-divergence for short) and convergence to critical points. We distinguish the three cases, making it easier to explain the qualitative difference of small-β 2 and large-β 2 regime. For non-realizable cases, the phase transition is from divergence to non-divergence. Therefore, it is important to discard the bounded-gradient assumption: this assumption eliminates the possibility of divergence of gradients a priori. To be clear, there are actually two sub-cases of non-divergence: iterates can stay in a bounded but huge region (bad case), or iterates stay in a bounded region dependent on some parameters (good case). Indeed, the "convergence" of constant-stepsize SGD is in the sense of "converging to a region with size proportional to the noise variance". Our result of "converging to bounded region" is also meaningful as the size of the region goes to zero as the noise variance goes to 0 or D 0 goes to 0 (realizable case). Note that "divergence" can be also interpreted as "not converging to critical points" which is the notion used in Reddi et al. ( 2018), instead of "diverging to infinity". We use the latter concept of "diverging to infinity" for the term "divergence", because "not converging to critical points" can include the good case of converging to a small region around critical points (like constant-stepsize SGD). In the example of Reddi et al. ( 2018), a constrained problem is considered (bound constraint [-1,1]), thus divergence to infinity cannot happen. We add an example where the iterates and the gradients can diverge to infinity for small β 2 ; see Appendix A.2. As a corollary of Theorem 4.3, if the optimization problem satisfies SGC (i.e. D 0 = 0), RMSprop converges to critical points. Corollary 4.1. Suppose the assumptions of Theorem 4.3 holds. Further, assume (2) holds with D 0 = 0, i.e., n-1 j=0 ∇f j (x) 2 2 ≤ D 1 ∇f (x) 2 2 for all x. we have: min t∈(1,T ] ∇f nt 1 ≤ O log T √ T , ∀ T ≥ 4. With the above corollary, the numerical result in Figure 1 should not be surprising: problem (1) satisfies the strong growth condition, and thus there is always a range of β 2 inside which RMSprop converges. We just need to tune β 2 larger. We can prove similar convergence results for Adam with small β 1 and large β 2 . Theorem 4.4. For optimization problem (3), assume that f is lower-bounded by f * and f j is gradient Lipschitz continuous with constant L for all j. Furthermore, assume that f j satisfies (2) for all x. Then, for randomly shuffled Adam with diminishing step size η t = η1 √ t and β 1 , β 2 satisfying T 1 (β 1 , β 2 )+T 2 (β 2 ) < 1-1 √ 2 , we have min t∈[1,T ] ∇f nt 1 ≤ O log T √ T +O Q 3,5 √ D 0 ∀ T ≥ 4 , where Q 3,5 is a constant that approaches 0 in the limit T 1 + T 2 → 0, T 2 is defined in (4), and T 1 is defined as T 1 (β 1 , β 2 ) = 5dn β n 2 dn 2 D 1 β1 β n 2 1-β1 1-β n 1 + 1 . Remark: This result shows that controlling β 2 and β 1 together can ensure convergence of Adam. We conjecture that the same convergence can be proved for a large range of β 1 , but we are not able to prove that for now (which is why we focus on RMSprop in this work) and leave it to future work.

5. EXPERIMENTS

We conduct experiments of image classification and GAN on MNIST and CIFAR-10 to support our theoretical findings. The details of GAN experiments are in the appendix, and in this section we focus on image classification results. We calculate the hyper-plane spanned by the starting point of two trajectories and their respective ending points, and then project two trajectories onto this plane. The large-β2-trajectory converges to minima, while the small-β2-trajectory diverges. The difference between these two trajectories suggests a phase transition for β2 in (0.8, 0.99). We visualize the optimization trajectory when training on MNIST for small β 2 = 0.8 and large β 2 = 0.99 in Figure 2 . We observe different behaviors: while the trajectory of β 2 = 0.8 moves away from the bottom of the basin, for larger β 2 the trajectory stays in the level set and has decreasing loss values. In the CIFAR experiments, we use ResNet-18. We choose β 2 = 0.8, 0.9, 0.95, 0.99 respectively. With different batch sizes 8, 16, 32, we run each algorithm for 100 epochs without explicit regularization. Table 2 shows two phenomena: first, for fixed batch size, there is a transition point of β 2 above which the accuracy suddenly jumps; second, the transition point is closer to 1 as the batch size decreases. More specifically, for batch size 8, the transition point lies in [0.95, 0.99]: the average training accuracy is 44.53% for β 2 = 0.95, but jumps to 99.74% for for β 2 = 0.99. For batch size 16, the transition point lies in [0.9, 0.95]: the average training accuracy is 67.27% for β 2 = 0.9, but jumps to 96.38% for β 2 = 0.95. For batch size 16, the transition point lies in [0.8, 0.9]. As batch size increases from 8 to 16 and then 32, then transition point decreases from 0.99 to 0.95 and then to 0.9. These two phenomena are consistent with our theory. The first phenomenon can be explained by Theorem 4.3 and Corollary 4.1 which state that for large enough β 2 , RMSprop converges. The second phenomenon can be explained by Theorem 4.3 as well: as explained in Remark 2, the required β 2 decreases as the number of mini-batches n decreases, i.e., the batch size increases. Next, we demonstrate that the convergence speed of SGD is much slower than Adam under the same experiment setting as Table 2 . We compare the average training and test accuracy at the 10-th epoch. As Table 3 shows, the accuracy of Adam is much higher than SGD at the 10-th epoch. All codes generating experimental results are available on the Github repository https://github. com/soundsinteresting/RMSprop 

6. CONCLUSION

In this work, we study the convergence behavior of RMSprop by taking a closer look at the hyperparameters. Specifically, for realizable problems, we provide a data-dependent threshold of β 2 above which we prove the convergence of randomly shuffled RMSprop and small β 1 Adam without bounded gradient assumption. We also show that RMSprop converge into a bounded region under non-realizable settings. These findings reveal that there is a critical threshold of β 2 regarding the convergence behavior of RMSprop, and the phase transition is supported by the numerical experiments. Our results provide basic guidelines for tuning hyper-parameters in practice.

7. ACKNOWLEDGEMENT

M. Hong is supported by NSF grant CMMI-1727757. Ruichen Li from Peking University helped check some proof of Theorem 4.3. We thank all anonymous reviewers for their feedback. We also want to thank Eduard Gorbunov and Juntang Zhuang for pointing out some mistakes on openreview in the earlier versions.



Rigorously speaking, for the best convergence rate, the momentum coefficient should also be problemdependent; but just for achieving convergence, it can be problem independent.



Figure 1: Phase diagram of the outcome of RMSprop on the counter example (1). Different marks represent different outcome: we label a data point as convergence if the distance between x and -1 is smaller than 0.01 on average after 750000 iterations and as divergence otherwise. For each choice of β2, there exists a counter example, but for each counter example in which Adam diverges, there exists a larger β2 that can make Adam converge. We fix β1 = 0.Step size is set as ηt = 1

Figure2: Trajectories of RMSprop with large or small β2 on the loss surface. Two trajectories start from the same initialization and are trained on the same dataset MNIST, but have very different behavior. We calculate the hyper-plane spanned by the starting point of two trajectories and their respective ending points, and then project two trajectories onto this plane. The large-β2-trajectory converges to minima, while the small-β2-trajectory diverges. The difference between these two trajectories suggests a phase transition for β2 in (0.8, 0.99).

and the algorithms analyzed inZou et al. (2019) and Chen et al. (2019) use iteration-dependent β 2t (and/or β 1t ) to let v t weigh more on past gradients. Some works add new modifications into RMSprop and Adam; for instance, Zhou et al. (2019) mitigate the bias in update direction by using a different estimate of v t , Dozat (2016) combine Adam with Nesterov momentum, and Liu et al. (2020a) employ a warm-up technique.

Performance of Adam with different β2 with ResNet-18 on CIFAR-10 (100 epochs)

Training and test accuracy at the 10-th epoch

