QUICKLY FINDING A BENIGN REGION VIA HEAVY BALL MOMENTUM IN NON-CONVEX OPTIMIZATION

Abstract

The Heavy Ball Method (Polyak, 1964) , proposed by Polyak over five decades ago, is a first-order method for optimizing continuous functions. While its stochastic counterpart has proven extremely popular in training deep networks, there are almost no known functions where deterministic Heavy Ball is provably faster than the simple and classical gradient descent algorithm in non-convex optimization. The success of Heavy Ball has thus far eluded theoretical understanding. Our goal is to address this gap, and in the present work we identify two non-convex problems where we provably show that the Heavy Ball momentum helps the iterate to enter a benign region that contains a global optimal point faster. We show that Heavy Ball exhibits simple dynamics that clearly reveal the benefit of using a larger value of momentum parameter for the problems. The first of these optimization problems is the phase retrieval problem, which has useful applications in physical science. The second of these optimization problems is the cubic-regularized minimization, a critical subroutine required by Nesterov-Polyak cubic-regularized method (Nesterov & Polyak (2006) ) to find second-order stationary points in general smooth non-convex problems.

1. INTRODUCTION

Poylak's Heavy Ball method (Polyak (1964) ) has been very popular in modern non-convex optimization and deep learning, and the stochastic version (a.k.a. SGD with momentum) has become the de facto algorithm for training neural nets. Many empirical results show that the algorithm is better than the standard SGD in deep learning (see e.g. Hoffer et al. (2017) ; Loshchilov & Hutter (2019) ; Wilson et al. (2017) ; Sutskever et al. (2013) ), but there are almost no corresponding mathematical results that show a benefit relative to the more standard (stochastic) gradient descent. Despite its popularity, we still have a very poor justification theoretically for its success in non-convex optimization tasks, and Kidambi et al. (2018) were able to establish a negative result, showing that Heavy Ball momentum cannot outperform other methods in certain problems. Furthermore, even for convex problems it appears that strongly convex, smooth, and twice differentiable functions (e.g. strongly convex quadratic functions) are one of just a handful examples for which a provable speedup over standard gradient descent can be shown (e.g (Lessard et al., 2016; Goh, 2017; Ghadimi et al., 2015; Gitman et al., 2019; Loizou & Richtárik, 2017; 2018; Gadat et al., 2016; Scieur & Pedregosa, 2020; Sun et al., 2019; Yang et al., 2018a; Can et al., 2019; Liu et al., 2020; Sebbouh et al., 2020; Flammarion & Bach, 2015) ). There are even some negative results when the function is strongly convex but not twice differentiable. That is, Heavy Ball momentum might lead to a divergence in convex optimization (see e.g. (Ghadimi et al., 2015; Lessard et al., 2016) ). The algorithm's apparent success in modern non-convex optimization has remained quite mysterious. In this paper, we identify two non-convex optimization problems for which the use of Heavy Ball method has a provable advantage over vanilla gradient descent. The first problem is phase retrieval. It has some useful applications in physical science such as microscopy or astronomy (see e.g. (Candés et al., 2013) , (Fannjiang & Strohmer, 2020) , and (Shechtman et al., 2015) ). The objective is min w∈R d f (w) := 1 4n n i=1 (x i w) 2 -y i 2 , where x i ∈ R d is the design vector and y i = (x i w * ) 2 is the label of sample i. The goal is to recover w * up to the sign that is not recoverable (Candés et al., 2013) . Under the Gaussian design setting (i.e. x i ∼ N (0, I d )), it is known that the empirical risk minimizer (1) is w * or -w * , as long as the number of samples n exceeds the order of the dimension d (see e.g. Bandeira et al. (2014) ). Therefore, solving (1) allows one to recover the desired vector w * ∈ R d up to the sign. Unfortunately the problem is non-convex which limits our ability to efficiently find a minimizer. For this problem, there are many specialized algorithms that aim at achieving a better computational complexity and/or sample complexity to recover w * modulo the unrecoverable sign (e.g. (Cai et al., 2016; Candés & Li, 2014; Candés et al., 2015; 2013; Chen & Candés, 2017; Duchi & Ruan, 2018; Ma et al., 2017; 2018; Netrapalli et al., 2013; Qu et al., 2017; Tu et al., 2016; Wang et al., 2017a; b; Yang et al., 2018b; Zhang et al., 2017a; b; Zheng & Lafferty, 2015) ). Our goal is not about providing a state-of-the-art algorithm for solving (1). Instead, we treat this problem as a starting point of understanding Heavy Ball momentum in non-convex optimization and hope for getting some insights on why Heavy Ball (Algorithm 1 and 2) can be faster than the vanilla gradient descent in non-convex optimization and deep learning in practice. If we want to understand why Heavy Ball momentum leads to acceleration for a complicated non-convex problem, we should first understand it in the simplest possible setting. We provably show that Heavy Ball recovers the desired vector w * , up to a sign flip, given a random isotropic initialization. Our analysis divides the execution of the algorithm into two stages. In the first stage, the ratio of the projection of the current iterate w t on w * to the projection of w t on the perpendicular component keeps growing, which makes the iterate eventually enter a benign region which is strongly convex, smooth, twice differentiable, and contains a global optimal point. Therefore, in the second stage, Heavy Ball has a linear convergence rate. Furthermore, up to a value, a larger value of the momentum parameter has a faster linear convergence than the vanilla gradient descent in the second stage. Yet, most importantly, we show that Heavy Ball momentum also has an important role in reducing the number of iterations in the first stage, which is when the iterate might be in a non-convex region. We show that the higher the momentum parameter β, the fewer the iterations spent in the first stage (see also Figure 1 ). Namely, momentum helps the iterate to enter a benign region faster. Consequently, using a non-zero momentum parameter leads to a speedup over the standard gradient descent (β = 0). Therefore, our result shows a provable acceleration relative to the vanilla gradient descent, for computing a global optimal solution in non-convex optimization. The second of these is solving a class of cubic-regularized problems, min w f (w) := 1 2 w Aw + b w + ρ 3 w 3 , where the matrix A ∈ R d×d is symmetric and possibly indefinite. Problem (2) is a sub-routine of the Nesterov-Polyak cubic-regularized method (Nesterov & Polyak (2006) ), which aims to minimize a non-convex objective F (•) by iteratively solving w t+1 = arg min w∈R d {∇F (w t ) (w -w t ) + 1 2 (w -w t ) ∇ 2 F (w)(w -w t ) + ρ 3 w -w t 3 }, (3) With some additional post-processing, the iterate w t converges to an ( g , h ) second order stationary point, defined as {w : ∇f (w) ≤ g and ∇ 2 f (w) h I d } for any small g , h > 0. However, their algorithm needs to compute a matrix inverse to solve (2), which is computationally expensive when the dimension is high. A very recent result due to Carmon & Duchi (2019) shows that vanilla gradient descent approximately finds the global minimum of (2) under mild conditions, which only needs a Hessian-vector product and can be computed in the same computational complexity as computing gradients (Pearlmutter, 1994) , and hence is computationally cheaper than the matrix inversion of the Hessian. Our result shows that, similar to the case of phase retrieval, the use of Heavy Ball momentum helps the iterate to enter a benign region of (3) that contains a global optimal solution faster, compared to vanilla gradient descent. For certain non-convex problems, e.g. dictionary learning (Sun et al., 2015) , matrix completion (Chi et al., 2019) , robust PCA (Ge et al., 2017) , and learning a neural network (Ge et al. (2019) ; Bai & Lee (2020) ), where it suffices to find a second-order stationary point, our result consequently might have application. To summarize, our theoretical results of the two non-convex problems provably show the benefit of using Heavy Ball momentum. Compared to the vanilla gradient descent, the use of momentum helps to accelerate the optimization process. The key to showing the acceleration in getting into benign regions of these problems is a family of simple dynamics due to Heavy Ball momentum. We will argue that the simple dynamics are not restricted to the two main problems considered in this paper. Specifically, the dynamics also naturally arise when solving the problem of top eigenvector computation (Golub & Loan, 1996) and the problem of saddle points escape (e.g. (Jin et al., 2017; Wang et al., 2020) ), which might imply the broad applicability of the dynamics for analyzing Heavy Ball in non-convex optimization. Algorithm 1: Heavy Ball method (Polyak, 1964) (Equivalent version 1) 1: Required: step size η and momentum parameter β ∈ [0, 1]. 2: Init: w0 = w-1 ∈ R d 3: for t = 0 to T do 4: Update iterate wt+1 = wt -η∇f (wt) + β(wt -wt-1). 5: end for Algorithm 2: Heavy Ball method (Polyak, 1964) (Equivalent version 2) 1: Required: step size η and momentum parameter β ∈ [0, 1]. 2: Init: w0 ∈ R d and m-1 = 0. 3: for t = 0 to T do 4: Update momentum mt := βmt-1 + ∇f (wt).

5:

Update iterate wt+1 := wt -ηmt. 6: end for

2. MORE RELATED WORKS

Heavy Ball (HB): HB has two exactly equivalent presentations in the literature (see Algorithm 1 and 2). Given the same initialization, both algorithms generate the same sequence of {w t }. In Algorithm 2, we note that the momentum m t can be written as m t = t s=0 β t-s ∇f (w s ) and can be viewed as a weighted sum of gradients. As we described in the opening paragraph, there is little theory of showing a provable acceleration of the method in non-convex optimization. The only exception that we are aware of is (Wang et al., 2020) . They show that HB momentum can help to escape saddle points faster and find a second-order stationary point faster for smooth non-convex optimization. They also observed that stochastic HB solves (1) and that using higher values of the momentum parameter β leads to faster convergence. However, while their work focused on the stochastic setting, their main result required some assumptions on the statistical properties of the sequence of observed gradients; it is not clear whether these would hold in general. In appendix A, we provide a more detailed literature review of HB. To summarize, current results in the literature imply that we are still very far from understanding deterministic HB in non-convex optimization, let alone understanding the success of stochastic HB in deep learning. Hence, this work aims to make progress on a simple question: can we give a precise advantage argument for the acceleration effect of Heavy Ball in the deterministic setting? Phase retrieval: The optimization landscape of problem (1) and its variants has been studied by (Davis et al., 2018; Soltanolkotabi, 2014; Sun et al., 2016; White et al., 2016) , which shows that as long as the number of samples is sufficiently large, it has no spurious local optima. We note that the problem can also be viewed as a special case of matrix sensing (e.g. Li et al. (2018) ; Gunasekar et al. (2017) ; Li & Lin (2020) ; Li et al. (2019); Gidel et al. (2019) ; You et al. (2020) ); in Appendix A, we provide a brief summary of matrix sensing. For solving phase retrieval, Mannellia et al. (2020) study gradient flow, while Chen et al. (2018) show that the standard gradient descent with a random initialization like Gaussian initialization solves (1) and recovers w * up to the sign. Tan & Vershynin (2019) show that online gradient descent with a simple random initialization can converge to a global optimal point in an online setting where fresh samples are required for each step. In this paper, we show that Heavy Ball converges even faster than the vanilla gradient descent. Zhou et al. (2016) propose leveraging Nesterov's momentum to solve phase retrieval. However, their approach requires delicate and computationally expensive initialization like spectral initialization so that the initial point is already within the neighborhood of a minimizer. Similarly, Xiong et al. (2018; 2020) show local convergence of Nesterov's momentem and Heavy Ball momentum for phase retrieval, but require the initial point to be in the neighborhood of an optimal point. Jin et al. (2018) propose an algorithm that uses Nesterov's momentum together as a subroutine with perturbation for finding a second-order stationary point, which could be applied for solving phase retrieval. Compared to (Zhou et al., 2016; Jin et al., 2018; Xiong et al., 2018; 2020) , we consider directly applying gradient descent with Heavy Ball momentum (i.e. HB method) to the objective function with simple random initialization, e.g. Gaussian initialization, which is what people do in practice and is what we want to understand. The goals of the works are different. Finally, we note that there are some efforts in integrating the technique of generative models and phase retrieval, which could help the task of image recovery (e.g. Hand et al. (2018) ). Phase retrieval might also be a good entry point of understanding some observations in optimization and neural net training (e.g. Mannellia et al. (2020) ). Here "1.0 → 0.9" stands for using parameter β = 1.0 in the first few iterations and then switching to using β = 0.9 after that. In our experiment, for the ease of implementation, we let the criteria of the switch be 1{ f (w 1 )-f (w t ) f (w 1 ) ≥ 0.5}, i.e. if the relative change of objective value compared to the initial value has been increased to 50%. Algorithm 3 and Algorithm 4 in Appendix H describe the procedures. All the lines are obtained by initializing the iterate at the same point w0 ∼ N (0, I d /(10000d)) and using the same step size η = 5 × 10 -4 . Here we set w * = e1 and sample xi ∼ N (0, I d ) with dimension d = 10 and number of samples n = 200. We see that the higher the momentum parameter β, the faster the algorithm enters the linear convergence regime. (a): Objective value (1) vs. iteration t. We see that the higher the momentum parameter β, the faster the algorithm enters the linear convergence regime. (b): The size of projection of wt on w * over iterations (i.e. |w t | vs. t), which is non-decreasing throughout the iterations until reaching an optimal point (here, w * = 1). (c): The size of the perpendicular component over iterations (i.e. w ⊥ t vs. t), which is increasing in the beginning and then it is decreasing towards zero after some point. We see that the slope of the curve corresponding to a larger momentum parameter β is steeper than that of a smaller one, which confirms Lemma 1 and Lemma 3.

3.1. PRELIMINARIES

Following the works of Candés et al. (2013) ; Chen et al. (2018) , we assume that the design vectors {x i } (which are known a priori) are from Gaussian distribution x i ∼ N (0, I d ). Furthermore, without loss of generality, we assume that w * = e 1 (so that w * = 1), where e 1 is the standard unit vector whose first element is 1. We also denote w t := w t [1] and w t,⊥ := [w t [2], . . . , w t [d] ] . That is, w t is the projection of the current iterate w t on w * , while w ⊥ t is the perpendicular component. Throughout the paper, the subscript t is an index of the iterations while the subscript i is an index of the samples. Before describing the main results, we would like to provide a preliminary analysis to show how momentum helps. Applying gradient descent with Heavy Ball momentum (Algorithm 1) to objective (1), we see that the iterate is generated according to w t+1 = w t -η 1 n n i=1 (x i w t ) 3 - (x i w t )y i x i + β(w t -w t-1 ). On the other hand, the population counterpart (i.e. when the number of samples n is infinite) of the update rule turns out to be the key to understanding momentum. The population gradient ∇F (w ) := E x∼N (0,In) [∇f (w)] is (proof is available in appendix B) ∇F (w) = 3 w 2 -1 w -2(w * w)w * . (4) Using the population gradient (4), we have the population update, w t+1 = w t -η∇F (w t ) + β(w tw t-1 ), which can be decomposed as follows: w t+1 = 1 + 3η(1 -w t 2 ) w t + β(w t -w t-1 ) w ⊥ t+1 = 1 + η(1 -3 w t 2 ) w ⊥ t + β(w ⊥ t -w ⊥ t-1 ). (5) Assume that the random initialization satisfies w 0 2 1 3 . From the population recursive system (5), both the magnitude of the signal component w t and the perpendicular component w ⊥ t grow exponentially in the first few iterations. Lemma 1. For a positive number θ and the momentum parameter β ∈ [0, 1], if a non-negative sequence {a t } satisfies a 0 ≥ a -1 > 0 and that for all t ≤ T , a t+1 ≥ (1 + θ)a t + β(a t -a t-1 ), then {a t } satisfies a t+1 ≥ 1 + 1 + β 1+θ θ a t , for every t = 1, . . . , T + 1. Similarly, if a non-positive sequence {a t } satisfies a 0 ≤ a -1 < 0 and that for all t ≤ T , a t+1 ≤ (1 + θ)a t + β(a t -a t-1 ), then {a t } satisfies a t+1 ≤ 1 + 1 + β 1+θ θ a t , for every t = 1, . . . , T + 1. One can view a t in Lemma 1 as the projection of the current iterate w t onto a vector of interest. The lemma says that with a larger value of the momentum parameter β, the magnitude of a t is increasing faster. It also implies that if the projection due to vanilla gradient descent satisfies a t+1 ≥ (1 + θ)a t , then the magnitude of the projection only grows faster with the use of Heavy Ball momentum. The dynamics in the lemma are the keys to showing that Heavy Ball momentum accelerates the process of entering a benign (convex) region. The factor β 1+θ θ in ( 7) and ( 8) represents the contribution due to the use of momentum, and the contribution is larger with a larger value of momentum parameter β. Now let us apply Lemma 1 to the recursive system (5) and pretend that the magnitude of w t was a constant for a moment. Denote θ t := 3η(1 -w t 2 ) and θt := η(1 -3 w t 2 ) and notice that θ t > θt > 0 when w t 2 < 1 3 . We can rewrite the recursive system as w t+1 = (1 + θ t )w t + β(w t -w t-1 ) w ⊥ t+1 = (1 + θt )w ⊥ t + β(w ⊥ t -w ⊥ t-1 ). Since the above system is in the form of (6), the dynamics ( 7) and (8) in Lemma 1 suggest that the larger the momentum parameter β, the faster the growth rate of the magnitude of the signal component w t and the perpendicular component w ⊥ t . Moreover, the magnitude of the signal component w t grows faster than that of the perpendicular component w ⊥ t . Both components will grow until the size of iterate w t 2 is sufficiently large (i.e. w t 2 > 1 3 ). After that, the magnitude of the perpendicular component w ⊥ t starts decaying, while |w t | keeps growing until it approaches 1. Furthermore, we have that the larger the momentum parameter β, the faster the decay rate of the (magnitude of the) perpendicular component w ⊥ t . In other words, |w t | converges to 1 and w ⊥ t converges to 0 quickly. Lemma 3 in Appendix C, which is a counterpart of Lemma 1, can be used to explain the faster decay of the magnitude due to a larger value of the momentum parameter β. By using the population recursive system (5) and Lemma 1 & 3 as the tool, we obtain a high-level insight on how momentum helps (see also Figure 1 ). The momentum helps to drive the iterate to enter the neighborhood of a w * (or -w * ) faster.

3.2. MAIN RESULTS

We denote dist(w t , w * ) := min{ w t -w * 2 , w t +w * 2 } as the distance between the current iterate w t and w * , modulo the unrecoverable sign. Note that both ±w * achieve zero testing errors. Furthermore, as long as the number of samples is sufficiently large, i.e. n d log d, there exists a constant ϑ > 0 so that the Hessian satisfies ∇ 2 f (w) ϑI d for all w ∈ B ζ (±w * ) with high probability, where B ζ (±w * ) represents the balls centered at ±w * with a radius ζ (e.g. (Ma et al., 2017) ). So in this paper we consider the case that the local strong convexity holds in the neighborhood B ζ (±w * ). We will divide the iterations into two stages. The first stage consists of those iterations that satisfy 0 ≤ t ≤ T ζ , where T ζ is defined as T ζ := min{t : ||w t | -1| ≤ ζ 2 and w ⊥ t ≤ ζ 2 }, and ζ > 0 is sufficiently small so that it makes w T ζ be in the neighborhood of w * or -w * which is smooth, twice differentiable, strongly convex, see e.g. (Ma et al., 2017; Soltanolkotabi, 2014) . be in a benign region at the start of this stage, which allows linear convergence to a global optimal point. That is, we have that dist(w t , w * ) ≤ (1 -ν) t-T ζ ζ for all t ≥ T ζ , where 1 > ν > 0 is some number. Since the behavior of the momentum method in the second stage can be explained by the existing results (e.g. Section 3 of Saunders (2018) or Xiong et al. (2020) ), the goal is to understand why momentum helps to drive the iterate into the benign region faster. To deal with the case that only finite samples are available in practice, we will consider some perturbations from the population dynamics (5). In particular, we consider w t+1 = 1 + 3η(1 -w t 2 ) + ηξ t w t + β(w t -w t-1 ) w ⊥ t+1 [j] = 1 + η(1 -3 w t 2 ) + ηρ t,j w ⊥ t [j] + β(w ⊥ t [j] -w ⊥ t-1 [j]), where {ξ t } and {ρ t,j } for 1 ≤ j ≤ d -1 are the perturbation terms. 1+η/2 ≤ 1. Furthermore, for all t ≥ T ζ , the distance is shrinking linearly for some values of η and β < 1. That is, we have that dist(w t , w * ) ≤ (1 -ν) t-T ζ ζ, for some number 1 > ν > 0. The theorem states that the number of iterations required for gradient descent to enter the linear convergence is reduced by a factor of (1 + c η β), which clearly demonstrates that momentum helps to drive the iterate into a benign region faster. The constant c η suggests that the smaller the step size η, the acceleration due to the use of momentum is more evident. The reduction can be about 1 + β for a small η. After T ζ , Heavy Ball has a locally linear convergence to w * or -w * . Specifically, if w 0 > 0, then we will have that w t > 0 for all t and that the iterate will converge to w * ; otherwise, we will have w t < 0 for all t and that the iterate will converge to -w * . The proof of Theorem 1 is in Appendix D. Remark 1: The initial point is required to satisfy w 0 , w * 1 √ d log d and w 0 < 1 3 . The first condition can be achieved by generating w 0 from a Gaussian distribution or uniformly sampling from a sphere with high probability (see e.g. Chapter 2 of (Blum et al., 2018) ). The second condition can then be satisfied by scaling the size appropriately. We note that the condition that the norm of the momentum is bounded is also assumed in (Wang et al., 2020) . Remark 2: Our theorem indicates that in the early stage of the optimization process, the momentum parameter β can be as large as 1, which is also verified in the experiment (Figure 1 ). However, to guarantee convergence after the iterate is in the neighborhood of a global optimal solution, the parameter β must satisfy β < 1 (Polyak, 1964; Lessard et al., 2016) . Remark 3: The number ν of the local linear convergence rate due to Heavy Ball momentum actually depends on the smoothness constant L and strongly convexity constant µ in the neighborhood of a global solution, as well as the step size η and the momentum parameter β (see e.g. Section 3 of (Saunders, 2018) or the original paper (Polyak, 1964) ). By setting the step size η = 4/( √ L + √ µ) 2 and the momentum parameter β = max{|1 -√ ηL|, |1 -√ ηµ|} 2 , ν will depend on the squared root of the condition number √ κ := L/µ instead of κ := L/µ, which means that an optimal local convergence rate is achieved (e.g. (Bubeck, 2014) ). In general, up to a certain threshold, a larger value of β leads to a faster rate than that of standard gradient descent.

4.1. NOTATIONS

We begin by introducing the notations used in this section. For the symmetric but possibly indefinite matrix A, we denote its eigenvalue in the increasing order, where any λ (i) might be negative. We denote the eigen-decomposition of A as A := d i=1 λ (i) (A)v i v i , where each v i ∈ R d is orthonormal. We also denote γ := -λ (1) (A), and γ + = max{γ, 0}, and A 2 := max{|λ (1) (A)|, |λ (d) (A)|}. For any vector w ∈ R d , we denote w (i) as the projection on the eigenvector of A, w (i) = w, v i . Denote w * as a global minimizer of the cubic-regularized problem (2) and denote A * := A + ρ w * I d . Previous works of Nesterov & Polyak (2006) ; Carmon & Duchi (2019) show that the minimizer w * has a characterization; it satisfies ρ w * ≥ γ and ∇f (w * ) = A * w * + b = 0. Furthermore, the minimizer w * is unique if ρ w * > γ. In this paper, we assume that the problem has a unique minimizer so that ρ w * > γ. The gradient of the cubic-regularized problem (2) is ∇f (w) = Aw + b + ρ w w = A * (w -w * ) -ρ w * -w w. ( ) By applying the Heavy Ball algorithm (Algorithm 1) to the cubic-regularized problem (2), we see that it generates the iterates via w t+1 = w t -η∇f (w t ) + β(w t -w t-1 ) = (I d -ηA -ρη w t I d )w t -ηb + β(w t -w t-1 ).

4.2. ENTERING A BENIGN REGION FASTER

For the cubic-regularized problem, we define a different notion of a benign region from that for the phase retrieval. The benign region here is smooth, contains the unique global optimal point w * , and satisfies a notion of one-point strong convexity (Kleinberg et al., 2018; Li & Yuan, 2017; Safran et al., 2020) , w ∈ R d : w -w * , ∇f (w) ≥ ϑ w -w * 2 , where ϑ > 0. (14) We note that the standard strong convexity used in the definition of a benign region for phase retrieval could imply the one-point strong convexity here, but not vice versa (Hinder et al., 2020) . Previous work of Carmon & Duchi (2019) shows that if the norm of the iterate is sufficiently large, i.e. ρ w ≥ γ -δ for any sufficiently small δ > 0, the iterate is in the benign region that contains the global minimizer w * . To see this, by using the gradient expression of ∇f (w), we have that w -w * , ∇f (w) =(w -w * ) A * + ρ 2 ( w -w * )I d (w -w * ) + ρ 2 w * -w 2 ( w + w * ). ( ) The first term on the r.h.s. of the equality becomes nonnegative if the matrix A * + ρ 2 ( w -w * )I d becomes PSD. Since A * (-γ + ρ w * )I d , it means that if ρ w ≥ γ -ρ w * -γ , the matrix becomes PSD (note that by the characterization, ρ w * -γ > 0). Furthermore, if the size of iterate ρ w satisfies ρ w > γ -ρ w * -γ , the matrix becomes positive definite and consequently, we have that A * + ρ 2 ( w -w * )I d ϑI d for a number ϑ > 0. Therefore, (15) becomes w -w * , ∇f (w) ≥ ϑ w -w * 2 , ϑ > 0. (16) Therefore, the benign region of the cubic regularized problem can be characterized as B := {w ∈ R d : ρ w > γ -ρ w * -γ }. (17) What we are going to show is that HB with a larger value of the momentum parameter β enters the benign region B faster. We have the following theorem, which shows that the size of the iterate ρ w t will grow very fast to exceed any level below γ. Furthermore, the larger the momentum parameter β, the faster the growth, which shows the advantage of Heavy Ball over vanilla gradient descent. Theorem 2. Fix any number δ that satisfies δ > 0. Define T δ := min{t : ρ w t+1 ≥ γ -δ}. Suppose that the initialization satisfies w (1) 0 b (1) ≤ 0. Set the momentum parameter β ∈ [0, 1]. If the step size η satisfies η ≤ 1 A 2 +ρ w * , then Heavy Ball (Algorithm 1 & 2) takes at most T δ ≤ 2 ηδ(1 + β/(1 + ηδ) ) log 1 + γ 2 + (1 + β/(1 + ηδ)) 4ρ|b (1) | number of iterations required to enter the benign region B. Note that the case of β = 0 in Theorem 2 reduces to the result of Carmon & Duchi (2019) which analyzes vanilla gradient descent. The lemma implies that the higher the momentum parameter β, the faster that the iterate enters the benign region for which a linear convergence is possible; see also Figure 2 for the empirical results. Specifically, β reduces the number of iterations T δ by a factor of (1 + β/(1 + ηδ)) (ignoring the β in the log factor as its effect is small), which also implies that for a smaller step size η, the acceleration effect due to the use of momentum is more evident. The factor can be approximately 1 + β for a small η. Lastly, the condition w (1) 0 b (1) ≤ 0 in Theorem 2 can be satisfied by w 0 = w -1 = -r b b for any r > 0. Proof. (sketch; detailed proof is available in Appendix E) The theorem holds trivially when γ ≤ 0, so let us assume γ > 0. Recall the notation that w (1) represents the projection of w on the eigenvector v 1 of the least eigenvalue λ (1) (A), i.e. w (1) = w, v 1 . From the update, we have that w (1) t+1 -ηb (1) = (I d + ηγ -ρη w t ) w (1) t -ηb (1) + 1 + β(w t -w (1) t-1 ) -ηb (1) . (18) Denote a t := w (1) t -ηb (1) and in the detailed proof we will show that a t ≥ 0 for all t ≤ T δ . We can rewrite (18) as a t+1 = (1 + ηγ -ρη w t )a t + 1 + β(a t -a t-1 ) ≥ (1 + ηδ)a t + 1 + β(a t -a t-1 ), where the inequality is due to that ρ w t ≤ γ -δ for t ≤ T δ . So we can now see that the dynamics is essentially in the form of (6) except that there is an additional 1 on the r.h.s of the inequality. Therefore, we can invoke Lemma 1 to show that the higher the momentum, the faster the iterate enters the benign region. In Appendix E, we consider the presence of 1 on the r.h.s and obtain a tighter bound than what Lemma 1 can provide. We have shown that the simple dynamic results in entering a benign region that is one-point strongly convex to w * faster. However, different from the case of phase retrieval, we are not aware of any prior results of showing that the iterate generated by Heavy Ball keeps staying in a region that has the property ( 16) once the iterate is in the region. Carmon & Duchi (2019) show that for the cubic regularized problem, the iterate generated by vanilla gradient descent stays in the region under certain conditions, which leads to a linear convergence rate after it enters the benign region. Showing that the property holds for HB is not in the scope of this paper, but we empirically observe that Heavy Ball stays in the region. Subfigure (a) on Figure 2 shows that the norm w t is monotone increasing for a wide range of β, which means that the iterate stays in the benign region according to (17). Assuming that the iterate stays in the region, in Appendix F, we show a locally linear convergence of HB for which up to a certain threshold of β, the larger the β the better the convergence rate.

5. DISCUSSION AND CONCLUSION

Let us conclude by a discussion about the applicability of the simple dynamics to other non-convex optimization problems. Let A be a positive semi-definite matrix. Consider applying HB to top eigenvector computations, i.e. solving min w∈R d : w ≤1 -1 2 w Aw, which is a non-convex optimization problem as it is about maximizing a convex function. The update of HB for this objective is w t+1 = (I d + ηA)w t + β(w t -w t-1 ). By projecting the iterate w t+1 on an eigenvector u i of the matrix A, we have that w t+1 , u i = (1 + ηλ i ) w t , u i + β( w t , u i -w t-1 , u i ). ( ) We see that this is in the form of the simple dynamics in Lemma 1 again. So one might be able to show that the larger the momentum parameter β, the faster the top eigenvector computation. This connection might be used to show that the dynamics of HB momentum implicitly helps fast saddle points escape. In Appendix G, we provide further discussion and show some empirical evidence. We conjecture that if a non-convex optimization problem has an underlying structure like the ones in this paper, then HB might be able to exploit the structure and hence makes progress faster than vanilla gradient descent.

A RELATED WORKS A.1 HEAVY BALL

We first note that Algorithm 1 and Algorithm 2 generate the same sequence of the iterates {w t } given the same initialization w 0 , the same step size η, and the same momentum parameter β. Lessard et al. (2016) analyze the Heavy Ball algorithm for strongly convex quadratic functions by using tools from dynamical systems and prove its accelerated linear rate. Ghadimi et al. (2015) also show an O(1/T ) ergodic convergence rate for general smooth convex problems, while Sun et al. (2019) show the last iterate convergence on some classes of convex problems. Nevertheless, the convergence rate of both results are not better than gradient descent. Maddison et al. (2018) and Diakonikolas & Jordan (2019) study a class of momentum methods which includes Heavy Ball by a continuous time analysis. Can et al. (2019) prove an accelerated linear convergence to a stationary distribution for strongly convex quadratic functions under Wasserstein distance. Gitman et al. (2019) analyze the stationary distribution of the iterate of a class of momentum methods that includes SGD with momentum for a quadratic function with noise, as well as studying the condition of its asymptotic convergence. Loizou & Richtárik (2017) show linear convergence results of the Heavy Ball method for a broad class of least-squares problems. Loizou & Richtárik (2018) study solving a average consensus problem by HB, which could be viewed as a strongly convex quadratic function. Sebbouh et al. (2020) show a convergence result of stochastic HB under a smooth convex setting and show that it can outperform SGD under the assumption that the data is interpolated. Chen & Kolar (2020) study stochastic HB under a growth condition. Yang et al. (2018a) show an O(1/ √ T ) rate of convergence in expected gradient norm for smooth non-convex problems, but the rate is not better than SGD. Liu et al. (2020) provide an improved analysis of SGD with momentum. They show that SGD with momentum can converge as fast as SGD for smooth nonconvex settings in terms of the expected gradient norm. Krichene et al. (2020) show that in the continuous time regime, i.e. infinitesimal step size is used, stochastic HB converges to a stationary solution asymptotically for training a one-hidden-layer network with infinite number of neurons, but the result does not show a clear advantage compared to standard SGD. Lastly, Wang et al. (2020) show that the Heavy Ball's momentum can help to escape saddle points faster and find a second order stationary point faster for smooth non-convex optimization. However, while their work focused on the stochastic setting, their main result required two assumptions on the statistical properties of the sequence of observed gradients; it is not clear whether these would hold in general. Specifically, they make an assumption called APAG (Almost Positively Aligned with Gradient), i.e. E t [ ∇f (w t ), m tg t ] ≥ -1 2 ∇f (w t ) 2 , where ∇f (w t ) is the deterministic gradient, g t is the stochastic gradient, and m t is the stochastic momentum. They also make an assumption called APCG (Almost Positively Correlated with Gradient), i.e. E t [ ∇f (w t ), M t m t ] ≥ -c ησ max (M t ) ∇f (w t ) 2 , where M t is a PSD matrix that is related to a local optimization landscape. We also note that there are negative results regarding Heavy Ball (see e.g. Lessard et al. (2016) ; Ghadimi et al. (2015) ; Kidambi et al. (2018) ).

A.2 MATRIX SENSING

Problem (1) can also be viewed as a special case of matrix factorization or matrix sensing. To see this, one can rewrite (1) as min U ∈R d×1 1 4n n i=1 y i -A i , U U 2 , where A i = x i x i ∈ R d×d and the dot product represents the matrix trace. Li et al. (2018) show that when the matrices {A i } satisfy restricted isometry property (RIP) and U ∈ R d×d , gradient descent can converge to a global solution with a close-to-zero random initialization. Yet, if the matrix A i is in the form of a rank-one matrix product, the matrix might not satisfy RIP and a modification of the algorithm might be required (Li et al. (2018) ). Li et al. (2019) , a different group of authors, show that with a carefully-designed initialization (e.g. spectral initialization), gradient descent will be in a benign region in the beginning and will converge to a global optimal point. In our work, we do not assume that x i x i satisfies RIP neither do we assume a carefully-designed initialization like spectral initialization is available. Li & Lin (2020) show a local convergence to an optimal solution by Nesterov's momentum for a matrix factorization problem; the initial point needs to be in the neighborhood of an optimal solution. In contrast, we study Polyak's momentum and are able to establish global convergence to an optimal solution with a simple random initialization. 

B POPULATION GRADIENT

Lemma 2. Assume that w * = 1. E x∼N (0,I d ) (x w) 3 x -(x w * ) 2 (x w)x = 3 w 2 -1 w -2(w * w)w * . Proof. In the following, denote Q := E[(x w) 3 x] R := E[(x w * ) 2 (x w)x]. Now define h(w) := E[(x w) 4 ]. We have that h(w) = 3 w 4 as x w ∼ N (0, w 2 ) (i.e. the fourth order moment). Then, ∇h(w) = 4E[(x w) 3 x] = 4Q. So Q = 1 4 ∇h(w) = 3 w 2 w. For the other term, define g(w) := E[(x w * ) 2 (x w) 2 ]. Given that x w * x w ∼ N 0 0 , w * 2 w * w w * w w 2 , ( ) we can write x w * d ∼ θ 1 x w + θ 2 z, where z ∼ N (0, 1), θ 1 := w * w w 2 , and θ 2 2 := w * 2 - (w * w) 2 w 2 . Then, g(w) = θ 2 1 E[(x w) 4 ] + 2θ 1 θ 2 E[(x w) 3 z] + θ 2 2 E[z 2 (x w) 2 ] = 3(w * w) 2 + θ 2 2 E[z 2 ]E[(x w) 2 ] = 2(w * w) 2 + w 2 w * 2 . So we have that ∇g(w) = 2E[(x w * ) 2 (x w)x] = 2R, which in turn implies that R = 1 2 ∇g(w) = 1 2 4(w * w)w * + 2 w * w = 2(w * w)w * + w * 2 w. Combining the above results, we have that ∇F (w) = Q -R = 3 w 2 w -w * 2 w -2(w * w)w * = 3 w 2 -1 w -2(w * w)w * .

C SIMPLE LEMMAS

Lemma 1: For a positive number θ and the momentum parameter β ∈ [0, 1], if a non-negative sequence {a t } satisfies a 0 ≥ a -1 > 0 and that for all t ≤ T , a t+1 ≥ (1 + θ)a t + β(a t -a t-1 ), ( ) then {a t } satisfies a t+1 ≥ 1 + 1 + β 1+θ θ a t , for every t = 1, . . . , T + 1. Similarly, if a non-positive sequence {a t } satisfies a 0 ≤ a -1 < 0 and that for all t ≤ T , a t+1 ≤ (1 + θ)a t + β(a t -a t-1 ), then {a t } satisfies a t+1 ≤ 1 + 1 + β 1+θ θ a t , for every t = 1, . . . , T + 1. Proof. Let us first prove the first part of the statement. In the following, we denote c := 1 1+θ . For the base case t = 1, we have that a 2 ≥ (1 + θ)a 1 + β(a 1 -a 0 ) ≥ (1 + θ)a 1 + βcθa 1 , where the last inequality holds because a 1 -a 0 ≥ θca 1 ⇐⇒ a 1 ≥ 1 1-θc a 0 , as a 1 ≥ (1 + θ)a 0 = 1 1-θc a 0 . Now suppose that it holds at iteration t, a t+1 ≥ (1 + (1 + βc)θ)a t . Consider iteration t + 1, we have that a t+2 ≥ (1 + θ)a t+1 + β(a t+1 -a t ) ≥ (1 + θ)a t+1 + θcβa t+1 , where the last inequality holds because a t+1 - a t ≥ θca t+1 ⇐⇒ a t+1 ≥ 1 1-θc a t as a t+1 ≥ (1 + (1 + βc)θ)a t ≥ 1 1-θc a t , given the assumption at t and that c := 1 1+θ . The second part of the statement can be proved similarly. Lemma 3. For a positive number θ < 1 and the momentum parameter β ∈ [0, 1] that satisfy (1 + β 1-θ )θ < 1, if a non-negative sequence {b t } satisfies b 0 ≤ b -1 and that for all t ≤ T , b t+1 ≤ (1 -θ)b t + β(b t -b t-1 ), ( ) then {b t } satisfies b t+1 ≤ 1 -1 + β 1-θ θ b t , for every t = 1, . . . , T + 1. Similarly, if a non-positive sequence {b t } satisfies b 0 ≥ b -1 and that for all t ≤ T , b t+1 ≥ (1 -θ)b t + β(b t -b t-1 ), then {b t } satisfies b t+1 ≥ 1 -1 + β 1-θ θ b t , for every t = 1, . . . , T + 1. Proof. Let us first prove the first part of the statement. In the following, we denote c := 1 1-θ . For the base case t = 1, we have that b 2 ≤ (1 -θ)b 1 + β(b 1 -b 0 ) ≤ (1 -θ)b 1 -βcθb 1 , where the last inequality holds because b 1 -b 0 ≤ -θcb 1 ⇐⇒ b 1 ≤ 1 1 + θc b 0 , as b 1 ≤ (1 -θ)b 0 = 1 1+θc b 0 due to that c = 1 1-θ . Now suppose that it holds at iteration t. b t+1 ≤ (1 -(1 + βc)θ)b t . Consider iteration t + 1, we have that b t+2 ≤ (1 -θ)b t+1 + β(b t+1 -b t ) ≤ (1 -θ)b t+1 -θcβb t+1 , where the last inequality holds because b t+1 -b t ≤ -θcb t+1 ⇐⇒ b t+1 ≤ 1 1 + θc b t (33) as b t+1 ≤ (1 -(1 + βc)θ)b t ≤ 1 1+θc b t , due to the induction at t and that c = 1 1-θ . The second part of the statement can be proved similarly.

D PROOF OF THEOREM 1

Recall the recursive system (11). w t+1 = 1 + 3η(1 -w t 2 ) + ηξ t w t + β(w t -w t-1 ) w ⊥ t+1 [j] = 1 + η(1 -3 w t 2 ) + ηρ t,j w ⊥ t [j] + β(w ⊥ t [j] -w ⊥ t-1 [j]), where {ξ t } and {ρ t,j } are the perturbation terms. In the analysis, we will show that the sign of w t never changes during the execution of the algorithm. Our analysis divides the iterations of the first stage to several sub-stages. We assume that the size of the initial point w 0 is small so that it begins in Stage 1.1. • (Stage 1.1) considers the duration when w t ≤ 4 9 + c n , which lasts for at most T 0 iterations, where T 0 is defined in Lemma 4. A by-product of our analysis shows that |w T0+1 | ≥ 4 9 + c n . • (Stage 1.2) considers the duration when the perpendicular component w ⊥ t is decreasing and eventually falls below ζ/2, which consists of all iterations T 0 ≤ t ≤ T 0 + T b , where T b is defined in Lemma 5. • (Stage 1.3) considers the duration when |w t | is converging to the interval [1 -ζ 2 , 1 + ζ 2 ], if it was outside the interval, which consists of all iterations T 0 + T b ≤ t ≤ T 0 + T b + T a , where T a is defined in Lemma 6. In stage 1.1, both the signal component |w t | and w ⊥ t grow in the beginning (see also Figure 1 ). The signal component grows exponentially and consequently it only takes a logarithm number of 9 + c n . Furthermore, for all t in this stage, w t ≤ 5 6 -cn 3 . In stage 1.2, we have that w t 2 ≥ |w t | 2 ≥ 4 9 + c n > 1 3 so that the perpendicular component w ⊥ t is decaying while the signal component |w t | keeps growing before reaching 1 (see also Figure 1 ). In particular, the perpendicular component decays exponentially so that at most additional 9 + c n . There will be at most T b iterations is needed to fall below ζ 2 (i.e. w ⊥ t ≤ ζ 2 ), T b ≤ log( ζ 2 ω ) log(1-η 3 (1+c b β)) iterations such that 4 9 + c n ≤ |w t | and w ⊥ t ≥ ζ 2 . In stage 1.3, we show that |w t | converges towards 1 linearly, given that w ⊥ There will be at most T a := max log((1-ζ/2)/ √ 4 9 +cn) log(1+η ζ 2 (1+c ζ,g β)) , log( 1+ζ/2 ω ) log(1-ηζ(1+c ζ,d β)) iterations such that ||w t | -1| ≥ ζ 2 and w ⊥ t ≤ ζ 2 . Under review as a conference paper at ICLR 2021 By combining the result of Lemma 4, Lemma 5 and Lemma 6, we have that T ≤ T 0 + T b + T a = log( √ 4 9 +cn |w 0 | ) log 1 + η 5 3 (1 + c a β) + log( ζ 2ω ) log(1 -η 3 (1 + c b β)) + max log( 1-ζ/2 √ 4 9 +cn ) log(1 + η ζ 2 (1 + c ζ,g β)) , log( 1+ζ/2 ω ) log(1 -ηζ(1 + c ζ,d β)) 6 log( √ 4 9 +cn |w 0 | ) 5η(1 + c a β) + 6 log( 2ω ζ ) η(1 + c b β) + max 2 log( 1-ζ/2 √ 4 9 +cn ) ηζ(1 + c ζ,g β)) , 2 log( ω 1+ζ/2 ) ηζ(1 + c ζ,d β)) log d η (1 + c η β) , where the last inequality uses that |w 0 | 1 √ d log d due to the random isotropic initialization. The detailed proof of Lemma 4-6 is available in the following subsections. After t ≥ T ζ , the iterate enters a benign region that is locally strong convex, smooth, twice differentiable, and contains w * (or -w * ) (Ma et al., 2017; White et al., 2016) , which allows us to use the existing result of gradient descent with Heavy Ball momentum for showing its linear convergence. In particular, the result of local landscape (e.g. Ma et al. (2017) ; White et al. (2016) ) and the known convergence result of gradient descent with Heavy Ball momentum (e.g. Saunders (2018) ; Polyak (1964) ; Lessard et al. (2016) ; Xiong et al. (2020) ) can be used to show that for all Proof. Let us first assume that w t > 0 and denote a t := w t . By using that w t ≤ 4 9 + c n in this stage, we can lower-bound the growth rate of a t as t > T ζ , dist(w t , w * ) ≤ (1 -ν) t-T ζ dist(w T ζ , w * ) ≤ (1 -ν) t-T ζ ζ, for some number 1 > ν > 0. a t+1 ≥ 1 + 3η(1 -w t 2 ) -η|ξ t | a t + β(a t -a t-1 ) ≥ 1 + 3η(1 - 4 9 + c n ) -c n a t + β(a t -a t-1 ) ≥ 1 + η 5 3 a t + β(a t -a t-1 ) ≥ 1 + (1 + c a β)η 5 3 a t , where in the last inequality we use Lemma 1 and that c a = 1 1 + η 5 3 . ( ) Observe that 1 + (1 + c a β)η 5 3 > 1. Consequently, the sign of w t never change in this stage. So for w t ≥ 4 9 + c n it takes number of iterations at most T 0 := log( √ 4 9 +cn |w 0 | ) log 1 + (1 + c a β)η 5 3 ≤ 2 log( √ 4 9 +cn |w 0 | ) (1 + c a β)η 5 3 . Similar, when w t < 0, we can show that after at most T 0 iterations, w t falls below -4 9 + c n . Since w t > |w t |, it means that there will be at most T 0 iterations such that w t ≤ 4 9 + c n .  b := log( ζ 2 ω ) log(1-η 3 (1+c b β)) iterations such that 4 9 + c n ≤ |w t | and w ⊥ t ≥ ζ 2 . Proof. Let t be the last iteration of the previous stage. We have that w t ≥ w t ≥ 4 9 + c n . Denote a t := |w t |. In this stage, we have that a t keeps increasing until w t 2 1. Moreover, w t remains the same sign as the previous stage. Now fix an element j = 1 and denote b t := |w ⊥ t [j]|. From Lemma 7, we know that the magnitude b t := |w ⊥ t [j]| is non-increasing in this stage. Furthermore, we can show the decay of b t as follows. If w t [j], w t-1 [j] > 0, w t+1 [j] ≤ 1 + η(1 -3 w t 2 ) + η|ρ t,j | w t [j] + β(w t [j] -w t-1 [j]) ≤ 1 + η 1 -3( 4 9 + c n ) + ηc n w t [j] + β(w t [j] -w t-1 [j]) ≤ 1 - η 3 w t [j] + β(w t [j] -w t-1 [j]) ≤ 1 - η 3 (1 + c b β) w t [j], where in the last inequality we used Lemma 3, as the condition (1 + β 1-η 3 ) η 3 < 1 is satisfied, and we denote that c b := 1 1 -η/3 . ( ) On the other hand, if w t [j], w t-1 [j] < 0, w t+1 [j] ≥ 1 + η(1 -3 w t 2 ) + η|ρ t,j | w t [j] + β(w t [j] -w t-1 [j]) ≥ 1 + η 1 -3( 4 9 + c n ) + ηc n w t [j] + β(w t [j] -w t-1 [j]) ≥ 1 - η 3 w t [j] + β(w t [j] -w t-1 [j]) ≥ 1 - η 3 (1 + c b β) w t [j], where in the last inequality we used Lemma 3, as the condition (1 + β 1-η 3 ) η 3 < 1 is satisfied, and we denote that c b := 1 1 -η/3 . ( ) The inequalities of ( 39) and ( 41) allow us to write |w t+1 [j]| ≤ 1 -η 3 (1 + c b β) |w t [j]|. Taking the square of both sides and summing all dimension j = 1, we have that w ⊥ t+1 2 ≤ 1 - η 3 (1 + c b β) 2 w ⊥ t 2 . ( ) Consequently, for w ⊥ t to fall below ζ/2, it takes at most T b := log( ζ/2 w ⊥ t ) log(1 -η 3 (1 + c b β)) ≤ log( ζ 2ω ) log(1 -η 3 (1 + c b β)) iterations. Lastly, Lemma 7 implies that at the time that the magnitude of w There will be at most T a := max log((1-ζ/2)/ √ 4 9 +cn) log(1+η ζ 2 (1+c ζ,g β)) , log( 1+ζ/2 ω ) log(1-ηζ(1+c ζ,d β)) iterations such that ||w t | -1| ≥ ζ 2 and w ⊥ t ≤ ζ 2 . Proof. Denote t the last iteration of the previous stage. We have that t ≤ T 0 + T b . Since w t does not change the sign in stage 1.1 and 1.2, w.l.o.g, we assume that w t > 0. Denote a t := w t . We consider a t in two cases: a t ≤ 1 -ζ 2 and a t ≥ 1 + ζ 2 . If a t ≤ 1 -ζ 2 , then we have that for all t in this stage, w t 2 ≤ a 2 t + w ⊥ t 2 ≤ (1 -ζ 2 ) 2 + ( ζ 2 ) 2 ≤ 1 -ζ 2 , for any sufficiently small ζ. So a t grows as follows. a t+1 ≥ 1 + 3η(1 -w t 2 ) -η|ξ t | a t + β(a t -a t-1 ) ≥ 1 + 3η ζ 2 -ηc n a t + β(a t -a t-1 ) (a) ≥ 1 + η ζ 2 a t + β(a t -a t-1 ) (b) ≥ (1 + η ζ 2 (1 + c ζ,g β))a t , where (a) is by ζ ≥ c n and (b) is due to that Lemma 1 and that c ζ,g = 1 1 + η ζ 2 . Consequently, it takes at most log 1-ζ/2 √ 4 9 +cn log(1 + η ζ 2 (1 + c ζ,g β)) number of iterations in this stage for a t to rise above 1 -ζ 2 . On the other hand, if a t ≥ 1 + ζ 2 , then we can lower bound w t 2 in this stage as w t 2 ≥ a 2 t ≥ (1 + ζ 2 ) 2 . We have that a t+1 ≤ 1 + 3η(1 -w t 2 ) + η|ξ t | a t + β(a t -a t-1 ) ≤ 1 -3ηζ + c n a t + β(a t -a t-1 ) (a) ≤ 1 -ηζ a t + β(a t -a t-1 ) (b) ≤ (1 -ηζ(1 + c ζ,d β))a t , where (a) uses ζ ≥ c n and (b) uses Lemma 3 c ζ,d := 1 1 -ηζ . ( ) That is, a t is decreasing towards 1 + ζ/2. Denote ω := max t |w t |. we see that it takes at most log 1+ζ/2 ω log(1 -ηζ(1 + c ζ,d β)) number of iterations in this stage for a t to fall below 1 + ζ/2. Lastly, Lemma 8 implies that at the time that the magnitude of w t starts decreasing, w t 2 ≤ 10 9 + 1 3 c n , which in turn implies that ω := max t |w t | ≤ 10 9 + 1 3 c n . On the other hand, by Lemma 10, the magnitude of the perpendicular component is non-increasing in this stage, and hence w ⊥ t keeps staying below ζ/2. Similar analysis holds for w t < 0, hence we omitted the details.

D.4 SOME SUPPORTING LEMMAS

Lemma 7. Suppose that η satisfies η ≤ 1 36 1 3 +cn max{cm,1} . Let t 0 be the first time such that w t0 2 ≥ 1 3 + c n . Then, there exists a time τ ≤ t 0 such that w ⊥ τ +1 [j] ≤ w ⊥ τ [j], if w ⊥ τ [j] ≥ 0; Similarly, w ⊥ τ +1 [j] ≥ w ⊥ τ [j], if w ⊥ τ [j] ≤ 0. Furthermore, we have that w t0 2 ≤ 4 9 + c n . Proof. Recall that w ⊥ t+1 [j] = 1 + η(1 -3 w t 2 ) + ηρ t,j w ⊥ t [j] + β(w ⊥ t [j] -w ⊥ t-1 [j]). W.l.o.g, let us consider w ⊥ 0 [j] > 0. Assume that w ⊥ t [j] ≥ w ⊥ t-1 [j] for all t ≤ t 0 ; otherwise, there exists a time τ < t 0 such that w ⊥ τ [j] ≤ w ⊥ τ -1 [j]. Denote λ t0,j := η(1 -3 w t0 2 ) + ηρ t0,j . Since w t0 2 ≥ 1 3 + c n , we have that λ t0,j ≤ 0. We can rewrite the dynamics as w ⊥ t0+1 [j] w ⊥ t0 [j] = 1 + λ t0,j + β -β 1 0 w ⊥ t0 [j] w ⊥ t0-1 [j] We have that w ⊥ t0+1 [j] w ⊥ t0 [j] ≤ 1 + λ t0,j + β -β 1 0 2 • w ⊥ t0 [j] w ⊥ t0-1 [j] . ( ) We will show that w ⊥ t0+1 [j] ≤ w ⊥ t0-1 [j] if w ⊥ t0-1 [j] > 0, which means that the magnitude of w ⊥ [j] has stopped increasing. It suffices to show that the spectral norm of the matrix 1 + λ t0,j + β -β 1 0 2 is not greater than 1. Note that the roots of the characteristic equation of the matrix, z 2 -(1 + λ t0,j + β)z + β are 1+λt 0 ,j +β ± √ (1+λt 0 ,j +β) 2 -4β . If the roots are complex conjugate, then the magnitude of the roots is at most √ β; consequently the spectral norm is at most √ β ≤ 1. On the other hand, if the roots are real, to show that the larger root is not larger than 1, it suffices to show that (1 + λ t0,j + β) 2 -4β ≤ 1 -λ t0,j -β, which is guaranteed if λ t0,j ≤ 0. To show that the smaller root is not greater than -1, we need to show that (1 + λ t0,j + β) 2 -4β ≤ 3 + λ t0,j + β, which is guaranteed if λ t0,j ≥ -1. By definition, w t0-1 2 < 1 3 + c n . If η ≤ min{ 1 36 wt 0 -1 cm , 1 18cm }, then by invoking Lemma 9, we have that w t0 2 -w t0-1 2 ≤ 1 9 and consequently, we have that w t0 2 ≤ 4 9 + c n . Thus, by choosing η satisfies η ≤ 1 36(1/3+cn)cm , we have that w t0 2 ≤ 4 9 + c n . Using the upper-bound of w t0 2 and the constraint of η, we have that λ t0,j ≥ -η( 1 3 + c n ) ≥ -1. Similar analysis when w ⊥ t [j] is negative; and hence omitted. Lemma 8. Suppose that η satisfies η ≤ 1 36 1+ 1 3 cn max{cm,1} . Let t 1 be the first time (if exist) such that w t1 2 ≥ 1 + 1 3 c n . Then, there exists a time τ ≤ t 1 such that w τ +1 ≤ w τ , if w τ ≥ 0. Similarly, w τ +1 ≥ w τ , if w τ ≤ 0. Furthermore, we have that w t1 2 ≤ 10 9 + 1 3 c n . Proof. Recall that w t+1 = 1 + 3η(1 -w t 2 ) + ηξ t w t + β(w t -w t-1 ). W.l.o.g, let us consider w 0 > 0. Assume that w t ≥ w t-1 for all t ≤ t 1 ; otherwise, there exists a time τ < t 1 such that w τ ≤ w τ -1 . Denote λ t1 := 3η(1 -w t1 2 ) + ηξ t1 . Since w t1 2 ≥ 1 + 1 3 c n , we have that λ t1 ≤ 0. We can rewrite the dynamics as w t1+1 w t1 = 1 + λ t1 + β -β 1 0 w t1 w t1-1 We have that w t1+1 w t1 ≤ 1 + λ t1 + β -β 1 0 2 • w t1 w t1-1 . ( ) The analysis essentially follows the same lines as Lemma 7. Specifically, to show that w t1+1 ≤ w t1-1 , it suffices to ensure that λ t1 := 3η(1 -w t1 2 ) + ηξ t1 ∈ [-1, 0]. We have that λ t1 ≤ 0 by the definition of t 1 . Furthermore, by the definition of t 1 -1, we have that w t1-1 2 < 1 + 1 3 c n . So if η ≤ min{ 1 36 wt 1 -1 cm , 1 18cm }, then by invoking Lemma 9, we have that w t1 2 -w t1-1 2 ≤ 1 9 and consequently, we have that w t1 2 ≤ 10 9 + 1 3 c n . Thus, by choosing η satisfies η ≤ 1 36 1+ 1 3 cn cm , we have that w t1 2 ≤ 10 9 + 1 3 c n . Using this upper-bound of w t1 2 and the constraint of η, we have that λ t1 ≥ -η( 13 + 2c n ) ≥ -1. Therefore, we have completed the proof. Lemma 9. Assume that the norm of the momentum is bounded for all t ≤ T ζ , i.e. m t ≤ c m , ∀t ≤ T ζ . Set the step size η satisfies η ≤ min{ 1 36 wt-1 cm , 1 18cm }. Then, we have that w t 2 -w t-1 2 ≤ 1 9 . Proof. To see this, we will use the alternative presentation Algorithm 2, which shows that w t = w t-1 -ηm t-1 , where the momentum m t-1 stands for the weighted sum of gradients up to (and including) iteration t -1, i.e. m t-1 = t-1 s=0 β t-1-s ∇f (w s ). Using the expression, we can expand w t 2 -w t-1 2 as w t 2 -w t-1 2 = w t-1 -ηm t-1 2 -w t-1 2 = -2η w t-1 , m t-1 + η 2 m t-1 2 ≤ 2η w t-1 m t-1 + η 2 m t-1 2 ≤ 1 9 . ( ) where the last inequality holds if η ≤ min{ 1 36 wt-1 cm , 1 18cm }. Lemma 10. Fix an index j. Set η ≤ 1 36(1+ 1 3 cn) max{cm,1} and β ≤ 1. Suppose that w t 2 ≥ 1 3 + c n . If for a number R > 0, we have that w ⊥ t [j] w ⊥ t-1 [j] ≤ R, then w ⊥ t+1 [j] w ⊥ t [j] ≤ R. Proof. The proof is similar to that of Lemma 7. We have that w ⊥ t+1 [j] w ⊥ t [j] ≤ 1 + λ t,j + β -β 1 0 2 • w ⊥ t [j] w ⊥ t-1 [j] . ( ) Denote λ t,j := η(1 -3 w t 2 ) + ηρ t,j . As the proof of Lemma 7, to show that the spectral norm of the matrix 1 + λ t,j + β -β 1 0 2 is not greater than one. It suffices to have β ≤ 1 and that λ t,j ∈ [-1, 0]. By the assumption, it holds that w t 2 ≥ 1 3 + c n , so we have that λ t,j ≤ 0. Furthermore, by Lemma 8, if the step size η satisfies η ≤ 1 36 1+ 1 3 cn max{cm,1} , we have that w t 2 ≤ 10 9 + 1 3 c n . Therefore, using the upper-bound of the step size and the norm, we can obtain that λ t,j := η(1 -3 w t 2 ) + ηρ t,j ≥ -1. Hence, we have completed the proof.

E PROOF OF THEOREM 2

To prove Theorem 2, we will need the following lemma. Lemma 11. Fix any number δ > 0. Define T δ := min{t : ρ w t+1 ≥ γ -δ}. Assume that η ≤ 1 A 2 +ρ w * . Suppose that w (1) 0 b (1) ≤ 0. Then, we have that w (1) t b (1) ≤ 0, for all 0 ≤ t ≤ T δ . Proof. The lemma holds trivially when γ ≤ 0, so let us assume γ > 0. Recall the Heavy Ball generates the iterates as w (1) t+1 = (1 -ηλ (1) (A) -ρη w t )w (1) t -ηb (1) + β(w t -w (1) t-1 ). We are going to show that for all t, w (1) t b (1) ≤ 0 and (w (1) t -w (1) t-1 )b (1) ≤ -c bw w (1) t b (1) , ( ) for any constant c bw ≥ 0. The initialization guarantees that w (1) 0 b (1) ≤ 0 and that (w 1) . Suppose that ( 57) is true at iteration t. Consider iteration t + 1. (1) 0 - w (1) -1 )b (1) = 0 ≤ -c bw w (1) 0 b ( w (1) t+1 b (1) = (1 -ηλ (1) (A) -ρη w t )w (1) t b (1) -η(b (1) ) 2 + β(w (1) t -w (1) t-1 )b (1) ≤ (1 -ηλ (1) (A) -ρη w t -βc bw )w (1) t b (1) -η(b (1) ) 2 ≤ 0, (58) where the first inequality is by induction at iteration t and the second one is true if (1 -ηλ (1) Aρη w t -βc bw ) ≥ 0, which gives a constraints about η, 1 -ηλ (1) (A) -ρη w t ≥ c bw . ( ) Now let us switch to show that (w 1) , which is equivalent to showing that w (1) t+1 -w (1) t )b (1) ≤ -c bw w (1) t+1 b ( (1) t+1 b (1) ≤ 1 1+c bw w (1) t b (1) . From (58), it suffices to show that (1 -ηλ (1) (A) -ρη w t -βc bw )w (1) t b (1) -η(b (1) ) 2 ≤ 1 1 + c bw w (1) t b (1) . ( ) Since w (1) t b (1) ≤ 0, a sufficient condition of the above inequality is 1 -ηλ (1) (A) -ρη w t -βc bw - 1 1 + c bw ≥ 0. ( ) Now using that ρ w t ≤ γ -δ and that 1 1+x ≤ 1 -1 2 x for x ∈ [0, 1]. It suffices to have that 1 -ηλ (1) (A) -η(γ -δ) -βc bw - 1 1 + c bw ≥ 1 + ηδ -βc bw -1 + 1 2 c bw ≥ ηδ - 1 2 c bw ≥ 0. ( ) By setting c bw = 0, we have that the inequality is satisfied. Substituting c bw = 0 to (59), we have that η ≤ 1 λ (1) (A) + ρ w t , Recall that ρ w t ≤ γ -δ for all t ≤ T δ . So we have that η ≤ 1 A 2+ γ-δ •1 γ-δ≥0 . Furthermore, by using that ρ w * > γ, it suffices to have that η ≤ 1 A 2+ρ w * . We have completed the proof. Given Lemma 11, we are ready for proving Theorem 2. Proof. (of Theorem 2) The lemma holds trivially when γ ≤ 0, so let us assume γ > 0. Recall the notation that w (1) represents the projection of w on the eigenvector v 1 of the least eigenvalue λ (1) (A), i.e. w (1) = w, v 1 . From the update rule, we have that w (1) t+1 -ηb (1) = (I d + ηγ -ρη w t ) w (1) t -ηb (1) + 1 + β(w t -w (1) t-1 ) -ηb (1) . (63) Denote a t := w (1) t -ηb (1) . We can rewrite (63) as a t+1 = (1 + ηγ -ρη w t )a t + 1 + β(a t -a t-1 ) ≥ (1 + ηδ)a t + 1 + β(a t -a t-1 ), where the inequality is due to that ρ w t ≤ γ -δ for t ≤ T δ . Now we are going to show that, a t+1 ≥ (1+ηδ + ηδ 1+ηδ β)a t +1. For the above inequality to hold, it suffices to show that a t -a t-1 ≥ ηδ 1+ηδ a t . That is, a t ≥ 1 1-ηδ/(1+ηδ) a t-1 . The base case t = 1 holds because a 1 ≥ (1 + ηδ)a 0 + 1 ≥ 1 1-ηδ/(1+ηδ) a 0 . Suppose that at iteration t, we have that a t ≥ 1 1-ηδ/(1+ηδ) a t-1 . Consider t + 1, we have that a t+1 ≥ (1 + ηδ)a t + 1 + β(a t -a t-1 ) ≥ (1 + ηδ)a t + 1 ≥ 1 1-ηδ/(1+ηδ) a t , where the second to last inequality is because a t ≥ 1 1-ηδ/(1+ηδ) a t-1 implies a t ≥ a t-1 . Therefore, we have completed the induction. So we have shown that w (1) t+1 -ηb (1) = (I d + ηγ -ρη w t ) w (1) t -ηb (1) + 1 + β(w (1) t -w (1) t-1 ) -ηb (1) ≥ (1 + ηδ + ηδ 1+ηδ β) w (1) t -ηb (1) + 1. Recursively expanding the inequality, we have that w (1) t+1 -ηb (1) ≥(1 + ηδ + ηδ 1+ηδ β) w (1) t -ηb (1) + 1 ≥ (1 + ηδ + ηδ 1+ηδ β) 2 w (1) t-1 -ηb (1) + (1 + ηδ + ηδ 1+ηδ β) + 1 ≥ . . . ≥ 1 ηδ(1+ β 1+ηδ ) (1 + ηδ + ηδ 1+ηδ β) t -1 . (65) Therefore, γ-δ ρ (a) ≥ w T δ ≥ |w (1) T δ | (b) ≥ |b (1) | δ+δβ/(1+ηδ) (1 + ηδ + ηδ 1+ηδ β) T δ -1 where (a) uses that for t ≤ T δ , ρ w t ≤ γ -δ, and (b) uses ( 65) and Lemma 11 that w (1) , where c δ and c0 are some constants that satisfy 0.5 > c δ > 0 and c0 > 0. Suppose that there is a number R, such that the size of the iterate satisfies w t ≤ R for all t during the execution of the algorithm and that w * ≤ R. Denote L := A 2 + 2ρR. Also, suppose that the momentum parameter β ∈ [0, 0.65] and that the step size η satisfies η ≤ min t b (1) ≤ 0. Consequently, T δ ≤ log 1+ (γ-δ)(δ+δβ/(1+ηδ)) ρ|b (1) | log(1+ηδ+ ηδ 1+ηδ β) (a) ≤ 2 ηδ(1+β/(1+ηδ)) log 1 + (γ-δ)δ(1+β/(1+ηδ)) ρ|b (1) | (b) ≤ 2 ηδ(1+β/(1+ηδ)) log 1 + γ 2 + (1+β/(1+ηδ)) 4ρ|b (1) | , 1 4( A * +26ρR) , c0cδ ρ w * -γ (50L+2(26) 2 )( A * +ρR)(1+ A * +ρR)+1.3L . Set w 0 = w -1 = -r b b for any sufficiently small r > 0. Then, in the benign region it takes at most t T := 1 ηc δ (ρ w * -γ)(1+ βc converge ) log( ( A 2+2ρR)c 2 ) number of iterations to reach an -approximate error, where c : = 4R 2 (1 + η C) with C = 4 3 c0 (ρ w * -γ) + 3 c0 c δ (ρ w * -γ) + 10 max{0, γ} . Note that the constraint of η ensures that ηc δ ρ w * -γ ≤ 1 8 ; as a consequence, c converge ≤ 1. Theorem 3 indicates that up to an upper-threshold, a larger value of β reduces the number of iterations to linearly converge to an -optimal point, and hence leads to a faster convergence. Let us now make a few remarks. First, we want to emphasize that in the linear convergence rate regime, a constant factor improvement of the convergence rate (i.e. of T here) means that the slope of the curve in the log plot of optimization value vs. iteration is steeper. Our experimental result (Figure 2 ) confirms this. In this figure, we can see that the curve corresponds to a larger momentum parameter β has a steeper slope than that of the smaller ones. The slope is steeper as β increases, which justifies the effectiveness of the momentum in the linear convergence regime. Our theoretical result also indicates that the acceleration due to the use of momentum is more evident for a small step size η. When η is sufficiently small, the number c converge are close to 1, which means that the number of iterations can be reduced approximately by a factor of 1 + β. Secondly, from the theorem, one will need |b (1) | be non-zero, which can be guaranteed with a high probability by adding some Gaussian perturbation on b. We refer the readers to Section 4.2 of Carmon & Duchi (2019) for the technique. To prove Theorem 3, we will need a series of lemmas which is given in the following subsection.

F.1 SOME SUPPORTING LEMMAS

Lemma 12. Denote sequences d t := ρ 2 ( w * -w t ) w t -w * 2 , e t := -(w t -w * ) (A * - 2ηc w A 2 * )(w t -w * ), g t := -ρ 2 ( w * -w t ) 2 w * + w t -4ηρc w w t 2 , where c w := 2β 1-β and cw := (1 + c w ) + (1 + c w ) 2 . For all t, we have that w t -w * , w t -w t-1 + c w w t -w t-1 2 ≤ η t-1 s=1 β t-1-s (d s + e s + g s ). Proof. We use induction for the proof. The base case t = 0 holds, because w 0 = w -1 by initialization and both sides of the inequality is 0. Let us assume that it holds at iteration t. That is, w t -w * , w t -w t-1 + c w w t -w t-1 2 ≤ η t-1 s=1 β t-1-s (d s + e s + g s ). Consider iteration t + 1. We want to prove that w t+1 -w * , w t+1 -w t + c w w t+1 -w t 2 ≤ η t s=1 β t-s (d s + e s + g s ). Denote ∆ := w t+1 -w t . It is equivalent to showing that ∆, w t + ∆ -w * + c w ∆ 2 ≤ η t s=1 β t-s (d s + e s + g s ), or -η∇f (w t ) + β(w t -w t-1 ), w t -w * -η∇f (w t ) + β(w t -w t-1 ) + c w -η∇f (w t ) + β(w t -w t-1 ) 2 ≤ η t s=1 β t-s (d s + e s + g s ). which is in turn equivalent to showing that -η ∇f (w t ), w t -w * (a) +η 2 (1 + c w ) ∇f (w t ) 2 (b) -2ηβ(1 + c w ) ∇f (w t ), w t -w t-1 (c) + β w t -w t-1 , w t -w * + β 2 w t -w t-1 2 + c w β 2 w t -w t-1 2 ≤ η t s=1 β t-s (d s + e s + g s ). For term (a), we have that w t -w * , ∇f (w t ) = (w t -w * ) A * (w t -w * ) + ρ( w t -w * )( w t 2 -w * w t ) = (w t -w * ) A * + ρ 2 ( w t -w * )I d (w t -w * ) + ρ 2 w * -w t 2 ( w t + w * ). = (w t -w * ) A * (w t -w * ) + ρ 2 ( w t -w * ) w t -w * 2 + ρ 2 w * -w t 2 ( w t + w * ). Notice that A * -γ + ρ w * I d 0I d , as ρ w * ≥ γ. On the other hand, for term (b), we get that ∇f (w t ) 2 = A * (w t -w * ) -ρ( w * -w t )w t 2 ≤ 2(w t -w * ) A 2 * (w t -w * ) + 2ρ 2 w * -w t 2 w t 2 . ( ) For term (c), we can bound it as -2ηβ(1 + c w ) ∇f (w t ), w t -w t-1 ≤ √ 2η(1 + c w )∇f (w t ) √ 2β(w t -w t-1 ) ≤ 1 2 2η 2 (1 + c w ) 2 ∇f (w t ) 2 + 2β 2 w t -w t-1 2 = η 2 (1 + c w ) 2 ∇f (w t ) 2 + β 2 w t -w t-1 2 . ( ) Combining the above, we have that -η ∇f (w t ), w t -w * + η 2 (1 + c w ) ∇f (w t ) 2 -2ηβ(1 + c w ) ∇f (w t ), w t -w t-1 + β w t -w t-1 , w t -w * + β 2 w t -w t-1 2 + c w β 2 w t -w t-1 2 ≤ -η ∇f (w t ), w t -w * + η 2 cw ∇f (w t ) 2 + β w t -w t-1 , w t -w * + β 2 (2 + c w ) w t -w t-1 2 ≤ -η(w t -w * ) A * -2ηc w A 2 * (w t -w * ) -η ρ 2 ( w * -w t ) 2 w * + w t -4ηρc w w t 2 + η ρ 2 ( w * -w t ) w t -w * 2 + β w t -w t-1 , w t -w * + β 2 (2 + c w ) w t -w t-1 2 =β wt-wt-1,wt-w * +βcw wt-wt-1 2 ≤ η(d t + e t + g t ) + ηβ t-1 s=1 β t-1-s (d s + e s + g s ) := η t s=1 β t-s (d s + e s + g s ). where in the second to last inequality we used β(2 + c w ) = c w and the last inequality we used the assumption at iteration t. So we have completed the induction. Lemma 13. Assume that for all t, w t ≤ R for some number R. Following the notations used in Lemma 12, we denote sequences  d t := ρ 2 ( w * -w t ) w t -w * 2 , e t := -(w t -w * ) (A * - 2ηc w A 2 * )(w t -w * ), g t := -ρ 2 ( w * -w t ) 2 w * + w t -4ηρc w w t 2 , w t+1 -w * 2 ≤ 1 -η(1 + β) ρ w t -(γ - ρ w * -γ 2 ) w t -w * 2 + η 2 (w t-1 -w * ) (2c β A 2 * + 2β 2 LI d + 2c β ρ 2 R 2 I d )(w t-1 -w * ) + η 2 (c β -2c w ) z t-2 2 + 2ηβ 2 t-2 s=0 β t-2-s (d s + e s + g s ). Proof. Recall that the update of heavy ball is w t+1 = w t -η∇f (w t ) + β(w t -w t-1 ). So the distance term can be decomposed as w t+1 -w * 2 = w t -η∇f (w t ) + β(w t -w t-1 ) -w * 2 = w t -w * 2 -2η w t -w * , ∇f (w t ) + η 2 ∇f (w t ) 2 + β 2 w t -w t-1 2 -2ηβ(w t -w t-1 ) ∇f (w t ) + 2β(w t -w * ) (w t -w t-1 ) = w t -w * 2 -2η w t -w * , ∇f (w t ) + η 2 ∇f (w t ) 2 + β 2 w t -w t-1 2 -2ηβ w t -w * , ∇f (w t ) -2ηβ w * -w t-1 , ∇f (w t ) + 2β(w t -w * ) (w t -w t-1 ) = w t -w * 2 -2η(1 + β) w t -w * , ∇f (w t ) (a) +η 2 ∇f (w t ) 2 (b) + (β 2 + 2β) w t -w t-1 2 -2ηβ w * -w t-1 , ∇f (w t ) + 2β(w t-1 -w * ) (w t -w t-1 ). (75) For term (a), w t -w * , ∇f (w t ) , by using that ∇f (w) = A * (w -w * ) -ρ( w * -w )w, we can bound it as w t -w * , ∇f (w t ) = (w t -w * ) A * (w t -w * ) + ρ( w t -w * )( w t 2 -w * w t ) = (w t -w * ) A * + ρ 2 ( w t -w * )I d (w t -w * ) + ρ 2 w * -w t 2 ( w t + w * ). (76) On the other hand, for term (b), ∇f (w t ) 2 , we get that ∇f (w t ) 2 = A * (w t -w * ) -ρ( w * -w t )w t 2 ≤ 2(w t -w * ) A 2 * (w t -w * ) + 2ρ 2 w * -w t 2 w t 2 . By combining ( 75),( 76), and (77), we can bound the distance term as w t+1 -w * 2 = w t -w * 2 -2η(1 + β) w t -w * , ∇f (w t ) + η 2 ∇f (w t ) + (β 2 + 2β) w t -w t-1 2 -2ηβ w * -w t-1 , ∇f (w t ) + 2β(w t-1 -w * ) (w t -w t-1 ) ≤ w t -w * 2 -2η(1 + β) (w t -w * ) A * + ρ 2 ( w t -w * )I d (w t -w * ) -ρη(1 + β) w * -w t 2 ( w t + w * ) + η 2 2(w t -w * ) A 2 * (w t -w * ) + 2ρ 2 w * -w t 2 w t 2 + (β 2 + 2β) w t -w t-1 2 -2ηβ w * -w t-1 , ∇f (w t ) + 2β(w t-1 -w * ) (w t -w t-1 ). Now let us bound the terms on the last line of (78). For the second to the last term, it is equal to -2ηβ w * -w t-1 , ∇f (w t ) = -2ηβ w * -w t-1 , ∇f (w t-1 ) -2ηβ w * -w t-1 , ∇f (w t ) -∇f (w t-1 ) , On the other hand, the last term of ( 78) is equal to 2β(w t-1 -w * ) (w t -w t-1 ) = -2βη w t-1 -w * , ∇f (w t-1 ) + 2β 2 w t-1 -w * , w t-1 -w t-2 . (80) By Lemma 12, for all t, we have that w t-1 -w * , w t-1 -w t-2 ≤ -c w w t-1 -w t-2 2 + η t-2 s=1 β t-2-s (d s + e s + g s ). Therefore, by combining (79), (80), and (81), we have that (β 2 + 2β) w t -w t-1 2 -2ηβ w * -w t-1 , ∇f (w t ) + 2β(w t-1 -w * ) (w t -w t-1 ) ≤ (β 2 + 2β) w t -w t-1 2 -2ηβ w t-1 -w * , ∇f (w t-1 ) -∇f (w t ) -2c w β 2 w t-1 -w t-2 2 + 2ηβ 2 t-2 s=1 β t-2-s (d s + e s + g s ). Let us summarize the results so far, by ( 83) and ( 90), we have that w t+1 -w * 2 ≤ w t -w * 2 -2η(1 + β) (w t -w * ) A * + ρ 2 ( w t -w * )I d (w t -w * ) -ρη(1 + β) w * -w t 2 ( w t + w * ) + η 2 2(w t -w * ) A 2 * (w t -w * ) + 2ρ 2 w * -w t 2 w t 2 + 2η 2 c β (w t-1 -w * ) A 2 * (w t-1 -w * ) + 2η 2 c β ρ 2 w * -w t-1 2 w t-1 2 + 2η 2 β 2 L w t-1 -w * 2 + η 2 (c β -2c w ) z t-2 2 + 2ηβ 2 t-2 s=1 β t-2-s (d s + e s + g s ), where we also used that w t-1 -w t-2 = -η β z t-2 . We can rewrite the inequality above further as w t+1 -w * 2 ≤ (w t -w * ) I d -2η(1 + β)A * (I d - η 1 + β A * ) -ηρ(1 + β)( w t -w * )I d (w t -w * ) -ηρ(1 + β)( w * -w t ) 2 w t (1 - 2ηρ w t 1 + β ) + w * ≤0 +2η 2 c β ρ 2 w * -w t-1 2 w t-1 2 + η 2 (w t-1 -w * ) (2c β A 2 * + 2β 2 LI d )(w t-1 -w * ) + η 2 (c β -2c w ) z t-2 2 + 2ηβ 2 t-2 s=1 β t-2-s (d s + e s + g s ), where we used that -ηρ(1 + β)( w * -w t ) 2 w t (1 -2ηρ wt 1+β ) + w * ≤ 0 for any η that satisfies η ≤ 1 + β 2ρR , as w t ≤ R for all t. Let us simplify the inequality (92) further by writing it as w t+1 -w * 2 (a) ≤ (w t -w * ) I d -2η(1 + β)A * (I d - η 1 + β A * ) -ηρ(1 + β)( w t -w * )I d (w t -w * ) + η 2 (w t-1 -w * ) (2c β A 2 * + 2β 2 LI d + 2c β ρ 2 R 2 I d )(w t-1 -w * ) + η 2 (c β -2c w ) z t-2 2 + 2ηβ 2 t-2 s=1 β t-2-s (d s + e s + g s ). (b) ≤ 1 -η(1 + β) ρ w t -(γ - ρ w * -γ 2 ) w t -w * 2 + η 2 (w t-1 -w * ) (2c β A 2 * + 2β 2 LI d + 2c β ρ 2 R 2 I d )(w t-1 -w * ) + η 2 (c β -2c w ) z t-2 2 + 2ηβ 2 t-2 s=1 β t-2-s (d s + e s + g s ), where (a) is because that ( w * -w t-1 ) 2 w t-1 2 ≤ w * -w t-1 2 w t-1 2 ≤ w * -w t-1 2 R 2 as w t ≤ R for all t, while (b) is by another constraint of η, η ≤ 1 + β 4 A * 2 , ( ) so that 2ηA * (I d -η 1+β A * ) 3 2 ηA * Lemma 14. Fix some numbers c 0 , c 1 > 0. Denote ω β t-2 := t-2 s=0 β. Following the notations and assumptions used in Lemma 12 and Lemma 13. If η satisfies: (1) η ≤ c0β 2c β A * 2 2 +2β 2 L+2c β ρ 2 R 2 , (2) η ≤ c1 β(2cw+Lβω β t-2 ) A * , (3) η ≤ c1 ρβ 2 Lω β t-2 R , (4) η ≤ 1 4ρcwR , and (5) η ≤ c1 A * +ρR Lβ 2 ω β t-2 ( A * 2 +ρ 2 R 2 ) for all t, then we have that for all t, w t+1 -w * 2 ≤ 1 -η(1 + β) ρ w t -(γ - ρ w * -γ 2 ) w t -w * 2 + ηβc 0 w t-1 -w * 2 + 2ηβc 1 A * + ρR t-2 s=0 β t-2-s w s -w * 2 -2ηβ 2 t-2 s=1 β t-2-s (w s -w * ) A * - ρ 2 w * -w t I d (w s -w * ). Proof. From Lemma 13, we have that w t+1 -w * 2 ≤ 1 -η(1 + β) ρ w t -(γ - ρ w * -γ 2 ) w t -w * 2 + η 2 (w t-1 -w * ) (2c β A 2 * + 2β 2 LI d + 2c β ρ 2 R 2 I d )(w t-1 -w * ) + η 2 (c β -2c w ) z t-2 2 + 2ηβ 2 t-2 s=1 β t-2-s (d s + e s + g s ), ≤ 1 -η(1 + β) ρ w t -(γ - ρ w * -γ 2 ) w t -w * 2 + ηβc 0 w t-1 -w * 2 + η 2 (c β -2c w ) z t-2 2 + 2ηβ 2 t-2 s=1 β t-2-s (d s + e s + g s ), (b) ≤ 1 -η(1 + β) ρ w t -(γ - ρ w * -γ 2 ) w t -w * 2 + ηβc 0 w t-1 -w * 2 + 2η 2 (c β -2c w )β 2 ω β t-2 t-2 s=0 β t-2-s (w s -w * ) (A 2 * + ρ 2 R 2 I d )(w s -w * ) + 2ηβ 2 t-2 s=1 β t-2-s (d s + e s + g s ), where (a) is by a constraint of η so that η(2c  and (b ) is due to the following, denote ω β t := t s=0 β, we have that β A * 2 2 + 2β 2 L + 2c β ρ 2 R 2 ) ≤ c 0 β, z t 2 := t s=0 β t-s+1 ∇f (w s ) 2 = β 2 (ω β t ) 2 t s=0 β t-s ω β t ∇f (w s ) 2 ≤ β 2 (ω β t ) 2 t s=0 β t-s ω β t ∇f (w s ) 2 ≤ 2β 2 ω β t t s=0 β t-s (w s -w * ) (A 2 * + ρ 2 R 2 I d )(w s -w * ) . ( ) where the first inequality of ( 97) is due to Jensen's inequality and the second inequality of ( 97) is due to the following upper-bound of the gradient norm, ∇f (w s ) 2 = A * (w s -w * ) -ρ( w * - w s )w s 2 ≤ 2(w s -w * ) A 2 * (w s -w * ) + 2ρ 2 ( w * -w s ) 2 w s 2 ≤ 2(w s -w * ) A 2 * (w s - w * ) + 2ρ 2 w * -w s 2 R 2 . By the definitions of d s , e s , g s , we further have that w t+1 -w * 2 ≤ 1 -η(1 + β) ρ w t -(γ - ρ w * -γ 2 ) w s -w * 2 + ηβc 0 w t-1 -w * 2 -2ηβ 2 t-2 s=1 β t-2-s (w s -w * ) A * - ρ 2 ( w * -w s )I d (w s -w * ) + 2ηβ 2 t-2 s=1 β t-2-s (w s -w * ) η(2c w + (c β -2c w )ω β t-2 )A 2 * (w s -w * ) + 2η 2 ρ 2 R 2 β 2 ω β t-2 (c β -2c w ) t-2 s=1 β t-2-s w s -w * 2 -ηβ 2 ρ t-2 s=1 β t-2-s ( w * -w s ) 2 w * + w s -4ηρc w w s 2 + η 2 (c β -2c w )2β t ω β t-2 A * 2 2 + ρ 2 R 2 ) w 0 -w * 2 ≤ 1 -η(1 + β) ρ w t -(γ - ρ w * -γ 2 ) w t -w * 2 + ηβc 0 w t-1 -w * 2 -2ηβ 2 t-2 s=1 β t-2-s (w s -w * ) A * - ρ 2 w * -w s I d (w s -w * ) + 2ηβc 1 t-2 s=1 β t-2-s (w s -w * ) (A * + ρRI d )(w s -w * ) -ηβ 2 ρ t-2 s=1 β t-2-s ( w * -w s ) 2 w * + w s -4ηρc w w s 2 + η 2 (c β -2c w )2β t ω β t-2 A * 2 2 + ρ 2 R 2 ) w 0 -w * 2 , ( ) where the last inequality is due to (1): 2η 2 β 2 (2c w + (c β -2c w )ω β t-2 )A 2 * 2η 2 β 2 (2c w + Lβω β t-2 )A 2 * 2ηβc 1 A * , as c β -2c w ≤ Lβ and that η ≤ c1 β(2cw+Lβω β t-2 ) A * , and (2) that 2η 2 ρ 2 R 2 β 2 ω β t-2 (c β -2c w ) ≤ 2η 2 ρ 2 R 2 Lβ 3 ω β t-2 ≤ 2ηρβc 1 R, as η ≤ c1 ρβ 2 Lω β t-2 R . To continue, let us bound the last two terms on (98). For the second to the last term, by using that w t ≤ R for all t and that η ≤ 1 4ρcwR , we have that w * + w s -4ηρc w w s 2 ≥ w * . Therefore, we have that the second to last term on (98) is non-positive, namely, -ηβ 2 ρ t-2 s=1 β t-2-s ( w * -w s ) 2 w * + w s -4ηρc w w s 2 ≤ 0. For the last term on (98), by using that η ≤ c1 A * +ρR Lβ 2 ω β t-2 ( A * 2 +ρ 2 R 2 ) , we have that η 2 (c β -2c w )2β t ω β t-2 A * 2 2 + ρ 2 R 2 ) w 0 - w * 2 ≤ η 2 2Lβ t+1 ω β t-2 A * 2 2 + ρ 2 R 2 ) w 0 -w * 2 ≤ 2ηβ t-1 c 1 A * + ρR w 0 -w * 2 . Combining the above results, we have that w t+1 -w * 2 ≤ 1 -η(1 + β) ρ w t -(γ - ρ w * -γ 2 ) w t -w * 2 + ηβc 0 w t-1 -w * 2 + 2ηβc 1 A * + ρR t-2 s=0 β t-2-s w s -w * 2 -2ηβ 2 t-2 s=1 β t-2-s (w s -w * ) A * - ρ 2 w * -w s I d (w s -w * ). The following lemma will be used for getting the iteration complexity from Lemma 14. Lemma 15. For a non-negative sequence {y t } t≥0 , suppose that it satisfies y t+1 ≤ py t + qy t-1 + rβ ȳt-2 + xβ t-2 s=τ β t-2-s y s + zβ t-τ +1 , for all t ≥ τ, for non-negative numbers p, q, r, x, z ≥ 0 and β ∈ [0, 1), where we denote ȳt := t s=0 β t-s y s . ( ) Fix some numbers φ ∈ (0, 1) and ψ > 0. Define θ = -p+ √ p 2 +4(q+φ) 2 . Suppose that β ≤ p+ √ p 2 +4(q+φ) 2 φ r + φ + x and β ≤ p + p 2 + 4(q + φ) 2 - 1 ψ . then we have that for all t ≥ τ , y t ≤ p + p 2 + 4(q + φ) 2 t-τ c τ,β , where c τ,β is an upper bound that satisfies, for any t ≤ τ , y t + θy t-1 + φȳ t-2 + βψz ≤ c τ,β , Proof. For a non-negative sequence {y t } t≥0 , suppose that it satisfies y t+1 ≤ py t + qy t-1 + rβ ȳt-2 + xβ t-2 s=τ β t-2-s y s + zβ t-τ +1 , for all t ≥ τ, for non-negative numbers p, q, r, x, z and β ∈ [0, 1), where we denote ȳt := t s=0 β t-s y s . (105) y t+1 + θy t + φȳ t-1 + ψzβ t-τ +2 (a) ≤ (p + θ)y t + qy t-1 + rβ ȳt-2 + φȳ t-1 + xβ t-2 s=τ β t-2-s y s + (ψβ + 1)zβ t-τ +1 (b) = (p + θ)y t + (q + φ)y t-1 + (rβ + φβ)ȳ t-2 + xβ t-2 s=τ β t-2-s y s + (ψβ + 1)zβ t-τ +1 ≤ (p + θ)(y t + (q + φ) p + θ y t-1 ) + β(r + φ + x)ȳ t-2 + (ψβ + 1)zβ t-τ +1 (c) ≤ (p + θ)(y t + θy t-1 ) + β(r + φ + x)ȳ t-2 + (ψβ + 1)zβ t-τ +1 (d) ≤ (p + θ)(y t + θy t-1 + φȳ t-2 + ψzβ t-τ +1 ) ≤ (p + θ) 2 (y t-1 + θy t-2 + φȳ t-3 + ψzβ t-τ -1 ) ≤ . . . := (p + θ) t-τ +1 c τ,β . where (a) is due to the dynamics (100) , (b) is because ȳt-1 = y t-1 + β ȳt-2 , (c) is by, , which completes the proof. q + φ p + θ ≤ θ, Lemma 16. Assume that for all t, w t ≤ R for some number R. Fix the numbers c 0 , c 1 > 0 in Lemma 14 so that c 1 ≤ c0 40( A * +ρR) . Suppose that the step size η satisfies (1) η ≤ c0β 2c β A * 2 2 +2β 2 L+2c β ρ 2 R 2 , (2) η ≤ c1 β(2cw+Lβω β t-2 ) A * , (3) η ≤ c1 ρβ 2 Lω β t-2 R , (4) η ≤ 1 4ρcwR , (5) η ≤ c1 A * +ρR Lβ 2 ω β t-2 ( A * 2 +ρ 2 R 2 ) , (6) η ≤ 1+β 2ρR , and (7) η ≤ 1+β 4 A * 2 for all t, where c β := (2β 2 +4β+Lβ), L := A 2 +2ρR, c w := 2β 1-β , cw := (1+c w )+(1+c w ) 2 and ω β t-2 := t-2 s=0 β. Furthermore, suppose that the momentum parameter β satisfies β ≤ min{ 10 11 1 -η(1 + β)δ , 1 -η(1 + β)δ -0.1 . Fix numbers δ, δ so that 1 2 (ρ w * -γ) > δ > 0 and c0 20 ≥ δ > 0. Assume that w t is nondecreasing. If ρ w t ≥ γ -1 2 (ρ w * -γ) + δ and ρ w t ≥ γ -δ for some t ≥ t * , then we have that w t -w * 2 ≤ 1 -η(1 + β)δ + 2ηβc 0 1 -η(1 + β)δ t-t * c β,t * where c β,t * is a number that satisfies for any t ≤ t * , w t -w * 2 + 2ηc0β 1-η(1+β)δ w t-1 -w * 2 + ηβc 0 t-2 s=0 β t-2-s w s -w * 2 + max{0, β20η t * -1 s=1 β t * -1-s γ 2 -ρ 2 w s w s -w * 2 } ≤ c β,t * . Proof. From Lemma 14, we have that w t+1 -w * 2 ≤ 1 -η(1 + β) ρ w t -(γ - ρ w * -γ 2 ) w t -w * 2 + ηβc 0 w t-1 -w * 2 + 2ηβc 1 ( A * + ρR) t-2 s=0 β t-2-s w s -w * 2 -2ηβ 2 t-2 s=1 β t-2-s (w s -w * ) A * - ρ 2 w * -w s I d (w s -w * ). Using that for all t ≥ t * , ρ w t ≥ γ -1 2 (ρ w * -γ) + δ, we have that w t+1 -w * 2 ≤ 1 -η(1 + β)δ w t -w * 2 + ηβc 0 w t-1 -w * 2 + 2ηβc 1 ( A * + ρR) t-2 s=0 β t-2-s w s -w * 2 -2ηβ 2 t-2 s=1 β t-2-s (w s -w * ) A * - ρ 2 w * -w s I d (w s -w * ) = 1 -η(1 + β)δ w t -w * 2 + ηβc 0 w t-1 -w * 2 + 2ηβc 1 ( A * + ρR) t-2 s=0 β t-2-s w s -w * 2 -2ηβ 2 t-2 s=t * β t-2-s (w s -w * ) A * - ρ 2 w * -w s I d (w s -w * ) -2ηβ 2 t * -1 s=1 β t-2-s (w s -w * ) A * - ρ 2 w * -w s I d (w s -w * ). We bound the second to last term of (110) as follows. Note that for s ≥ t * , we have that ρ w s ≥ γ -δ by the assumption that w t is non-decreasing. Therefore, once ρ w t exceeds any level blow γ -δ , it will not fall below γ -δ . So we have that A * - ρ 2 w * -w s I d (a) A * - ρ 2 w * I d + γ -δ 2 I d (b) (-γ + ρ w * )I d - ρ 2 w * I d + γ -δ 2 I d (c) - δ 2 I d where (a) uses that ρ w s ≥ γ -δ for some number δ > 0, (b) uses the fact that A * (-γ + ρ w * )I d , and (c) uses the fact that ρ w * ≥ γ. Using the result, we can bound the second to the last term of (110) as -2ηβ 2 t-2 s=t * β t-2-s (w s -w * ) A * - ρ 2 w * -w s I d (w s -w * ) ≤ ηβ 2 δ t-2 s=t * β t-2-s w s -w * 2 . ( ) Now let us switch to the last term of (110). Using the fact that A * (-γ + ρ w * )I d , and that ρ w * ≥ γ by the characterization of the optimizer w * , we have that A * - ρ 2 w * -w s I d ρ 2 w * + w s I d -γ ρ 2 w s - γ 2 I d . So we can bound the last term as -2ηβ 2 t * -1 s=1 β t-2-s (w s -w * ) A * - ρ 2 w * -w s I d (w s -w * ) ≤ 2ηβ t-t * +1 t * -1 s=1 β t * -1-s (w s -w * ) γ 2 - ρ 2 w s I d (w s -w * ) := 2ηβ t-t * +1 D t * , where we denote D t * := t * -1 s=1 β t * -1-s (w s -w * ) γ 2 -ρ 2 w s I d (w s -w * ). Combining ( 110), ( 112), (114), we have that w t+1 -w * 2 ≤ 1 -η(1 + β)δ w t -w * 2 + ηβc 0 w t-1 -w * 2 + 2ηβc 1 ( A * + ρR) t-2 s=0 β t-2-s w s -w * 2 + ηβ 2 δ t-2 s=t * β t-2-s w s -w * 2 + 2ηβ t-t * +1 D t * . Now we are ready to use Lemma 15. Set p, q, r, x, z in Lemma 15 as follows. • p = 1 -η(1 + β)δ • q = ηβc 0 • r = 2η( A * + ρR)c 1 • x = ηβδ • z = 2ηD t * 1{D t * ≥ 0}. • φ = q = ηβc 0 • ψ = 10. we have that . Using the expression of p, q, φ, and ψ, it suffices to have that β ≤ 10 11 1 -η(1 + β)δ . On the other hand, for the second condition, β ≤ p+ √ p 2 +4(q+φ) 2 w t -w * 2 ≤ 1 -η(1 + β)δ + (1 -η(1 + β)δ) 2 + 4(2ηβc 0 ) 2 t-t * c β,t * ≤ 1 -η(1 + β)δ + 2ηβc 0 1 -η(1 + β)δ t- -1 ψ , by using the expression of p, q, φ, and ψ, it suffices to have that β ≤ 1 -η(1 + β)δ -0.1.

F.2 PROOF OF THEOREM 3

Proof. Let δ = c δ ρ w * -γ for some number c δ < 0.5. Denote c converge ηδ := 1 -2c0 1-2ηδ for some number c0 > 0. By Lemma 16, we have that w t -w * 2 ≤ 1 -η(1 + β)δ + 2ηβc 0 δ 1 -η(1 + β)δ t-t * c β,t * ≤ 1 -ηδ(1 + βc converge ηδ ) t-t * c β,t * where c β,t * is a number that satisfies for any t ≤ t * , w t -w * 2 + 2ηc0β 1-η(1+β)δ w t-1 -w * 2 + ηβc 0 t-2 s=0 β t-2-s w s -w * 2 +max{0, β20η t * -1 s=1 β t * -1-s γ 2 -ρ 2 w s w s -w * 2 } ≤ c β,t * . Note that we can obtain a trivial upper-bound c β,t for any t as follows. Using that w t -w * 2 ≤ 4R 2 and that c 0 ← c0 δ and that δ ← c δ ρ w * -γ , we can upper-bound the term as c β,t ≤ 4R 2 1 + 2ηc 0 c δ ρ w * -γ β 1 -η(1 + β)c δ ρ w * -γ + ηβc 0 c δ ρ w * -γ 1 -β t 1 -β + β10η max{0, γ} 1 -β t 1 -β ≤ 4R 2 (1 + ηβ Cβ ) := cβ . ( ) where we define Cβ := 2c0c δ ρ w * -γ 1-η(1+β)c δ ρ w * -γ + 1 1-β c0 c δ (ρ w * -γ) + 10 max{0, γ} and cβ := 4R 2 (1 + ηβ Cβ ). Now denote c converge := 1 - 2c0 1-2ηc δ ρ w * -γ . We have that f (w t * +t ) -f (w * ) (a) ≤ A 2 + 2ρR 2 w t * +t -w * 2 (b) ≤ A 2 + 2ρR 2 cβ exp -ηc δ (ρ w * -γ)(1 + βc converge )t where (a) uses the ( A 2 + 2ρR)-smoothness of function f (•) in the region of {w : w ≤ R}, and (b) uses ( 117), ( 118) and that δ := c δ ρ w * -γ . So we see that the number of iterations in the linear convergence regime is at most t ≤ T := 1 ηc δ (ρ w * -γ)(1 + βc converge ) log( ( A 2 + 2ρR)c β 2 ). Lastly, let us check if the step size η satisfies the constraints of Lemma 16. Recall the notations that c w := 2β 1-β , cw := (1 + c w ) + (1 + c w ) 2 , c β := (2β 2 + 4β + Lβ), and L := A 2 + 2ρR. Lemma 16 has the following constraints, (1) η ≤ c0β 2c β A * 2 2 +2β 2 L+2c β ρ 2 R 2 , (2) η ≤ c1 β(2cw+Lβ/(1-β)) A * , (3) η ≤ c1 ρβ 2 LR/(1-β) , (4) η ≤ 1 4ρcwR , (5) η ≤ c1 A * +ρR Lβ 2 ( A * 2 +ρ 2 R 2 )/(1-β) , (6) η ≤ 1+β 2ρR , and (7) η ≤ 1+β 4 A * 2 . For the constraints of (1), using c 0 = c0 δ and that δ = c δ ρ w * -γ , it can be rewritten as η ≤ c0 c δ ρ w * -γ 2(2β + 4 + L) A * 2 2 + ρ 2 R 2 + 2βL . ( ) For the constraints of (2), using c 1 ≤ c0δ 40( A * +ρR) and that δ = c δ ρ w * -γ , it can be rewritten as η ≤ c0 c δ ρ w * -γ 40β( A * 2 + ρR A * )(2c w + Lβ/(1 -β)) . ( ) The constraints of (3) can be written as, using c 1 ≤ c0δ 40( A * +ρR) and that δ = c δ ρ w * -γ , η ≤ c0 c δ ρ w * -γ 40ρR( A * + ρR)β 2 L/(1 -β) . ( ) The constraints of (5) translates into, using c 1 ≤ c0δ 40( A * +ρR) and that δ = c δ ρ w * -γ , η ≤ c0 c δ ρ w * -γ Lβ 2 ( A * 2 + ρ 2 R 2 )/(1 -β) . Considering all the above constraints, it suffices to let η satisfies η ≤ min 1 4( A * + cw ρR) , c0 c δ ρ w * -γ C β ( A * + ρR)(1 + A * + ρR) + 2βL , where C β := max 4β + 8 + 2L, 40β 2 L 1-β + 80βc w , 4c w . Note that the constraint of η satisfies ηc δ ρ w * -γ ≤ 1 4 c δ ≤ 1 8 . Using this inequality, we can simplify the constraint regarding the parameter β in Lemma 16, which leads to β ∈ [0, 0.65]. Consequently, we can simplify and upper bound the constants cw , C β and Cβ , which leads to the theorem statement. Thus, we have completed the proof.

G MORE DISCUSSIONS

Recall the discussion in the main text, we showed that the iterate w t+1 generated by HB satisfies w t+1 , u i = (1 + ηλ i ) w t , u i + β( w t , u i -w t-1 , u i ), which is in the form of dynamics shown in Lemma 1. Hence, one might be able to show that with the use of the momentum, the growth rate of the projection on the eigenvector u i (i.e. | w t+1 , u i |) is faster as the momentum parameter β increases. Furthermore, the top eigenvector projection | w t+1 , u 1 | is the one that grows at the fastest rate. As the result, after normalization (i.e. ), the normalized solution will converge to the top eigenvector after a few iterations T . However, we know that power iteration or Lanczos method are the standard, specialized, state-ofthe-art algorithms for computing the top eigenvector. It is true that HB is outperformed by these methods. But in the next subsection, we will show an implication of the acceleration result, compared to vanilla gradient descent, of top eigenvector computations. 2020)). The common assumptions are that the gradient is L-Lipschitz: ∇f (x) -∇f (y) ≤ L x -y and that the Hessian is ρ-Lipschitz: ∇ 2 f (x) -2 f (y) ≤ ρ x -y , while some related works make some additional assumptions. All the related works agree that if the current iterate w t0 is in the region of strict saddle points, defined as that the gradient is small (i.e. ∇f (w t0 ) ≤ g ) but the least eigenvalue of the Hessian is strictly negative (i.e. λ min (∇ 2 f (w t0 )) h I d ), then the eigenvector corresponding the least eigenvalue λ min (∇ 2 f (w t0 )) is the escape direction. To elaborate, by ρ-Lipschitzness of the Hessian, f (w t0+t ) -f (w t0 ) ≤ ∇f (w t0 ), w t0+t -w t0 + 1 2 (w t0+t -w t0 ) ∇ 2 f (w t0 )(w t0+t -w t0 ) exhibit negative curvature + ρ 3 w t0+t -w t0 3 . So if w t0+t -w t0 is in the direction of the bottom eigenvector of ∇ 2 f (w t0 ), then 1 2 (w t0+t -w t0 ) ∇ 2 f (w t0 )(w t0+t -w t0 ) ≤ -c h for some c > 0. Together with the fact that the gradient is small when in the region of saddle points and the use of a sufficiently small step size can guarantee that the function value decreases sufficiently (i.e.f (w t0+t ) -f (w t0 ) ≤ -c h for some c > 0). Therefore, many related works design fast algorithms by leveraging the problem structure to quickly compute the bottom eigenvector of the Hessian (see e.g. Carmon et al. (2018) ; Agarwal et al. (2017) ; Allen-Zhu & Li (2018) ; Xu et al. (2018) ). An interesting question is as follows "If the Heavy Ball algorithm is used directly to solve a nonconvex optimization problem, can it escape possible saddle points faster than gradient descent?" Before answering the question, let us first conduct an experiment to see if the Heavy Ball algorithm can accelerate the process of escaping saddle points. Specifically, we consider a problem that was consider by Staib et al. (2019) . where x i ∼ N (0, diag([0.1, 0.001])) and the small variance in the second component will provide smaller component of gradient in the escape direction. At the origin, we have that the gradient is small but that the Hessian exhibits a negative curvature. For this problem, Wang et al. (2020) observe that SGD with momentum escapes the saddle points faster but they make strong assumptions in their analysis. We instead consider the Heavy Ball algorithm (i.e. Algorithm 1, the deterministic version of SGD with momentum). Figure 4 shows the result and we see that the higher the momentum parameter β, the faster the process of escaping the saddle points. We are going to argue that the observation can be explained by our theoretical result that the Heavy Ball algorithm computes the top eigenvector faster than gradient descent. Let us denote h(w) :=  where • denotes element-wise product and abs(•) denotes the absolution value of its argument in the element-wise way. By initialization w 0 = w -1 = 0, we have that (w 0 u 1 ) = (w -1 u 1 ) = (w 0 u 2 ) = (w -1 u 2 ) = 0 and the dynamics w t+1 u 2 = (1 -η)w t u 2 + β(w t u 2 -w t-1 u 2 ) -ηu 2 ∇h(w t ) w t+1 u 1 = (1 + η 10 )w t u 1 + β(w t u 1 -w t-1 u 1 ) -ηu 1 ∇h(w t ), while we also have the initial condition that w 1 u 2 = -ηu 2 ∇h(w 0 ) = -ηu 2 ∇h( 0 0 ) = -ηu 2 x = -ηx[1] w 1 u 1 = -ηu 1 ∇h(w 0 ) = -ηu 1 ∇h( 0 0 ) = -ηu 1 x = -ηx[2]. That is, w 1 = -ηx. (135) Using (132), we can rewrite (133) as  w t+1 u 2 = (1 -η)w t u 2 + β(w t u 2 -w t-1 u 2 ) -ηx[1] -10η(w t+1 u 2 ) 9 Therefore, the Hessian at a stationary point is positive semi-definite as long as |w[2]| 8 ≥ 1 900 , which can be guaranteed by some realizations of x[2] according to (138). Now let us illustrate why higher momentum β leads to the faster convergence in a high level way, From (138), we see that one can specify the stationary point by determining x[1] and x[2]. Furthermore, from (139), once the iterate w t satisfies |w t [2]| 8 > 1 900 , the iterate enters the locally strongly convex and smooth region, for which the local convergence of the Heavy Ball algorithm is known (Ghadimi et al. (2015) ). W.l.o.g, let us assume that x[2] is negative. From (134), we have that w 1 u 1 > 0 when x[2] is negative. Moreover, from (136), if, before the iterate satisfies |w t [2]| 8 > 1 900 , we have that 1 10 (w t u 1 ) -10(w t u 1 ) 9 -x[2] > 0, then the contribution of the projection on the escape direction (i.e. w t+1 u 1 ) due to the momentum term β(w t u 1 -w t-1 u 1 ) is positive, which also implies that the larger β, the larger the contribution and hence the faster the growing rate of |w t [2]|. Now let us check the condition, 1 10 (w t u 1 ) -10(w t u 1 ) 9 -x[2] > 0 before |w t [2]| 8 > 1 900 . A sufficient condition is that ( 1 10 -1 90 )(w t u 1 ) -x[2] > 0, which we immediately see that it is true given that x[2] is negative and that w 1 u 1 > 0 and that the magnitude of w t u 1 is increasing. A similar reasoning can be conducted on the other coordinate w t+1 [1] := w t+1 u 2 . practice. In our experiment, for the ease of implementation, we let the criteria of the switch be 1{ f (w1)-f (wt) f (w1) ≥ 0.5}, i.e. if the relative change of objective value compared to the initial value has been increased to 50%. Algorithm 3: Switching β = 1 to β < 1 1: Required: step size η and momentum parameter β ∈ [0, 1). 2: Init: u = v = w0 ∈ R d and β = 1. 3: for t = 0 to T do 4: Update iterate wt+1 = wt -η∇f (wt) + β(u -v).

5:

Update auxiliary iterate u = v 6: Update auxiliary iterate v = wt.

7:

If { Criteria is met } 8: β = β. 9: u = v = wt+1 # (reset momentum) 10: end 11: end for Algorithm 4: Switching β = 1 to β < 1. 1: Required: step size η and momentum parameter β ∈ [0, 1). 2: Init: w0 ∈ R d , m-1 = 0 d , β = 1. 3: for t = 0 to T do 4: Update momentum mt := βmt-1 + ∇f (wt).

5:

Update iterate wt+1 := wt -ηmt.

6:

If { Criteria is met } 7: β = β. 8: mt = 0. # (reset momentum) 9: end 10: end for H.2 CUBIC-REGULARIZED PROBLEM Figure 2 shows empirical results of solving the cubic-regularized problem by Heavy Ball with different values of momentum parameter β. Subfigure (a) shows that larger momentum parameter β results in a faster growth rate of w t , which confirms Lemma 2 and shows that it enters the benign region B faster with larger β. Note that here we have that w * = 1. It suggests that the norm is non-decreasing during the execution of the algorithm for a wide range of β except very large β For β = 0.9, the norm starts decreasing only after it arises above w * . Subfigure (b) show that higher β also accelerates the linear convergence, as one can see that the slope of a line that corresponds to a higher β is steeper than that of the lower one (e.g. compared to β = 0), which verifies Theorem 3. We also observe a very interesting phenomenon: when β is set to a very large value (e.g. 0.9 here), the pattern is intrinsically different from the smaller ones. The convergence is not monotone and its behavior (bump and overshoots when decreasing); furthermore, the norm of w t generated by the high β is larger than w * of the minimizer at some time during the execution of the algorithm, which is different from the behavior due to using smaller values of β (i.e. non-decreasing of the norm until the convergence). Our theoretical results cannot explain the behavior of such high β, as such value of β exceeds the upper-threshold required by the theorem. An investigation and understanding of the observation might be needed in the future. 



η(-γ + ρ w * )I d .



Figure1: Performance of HB with different β = {0, 0.3, 0.5, 0.7, 0.9, 1.0 → 0.9} for phase retrieval. (Enlarged figures are available in Appendix H.) Here "1.0 → 0.9" stands for using parameter β = 1.0 in the first few iterations and then switching to using β = 0.9 after that. In our experiment, for the ease of implementa-

c η β number of iterations to enter the benign region B ζ (w * ) or B ζ (-w * ), where c η := 1

(a) norm wt vs. iteration (b) f (wt) -f (w * ) vs. iteration

Figure 2: Solving (2) with different values of momentum parameter β. The empirical result shows the clear advantage of Heavy Ball momentum. Subfigure (a) shows that larger momentum parameter β results in a faster growth rate of wt , which confirms Lemma 2 and shows that it enters the benign region B faster with larger β.Note that here we have that w * = 1. It suggests that the norm is non-decreasing during the execution of the algorithm for a wide range of β except very large β. For β = 0.9, the norm starts decreasing only after it arises above w * . Subfigure (b) show that higher β also accelerates the linear convergence. Now let us switch to describe the setup of the experiment. We first set step size η = 0.01, dimension d = 4, ρ = w * = A 2 = 1, γ = 0.2 and gap = 5×10 -3 . Then we set A = diag([-γ; -γ +gap; a33; a44]), where the entries a33 and a44 are sampled uniformly random in [-γ + gap; A 2]. We draw w = (A + ρ w * I d ) -ξ θ, where θ ∼ N (0; I d ) and log 2 ξ is uniform on [-1, 1]. We set w * = w * w w and b = -(A + ρ w * I d )w * . The procedure makes w * the global minimizer of problem instance (A, b, ρ). Patterns shown on this figure exhibit for other random problem instances as well.

+ 1 to reach |w T0+1 | > 4 9 + c n , which also means that w T0+1 > 4 9 + c n . Moreover, the larger the momentum parameter β, the smaller the number of T 0 . After w t passing the threshold, the iterate enters Stage 1.2. Lemma 4. (Stage 1.1) Denote c a := 1 1+ η There will be at most T 0 := w t ≤ 4

with larger β leading to a smaller number of T b . Notice that if w t satisfies ||w t |-1| ≤ ζ 2 when w ⊥ t falls below ζ 2 , we immediately have that T ζ ≤ T 0 +T b ; otherwise, the iterate enters the next stage, Stage 1.3. Lemma 5. (Stage 1.2) Denote c b :

additional T a iterations, we have that ||w t | -1| ≤ ζ 2 and that larger β reduces the number of iterations T a . Lemma 6. (Stage 1.3) Denote c ζ,g := 1 1+η ζ 2 , c ζ,d := 1 1-ηζ , and ω := max t |w t | ≤ 10 9 + 1 3 c n .

w t ≤ 4 9 + c n . Furthermore, we have that |w T0+1 | > 4 9 + c n .

n . There will be at most T

where (a) uses that log(1 + x) ≥ x 2 for any x ∈ [0, ∼ 2.51], and (b) uses γδ -δ 2 ≤ HB FOR THE CUBIC-REGULARIZED PROBLEM Theorem 3. Assume that the iterate stays in the benign region that exhibits one-point strong convexity to w * , i.e. B := {w ∈ R d : ρ w ≥ γ -δ} for a number δ > 0, once it enters the benign region. Denote c converge := 1 -2c0 1-2ηc δ ρ w * -γ

where c w := 2β 1-β and cw := (1 + c w ) + (1 + c w ) 2 . Let us also denote c β := (2β 2 + 4β + Lβ), L := A 2 + 2ρR, and z t := t s=0 β t-s+1 ∇f (w s ). If η satisfies: (1) η ≤ 1+β 2ρR , and (2) η ≤ 1+β 4 A * 2 , then we have that for all t,

(a) η = 1 × 10 -2 (b) η = 5 × 10 -3 (c) η = 1 × 10 -3 (d) η = 5 × 10 -4

Figure 3: Distance to the top leading eigenvector vs. iteration when applying HB for solving minw 1 2 w Aw.The acceleration effect due to the use of momentum is more evident for small η. Here we construct the matrix A = BB ∈ R 10×10 with each entry of B ∈ R 10×10 sampled from N (0, 1).

;Reddi et al. (2018);Wang et al. (2020) for the challenge of escaping saddle points. The problem is min w f (w)

i w + w 10 10 . We can rewrite the objective (127) as f (w) := 1 2 w Hw + h(w). Then, denote u 2 := e 1 p|w j | p-1 ∂|wj | ∂wj = p|w j | p-2 w j , we have that ∇h(w) = x + 10abs(w) 8 • w,

t+1 u 1 = (1 + η 10 )w t u 1 + β(w t u 1 -w t-1 u 1 ) -ηx[2] -10η(w t+1 u 1 ) 9 .(136)Note that we have that ∇f (w) := Hw + ∇h(w) = Hw + x + 10abs(w) points of the objective function satisfy(1 + 10|w[1]| 8 )w[1] + x[1] = 0 (-0.1 + 10|w[2]| 8 )w[2] + x[2] = 0(138)To check if the stationary point is a local minimum, we can use the expression of the Hessian ∇ 2 f (w) := H + ∇ 2 h(w)

Now let us switch to describe the setup of the experiment. We first set step size η = 0.01, dimension d = 4, ρ = w * = A 2 = 1, γ = 0.2 and gap = 5 × 10 -3 . Then we set A = diag([-γ; -γ + gap; a 33 ; a 44 ]), where the entries a 33 and a 44 are sampled uniformly random in [-γ + gap; A 2 ]. We draw w = (A + ρ w * I d ) -ξ θ, where θ ∼ N (0; I d ) and log 2 ξ is uniform on [-1, 1]. We set w * = w * w w and b = -(A + ρ w * I d )w * . The procedure makes w * the global minimizer of problem instance (A, b, ρ). Patterns shown on this figure exhibit for other random problem instances as well.

The second stage consists of those iterations that satisfy t ≥ T ζ . Given that dist(w T ζ , w * ) ≤ ζ and that the local strong convexity holds in B ζ (±w * ), the iterate w t would

The perturbation terms are used to model the deviation from the population dynamics. In this paper, we assume that there exists a small number c n > 0 such that for all iterations t ≤ T ζ and all j ∈[d -1], max{|ξ t |, |ρ t,j |} ≤ c n ,where the value c n should decay when the number of samples n is increasing and c n 0 when there are sufficiently large number of samples n. Theorem 1. Suppose that the approximated dynamics (11) holds with max{|ξ t |, |ρ t,j |} ≤ c n for all iterations t ≤ T ζ and all dimensions j ∈ [d -1], where c n ≤ ζ. Suppose that the local strong convexity holds in B ζ (±w * ). Assume that the initial point w 0 satisfies |w 0 | Assume that the norm of the momentum m t is bounded for all t ≤ T ζ , i.e. m t ≤ c m for some constant c m > 0. If the step size η satisfies η ≤

2020) study implicit regularization of gradient descent for the matrix sensing/matrix factorization problem. The directions are different from ours.

t * c β,t * Now let us check the conditions of Lemma 15 to see if β ≤ + ρR)c 1 + ηβc 0 + ηβδ ≤

annex

Combining ( 78) and (82) leads to the following,-ρη(1 + β) w * -w t 2 ( w t + w * ) + η 2 2(w t -w * ) A 2 * (w t -w * ) + 2ρ 2 w * -w t 2 w t 2 + (β 2 + 2β) w t -w t-1 2 -2ηβ w t-1 -w * , ∇f (w t-1 ) -∇f (w t ) .-2c w β 2 w t-1 -w t-2 2 + 2ηβ 2 t-2 s=1 β t-2-s (d s + e s + g s ).(Let us now bound the terms on the second to the last line (83) above,First, note that ∇ 2 f (w) ≤ A 2 + 2ρ w . So we know that f is L := A 2 + 2ρR smooth on {w : w ≤ R}. Second, denotewe can bound w t -w t-1 asUsing the results, we can bound (84) as(89) By ( 87), (88), we have thatthe Heavy Ball algorithm generates the iterate according toBy setting η ≤ 1, we have that the top eigenvector of A is u 1 = e 2 , which is the escape direction. Now observe the similarity (if one ignores the term η∇h(w t )) between the update (128) and 19. It suggests that a similar analysis could be used for explaining the escape process. In Appendix G.2, we provide a detailed analysis of the observation. However, problem (127) has a fixed Hessian and is a synthetic objective function. For general smooth non-convex optimization, can we have a similar explanation? Consider applying the Heavy Ball algorithm to min w f (w) and suppose that at time t 0 , the iterate is in the region of strict saddle points. We have that w t+1+t0 - By setting η ≤ 1 L with L being the smoothness constant of the problem, we have that the top eigenvector I d -η∇ 2 f (w t0 ) is the eigenvector that corresponds to the smallest eigenvalue of the Hessian ∇ 2 f (w t0 ), which is an escape direction. Therefore, if the deviation term can be controlled, the dynamics of the Heavy Ball algorithm can be viewed as implicitly and approximately computing the eigenvector, and hence we should expect that higher values of momentum parameter accelerate the process. To control ∇ 2 f (w t0 )(w t+t0 -w t0 )-∇f (w t+t0 ), one might want to exploit the ρ-Lipschitzness assumption of the Hessian and might need further mild assumptions, as Du et al. (2017) provide examples showing that gradient descent can take exponential time T = Ω(exp(d)) to escape saddles points. Previous work of Wang et al. (2020) makes strong assumptions to avoid the result of exponential time to escape. On the other hand, Lee et al. (2019) show that first order methods can escape strict saddle points almost surely. So we conjecture that under additional mild conditions, if gradient descent can escape in polynomial time, then using Heavy Ball momentum (β = 0) will accelerate the process of escape. We leave it as a future work.

G.2 ESCAPING SADDLE POINTS OF (127)

Recall the objective iswhere x i ∼ N (0, diag([0.1, 0.001])) and the small variance in the second component will provide smaller component of gradient in the escape direction. At the origin, we have that the gradient is small but that the Hessian exhibits a negative curvature. Let us denote h(w) := 1 n n i=1 x i w + w 10 10 . We can rewrite the objective as f (w) := 1 2 w Hw + h(w). Then, the Heavy Ball algorithm generates the iterate according towhere the matrix A is defined as. By setting η ≤ 1, we have that the top eigen-vector of A is u 1 = e 2 = 0 1 , which is the escape direction. In the following, let us The size of projection of wt on w * over iterations (i.e. |w t | vs. t), which is non-decreasing throughout the iterations until reaching an optimal point (here, w * = 1). (b): The size of the perpendicular component over iterations (i.e. w ⊥ t vs. t), which is increasing in the beginning and then it is decreasing towards zero after some point. We see that the slope of the curve corresponding to a larger momentum parameter β is steeper than that of a smaller one, which confirms Lemma 1 and Lemma 3.All the lines are obtained by initializing the iterate at the same point w 0 ∼ N (0, I d /(10000d)) and using the same step size η = 5 × 10 -4 . Here we set w * = e 1 and sample x i ∼ N (0, I d ) with dimension d = 10 and number of samples n = 200. We see that the higher the momentum parameter β, the faster the algorithm enters the linear convergence regime. For the line represented by HB+(β = 1.0 → 0.9), it means switching to the use of β = 0.9 from β = 1 after some iterations. Below, Algorithm 3 and Algorithm 4, we show two equivalent presentations of this

