THE ROLE OF MOMENTUM PARAMETERS IN THE OPTIMAL CONVERGENCE OF ADAPTIVE POLYAK'S HEAVY-BALL METHODS

Abstract

The adaptive stochastic gradient descent (SGD) with momentum has been widely adopted in deep learning as well as convex optimization. In practice, the last iterate is commonly used as the final solution. However, the available regret analysis and the setting of constant momentum parameters only guarantee the optimal convergence of the averaged solution. In this paper, we fill this theory-practice gap by investigating the convergence of the last iterate (referred to as individual convergence), which is a more difficult task than convergence analysis of the averaged solution. Specifically, in the constrained convex cases, we prove that the adaptive Polyak's Heavy-ball (HB) method, in which the step size is only updated using the exponential moving average strategy, attains an individual convergence rate of O( 1 √ t ), as opposed to that of O( log t √ t ) of SGD, where t is the number of iterations. Our new analysis not only shows how the HB momentum and its timevarying weight help us to achieve the acceleration in convex optimization but also gives valuable hints how the momentum parameters should be scheduled in deep learning. Empirical results validate the correctness of our convergence analysis in optimizing convex functions and demonstrate the improved performance of the adaptive HB methods in training deep networks.

1. INTRODUCTION

One of the most popular optimization algorithms in deep learning is the momentum method (Krizhevsky et al., 2012) . The first momentum can be traced back to the pioneering work of Polyak's heavy-ball (HB) method (Polyak, 1964) , which helps accelerate stochastic gradient descent (SGD) in the relevant direction and dampens oscillations (Ruder, 2016) . Recent studies also find that the HB momentum has the potential to escape from the local minimum and saddle points (Ochs et al., 2014; Sun et al., 2019a) . From the perspective of theoretical analysis, HB enjoys a smaller convergence factor than SGD when the objective function is twice continuously differentiable and strongly convex (Ghadimi et al., 2015) . In nonsmooth convex cases, with suitably chosen step size, HB attains an optimal convergence rate of O( 1 √ t ) in terms of the averaged output (Yang et al., 2016) , where t is the number of iterations. To overcome the data-independent limitation of predetermined step size rules, some adaptive gradient methods have been proposed to exploit the geometry of historical data. The first algorithm in this line is AdaGrad (Duchi et al., 2011) . The intuition behind AdaGrad is that the seldom-updated weights should be updated with a larger step size than the frequently-updated weights. Typically, AdaGrad rescales each coordinate and estimates the predetermined step size by a sum of squared past gradient values. As a result, AdaGrad has the same convergence rate as vanilla SGD but enjoys a smaller factor especially in sparse learning problems. The detailed analysis of AdaGrad (Mukkamala & Hein, 2017) implies that one can derive similar convergence rates for the adaptive variants of the predetermined step size methods without additional difficulties. Unfortunately, experimental results illustrate that AdaGrad under-performed when applied to training deep neural newtworks (Wilson et al., 2017) . Practical experience has led to the development of adaptive methods that is able to emphasize the more recent gradients. Specifically, an exponential moving average (EMA) strategy was proposed in RMSProp to replace the cumulative sum operation (Tieleman & Hinton, 2012) . Adam (Kingma & Ba, 2014) , which remains one of the most popular optimization algorithms in deep learning till today, built upon RMSProp together with updating the search directions via the HB momentum. Generally speaking, the gradient-based momentum algorithms that simultaneously update the search directions and learning rates using the past gradients are referred to as the Adam-type methods (Chen et al., 2019) . These kinds of methods have achieved several state-of-the-art results on various learning tasks (Sutskever et al., 2013) . Compared with HB and AdaGrad, the main novelty of Adam lies in applying EMA to gradient estimate (first-order) and to element-wise square-of-gradients (second-order), with the momentum parameter β 1t and step size parameter β 2t (see ( 6)) (Alacaoglu et al., 2020) . However, the use of EMA causes a lot of complexities to the convergence analysis. For example, in the online setting, (Kingma & Ba, 2014) offered a proof that Adam would converge to the optimum. Despite its remarkable practicality, Adam suffers from the non-convergence issue. To overcome its advantages, several variants such as AMSGrad and AdamNC were proposed (Reddi et al., 2018) . Unfortunately, the regret bound of AMSGrad in (Reddi et al., 2018)  is O( √ log t √ t) for nonsmooth convex problems, as opposed to that of O( √ t) of SGD. On the other hand, EMA uses the current step size in exponential moving averaging while the original HB can use the previous information (Zou et al., 2018) . This will lead the update to stagnate when β 1t is very close to 1. Fortunately, such a dilemma will not appear in Polyak's HB method and a simple proof on the convergence of this kind of Adams in smooth cases has been provided (Défossez et al., 2020) . In this paper, we will focus on the adaptive Polyak's HB method, in which the step size is only updated using EMA. Despite various reported practical performance for the Adam-type methods, there still exist some gaps between theoretical guarantees and empirical success. • First of all, some important regret bounds have been established to guarantee the performance of online Adam-type algorithms. Nevertheless, the online-to-batch conversion can inevitably lead the solution of the induced stochastic algorithm to take the form of averaging of all the past iterates. In practice, the last iterate is popularly used as the final solution, which has the advantage of readily enforcing the learning structure (Chen et al., 2012) . For SGD, the convergence of the last iterate, which is referred to as individual convergence in (Tao et al., 2020b) , was posed as an open problem (Shamir, 2012) . Only recently, its optimal individual convergence rate is proved to be O( log t √ t ) and O( log t t ) for general and strongly convex problems respectively (Harvey et al., 2019; Jain et al., 2019) . Despite enjoying the optimal averaging convergence (Yang et al., 2016) , as far as we know, the individual convergence about the adaptive HB has not been discussed. • Secondly, the momentum technique is often claimed as an accelerated strategy in machine learning community. However, almost all the theoretical analysis is only limited to the Nesterov's accelerated gradient (NAG) (Nesterov, 1983) method especially in smooth cases (Hu et al., 2009; Liu & Belkin, 2020) , which accelerates the rate of SGD from O( 1 t ) to O( 1 t 2 ). While the individual convergence of HB is also concerned in some papers (Sebbouh et al., 2020; Sun et al., 2019b) , the considered problem is limited to smooth and the derived rate is not optimal in convex cases. It is discovered that NAG is capable of accelerating the rate of individual convergence of SGD from O( log t √ t ) to O( 1 √ t ) (Tao et al., 2020a) in nonsmooth convex cases. Nevertheless, there is still a lack of the report about the acceleration of the adaptive HB. • Finally, in practice, almost all the momentum and Adam-type algorithms are often used with a constant momentum parameter β 1t (typically between 0.9 and 0.99). In theory, regret guarantees in the online Adam require a rapidly decaying β 1t → 0 schedule, which is also considered in (Sutskever et al., 2013; Orvieto et al., 2019) . This gap is recently bridged by getting the same regret bounds as that in (Reddi et al., 2018) with a constant β 1t (Alacaoglu et al., 2020) . In each state-of-the-art deep learning library (e.g. TensorFlow, PyTorch, and Keras), HB is named as SGD with momentum and β 1t is empirically set to 0.9 (Ruder, 2016) . Despite its intuition in controlling the number of forgotten past gradients and guarantee in optimal averaging convergence (Yang et al., 2016) , how β 1t affects individual convergence has not been discussed (Gitman et al., 2019) . The goal of this paper is to close a theory-practice gap when using HB to train the deep neural networks as well as optimize the convex objective functions. Specifically, • By setting β 1t = t t+2 , we prove that the adaptive HB attains an individual convergence rate of O( 1 √ t ) (Theorem 5), as opposed to that of O( log t √ t ) of SGD. Our proof is different from all the existing analysis of averaging convergence. It not only provides a theoretical guarantee for the acceleration of HB but also clarifies how the momentum and its parameter β 1t help us to achieve the optimal individual convergence. • If 0 ≤ β 1t ≡ β < 1, we prove that the adaptive HB attains optimal averaging convergence (Theorem 6). To guarantee the optimal individual convergence, Theorem 5 suggests that time-varying β 1t can be adopted. Note β 1t = t t+2 → 1, thus our new convergence analysis not only offers an interesting explanation why we usually restrict β 1t → 1 but also gives valuable hints how the momentum parameters should be scheduled in deep learning. We mainly focus on the proof of individual convergence of HB (Theorem 3, Appendix A.1). The analysis of averaging convergence (Theorem 4) is simpler. Their extensions to adaptive cases are slightly more complex (Theorem 5 and 6), but it is similar to the proof of AdaGrad (Mukkamala & Hein, 2017) and the details can be found in the supplementary material.

2. PROBLEM STATEMENT AND RELATED WORK

Consider the following optimization problem, min f (w), s.t. w ∈ Q. (1) where Q ⊆ R d is a closed convex set and f (w) is a convex function. Denote that w * is an optimal solution and P is the projection operator on Q. Generally, the averaging convergence is defined as f ( wt ) -f (w * ) ≤ (t), where wt = 1 t t i=1 w i and (t) is the convergence bound about t. By contrast, the individual convergence is described as f (w t ) -f (w * ) ≤ (t). (3) Throughout this paper, we use g(w t ) to denote the subgradient of f at w t . Projected subgradient descent (PSG) is one of the most fundamental algorithms for solving problem (1) (Dimitri P. et al., 2003) , and the iteration of which is w t+1 = P [w t -α t g(w t )], where α t > 0 is the step size. To analyze the convergence, we need the following assumption. Assumption 1. Assume that there exists a number M > 0 such that g(w) ≤ M, ∀w ∈ Q. It is known that the optimal bound for the nonsmooth convex problem (1 Yudin, 1983) . PSG can attain this optimal convergence rate in terms of the averaged output while its optimal individual rate is only O( log t √ t ) (Harvey et al., 2019; Jain et al., 2019) . ) is O( 1 √ t ) (Nemirovsky & When Q = R N , the regular HB for solving the unconstrained problem (1) is w t+1 = w t -α t g(w t ) + β t (w t -w t-1 ). If 0 ≤ β t ≡ β < 1, the key property of HB is that it can be reformulated as (Ghadimi et al., 2015 ) w t+1 + p t+1 = w t + p t - α t 1 -β g(w t ), where p t = β 1 -β (w t -w t-1 ). Thus its convergence analysis makes almost no difference to that of PSG. Especially, if (Yang et al., 2016) , where T is the total number of iterations. Simply speaking, the regular Adam (Kingma & Ba, 2014) takes the form of α t ≡ α √ T , its averaging convergence rate is O( 1 √ T ) w t+1 = w t - α √ t V -1 2 t ĝt , where ĝ(w t ) is a unbiased estimation of g(w t ) and ĝt = β 1t ĝt-1 + (1 -β 1t )ĝ(w t ), V t = β 2t V t-1 + (1 -β 2t )diag ĝ(w t )ĝ(w t ) .

3. INDIVIDUAL CONVERGENCE OF HB

To solve the constrained problem (1), HB can be naturally reformulated as w t+1 = P Q [w t -α t g(w t ) + β t (w t -w t-1 )]. We first prove a key lemma, which extends (5) to the constrained and time-varying cases. Lemma 1. (Dimitri P. et al., 2003) For w ∈ R d , w 0 ∈ Q, w -w 0 , u -w 0 ≤ 0, for all u ∈ Q if and only if w 0 = P (w). Lemma 2. Let {w t } ∞ t=1 be generated by HB (7). Let p t = t(w t -w t-1 ), β t = t t + 2 , α t = α (t + 2) √ t . Then HB (7) can be reformulated as w t+1 + p t+1 = P Q [w t + p t - α √ t g(w t )]. Proof. The projection operation can be rewritten as an optimization problem (Duchi, 2018) , i.e., w t+1 = P Q [w t -α t g(w t ) + β t (w t -w t-1 )] is equivalent to w t+1 = arg min w∈Q {α t g(w t ), w + 1 2 w -w t -β t (w t -w t-1 ) 2 }. Then, ∀w ∈ Q, we have w t+1 -w t -β t (w t -w t-1 ) + α t g(w t ), w t+1 -w ≤ 0.

This is

w t+1 + p t+1 -(w t + p t ) + α √ t g(w t ), w t+1 -w ≤ 0. Specifically, w t+1 + p t+1 -(w t + p t ) + α √ t g(w t ), w t+1 -w t ≤ 0. From ( 10) and ( 11), w t+1 + p t+1 -(w t + p t ) + α √ t g(w t ), w t+1 -w + (t + 1)(w t+1 -w t ) ≤ 0. i.e., w t+1 + p t+1 -(w t + p t ) + α √ t g(w t ), w t+1 + p t+1 -w ≤ 0. Using Lemma 1, Lemma 2 is proved. Due to the non-expansive property of P Q (Dimitri P. et al., 2003) , Lemma 2 implies that the convergence analysis for unconstrained problems can be applied to analyze the constrained problems. Theorem 3. Assume that Q is bounded. Let {w t } ∞ t=1 be generated by HB (7). Set β t = t t + 2 and α t = α (t + 2) √ t . Then f (w t ) -f (w * ) ≤ O( 1 √ t ). Proof. According to Lemma 2, w * -(w t+1 + p t+1 ) 2 ≤ w * -(w t + p t ) + α √ t g(w t ) 2 . w * -(w t + p t ) + α √ t g(w t ) 2 = w * -(w t + p t ) 2 + α √ t g(w t ) 2 + 2 α √ t g(w t ), w * -w t + 2 αt √ t g(w t ), w t-1 -w t Note g(w t ), w * -w t ≤ f (w * ) -f (w t ), g(w t ), w t-1 -w t ≤ f (w t-1 ) -f (w t ). Then (t + 1)[f (w t ) -f (w * )] ≤ t[f (w t-1 ) -f (w * )] + √ t 2α w * -(w t + p t ) 2 - √ t 2α w * -(w t+1 + p t+1 ) 2 + α 2 √ t g(w t ) 2 . Summing this inequality from k = 1 to t, we obtain (t + 1)[f (w t ) -f (w * )] ≤ f (w 0 ) -f (w * ) + t k=1 α 2 √ k g(w k ) 2 + t k=1 √ k 2α ( w * -(w k + p k ) 2 -w * -(w k+1 + p k+1 ) 2 ) . Note t k=1 1 2 √ k g(w k ) 2 ≤ √ tM 2 . and t k=1 √ k 2 ( w * -(w k + p k ) 2 -w * -(w k+1 + p k+1 ) 2 ) . ≤ 1 2 w * -(w 1 + p 1 ) 2 - √ t 2 w * -(w t+1 + p t+1 ) 2 + t k=2 ( √ k 2 - √ k -1 2 ) w * -(w k + p k ) 2 . Since Q is a bounded set, there exists a positive number M 0 > 0 such that w * -(w t+1 + p t+1 ) 2 ≤ M 0 , ∀t ≥ 0. Therefore (t + 1)[f (w t ) -f (w * )] ≤ f (w 0 ) -f (w * ) + α √ tM 2 + √ t 2α M 0 . This completes the proof of Theorem 3. It is necessary to give some remarks about Theorem 3. • In nonsmooth convex cases, Theorem 3 shows that the individual convergence rate of SGD can be accelerated from O( log t √ t ) to O( 1 √ t ) via the HB momentum. The proof here clarifies how the HB-type momentum w t -w t-1 and its time-varying weight β t help us to derive the optimal individual convergence. • The convergence analysis in Theorem 3 is obviously different from the regret analysis in all the available papers, this is because the connection between f (w t ) -f (w * ) and f (w t-1 ) -f (w * ) should be established here. It can be seen that seeking an optimal individual convergence is more difficult than the analysis of averaging convergence in many papers such as (Zinkevich, 2003) and (Yang et al., 2016) . • We can get a stochastic HB by replacing the subgradient g(w t ) in ( 7) with its unbiased estimation ĝ(w t ). Such substitution will not influence our convergence analysis. This means that we can get E[f (w t ) -f (w * )] ≤ O( 1 √ t ) under the same assumptions. If β t remains a constant, we can get the averaging convergence rate, in which the proof of the first part is similar to Lemma 2 and that of the second part is similar to online PSG (Zinkevich, 2003) . Theorem 4. Assume that Q is bounded and 0 ≤ β t ≡ β < 1. Let {w t } ∞ t=1 be generated by HB (7). Set p t = β 1 -β (w t -w t-1 ) and α t = α √ t . Then we have w t+1 + p t+1 = P Q [w t + p t - α t 1 -β g(w t )], f ( 1 t t k=1 w k ) -f (w * ) ≤ O( 1 √ t ). If Q is not bounded, the boundness of sequence w * -(w t+1 + p t+1 ) can not be ensured, which may lead to the failure of Theorem 4. Fortunately, like that in (Yang et al., 2016) , E[f ( 1 T T k=1 w k ) -f (w * )] ≤ O( 1 √ T ) still holds, but we need to set α t ≡ α √ T .

4. EXTENSION TO ADAPTIVE CASES

It is easy to find that HB ( 8) is in fact a gradient-based algorithm with predetermined step size α √ t . Thus its adaptive variant with EMA can be naturally formulated as w t+1 = P Q [w t - αβ 1t t √ t V -1 2 t ĝ(w t ) + β 1t (w t -w t-1 )]. where β 1t = t t + 2 , V t = β 2t V t-1 + (1 -β 2t )diag ĝ(w t )ĝ(w t ) . The detailed steps of the adaptive HB are shown in Algorithm 1. Algorithm 1 Adaptive HB Input: momentum parameters β 1t , β 2t , constant δ > 0, the total number of iterations T 1: Initialize w 0 = 0, V 0 = 0 d×d 2: repeat 3: ĝt (w t ) = ∇f t (w t ), 4: V t = β 2t V t-1 + (1 -β 2t )diag(ĝ t (w t )ĝ t (w t ) ), 5: Vt = V 1 2 t + δ √ t I d , : w t+1 = P Q [w t - αβ 1t t √ t Vt -1 ĝ(w t ) + β 1t (w t -w t-1 )], 7: until t = T Output: w T Theorem 5. Assume that Q is a bounded set. Let {w t } ∞ t=1 be generated by the adaptive HB (Algorithm 1). Denote p t = t(w t -w t-1 ). Suppose that β 1t = t t+2 and 1 -1 t ≤ β 2t ≤ 1 -γ t for some 0<γ ≤ 1. Then w t+1 + p t+1 = P Q [w t + p t - α √ t Vt -1 ĝ(w t )] E[f (w t ) -f (w * )] ≤ O( 1 √ t ). The proof of ( 13) is identical to that of Lemma 2. It is easy to find that ( 13) is an adaptive variant of ( 8). This implies that the proof of the second part is similar to that of AdaGrad (Mukkamala & Hein, 2017) . When 0 ≤ β 1t ≡ β < 1, the adaptive variant of HB ( 7) is w t+1 = P Q [w t - α √ t V -1 2 t ĝ(w t ) + β(w t -w t-1 )]. where V t = β 2t V t-1 + (1 -β 2t )diag(ĝ(w t )ĝ(w t ) ). Similar to the proof of Theorem 5, we can get the following averaging convergence. Theorem 6. Assume that Q is bounded and 0 ≤ β 1t ≡ β < 1 in Algorithm 1. Let {w t } ∞ t=1 be generated by the adaptive HB (Algorithm 1). Suppose that 1 - 1 t ≤ β 2t ≤ 1 -γ t for some 0<γ ≤ 1. Denote p t = β 1-β (w t -w t-1 ). Then w t+1 + p t+1 = P Q [w t + p t - α (1 -β) √ t Vt -1 ĝ(w t )], E[f ( 1 t t k=1 w k ) -f (w * )] ≤ O( 1 √ t ). It is necessary to give some remarks about Theorem 5 and Theorem 6. • The adaptive HB is usually used with a constant β 1t in deep learning. However, according to Theorem 6, the constant β 1t only guarantees the optimal data-dependent averaging convergence. The convergence property of the last iterate still remains unknown. • In order to assure the optimal individual convergence, according to Theorem 5, β 1t has to be time-varying. β 1t = t t+2 can explain why we usually restrict β 1t → 1 in practice. It also offers a new schedule about the selection of momentum parameters in deep learning.

5. EXPERIMENTS

In this section, we present some empirical results. The first two experiments are to validate the correctness of our convergence analysis and investigate the performance of the suggested parameters schedule. For fair comparison, we independently repeat the experiments five times and report the averaged results. The last experiment (Appendix A.4) is to show the effective acceleration of HB over GD in terms of the individual convergence.

5.1. EXPERIMENTS ON OPTIMIZING GENERAL CONVEX FUNCTIONS

This experiment is to optimize hinge loss with the l 1 -ball constraints. Let τ denotes the radius of the l 1 -ball. For implementation of the l 1 projection operation, we use SLEP packagefoot_0 . min f (w), s.t. w ∈ {w : w 1 ≤ τ }. ( ) Datasets: A9a, W8a, Covtype, Ijcnn1, Rcv1, Realsim (available at LibSVMfoot_1 website). Algorithms: PSG (α t = α √ t ), HB (α t = α (t+2) √ t , β t = t t+2 ), NAG (Tao et al., 2020a) and adaptive HB (12) (β 1t = t t+2 ). The relative function value f (w t ) -f (w * ) v.s. epoch is illustrated in Figure 1 . As expected, the individual convergence of the adaptive HB has almost the same behavior as the averaging output of PSG, and the individual output of HB and NAG. Since the three stochastic methods have the optimal convergence for general convex problems, we conclude that the stochastic adaptive HB attains the optimal individual convergence. Algorithms: Adam (α, β 1t ≡ 0.9, β 2t ≡ 0.999, = 10 -8 ) (Kingma & Ba, 2014), SGD (α t ≡ α), SGD-momentum (α t ≡ α, β t ≡ 0.9), AdaGrad (α t ≡ α) (Duchi et al., 2011) , RMSprop (α t ≡ α, β 2t ≡ 0.9, = 10 -8 ) (Tieleman & Hinton, 2012) . For our adaptive HB, γ = 0.1 and δ = 10 -8 . Different from the existing methods, we set β 1t = t t+2 and β 2t = 1 -γ t in Algorithm 1. Within each epoch, β 1t and β 2t remain unchanged. Note that all methods have only one adjustable parameter α, we choose α from the set of {0.1, 0.01, 0.001, 0.0001} for all experiments. Following (Mukkamala & Hein, 2017) and (Wang et al., 2020) , we design a simple 4-layer CNN architecture that consists two convolutional layers (32 filters of size 3 × 3), one max-pooling layer (2 × 2 window and 0.25 dropout) and one fully connected layer (128 hidden units and 0.5 dropout). We also use weight decay and batch normalization to reduce over-fitting. The optimal rate is always chosen for each algorithm separately so that one achieves either best training objective or best test performance after a fixed number of epochs. The loss function is the cross-entropy. The training loss results are illustrated in Figure 2 and 4 , and the test accuracy results are presented in Figure 3 and 5 . As can be seen, the adaptive HB achieves the improved training loss. Moreover, this improvement also leads to good performance on test accuracy. The experimental results show that our suggested schedule about the momentum parameters could gain improved practical performance even in deep learning tasks.

6. CONCLUSION

In this paper, we prove that the adaptive HB method attains an optimal data-dependent individual convergence rate in the constrained convex cases, which bridges a theory-practice gap in using momentum methods to train the deep neural networks as well as optimize the convex functions. Our new analysis not only clarifies how the HB momentum and its time-varying weight β 1t = t t+2 help us to achieve the acceleration but also gives valuable hints how its momentum parameters should be scheduled in deep learning. Empirical results on optimizing convex functions validate the Let {w t } ∞ t=1 be generated by HB (7). Set p t = β 1 -β (w t -w t-1 ) and α t = α √ t . Then, ∀w ∈ Q, according to Lemma 1, we have w t+1 -w t -β(w t -w t-1 ) + α t g(w t ), w t+1 -w ≤ 0. This is 1 1 -β (w t+1 -w t ) -p t + α t 1 -β g(w t ), w t+1 -w ≤ 0. i.e., w t+1 + p t+1 -(w t + p t ) + α t 1 -β g(w t ), w t+1 -w ≤ 0 Specifically, w t+1 + p t+1 -(w t + p t ) + α t 1 -β g(w t ), β(w t+1 -w t ) 1 -β ≤ 0 From ( 16) and ( 17), w t+1 + p t+1 -(w t + p t ) + α t 1 -β g(w t ), w t+1 + p t+1 -w ≤ 0. Using Lemma 1, we have w t+1 + p t+1 = P Q [w t + p t - α t 1 -β g(w t )]. Then w * -(w t+1 + p t+1 ) 2 ≤ w * -(w t + p t ) + α t 1 -β g(w t ) 2 = w * -(w t + p t ) 2 + α t 1 -β g(w t ) 2 + 2 α t 1 -β g(w t ), w * -w t +2 α t β (1 -β) 2 g(w t ), w t-1 -w t Note g(w t ), w * -w t ≤ f (w * ) -f (w t ), g(w t ), w t-1 -w t ≤ f (w t-1 ) -f (w t ). Then w * -(w t+1 + p t+1 ) 2 ≤ w * -(w t + p t ) 2 + α 2 t (1 -β) 2 g(w t ) 2 + 2α t 1 -β [f (w * ) -f (w t )] + 2α t β (1 -β) 2 [f (w t-1 ) -f (w t )]. Rearrange the inequality, we have 2α t 1 -β [f (w t ) -f (w * )] ≤ 2α t β (1 -β) 2 [f (w t-1 ) -f (w t )] + w * -(w t + p t ) 2 -w * -(w t+1 + p t+1 ) 2 + α 2 t (1 -β) 2 g(w t ) 2 . i.e., f (w t ) -f (w * ) ≤ β 1 -β [f (w t-1 ) -f (w t )] + 1 -β 2α t [ w * -(w t + p t ) 2 -w * -(w t+1 + p t+1 ) 2 ] + α t 2(1 -β) g(w t ) 2 . Summing this inequality from k = 1 to t, we obtain t k=1 [f (w k ) -f (w * )] ≤ β 1 -β [f (w 0 ) -f (w t )] + 1 -β 2α 1 w * -(w 1 + p 1 ) 2 - 1 -β 2α t w * -(w t+1 + p t+1 ) 2 + t k=1 α k 2(1 -β) g(w k ) 2 + t k=2 w * -(w k + p k ) 2 ( 1 -β 2α k - 1 -β 2α k-1 ). i.e., t k=1 [f (w k ) -f (w * )] ≤ β 1 -β [f (w 0 ) -f (w t )] + 1 -β 2α w * -(w 1 + p 1 ) 2 + t k=2 w * -(w k + p k ) 2 ( (1 -β) √ k 2α - (1 -β) √ k -1 2α ) + t k=1 α 2(1 -β) √ k g(w k ) 2 . ( ) Note t k=1 1 2 √ k g(w k ) 2 ≤ √ tM 2 . ( ) and since Q is a bounded set, there exists a positive number M 0 > 0 such that w * -(w t+1 + p t+1 ) 2 ≤ M 0 , ∀t ≥ 0. From (18)(19)(20) we have, t k=1 [f (w k ) -f (w * )] ≤ β 1 -β [f (w 0 ) -f (w t )] + (1 -β) √ tM 0 2α + α √ tM 2 1 -β . By convexity of f (w), we obtain f ( 1 t t k=1 w k ) -f (w * ) ≤ β (1 -β)t [f (w 0 ) -f (w t )] + (1 -β)M 0 2α √ t + αM 2 (1 -β) √ t . This completes the proof of Theorem 4. A.2 PROOF FOR THEOREM 5 Notation. For a positive definite matrix H ∈ R d×d , the weighted 2 -norm is defined by x 2 H = x Hx. The H-weighted projection P H Q (x) of x onto Q is defined by P H Q (x) = arg min y∈Q y - x 2 H . We use g(w k ) to denote the subgradient of f k (•) at w k . For the diagonal matrix sequence {M k } t k=1 , we use m k,i to denote the i-th element in the diagonal of M k . We introduce the notation, g 1:k,i = (g 1,i , g 2,i , .., g k,i ) , where g k,i is the i-th element of g(w k ). Lemma 7. (Mukkamala & Hein, 2017) Suppose that 1 -1 t ≤ β 2t ≤ 1 -γ t for some 0<γ ≤ 1, and t ≥ 1, then d i=1 t k=1 g 2 k,i kv k,i + δ ≤ d i=1 2(2 -γ) γ ( tv t,i + δ). Proof for Theorem 5. Without loss of generality, we only prove Theorem 5 in the full gradient setting. It can be extended to stochastic cases using the regular technique in (Rakhlin et al., 2011) . Note that the projection operation can be rewritten as an optimization problem (Duchi, 2018) , i.e., w t+1 = P Q [w t -α t V -1 t g(w t ) + β 1t (w t -w t-1 )] is equivalent to w t+1 = arg min w∈Q {α t V -1 t g(w t ), w + 1 2 w -w t -β 1t (w t -w t-1 ) 2 }. Then, ∀u ∈ Q, we have w t+1 -w t -β t (w t -w t-1 ) + α t V -1 t g(w t ), w t+1 -w ≤ 0. This is w t+1 + p t+1 -(w t + p t ) + α √ t V -1 t g(w t ), w t+1 -w ≤ 0. Specifically, w t+1 + p t+1 -(w t + p t ) + α √ t V -1 t g(w t ), w t+1 -w t ≤ 0. From ( 22) and ( 23), w t+1 + p t+1 -(w t + p t ) + α √ t V -1 t g(w t ), w t+1 -w t + (t + 1)(w t+1 -w t ) ≤ 0. i.e., w t+1 + p t+1 -(w t + p t ) + α √ t V -1 t g(w t ), w t+1 + p t+1 -w t ≤ 0. Using Lemma 1, we have w t+1 + p t+1 = P Vt Q [w t + p t - α √ t V -1 t g(w t )]. Then w * -(w t+1 + p t+1 ) 2 Vt ≤ w * -(w t + p t ) + α √ t V -1 t g(w t ) 2 Vt = w * -(w t + p t ) 2 Vt + α √ t g(w t ) 2 Vt + 2 α √ t g(w t ), w * -w t + 2 αt √ t g(w t ), w t-1 -w t . Note g(w t ), w * -w t ≤ f (w * ) -f (w t ), g(w t ), w t-1 -w t ≤ f (w t-1 ) -f (w t ). Then (t + 1)[f (w t ) -f (w * )] ≤t[f (w t-1 ) -f (w * )] + √ t 2α w * -(w t + p t ) 2 Vt - √ t 2α w * -(w t+1 + p t+1 ) 2 Vt + α 2 √ t g(w t ) 2 V -1 t . Summing this inequality from k = 1 to t, we obtain (t + 1)[f (w t ) -f (w * )] ≤ f (w 0 ) -f (w * ) + t k=1 α 2 √ k g(w k ) 2 V -1 k + t k=1 √ k 2α ( w * -(w k + p k ) 2 Vk -w * -(w k+1 + p k+1 ) 2

Vk

) . Using Lemma 7, we have t k=1 α 2 √ k g(w k ) 2 V -1 k ≤ d i=1 α(2 -γ) γ ( tv t,i + δ). Note t k=1 √ k 2α ( w * -(w k + p k ) 2 Vk -w * -(w k+1 + p k+1 ) 2 Vk ) = d i=1 v1,i 2α (w * i -(w 1,i + p 1,i )) 2 - d i=1 √ tv t,i 2α (w * i -(w t+1,i + p t+1,i )) 2 + d i=1 t k=2 1 2α ( √ kv k,i - √ k -1v k-1,i )(w * i -(w k,i + p k,i )) 2 . ( ) Since Q is a bounded set, there exists a positive number M 1 > 0 such that (w * i -(w t+1,i + p t+1,i )) 2 ≤ M 1 , ∀t ≥ 0, i = 1, 2, ..., d. and v k,i = β 2k v k-1,i + (1 -β 2k )g 2 k,i as well as β 2k ≥ 1 -1 k which implies kβ 2k ≥ k -1, we get √ kv k,i = kv k,i + δ = kβ 2k v k-1,i + k(1 -β 2k )g 2 k,i + δ ≥ (k -1)v k-1,i + δ = √ k -1v k-1,i . Therefore t k=1 √ k 2α ( w * -(w k + p k ) 2 Vk -w * -(w k+1 + p k+1 ) 2 Vk ) ≤ d i=1 v1,i 2α M 1 + d i=1 t k=2 1 2α ( √ kv k,i - √ k -1v k-1,i )M 1 = d i=1 v1,i M 1 2α + d i=1 √ tv t,i M 1 2α - d i=1 v1,i M 1 2α = M 1 2α d i=1 ( tv t,i + δ). Since √ tv t,i = g 1:t,i , therefore (t + 1)[f (w t ) -f (w * )] ≤f (w 0 ) -f (w * ) + M 1 2α d i=1 ( tv t,i + δ) + d i=1 α(2 -γ) γ ( tv t,i + δ) =f (w 0 ) -f (w * ) + ( M 1 2α + α(2 -γ) γ ) d i=1 ( g 1:t,i + δ). This proves f (w t ) -f (w * ) ≤ O( 1 √ t ).

A.3 PROOF FOR THEOREM 6

Let {w t } ∞ t=1 be generated by the adaptive HB (Algorithm 1). Set p t = β 1 -β (w t -w t-1 ) and α t = α √ t . Then, ∀u ∈ Q, according to Lemma 1, we have w t+1 -w t -β(w t -w t-1 ) + α t V -1 t g(w t ), w t+1 -w ≤ 0. This is 1 1 -β (w t+1 -w t ) -p t + α t V -1 t 1 -β g(w t ), w t+1 -w ≤ 0. i.e., w t+1 + p t+1 -(w t + p t ) + α t V -1 t 1 -β g(w t ), w t+1 -w ≤ 0 Specifically, w t+1 + p t+1 -(w t + p t ) + α t V -1 t 1 -β g(w t ), β(w t+1 -w t ) 1 -β ≤ 0 From ( 26) and ( 27), w t+1 + p t+1 -(w t + p t ) + α t V -1 t 1 -β g(w t ), w t+1 + p t+1 -w ≤ 0. Using Lemma 1, we have w t+1 + p t+1 = P Vt Q [w t + p t - α t V -1 t 1 -β g(w t )]. According to Lemma 2, w * -(w t+1 + p t+1 ) 2 Vt ≤ w * -(w t + p t ) + α t V -1 t 1 -β g(w t ) 2 Vt = w * -(w t + p t ) 2 Vt + α t 1 -β g(w t ) 2 Vt +2 α t 1 -β g(w t ), w * -w t + 2 α t β (1 -β) 2 g(w t ), w t-1 -w t Note g(w t ), w * -w t ≤ f (w * ) -f (w t ), g(w t ), w t-1 -w t ≤ f (w t-1 ) -f (w t ). Then w * -(w t+1 + p t+1 ) 2 Vt ≤ w * -(w t + p t ) 2 Vt + α 2 t (1 -β) 2 g(w t ) 2 V -1 t + 2α t 1 -β [f (w * ) -f (w t )] + 2α t β (1 -β) 2 [f (w t-1 ) -f (w t )]. Rearrange the inequality, we have 2α t 1 -β [f (w t ) -f (w * )] ≤ 2α t β (1 -β) 2 [f (w t-1 ) -f (w t )] + w * -(w t + p t ) 2 Vt -w * -(w t+1 + p t+1 ) 2 Vt + α 2 t (1 -β) 2 g(w t ) 2 V -1 t . i.e., f (w t ) -f (w * ) ≤ β 1 -β [f (w t-1 ) -f (w t )] + 1 -β 2α t [ w * -(w t + p t ) 2 Vt -w * -(w t+1 + p t+1 ) 2 Vt ] + α t 2(1 -β) g(w t ) 2 V -1 t . Published as a conference paper at ICLR 2021 Summing this inequality from k = 1 to t, we obtain t k=1 [f (w k ) -f (w * )] ≤ β 1 -β [f (w 0 ) -f (w t )] + 1 -β 2α 1 w * -(w 1 + p 1 ) 2 V1 - 1 -β 2α t w * -(w t+1 + p t+1 ) 2 Vt + t k=1 α k 2(1 -β) g(w k ) 2 V -1 k + d i=1 t k=2 (w * i -(w k,i + p k,i )) 2 ( (1 -β)v k,i 2α k - (1 -β)v k-1,i 2α k-1 ). i.e., t k=1 [f (w k ) -f (w * )] ≤ β 1 -β [f (w 0 ) -f (w t )] + 1 -β 2α w * -(w 1 + p 1 ) 2 V1 + d i=1 t k=2 (w * i -(w k,i + p k,i )) 2 1 -β 2α ( √ kv k,i - √ k -1v k-1,i ) + t k=1 α 2(1 -β) √ k g(w k ) 2 V -1 k . ( ) Using Lemma 7, we have From (28)(29)(30) we have, t k=1 [f (w k ) -f (w * )] ≤ β 1 -β [f (w 0 ) -f (w t )] + d i=1 (1 -β)v 1,i M 0 2α + α(2 -γ) γ(1 -β) d i=1 ( g 1:t,i + δ) + d i=1 (1 -β)v t,i √ tM 0 2α - d i=1 (1 -β)v 1,i M 0 2α . i.e., This completes the proof of Theorem 6. Published as a conference paper at ICLR 2021

A.4 EXPERIMENTS ON OPTIMIZING A SYNTHETIC CONVEX FUNCTION

A constrained convex optimization problem was constructed in (Harvey et al., 2019) to show that the optimal individual convergence rate of SGD is O( log t √ t ). We will use example to illustrate the acceleration of HB. Let Q be unit ball in R T . For i ∈ [T ] and c ≥ 1, define the positive scalar parameters T . Set c = 2, the function value f (w t ) v.s. iteration is illustrated in Figure 6 , where the step size of GD is c √ t and the parameters of the constrained HB (7) (α = 8) and AdaHB (12) (α = 0.08, γ = 0.9, δ = 10 -8 ) are selected according to Theorem 3 and Theorem 5. As expected, the individual convergence of HB is much faster than that of PSG. We thus conclude that HB is an effective acceleration of GD in terms of the individual convergence. a i = 1 8c(T -i + 1) b i = √ i 2c √ T Define f : Q → R



http://yelabs.net/software/SLEP/ http://www.csie.ntu.edu.tw/ ˜cjlin/libsvmtools/datasets/



Figure 1: Convergence on different LibSVM datasets

Figure 2: Training loss v.s. number of epochs on 4-layer CNN: CIFAR10, CIFAR100, MNIST

t,i + δ).(29)    and since Q is a bounded set, there exists a positive number M 0 > 0 such that w * -(w t+1 + p t+1 ) 2 ≤ M 0 , ∀t ≥ 0.(30)

k )-f (w * ) ≤ β (1 -β)t [f (w 0 )-f (w t )]+ α(2 -γ) γ(1 -β)t d i=1 ( g 1:t,i +δ)+ (1 -β)M 0 2αt d i=1( g 1:t,i +δ).

Figure 6: Convergence of the function value when T = 1000 and T = 5000 Obviously, the minimum value of f on the unit ball is non-positive because f (0) = 0. It can be proved f (w T ) ≥ log T 32c √

7. ACKNOWLEDGEMENTS

This work was supported in part by National Natural Science Foundation of China under Grants (62076252, 61673394, 61976213) and in part by Beijing Advanced Discipline Fund.

