THE ROLE OF MOMENTUM PARAMETERS IN THE OPTIMAL CONVERGENCE OF ADAPTIVE POLYAK'S HEAVY-BALL METHODS

Abstract

The adaptive stochastic gradient descent (SGD) with momentum has been widely adopted in deep learning as well as convex optimization. In practice, the last iterate is commonly used as the final solution. However, the available regret analysis and the setting of constant momentum parameters only guarantee the optimal convergence of the averaged solution. In this paper, we fill this theory-practice gap by investigating the convergence of the last iterate (referred to as individual convergence), which is a more difficult task than convergence analysis of the averaged solution. Specifically, in the constrained convex cases, we prove that the adaptive Polyak's Heavy-ball (HB) method, in which the step size is only updated using the exponential moving average strategy, attains an individual convergence rate of O( 1 √ t ), as opposed to that of O( log t √ t ) of SGD, where t is the number of iterations. Our new analysis not only shows how the HB momentum and its timevarying weight help us to achieve the acceleration in convex optimization but also gives valuable hints how the momentum parameters should be scheduled in deep learning. Empirical results validate the correctness of our convergence analysis in optimizing convex functions and demonstrate the improved performance of the adaptive HB methods in training deep networks.

1. INTRODUCTION

One of the most popular optimization algorithms in deep learning is the momentum method (Krizhevsky et al., 2012) . The first momentum can be traced back to the pioneering work of Polyak's heavy-ball (HB) method (Polyak, 1964) , which helps accelerate stochastic gradient descent (SGD) in the relevant direction and dampens oscillations (Ruder, 2016) . Recent studies also find that the HB momentum has the potential to escape from the local minimum and saddle points (Ochs et al., 2014; Sun et al., 2019a) . From the perspective of theoretical analysis, HB enjoys a smaller convergence factor than SGD when the objective function is twice continuously differentiable and strongly convex (Ghadimi et al., 2015) . In nonsmooth convex cases, with suitably chosen step size, HB attains an optimal convergence rate of O( 1 √ t ) in terms of the averaged output (Yang et al., 2016) , where t is the number of iterations. To overcome the data-independent limitation of predetermined step size rules, some adaptive gradient methods have been proposed to exploit the geometry of historical data. The first algorithm in this line is AdaGrad (Duchi et al., 2011) . The intuition behind AdaGrad is that the seldom-updated weights should be updated with a larger step size than the frequently-updated weights. Typically, AdaGrad rescales each coordinate and estimates the predetermined step size by a sum of squared past gradient values. As a result, AdaGrad has the same convergence rate as vanilla SGD but enjoys a smaller factor especially in sparse learning problems. The detailed analysis of AdaGrad (Mukkamala & Hein, 2017) implies that one can derive similar convergence rates for the adaptive variants of the predetermined step size methods without additional difficulties. Unfortunately, experimental results illustrate that AdaGrad under-performed when applied to training deep neural newtworks (Wilson et al., 2017) . Practical experience has led to the development of adaptive methods that is able to emphasize the more recent gradients. Specifically, an exponential moving average (EMA) strategy was proposed in RMSProp to replace the cumulative sum operation (Tieleman & Hinton, 2012) . Adam (Kingma & Ba, 2014) , which remains one of the most popular optimization algorithms in deep learning till today, built upon RMSProp together with updating the search directions via the HB momentum. Generally speaking, the gradient-based momentum algorithms that simultaneously update the search directions and learning rates using the past gradients are referred to as the Adam-type methods (Chen et al., 2019) . These kinds of methods have achieved several state-of-the-art results on various learning tasks (Sutskever et al., 2013) . Compared with HB and AdaGrad, the main novelty of Adam lies in applying EMA to gradient estimate (first-order) and to element-wise square-of-gradients (second-order), with the momentum parameter β 1t and step size parameter β 2t (see ( 6)) (Alacaoglu et al., 2020) . However, the use of EMA causes a lot of complexities to the convergence analysis. For example, in the online setting, (Kingma & Ba, 2014) offered a proof that Adam would converge to the optimum. Despite its remarkable practicality, Adam suffers from the non-convergence issue. To overcome its advantages, several variants such as AMSGrad and AdamNC were proposed (Reddi et al., 2018) . Unfortunately, the regret bound of AMSGrad in (Reddi et al., 2018)  is O( √ log t √ t) for nonsmooth convex problems, as opposed to that of O( √ t) of SGD. On the other hand, EMA uses the current step size in exponential moving averaging while the original HB can use the previous information (Zou et al., 2018) . This will lead the update to stagnate when β 1t is very close to 1. Fortunately, such a dilemma will not appear in Polyak's HB method and a simple proof on the convergence of this kind of Adams in smooth cases has been provided (Défossez et al., 2020) . In this paper, we will focus on the adaptive Polyak's HB method, in which the step size is only updated using EMA. Despite various reported practical performance for the Adam-type methods, there still exist some gaps between theoretical guarantees and empirical success. • First of all, some important regret bounds have been established to guarantee the performance of online Adam-type algorithms. Nevertheless, the online-to-batch conversion can inevitably lead the solution of the induced stochastic algorithm to take the form of averaging of all the past iterates. In practice, the last iterate is popularly used as the final solution, which has the advantage of readily enforcing the learning structure (Chen et al., 2012) . For SGD, the convergence of the last iterate, which is referred to as individual convergence in (Tao et al., 2020b) , was posed as an open problem (Shamir, 2012) . Only recently, its optimal individual convergence rate is proved to be O( log t √ t ) and O( log t t ) for general and strongly convex problems respectively (Harvey et al., 2019; Jain et al., 2019) . Despite enjoying the optimal averaging convergence (Yang et al., 2016) , as far as we know, the individual convergence about the adaptive HB has not been discussed. • Secondly, the momentum technique is often claimed as an accelerated strategy in machine learning community. However, almost all the theoretical analysis is only limited to the Nesterov's accelerated gradient (NAG) (Nesterov, 1983) method especially in smooth cases (Hu et al., 2009; Liu & Belkin, 2020) , which accelerates the rate of SGD from O( 1 t ) to O( 1 t 2 ). While the individual convergence of HB is also concerned in some papers (Sebbouh et al., 2020; Sun et al., 2019b) , the considered problem is limited to smooth and the derived rate is not optimal in convex cases. It is discovered that NAG is capable of accelerating the rate of individual convergence of SGD from O( log t 



) to O( 1 √ t ) (Tao et al.,

