THE ROLE OF MOMENTUM PARAMETERS IN THE OPTIMAL CONVERGENCE OF ADAPTIVE POLYAK'S HEAVY-BALL METHODS

ABSTRACT

The adaptive stochastic gradient descent (SGD) with momentum has been widely adopted in deep learning as well as convex optimization. In practice, the last iterate is commonly used as the final solution. However, the available regret analysis and the setting of constant momentum parameters only guarantee the optimal convergence of the averaged solution. In this paper, we fill this theory-practice gap by investigating the convergence of the last iterate (referred to as individual convergence), which is a more difficult task than convergence analysis of the averaged solution. Specifically, in the constrained convex cases, we prove that the adaptive Polyak's Heavy-ball (HB) method, in which the step size is only updated using the exponential moving average strategy, attains an individual convergence rate of O( 1 √ t ), as opposed to that of O( log t √ t ) of SGD, where t is the number of iterations. Our new analysis not only shows how the HB momentum and its timevarying weight help us to achieve the acceleration in convex optimization but also gives valuable hints how the momentum parameters should be scheduled in deep learning. Empirical results validate the correctness of our convergence analysis in optimizing convex functions and demonstrate the improved performance of the adaptive HB methods in training deep networks.

1. INTRODUCTION

One of the most popular optimization algorithms in deep learning is the momentum method (Krizhevsky et al., 2012) . The first momentum can be traced back to the pioneering work of Polyak's heavy-ball (HB) method (Polyak, 1964) , which helps accelerate stochastic gradient descent (SGD) in the relevant direction and dampens oscillations (Ruder, 2016) . Recent studies also find that the HB momentum has the potential to escape from the local minimum and saddle points (Ochs et al., 2014; Sun et al., 2019a) . From the perspective of theoretical analysis, HB enjoys a smaller convergence factor than SGD when the objective function is twice continuously differentiable and strongly convex (Ghadimi et al., 2015) . In nonsmooth convex cases, with suitably chosen step size, HB attains an optimal convergence rate of O( 1 √ t ) in terms of the averaged output (Yang et al., 2016) , where t is the number of iterations.

