QUICKLY FINDING A BENIGN REGION VIA HEAVY BALL MOMENTUM IN NON-CONVEX OPTIMIZATION

Abstract

The Heavy Ball Method (Polyak, 1964), proposed by Polyak over five decades ago, is a first-order method for optimizing continuous functions. While its stochastic counterpart has proven extremely popular in training deep networks, there are almost no known functions where deterministic Heavy Ball is provably faster than the simple and classical gradient descent algorithm in non-convex optimization. The success of Heavy Ball has thus far eluded theoretical understanding. Our goal is to address this gap, and in the present work we identify two non-convex problems where we provably show that the Heavy Ball momentum helps the iterate to enter a benign region that contains a global optimal point faster. We show that Heavy Ball exhibits simple dynamics that clearly reveal the benefit of using a larger value of momentum parameter for the problems. The first of these optimization problems is the phase retrieval problem, which has useful applications in physical science. The second of these optimization problems is the cubic-regularized minimization, a critical subroutine required by Nesterov-Polyak cubic-regularized method (Nesterov & Polyak (2006)) to find second-order stationary points in general smooth non-convex problems.

1. INTRODUCTION

Poylak's Heavy Ball method (Polyak (1964) ) has been very popular in modern non-convex optimization and deep learning, and the stochastic version (a.k.a. SGD with momentum) has become the de facto algorithm for training neural nets. Many empirical results show that the algorithm is better than the standard SGD in deep learning (see e.g. Hoffer et al. (2017); Loshchilov & Hutter (2019) ; Wilson et al. (2017); Sutskever et al. (2013) ), but there are almost no corresponding mathematical results that show a benefit relative to the more standard (stochastic) gradient descent. Despite its popularity, we still have a very poor justification theoretically for its success in non-convex optimization tasks, and Kidambi et al. (2018) were able to establish a negative result, showing that Heavy Ball momentum cannot outperform other methods in certain problems. Furthermore, even for convex problems it appears that strongly convex, smooth, and twice differentiable functions (e.g. strongly convex quadratic functions) are one of just a handful examples for which a provable speedup over standard gradient descent can be shown (e.g (Lessard et al., 2016; Goh, 2017; Ghadimi et al., 2015; Gitman et al., 2019; Loizou & Richtárik, 2017; 2018; Gadat et al., 2016; Scieur & Pedregosa, 2020; Sun et al., 2019; Yang et al., 2018a; Can et al., 2019; Liu et al., 2020; Sebbouh et al., 2020; Flammarion & Bach, 2015) ). There are even some negative results when the function is strongly convex but not twice differentiable. That is, Heavy Ball momentum might lead to a divergence in convex optimization (see e.g. (Ghadimi et al., 2015; Lessard et al., 2016) ). The algorithm's apparent success in modern non-convex optimization has remained quite mysterious. In this paper, we identify two non-convex optimization problems for which the use of Heavy Ball method has a provable advantage over vanilla gradient descent. The first problem is phase retrieval. It has some useful applications in physical science such as microscopy or astronomy (see e.g. (Candés et al., 2013 ), (Fannjiang & Strohmer, 2020 ), and (Shechtman et al., 2015) ). The objective is min w∈R d f (w) := 1 4n n i=1 (x i w) 2 -y i 2 , where x i ∈ R d is the design vector and y i = (x i w * ) 2 is the label of sample i. The goal is to recover w * up to the sign that is not recoverable (Candés et al., 2013) . Under the Gaussian design

