IMPLICIT REGULARIZATION IN HEAVY-BALL MOMEN-TUM ACCELERATED STOCHASTIC GRADIENT DESCENT

Abstract

It is well known that the finite step-size (h) in Gradient Descent (GD) implicitly regularizes solutions to flatter minima. A natural question to ask is "Does the momentum parameter β play a role in implicit regularization in Heavy-ball (H.B) momentum accelerated gradient descent (GD+M)?" To answer this question, first, we show that the discrete H.B momentum update (GD+M) follows a continuous trajectory induced by a modified loss, which consists of an original loss and an implicit regularizer. Then, we show that this implicit regularizer for (GD+M) is stronger than that of (GD) by factor of ( 1+β 1-β ), thus explaining why (GD+M) shows better generalization performance and higher test accuracy than (GD). Furthermore, we extend our analysis to the stochastic version of gradient descent with momentum (SGD+M) and characterize the continuous trajectory of the update of (SGD+M) in a pointwise sense. We explore the implicit regularization in (SGD+M) and (GD+M) through a series of experiments validating our theory.

1. INTRODUCTION

Deep neural networks (NN) have led to huge empirical successes in recent years across a wide variety of tasks, ranging from computer vision, natural language processing, autonomous driving to medical imaging, astronomy and physics (Bengio & LeCun, 2007; Hinton et al., 2006; Goodfellow et al., 2016) . Most deep learning problems are in essence solving an over-parameterized, largescale non-convex optimization problem. A mysterious phenomenon about NN that attracted much attention in the past few years is why NN generalizes so well. Indeed, even with extremely overparametrized model, NNs rarely show a sign of over-fitting (Neyshabur, 2017) . Thus far, studies along this line have successfully revealed many forms of implicit regularization that potentially lead to good generalization when gradient descent (GD) or stochastic gradient descent (SGD) algorithms are used for training, including norm penalty (Soudry et al., 2018) In contrast, the family of momentum accelerated gradient descent methods including Polyak's Heavy-ball momentum (Polyak, 1964), Nesterov's momentum (Sutskever et al., 2013 ), RMSProp (Tieleman et al., 2012 ), and Adam (Kingma & Ba, 2014) , albeit being powerful alternatives to SGD with faster convergence rates, are far from well-understood in the aspect of implicit regularization. In this paper, we analyze the implicit gradient regularization in the Heavy-ball momentum accelerated SGD (SGD+M) algorithm with the goal of gaining more theoretical insights on how momentum affects the generalization performance of SGD, and why it tends to introduce a variance reduction effect whose strength increases with the momentum parameter.

2. RELATED LITERATURE

It has been well studied that gradient based optimization implicitly biases solutions towards models of lower complexity which encourages better generalization. For example, in an over-parameterized quadratic model, gradient descent with a near-zero initialization implicitly biases solutions towards * denotes equal contribution i



, implicit gradient regularization (Barrett & Dherin, 2020), and implicit Hessian regularization (Orvieto et al., 2022a;b) through noise injection.

