IMPLICIT REGULARIZATION IN HEAVY-BALL MOMEN-TUM ACCELERATED STOCHASTIC GRADIENT DESCENT

Abstract

It is well known that the finite step-size (h) in Gradient Descent (GD) implicitly regularizes solutions to flatter minima. A natural question to ask is "Does the momentum parameter β play a role in implicit regularization in Heavy-ball (H.B) momentum accelerated gradient descent (GD+M)?" To answer this question, first, we show that the discrete H.B momentum update (GD+M) follows a continuous trajectory induced by a modified loss, which consists of an original loss and an implicit regularizer. Then, we show that this implicit regularizer for (GD+M) is stronger than that of (GD) by factor of ( 1+β 1-β ), thus explaining why (GD+M) shows better generalization performance and higher test accuracy than (GD). Furthermore, we extend our analysis to the stochastic version of gradient descent with momentum (SGD+M) and characterize the continuous trajectory of the update of (SGD+M) in a pointwise sense. We explore the implicit regularization in (SGD+M) and (GD+M) through a series of experiments validating our theory.

1. INTRODUCTION

Deep neural networks (NN) have led to huge empirical successes in recent years across a wide variety of tasks, ranging from computer vision, natural language processing, autonomous driving to medical imaging, astronomy and physics (Bengio & LeCun, 2007; Hinton et al., 2006; Goodfellow et al., 2016) . Most deep learning problems are in essence solving an over-parameterized, largescale non-convex optimization problem. A mysterious phenomenon about NN that attracted much attention in the past few years is why NN generalizes so well. Indeed, even with extremely overparametrized model, NNs rarely show a sign of over-fitting (Neyshabur, 2017) . Thus far, studies along this line have successfully revealed many forms of implicit regularization that potentially lead to good generalization when gradient descent (GD) or stochastic gradient descent (SGD) algorithms are used for training, including norm penalty (Soudry et al., 2018) , implicit gradient regularization (Barrett & Dherin, 2020) , and implicit Hessian regularization (Orvieto et al., 2022a; b) through noise injection. In contrast, the family of momentum accelerated gradient descent methods including Polyak's Heavy-ball momentum (Polyak, 1964 ), Nesterov's momentum (Sutskever et al., 2013 ), RMSProp (Tieleman et al., 2012 ), and Adam (Kingma & Ba, 2014) , albeit being powerful alternatives to SGD with faster convergence rates, are far from well-understood in the aspect of implicit regularization. In this paper, we analyze the implicit gradient regularization in the Heavy-ball momentum accelerated SGD (SGD+M) algorithm with the goal of gaining more theoretical insights on how momentum affects the generalization performance of SGD, and why it tends to introduce a variance reduction effect whose strength increases with the momentum parameter.

2. RELATED LITERATURE

It has been well studied that gradient based optimization implicitly biases solutions towards models of lower complexity which encourages better generalization. For example, in an over-parameterized quadratic model, gradient descent with a near-zero initialization implicitly biases solutions towards * denotes equal contribution i Published as a conference paper at ICLR 2023 having a small nuclear norm (Arora et al., 2019; Gunasekar et al., 2017; Razin & Cohen, 2020) , in a least-squares regression problem, gradient descent solutions with 0 initial guess are biased towards having a minimum ℓ 2 norm (Soudry et al., 2018; Neyshabur et al., 2014; Ji & Telgarsky, 2019; Poggio et al., 2020) . Similarly, in a linear classification problem with separable data, the solution of gradient descent is biased towards the max-margin (i.e., the minimum ℓ 2 norm) solution (Soudry et al., 2018) . However in (Vardi & Shamir, 2021) , the authors showed that these norm-based regularization results proved on simple settings might not extend to non-linear neural networks. The first general implicit regularization for GD discovered for all non-linear models (including neural networks) is the Implicit Gradient Regularization (IGR) (Barrett & Dherin, 2020) . It is shown that the learning rate in gradient descent (GD) penalizes the second moment of the loss gradients, hence encouraging discovery of flatter optima. Flatter optima usually give higher test-accuracy and are more robust to parameter perturbations (Barrett & Dherin, 2020). Implicit Gradient Regularization was also discovered for Stochastic Gradient Descent (SGD) (Smith et al., 2021) Li et al. ( 2019) , as one (but perhaps not the only one) reason for its good generalization. SGD is believed to also benefit from its stochasticity, which might act as a type of noise injection to enhance the performance. Indeed, it is shown in (Wu et al., 2020) that, by injecting noise to the gradients, full-batch gradient descent will be able to match the performance of SGD with small batch sizes. Besides injecting noise to the gradients, many other ways of noise injections have been discovered to have an implicit regularization effect on the model parameters, including noise injection to the model space (Orvieto et al., 2022b) and those to the network activations (Camuto et al., 2020) . However, how these different types of regularization cooperatively affect generalization is still quite unclear. The effect of generalization in momentum accelerated gradient descent has been studied much less. Li et al. (2019) analyzed the trajectory of SGD+M and found that it can be weakly approximated by solutions of certain Ito stochastic differential equations, which hinted the existence of IGR in (SGD+M). However, both the explicit formula of IGR and its relation to generalization remain unknown. Recently, in (Wang et al., 2021) , the authors analyzed the implicit regularization in momentum (GD+M) based on a linear classification problem with separable data and show that (GD+M) converges to the L 2 max-margin solution. Although this is one of the first proposed forms of implicit regularization for momentum based methods, it fails to provide an insight on the implicit regularization for momentum in non-linear neural networks. Recently, (Jelassi & Li, 2022) has shown that the (GD+M) increases the generalization capacity of networks in some special settings (i.e., a simple binary classification problem with a two layer network and part of the input features are much weaker than the rest), but it is unclear to which extent the insight obtained from this special setting can be extended to practical NN models. To the best of our knowledge, no prior work has derived an implicit regularization for (SGD+M) for general non-linear neural networks.

3. IMPLICIT GRADIENT REGULARIZATION FOR GRADIENT DESCENT AND ITS RELATION TO GENERALIZATION

We briefly review the IGR defined for GD (Barrett & Dherin, 2020) which our analysis will be based on. Let E(x) be the loss function defined over the parameters space x ∈ R p of the neural network. Gradient descent iterates take a discrete step (h) opposite to the gradient of the loss at the current iterate x k+1 = x k -h∇E(x k ). (1) With an infinitesimal step-size (h → 0), the trajectory of GD converges to that of the first order ODE x ′ (t) = -∇E(x(t)) (2) known as the gradient flow. But for a finite (albeit small) step size h, the updates of GD steps off the path of gradient flow and follow more closely the path of a modified flow: x ′ (t) = -∇ Ê(x(t)), where Ê(x) = E(x) + h 4 ∥∇E(x)∥ 2 . (3)

