IMPLICIT REGULARIZATION IN HEAVY-BALL MOMEN-TUM ACCELERATED STOCHASTIC GRADIENT DESCENT

Abstract

It is well known that the finite step-size (h) in Gradient Descent (GD) implicitly regularizes solutions to flatter minima. A natural question to ask is "Does the momentum parameter β play a role in implicit regularization in Heavy-ball (H.B) momentum accelerated gradient descent (GD+M)?" To answer this question, first, we show that the discrete H.B momentum update (GD+M) follows a continuous trajectory induced by a modified loss, which consists of an original loss and an implicit regularizer. Then, we show that this implicit regularizer for (GD+M) is stronger than that of (GD) by factor of ( 1+β 1-β ), thus explaining why (GD+M) shows better generalization performance and higher test accuracy than (GD). Furthermore, we extend our analysis to the stochastic version of gradient descent with momentum (SGD+M) and characterize the continuous trajectory of the update of (SGD+M) in a pointwise sense. We explore the implicit regularization in (SGD+M) and (GD+M) through a series of experiments validating our theory.

1. INTRODUCTION

Deep neural networks (NN) have led to huge empirical successes in recent years across a wide variety of tasks, ranging from computer vision, natural language processing, autonomous driving to medical imaging, astronomy and physics (Bengio & LeCun, 2007; Hinton et al., 2006; Goodfellow et al., 2016) . Most deep learning problems are in essence solving an over-parameterized, largescale non-convex optimization problem. A mysterious phenomenon about NN that attracted much attention in the past few years is why NN generalizes so well. Indeed, even with extremely overparametrized model, NNs rarely show a sign of over-fitting (Neyshabur, 2017) . Thus far, studies along this line have successfully revealed many forms of implicit regularization that potentially lead to good generalization when gradient descent (GD) or stochastic gradient descent (SGD) algorithms are used for training, including norm penalty (Soudry et al., 2018) , implicit gradient regularization (Barrett & Dherin, 2020) , and implicit Hessian regularization (Orvieto et al., 2022a; b) through noise injection. In contrast, the family of momentum accelerated gradient descent methods including Polyak's Heavy-ball momentum (Polyak, 1964 ), Nesterov's momentum (Sutskever et al., 2013) , RMSProp (Tieleman et al., 2012) , and Adam (Kingma & Ba, 2014) , albeit being powerful alternatives to SGD with faster convergence rates, are far from well-understood in the aspect of implicit regularization. In this paper, we analyze the implicit gradient regularization in the Heavy-ball momentum accelerated SGD (SGD+M) algorithm with the goal of gaining more theoretical insights on how momentum affects the generalization performance of SGD, and why it tends to introduce a variance reduction effect whose strength increases with the momentum parameter.

2. RELATED LITERATURE

It has been well studied that gradient based optimization implicitly biases solutions towards models of lower complexity which encourages better generalization. For example, in an over-parameterized quadratic model, gradient descent with a near-zero initialization implicitly biases solutions towards having a small nuclear norm (Arora et al., 2019; Gunasekar et al., 2017; Razin & Cohen, 2020) , in a least-squares regression problem, gradient descent solutions with 0 initial guess are biased towards having a minimum ℓ 2 norm (Soudry et al., 2018; Neyshabur et al., 2014; Ji & Telgarsky, 2019; Poggio et al., 2020) . Similarly, in a linear classification problem with separable data, the solution of gradient descent is biased towards the max-margin (i.e., the minimum ℓ 2 norm) solution (Soudry et al., 2018) . However in (Vardi & Shamir, 2021) , the authors showed that these norm-based regularization results proved on simple settings might not extend to non-linear neural networks. The first general implicit regularization for GD discovered for all non-linear models (including neural networks) is the Implicit Gradient Regularization (IGR) (Barrett & Dherin, 2020) . It is shown that the learning rate in gradient descent (GD) penalizes the second moment of the loss gradients, hence encouraging discovery of flatter optima. Flatter optima usually give higher test-accuracy and are more robust to parameter perturbations (Barrett & Dherin, 2020) . Implicit Gradient Regularization was also discovered for Stochastic Gradient Descent (SGD) (Smith et al., 2021 ) Li et al. (2019) , as one (but perhaps not the only one) reason for its good generalization. SGD is believed to also benefit from its stochasticity, which might act as a type of noise injection to enhance the performance. Indeed, it is shown in (Wu et al., 2020) that, by injecting noise to the gradients, full-batch gradient descent will be able to match the performance of SGD with small batch sizes. Besides injecting noise to the gradients, many other ways of noise injections have been discovered to have an implicit regularization effect on the model parameters, including noise injection to the model space (Orvieto et al., 2022b) and those to the network activations (Camuto et al., 2020) . However, how these different types of regularization cooperatively affect generalization is still quite unclear. The effect of generalization in momentum accelerated gradient descent has been studied much less. Li et al. (2019) analyzed the trajectory of SGD+M and found that it can be weakly approximated by solutions of certain Ito stochastic differential equations, which hinted the existence of IGR in (SGD+M). However, both the explicit formula of IGR and its relation to generalization remain unknown. Recently, in (Wang et al., 2021) , the authors analyzed the implicit regularization in momentum (GD+M) based on a linear classification problem with separable data and show that (GD+M) converges to the L 2 max-margin solution. Although this is one of the first proposed forms of implicit regularization for momentum based methods, it fails to provide an insight on the implicit regularization for momentum in non-linear neural networks. Recently, (Jelassi & Li, 2022) has shown that the (GD+M) increases the generalization capacity of networks in some special settings (i.e., a simple binary classification problem with a two layer network and part of the input features are much weaker than the rest), but it is unclear to which extent the insight obtained from this special setting can be extended to practical NN models. To the best of our knowledge, no prior work has derived an implicit regularization for (SGD+M) for general non-linear neural networks.

3. IMPLICIT GRADIENT REGULARIZATION FOR GRADIENT DESCENT AND ITS RELATION TO GENERALIZATION

We briefly review the IGR defined for GD (Barrett & Dherin, 2020) which our analysis will be based on. Let E(x) be the loss function defined over the parameters space x ∈ R p of the neural network. Gradient descent iterates take a discrete step (h) opposite to the gradient of the loss at the current iterate x k+1 = x k -h∇E(x k ). With an infinitesimal step-size (h → 0), the trajectory of GD converges to that of the first order ODE x ′ (t) = -∇E(x(t)) (2) known as the gradient flow. But for a finite (albeit small) step size h, the updates of GD steps off the path of gradient flow and follow more closely the path of a modified flow: x ′ (t) = -∇ Ê(x(t)), where Ê(x) = E(x) + h 4 ∥∇E(x)∥ 2 . ( ) It is shown (Barrett & Dherin, 2020) via the so-called classical backward analysis that when GD and the two gradient flows 2 and 3 all set off from the same point x k , the next gradient update x k+1 is O(h 2 ) close to the original gradient flow (2) evaluated at the next time point t k+1 = h + t k , but is O(h 3 ) close to the modified flow (3) evaluated at t k+1 . So, locally, the modified flow tracks the gradient descent trajectory more closely than the original flow. To discuss the global behaviour of GD, we need the following definition of closeness between two trajectories. Definition 3.1 (O(h α )-closeness in the strong sense). Fix some T > 0, we call the trajectory of the discrete GD-update x k and a continuous flow x(t k ) to be O(h α )-close in the strong sensefoot_0 if: max k∈K ∥x k -x(t k )∥ 2 ≤ ch α , K = 1, ... T h , where t k = hk, and c is some constant independent of h and k Definition 3.1 quantifies the global closeness of a discrete trajectory and a continuous one. By accumulating the local error, one can show that setting off from the same location x 0 , the original gradient flow equation 2 is O(h)-close to the GD trajectory while that of the modified flow 3 is O(h 2 )-close. Based on this observation, the authors defined the O(h) term h 4 ∥∇E(x)∥ 2 in the modified flow as the IGR term and concluded by stating that it guides the solutions to flatter minima in a highly non-convex landscape. To justify why minimizing ∥∇E(x)∥ 2 is a good idea and why it encourages flat minimizers (which seems to be missing from the original paper), we borrow an argument from (Foret et al., 2020) that was originally developed for a different purpose. Due to the PAC-Bayes analysis (Neyshabur, 2017), a simplified generalization bound for NN derived under some technical conditions can be stated as (Foret et al., 2020 ) L D (x) ≤ max ∥ϵ∥≤ρ L S (x + ϵ) + ĥ ∥x∥ 2 ρ , for any ρ > 0, where L D is the population loss (i.e., the generalization error), L S is the empirical/training loss, and ĥ : R + → R + is some strictly increasing function. One can try to minimize this upper bound in order to minimize the generalization error L D . The term ĥ ∥x∥ 2 ρ in the upper bound can be controlled by activating a weight decay penalty during training, and the first term in the bound is usually written into max ∥ϵ∥≤ρ L S (x + ϵ) = max ∥ϵ∥≤ρ (L S (x + ϵ) -L S (x)) sharpness + L S (x) training loss which consists of the training loss and an extra term called sharpness, minimizing which will help with generalization. Since directly minimizing the sharpness is difficult, one can then use the following first-order Taylor approximation max ∥ϵ∥≤ρ (L S (x + ϵ) -L S (x)) ≈ max ∥ϵ∥≤ρ ϵ T ∇L S (x) = ρ∥∇L S (x)∥. Using our notation, ∇L S (x) is ∇E(x), so the sharpness is approximately proportional to ∥∇E∥ which is the square root of the IGR term.

4. IMPLICIT GRADIENT REGULARIZATION FOR HEAVY BALL ACCELERATED GRADIENT DESCENT

The main mathematical challenge in studying the IGR in momentum updates is that we now need to perform global error analysis instead of the local backward analysis, as the momentum updates utilizes the entire update history. The IGR for Heavy-ball momentum was previously analyzed in (Kovachki & Stuart, 2021) through studying its relationship with the damped second order Hamiltonian dynamic mx ′ (t) + γx ′′ (t) + ∇E(x(t)) = 0 which has been well-known as the underlying ODE for the momentum updates. However, only O(h) closeness is proven between the momentum updates and this ODE trajectory under general step size assumptionsfoot_1 , which is not enough since the implicit regularization term h 4 ∥∇E∥ 2 itself is of order O(h). In addition, this approach is difficult to be applied to the stochastic setting. In this paper, we circumvent the use of the second order ODE (as it only gives O(h) closeness) and directly obtain a continuous path that is O(h 2 )-close to the momentum update for both GD and SGD. This is achieved by linking the momentum updates with a first order piecewise ODE, proving that the ODE has a piece-wise differentiable trajectory that is O(h 2 )-close to the momentum updates, and then using its trajectory to study the IGR. The detailed argument can be found in the appendix. Here we provide the final mathematical formula for the implicit regularization of (Heavy-Ball) momentum based gradient descent method (IGR-M). Theorem 4.1. (IGR-M): Let the loss for the full-batch gradient E be smooth and 4-times differentiable, then the (GD+M) updates x k+1 = x k -h∇E(x k ) + β(x k -x k-1 ) ∀k = 1, 2, ..., n are O(h 2 ) close to the flow of the continuous trajectory of the piecewise first-order ODE x ′ (t) = - 1 -β k+1 1 -β ∇E( x(t)) - hγ(1 + β) 2(1 -β) 3 ∇ 2 E( x(t))∇E( x(t)), t ∈ [t k , t k+1 ] ( ) where t k = kh and γ = (1 -β 2k+2 ) -4(k + 1)β k+1 (1 -β) (1 + β) . Since β k quickly decays to 0 as k grows, for a sufficiently large iteration k, equation 4 reduces to: x ′ (t) = - 1 1 -β ∇ Ê( x(t)), t ∈ [0, T ] driven by the modified loss Ê( x(t)) := E( x(t)) + (1+β)h 4(1-β) 2 ∥∇E( x(t))∥ 2 2 . More specifically, for a fixed time T , there exists a constant C, such that for any learning rate 0 < h < T , we have ∥ x(t k ) -x k ∥ 2 ≤ Ch 2 , t k = kh, k = 1, 2, ..., ⌊ T h ⌋. Comparing the continuous trajectory traced with 5 and the one without momentum 3, we immediately have a few important observations: Remark 4.1. Ignoring the O(h) terms in 5 and 3, we see that the momentum trajectory is driven by a force that is 1 1-β times as large as that for GD. Therefore, using the same learning rate, (GD+M) is expected to converge 1 1-β times as fast as GD. Alternatively, (GD+M) with a learning rate h has roughly the same convergence rate as GD with a learning rate h 1-β . From now on, we call h 1-β the effective learning rate of (GD+M). Remark 4.2. In terms of the IGR, we can see that adding the momentum amplifies the strength of the IGR (the coefficient in front of the IGR term increased from the h 4 in (GD) to the h 4 1+β (1-β) 2 in (GD+M). Even when letting the effective learning rates of the two methods equal (i.e., h GD+M 1-β = h GD ), the implicit regularization in (GD+M) is still 1+β 1-β times stronger. Remark 4.3. The IGR for (GD+M) reduces to the IGR for GD when β = 0. With an additional momentum parameter β, the strength of the implicit regularizer increases by a factor of 1+β 1-β . Hence for increasing values of momentum parameter β, the strength of the regularization increases, thus preferring more flatter trajectories through its descent.

4.1. IGR-M IN A 2D LINEAR MODEL

We first show the impact of IGR-M in a very simple setting that minimizes a loss function with two learnable parameters, i.e., ( ŵ1 , ŵ2 ) = arg min w1,w2 E(w 1 , w 2 ) where E(w 1 , w 2 ) = 1 2 (y -w 1 w 2 x) 2 . Here x, y, w 1 , w 2 are all scalars and mimics a simple scalar linear two-layer network. For a given scalar (x, y), the global minima of E(w 1 , w 2 ) are all the points on the curve w 1 w 2 = y x . The continuous gradient flow of the parameters are w ′ 1 (t) = -∂E ∂w1 and w ′ 2 (t) = -∂E ∂w2 . Figure 1 : Implicit regularization for (GD+M) is stronger than that of (GD) for the same effective learning rate h (1-β) . As β increases the optima seems to find solution with a lower norm. This confirms Remark 4.2 that the strength of implicit regularization increases with β. The background color denotes the magnitude of the norm of the gradient, i.e. , ∥∇E∥ 2 2 The IGR for this problem is h GD 4 ∥∇E∥ 2 2 = h GD 4 (w 2 1 + w 2 2 )E, which will regularize the trajectory to find minima with a smaller value (w 2 1 + w 2 2 ) (towards the cross) among all the global minima lying on w 1 w 2 = y x . We intentionally chose the same experiment as in (Barrett & Dherin, 2020) to compare the effect of implicit regularization for (GD) and (GD+M). For a fair comparison between (GD) and (GD+M), we set the effective learning rates to be the same, i.e, h GD = h GD+M 1-β as in Remark 4.1. With the same initialization (w 0 1 , w 0 2 ) = (2.8, 3.4) we explore and track the path of four trajectories with (h GD+M , β) being (5 × 10 -3 , 0.5), (2 × 10 -3 , 0.8), (10 -3 , 0.9) and (10 -2 , 0.0). In all the four cases, the effective learning rates are the same, i.e, h GD+M (1-β) = h GD = 10 -2 . We make the following observations: a) For all the four trajectories, the converged weight parameters w * = (w * 1 , w * 2 ) lie on the global minima curve. b) With increasing value of β, the converged solutions have decreasing value of ℓ 2 norm (or increasing strength of implicit regularization), i.e, ∥w * (10 -3 ,0.9) ∥ 2 < ∥w * (2×10 -3 ,0.8) ∥ 2 < ∥w * (5×10 -3 ,0.5) ∥ 2 < ∥w * (10 -2 ,0.0) ∥ 2 . This observation supports Remark 4.2, that the strength of implicit regularization increases with β, even with the effective learning rate.

5. IMPLICIT REGULARIZATION IN SGD WITH MOMENTUM

In SGD, the full-batch gradient is replaced by it's sampled unbiased estimator. Assume that the loss function E(x) has the following form E(x) = 1 M M j=1 E (j) (x) where E (j) is the j th mini-batch loss. In the k th iteration, we randomly pick a mini-batch, whose loss is denoted by E k , and update the parameters accordingly. The heavy-ball accelerated SGD iterates as      x k+1 = x k -h∇E k (x k ) + β(x k -x k-1 ) k = 1, 2, ..., n x 1 = x 0 -h∇E 0 (x 0 ) x 0 = x -1 = 0 (8) For each iteration k, the update is driven by the current mini-batch loss E k . Its continuous approximation is x ′ (t) = -∇E k (x(t)) + β h (x(t k ) -x(t k-1 )) for t k < t < t k+1 v Published as a conference paper at ICLR 2023 during the time [t k , t k+1 ]. As a result, the trajectory of x is continuous but piece-wise differentiable as it is easy to see that the left and right-side derivatives are not equal at a transit point t k from one batch to another, x ′ (t + k ) ̸ = x ′ (t - k ). Therefore, we expect the implicit regularization term to also have discontinuous derivatives on different intervals. Below we present the mathematical formula for IGR-M in the stochastic setting. Theorem 5.1. [IGR-M stochastic version (IGRM s )] Let the loss for each mini-batch E k be 4-times differentiable, then the Heavy Ball momentum updates 8 are O(h 2 ) close to the trajectory of the gradient flow with the modified loss, x ′ (t) = -∇ Êk ( x(t)) for t k ≤ t < t k+1 where, Êk ( x) = G k ( x) f orce + h 4 (∥∇G k ( x)∥ 2 2 + 2 k-1 r=0 β k-r ∥∇G r ( x)∥ 2 2 ) IGRMs (10) with G k ( x(t)) = k r=0 β k-r E r ( x(t)) . Specifically, there exists a constant C such that ∥ x(t k ) -x k ∥ 2 ≤ Ch 2 , k =1,2,...,n. The theorem tells us that the IGR for momentum is in the form of ℓ 2 norms of ∇G k which is a weighted sum of the historical gradients ∇E k , k = 0, ..., n, by powers of β and evaluated at the current location x(t). In addition, the stochastic IGR-M has different expressions on different intervals, caused by the abrupt changes between adjacent batches. Some further remarks about the (SGD+M) algorithm are listed below. Remark 5.1. The flow of the continuous trajectory is governed by the driving-force -∇G k (x(t)) and the negative gradient of an implicit regularizer IGRM s ( x) = h 4 (∥∇G k ( x)∥ 2 2 + 2 k-1 r=0 β k-r ∥∇G r ( x)∥ 2 2 which depends on both the learning rate h and momentum β. Remark 5.2. When β = 0, 10 reduces to x ′ (t) = -∇   Ek( x(t) f orce + h 4 ∥∇E k ( x(t))∥ 2 2 ) IGR    , t ∈ [t k , t k+1 ], which is the modified loss for SGD. Remark 5.3. Taking expectation over the random selections of batches, we get E(IGRM s )(x) = h(1+β) 4(1-β) 3 ∥∇E∥ 2 + h 4(1-β) 2 F , where F := E∥E n -E∥ 2 (appendix Th 3.1). In comparison, the IGR term in SGD after taking expectation is h 4 (∥∇E∥ 2 + F ) (Smith et al., 2021) , which is much weaker. Even with the adjusted learning rate, the IGR in (SGD+M) is still about 1/(1 -β) times stronger than SGD. Remark 5.4 (Variance reduction). We notice that momentum has a variance reduction effect. Explicitly, suppose the effective learning rate (Remark 4.1) is used so that the force terms in (SGD) and (SGD+M) have the same expectation, and then we can compare their variance. Let the covariance matrix of ∇E k at a fixed point x be C := E(∇E k (x) -∇E(x))(∇E k (x) -∇E(x)) T . Here ∇E(x) denotes the full-batch gradient. Then the covariance matrix of the force -E k driving (SGD) is exactly C, while that of the force -G k driving (SGD+M) is only (1-β) 1+β C (appendix Th 4.1), which is about (1-β) (1+β) times smaller. Remark 5.5. It is observed and confirmed by many literature that a larger variance of SGD iterations (caused by either a small batch size or a larger learning rate (Smith & Le, 2017; Li et al., 2017 )) increases the generalization power. Larger variance in mini-batch gradients increases the escape efficiency of SGD from bad local minimas (Ibayashi & Imaizumi, 2022) [See Appendix section-7 for detailed discussion], hence increasing generalization power. Therefore, the variance reduction effect of (SGD+M) hurts generalization. On the other hand, the fact that (SGD+M) has a stronger IGR (Remark 5.3) and that (SGD+M) is more stable than (SGD) to the use of large effective learning rates (see e.g., (Cohen et al., 2021) ) tend to help with its generalization. This explains why in practice we usually observe that (SGD+M) is only slightly better than (SGD). 

6. NUMERICAL EXPERIMENTS

Our first experiment is to compare the full-batch (GD) with (GD+M). For a linear least-squares problem with a Hessian matrix bounded by L in the spectral norm, it is well-known that (e.g., (Cohen et al., 2021 )) (GD+M) is stable as long as h ≤ 2+2β L , and GD is stable as long as h ≤ 2 L . This means, the maximum achievable effective learning rate by (GD) is 2 L , while that by (GD+M) can be as large as 2(1+β) L(1-β) . Since larger effective learning rates means a stronger IGR, (GD+M) clearly benefits from its large stability region. To confirm this, ResNet-18 is used to classify a uniformly sub-sampled MNIST dataset with 1000 training images. All external regularization schemes except learning rate decay and batch normalization have been turned off. We perform the following experiments : 1) Full-batch gradient descent with β = 0 (i.e., GD) for various learning rate h and the best test accuracy is noted (in Figure 2 ) to be 95.72%. 2) Full-batch gradient descent with momentum (GD+M) performed for various β with a fixed step-size h = 0.1 and the best test-accuracy is noted (in Figure 3 ) to be 96.22%. Our observation is that the best performance of GD (across all learning rates) is worse than the best performance of (GD+M) (across all β's). This observation failed to be explained by the known theory of edge of stabilityfoot_2 but can be well-explained by our implicit regularization theory for (GD+M) as adding momentum increases the strength of the IGR. To study the effect of implicit regularization (SGD+M), a series of experiments have been performed on an image classification task. Four well-known and popular network architectures . namely DenseNet (Iandola et al., 2014 ), ResNet-18, ResNet-50 (He et al., 2016) and WideResNet (Zagoruyko & Komodakis, 2016) are trained to classify images from the CIFAR-10 and CIFAR-100 datasets. We are interested to know how well training these networks with (SGD) and (SGD+M) respectively can generalize well onto the test dataset. To solely observe the effects of the momentum parameter β and learning rate h in generalization, we turn off all the external regularization like dropout, weight-decay and label-smoothing. We fix the batch-size to 640 in all our experiments. (1-β) = 0.1. The best performance of different models is highlighted. The mean and the standard deviation of test accuracy is calculated over the last 5 epochs and three random seed initializations. DN-121 is the Densenet-121, RN-18 and RN-50 denote the Resnet-18 and Resnet-50, WRN-16-8 represents the WideResnet with depth 16 and width-factor 8.

CIFAR

In the first experiment, we showed (GD+M) has a larger stability region than (GD) and hence allows for the use of a larger effective learning rate. The same conclusion holds for (SGD+M) and (SGD). However, here we want to show that even in the region where both algorithms are stable, (SGD+M) is still not just a scaled version of (SGD). For this purpose, we pick a small learning rate to ensure stability of both algorithms, and keep the effective learning rate h (1-β) for (SGD+M) to be the same as the learning rate for (SGD) (both equal 0.1). We observe from Table 1 (also Figure 4 ), that the viii Published as a conference paper at ICLR 2023 maximum test accuracy is almost always achieved at the highest value for β. This observation is consistent with Remark-5.3 where we showed that the implicit regularization in (SGD+M) is indeed stronger than (SGD), even after the learning rate adjustment. The standard deviation of test accuracy in Table 1 is calculated over the last 5 epochs and three random seed initialization. Lower standard deviation indicates a smoother test accuracy curve meaning less variation of test accuracy within an epoch interval. We observe that the lowest standard deviation is achieved at the highest value of β. Hence the observation that variance reduction effect is more prominent with higher β is consistent with Remark 5.4.

7. COMBINED EFFECTS OF IGR AND NOISE INJECTION

Despite its close relation to sharpness (Section 3), the IGR term ∥∇E∥ 2 gets very weak and irrelevant as x k approaches a local minimizer, since ∇E → 0. However, we find that this would not be the case if there was noise injection, which can help the IGR term retain its power even near local minima. More specifically, as studied in previous literature (Orvieto et al., 2022b; Camuto et al., 2020) , the algorithm resulting from injecting noise to each iteration of GD is usually called PGD (Perturbed gradient descent) that essentially minimizes an averaged objective function R(x) := E η∼N (0,σ 2 I) E(x + η). For small values of σ, we can expand R(x) into R(x) = E(x) + E η η T ∇E(x) + 1 2 E η η T ∇ 2 E(x)η + O(σ 3 ) = E(x) + 1 2 σ 2 T r(∇ 2 E(x)) + O(σ 3 ), where T r denotes the trace operator. Thus minimizing R(x) regularizes the trace Hessian of E. When minimizing R(x) using an SGD type of update, the iterations would be x k+1 = x k -h∇E(x k + η k ), where η k ∼ N (0, σ 2 I), which is known as a form of PGD. Because of the finite learning rate, the updates would follow the modified flow with an IGR term, which in this case is E(x + η k ) + h 4 ∥∇E(x + η k )∥ 2 . In expectation, the modified loss is E η k E(x + η k ) + h 4 ∥∇E(x + η k )∥ 2 ) =E η k E(x) + η T k ∇E(x) + 1 2 (η k ) T ∇ 2 E(x)η k + h 4 E η k ∥∇E(x) + ∇ 2 E(x)η k + ∇ 3 E(x)[η k , η k ]∥ 2 ) + O(σ 3 ) =E(x) + σ 2 T r(∇ 2 E(x)) + h 4 σ 2 ∥∇ 2 E(x)∥ 2 F + E η k ∥∇E(x) + ∇ 3 E(x)[η k , η k ]∥ 2 + O(σ 3 ). We see that now there is a Hessian regularization term h 4 σ 2 ∥∇ 2 E(x)∥ 2 F coming out of IGR which does not vanish even around local minimizers, and it's strength is proportional to the learning rate. We expect this new regularization term to get stronger when momentum is added, as momentum amplifies the power of IGR (Remark 4.2, 5.3). This observation suggests that IGR and noise injection as two different types of implicit regularization might be able to reinforce each other when used collaboratively.

8. CONCLUSION

This work studies the generalization of momentum driven gradient descent approach through the lens of implicit regularization (IR) with both theoretical analysis and experimental validation provided. We examined the similarities and differences between (SGD) and (SGD+M) and find that (SGD+M) with suitable parameters outperforms (SGD) in almost all settings. Moreover, we found that in addition to momentum, IGR may also be magnified by noise injection, which is a topic we want to further explore in the future.



Weak-sense approximation as studied inLi et al. (2019) only requires the distributions of the sample processes x and x to be close, whereas our strong-sense approximation requires each instance of x and x to be close. The strong-sense IGR found using the latter is valid for trainings with any fixed random-batch sequence and any fixed initialization, while the former only characterizes the mean trajectory taking expectation over many different trainings (each with a random batch-sequence and initialization). An O(h 2 ) closeness is proven under very stringent conditions on the learning rate, which excludes the interesting regime where momentum has an advantage over plain GD in terms of the convergence rate and stability "edge of stability" (EOS) Cohen et al. (2021) is a phenomenon that shows during network training by the full batch gradient descent, the sharpness ||∇ 2 E||2 tends to progressively increase until it reaches the threshold 2 h and then hovers around it. For GD+M, the sharpness will hover around a large value 2(1+β) h . Since larger sharpness usually means worse generalization, the EOS theory then predicts that adding momentum hurts the generalization.



Figure 2: (GD): Classification results of ResNet-18 on MNIST dataset performed with full-batch gradient descent. Figure shows the effect of implicit regularization due to the finite learning rate h. Test accuracy improves with higher learning rate till h = 0.2. For h = 0.2, the best test-accuracy is achieved at 95.72%

Figure 4: Classification results for CIFAR-10 dataset with various network architectures with combinations of (h, β) chosen such that the effective learning rate h (1-β) remains same. In all of the experiments, external regularization like weight-decay, l.r scheduler, dropout,label-smoothing are kept off (except Batch-normalization). The results have been averaged over 3 random seeds having different initializations. (SGD+M) has a) higher test accuracy for increasing β than (SGD) confirming Remark 5.4 b) Less variance for test accuracy confirming Remark 5.3.

Testing accuracy of CIFAR-10 and CIFAR-100 with different momentum β and learning rates h, but the same effective learning rate h

