IMPLICIT GRADIENT REGULARIZATION

Abstract

Gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We find that the discrete steps of gradient descent implicitly regularize models by penalizing gradient descent trajectories that have large loss gradients. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization. We confirm empirically that implicit gradient regularization biases gradient descent toward flat minima, where test errors are small and solutions are robust to noisy parameter perturbations. Furthermore, we demonstrate that the implicit gradient regularization term can be used as an explicit regularizer, allowing us to control this gradient regularization directly. More broadly, our work indicates that backward error analysis is a useful theoretical approach to the perennial question of how learning rate, model size, and parameter regularization interact to determine the properties of overparameterized models optimized with gradient descent.

1. INTRODUCTION

The loss surface of a deep neural network is a mountainous terrain -highly non-convex with a multitude of peaks, plateaus and valleys (Li et al., 2018; Liu et al., 2020) . Gradient descent provides a path through this landscape, taking discrete steps in the direction of steepest descent toward a sub-manifold of minima. However, this simple strategy can be just as hazardous as it sounds. For small learning rates, our model is likely to get stuck at the local minima closest to the starting point, which is unlikely to be the most desirable destination. For large learning rates, we run the risk of ricocheting between peaks and diverging. However, for moderate learning rates, gradient descent seems to move away from the closest local minima and move toward flatter regions where test data errors are often smaller (Keskar et al., 2017; Lewkowycz et al., 2020; Li et al., 2019) . This phenomenon becomes stronger for larger networks, which also tend to have a smaller test error (Arora et al., 2019a; Belkin et al., 2019; Geiger et al., 2020; Liang & Rakhlin, 2018; Soudry et al., 2018) . In addition, models with low test errors are more robust to parameter perturbations (Morcos et al., 2018) . Overall, these observations contribute to an emerging view that there is some form of implicit regularization in gradient descent and several sources of implicit regularization have been identified. We have found a surprising form of implicit regularization hidden within the discrete numerical flow of gradient descent. Gradient descent iterates in discrete steps along the gradient of the loss, so after each step it actually steps off the exact continuous path that minimizes the loss at each point. Instead of following a trajectory down the steepest local gradient, gradient descent follows a shallower path. We show that this trajectory is closer to an exact path along a modified loss surface, which can be calculated using backward error analysis from numerical integration theory (Hairer et al., 2006) . Our core idea is that the discrepancy between the original loss surface and this modified loss surface is a form of implicit regularization (Theorem 3.1, Section 3). We begin by calculating the discrepancy between the modified loss and the original loss using backward error analysis and find that it is proportional to the second moment of the loss gradients, which we call Implicit Gradient Regularization (IGR). Using differential geometry, we show that IGR is also proportional to the square of the loss surface slope, indicating that it encourages optimization paths with shallower slopes and optima discovery in flatter regions of the loss surface. Next, we explore the properties of this regularization in deep neural networks such as MLP's trained to classify MNIST digits and ResNets trained to classify CIFAR-10 images and in a tractable two-parameter model. In these cases, we verify that IGR effectively encourages models toward minima in the vicinity of small gradient values, in flatter regions with shallower slopes, and that these minima have low test error, consistent with previous observations. We find that IGR can account for the observation that learning rate size is correlated with test accuracy and model robustness. Finally, we demonstrate that IGR can be used as an explicit regularizer, allowing us to directly strengthen this regularization beyond the maximum possible implicit gradient regularization strength.

2. THE MODIFIED LOSS LANDSCAPE INDUCED BY GRADIENT DESCENT

The general goal of gradient descent is to find a weight vector θ in parameter space R m that minimizes a loss E(θ). Gradient descent proceeds by iteratively updating the model weights with learning rate h in the direction of the steepest loss gradient: θ n+1 = θ n -h∇ θ E(θ n ) (1) Now, even though gradient descent takes steps in the direction of the steepest loss gradient, it does not stay on the exact continuous path of the steepest loss gradient, because each iteration steps off the exact continuous path. Instead, we show that gradient descent follows a path that is closer to the exact continuous path given by θ = -∇ θ E(θ), along a modified loss E(θ), which can be calculated analytically using backward error analysis (see Theorem 3.1 and Section 3), yielding: E(θ) = E(θ) + λR IG (θ), (2) where λ ≡ hm 4 (3) and R IG (θ) ≡ 1 m m i=1 (∇ θi E(θ)) 2 Immediately, we see that this modified loss is composed of the original training loss E(θ) and an additional term, which we interpret as a regularizer R IG (θ) with regularization rate λ. We call R IG (θ) the implicit gradient regularizer because it penalizes regions of the loss landscape that have large gradient values, and because it is implicit in gradient descent, rather than being explicitly added to our loss. Definition. Implicit gradient regularization is the implicit regularisation behaviour originating from the use of discrete update steps in gradient descent, as characterized by Equation 2. We can now make several predictions about IGR which we will explore in experiments: Prediction 2.1. IGR encourages smaller values of R IG (θ) relative to the loss E(θ). Given Equation 2 and Theorem 3.1, we expect gradient descent to follow trajectories that have relatively small values of R IG (θ). It is already well known that gradient descent converges by reducing the loss gradient so it is important to note that this prediction describes the relative size of R IG (θ) along the trajectory of gradient descent. To expose this phenomena in experiments, great care must be taken when comparing different gradient descent trajectories. For instance, in our deep learning experiments, we compare models at the iteration time of maximum test accuracy (and we consider other controls in the appendix), which is an important time point for practical applications and is not trivially determined by the speed of learning (Figures 1, 2 ). Also, related to this, since the regularization rate λ is proportional to the learning rate h and network size m (Equation 3), we expect that larger models and larger learning rates will encourage smaller values of R IG (θ) (Figure 2 ). Prediction 2.2. IGR encourages the discovery of flatter optima. In section 3 we will show that R IG (θ) is proportional to the square of the loss surface slope. Given this and Prediction 2.1, we expect that IGR will guide gradient descent along paths with shallower loss surface slopes, thereby encouraging the discovery of flatter, broader optima. Of course, it is possible to construct loss surfaces at odds with this (such as a Mexican-hat loss surface, where all minima are equally flat). However, we will provide experimental support for this using loss surfaces that are of widespread interest in deep learning, such as MLPs trained on MNIST (Figure 1 , 2, 3).

