UNDERSTANDING GRADIENT REGULARIZATION IN DEEP LEARNING: EFFICIENT FINITE-DIFFERENCE COMPUTATION AND IMPLICIT BIAS

Abstract

Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. Although some studies have reported that GR improves generalization performance in deep learning, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost for GR. In addition, this computation empirically achieves better generalization performance. Next, we theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias in a certain problem. In particular, learning with the finite-difference GR chooses better minima as the ascent step size becomes larger. Finally, we demonstrate that finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima: sharpness-aware minimization and the flooding method. We reveal that flooding performs finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR in both practice and theory.

1. INTRODUCTION

Explicit or implicit regularization is a key component for achieving better performance in deep learning. For instance, adding some regularization on the local sharpness of the loss surface is one common approach to enable the trained model to achieve better performance (Hochreiter & Schmidhuber, 1997; Foret et al., 2021; Jastrzebski et al., 2021) . In the related literature, some recent studies have empirically reported that gradient regularization (GR), i.e., adding penalty of the gradient norm to the original loss, makes the training dynamics reach flat minima and leads to better generalization performance (Barrett & Dherin, 2021; Smith et al., 2021; Zhao et al., 2022) . Using only the information of the first-order gradient seems a simple and computationally friendly idea. Because the first-order gradient is used to optimize the original loss, using its norm is seemingly easier to use than other sharpness penalties based on second-order information such as the Hessian and Fisher information (Hochreiter & Schmidhuber, 1997; Jastrzebski et al., 2021) . Despite its simplicity, our understanding of GR has been limited so far in the following ways. First, we need to consider the fact that GR must compute the gradient of the gradient with respect to the parameter. This type of computation has been investigated in a slightly different context: input-Jacobian regularization, that is, penalizing the gradient with respect to the input dimension to increase robustness against input noise (Drucker & Le Cun, 1992; Hoffman et al., 2019) . Some studies proposed the use of double backpropagation (DB) as an efficient algorithm for computing the gradient of the gradient for input-Jacobian regularization, whereas others proposed the use of finite-difference computation (Peebles et al., 2020; Finlay & Oberman, 2021) . Second, theoretical understanding of GR has been limited. Although empirical studies have confirmed that the GR causes the gradient dynamics to eventually converge to better minima with higher performance, the previous work provides no concrete theoretical evaluation for this result. Third, it remains unclear whether the GR has any potential connection to other regularization methods. Because the finite difference is composed of both gradient ascent and descent steps by definition, we are reminded of some learning algorithms for exploring flat minima such as sharpness-aware minimization (SAM) (Foret et al., 

