UNDERSTANDING GRADIENT REGULARIZATION IN DEEP LEARNING: EFFICIENT FINITE-DIFFERENCE COMPUTATION AND IMPLICIT BIAS

Abstract

Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. Although some studies have reported that GR improves generalization performance in deep learning, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost for GR. In addition, this computation empirically achieves better generalization performance. Next, we theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias in a certain problem. In particular, learning with the finite-difference GR chooses better minima as the ascent step size becomes larger. Finally, we demonstrate that finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima: sharpness-aware minimization and the flooding method. We reveal that flooding performs finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR in both practice and theory.

1. INTRODUCTION

Explicit or implicit regularization is a key component for achieving better performance in deep learning. For instance, adding some regularization on the local sharpness of the loss surface is one common approach to enable the trained model to achieve better performance (Hochreiter & Schmidhuber, 1997; Foret et al., 2021; Jastrzebski et al., 2021) . In the related literature, some recent studies have empirically reported that gradient regularization (GR), i.e., adding penalty of the gradient norm to the original loss, makes the training dynamics reach flat minima and leads to better generalization performance (Barrett & Dherin, 2021; Smith et al., 2021; Zhao et al., 2022) . Using only the information of the first-order gradient seems a simple and computationally friendly idea. Because the first-order gradient is used to optimize the original loss, using its norm is seemingly easier to use than other sharpness penalties based on second-order information such as the Hessian and Fisher information (Hochreiter & Schmidhuber, 1997; Jastrzebski et al., 2021) . Despite its simplicity, our understanding of GR has been limited so far in the following ways. First, we need to consider the fact that GR must compute the gradient of the gradient with respect to the parameter. This type of computation has been investigated in a slightly different context: input-Jacobian regularization, that is, penalizing the gradient with respect to the input dimension to increase robustness against input noise (Drucker & Le Cun, 1992; Hoffman et al., 2019) . Some studies proposed the use of double backpropagation (DB) as an efficient algorithm for computing the gradient of the gradient for input-Jacobian regularization, whereas others proposed the use of finite-difference computation (Peebles et al., 2020; Finlay & Oberman, 2021) . Second, theoretical understanding of GR has been limited. Although empirical studies have confirmed that the GR causes the gradient dynamics to eventually converge to better minima with higher performance, the previous work provides no concrete theoretical evaluation for this result. Third, it remains unclear whether the GR has any potential connection to other regularization methods. Because the finite difference is composed of both gradient ascent and descent steps by definition, we are reminded of some learning algorithms for exploring flat minima such as sharpness-aware minimization (SAM) (Foret et al., 2021) and the flooding method (Ishida et al., 2020) , which are also composed of ascent and descent steps. Clarifying these points would help to deepen our understanding on efficient regularization methods for deep learning. In this work, we reveal that GR works efficiently with a finite-difference computation. This approach has a lower computational cost, and surprisingly achieves better generalization performance than the other computation methods. We present three main contributions to deepen our understanding of GR: • We demonstrate some advantages to using the finite-difference computation. We give a brief estimation of the computational costs of finite difference and DB in a deep neural network and show that the finite difference is more efficient than DB (Section 3.2). We find that a so-called forward finite difference leads to better generalization than a backward one and DB (Section 3.3). Learning with forward finite-difference GR requires two gradients of the loss function, gradient ascent and descent. A relatively large ascent step improves the generalization. • We give a theoretical analysis of the performance improvement obtained by GR. we analyze the selection of global minima in a diagonal linear network (DLN), which is a theoretically solvable model. We prove that GR has an implicit bias for selecting desirable solutions in the so-called rich regime (Woodworth et al., 2020) which would potentially lead to better generalization (Section 4.2). This implicit bias is strengthened when we use forward finitedifference GR with an increasing ascent step size. In contrast, it is weaken for a backward finite difference, i.e., a negative ascent step. • Finite-difference GR is also closely related to other learning methods composed of both gradient ascent and descent, that is, SAM and the flooding method. In particular, we reveal that the flooding method performs finite-difference GR in an implicit way (Section 5.2). Thus, this work gives a comprehensive perspective on GR for both practical and theoretical understanding.

2. RELATED WORK

Barrett & Dherin (2021) and Smith et al. ( 2021) investigated explicit and implicit GR in deep learning. They found that the discrete-time update of the usual gradient descent implicitly regularizes the gradient norm when its dynamics are mapped to the continual-time counterpart. This is referred to as implicit GR. They also investigated explicit GR, i.e., adding a GR term explicitly to the original loss, and reported that it improved generalization performance even further. Jia & Su (2020) also empirically confirmed that the explicit GR gave the improvement of generalization. Barrett & Dherin (2021) characterized GR as the slope of the loss surface and showed that a low GR (gentle slope) prefers flat regions of the surface. Recently, Zhao et al. (2022) independently proposed a similar but different gradient norm regularization, that is, explicitly adding a non-squared L2 norm of the gradient to the original loss. They used a forward finite-difference computation, but its superiority to other computation methods remains unconfirmed. The implementation of GR has not been discussed in much detail in the literature. In general, to compute the gradient of the gradient, there are two well-known computational methods: DB and finite difference. Some previous studies applied DB to the regularization of an information matrix (Jastrzebski et al., 2021) and input-Jacobian regularization, i.e., adding the L2 norm of the derivative with respect to the input dimension (Drucker & Le Cun, 1992; Hoffman et al., 2019) . Others have used the finite-difference computation for Hessian regularization (Peebles et al., 2020) and input-Jacobian regularization (Finlay & Oberman, 2021) . Here, we apply the finite-difference computation to GR and present some evidence that the finite-difference computation outperforms DB computation with respect to computational costs and generalization performance. In Section 4, we give a theoretical analysis of learning with GR in diagonal linear networks (DLNs) (Woodworth et al., 2020) . The characteristic property of this solvable model is that we can evaluate the implicit bias of learning algorithms (Nacson et al., 2022; Pesme et al., 2021) . Our analysis includes the analysis of SAM in DLN as a special case (Andriushchenko & Flammarion, 2022) . In contrast to previous work, we evaluate another lower-order term, and this enables us to show that forward finite-difference GR selects global minima in the so-called rich regime.

