UNDERSTANDING GRADIENT REGULARIZATION IN DEEP LEARNING: EFFICIENT FINITE-DIFFERENCE COMPUTATION AND IMPLICIT BIAS

Abstract

Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. Although some studies have reported that GR improves generalization performance in deep learning, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost for GR. In addition, this computation empirically achieves better generalization performance. Next, we theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias in a certain problem. In particular, learning with the finite-difference GR chooses better minima as the ascent step size becomes larger. Finally, we demonstrate that finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima: sharpness-aware minimization and the flooding method. We reveal that flooding performs finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR in both practice and theory.

1. INTRODUCTION

Explicit or implicit regularization is a key component for achieving better performance in deep learning. For instance, adding some regularization on the local sharpness of the loss surface is one common approach to enable the trained model to achieve better performance (Hochreiter & Schmidhuber, 1997; Foret et al., 2021; Jastrzebski et al., 2021) . In the related literature, some recent studies have empirically reported that gradient regularization (GR), i.e., adding penalty of the gradient norm to the original loss, makes the training dynamics reach flat minima and leads to better generalization performance (Barrett & Dherin, 2021; Smith et al., 2021; Zhao et al., 2022) . Using only the information of the first-order gradient seems a simple and computationally friendly idea. Because the first-order gradient is used to optimize the original loss, using its norm is seemingly easier to use than other sharpness penalties based on second-order information such as the Hessian and Fisher information (Hochreiter & Schmidhuber, 1997; Jastrzebski et al., 2021) . Despite its simplicity, our understanding of GR has been limited so far in the following ways. First, we need to consider the fact that GR must compute the gradient of the gradient with respect to the parameter. This type of computation has been investigated in a slightly different context: input-Jacobian regularization, that is, penalizing the gradient with respect to the input dimension to increase robustness against input noise (Drucker & Le Cun, 1992; Hoffman et al., 2019) . Some studies proposed the use of double backpropagation (DB) as an efficient algorithm for computing the gradient of the gradient for input-Jacobian regularization, whereas others proposed the use of finite-difference computation (Peebles et al., 2020; Finlay & Oberman, 2021) . Second, theoretical understanding of GR has been limited. Although empirical studies have confirmed that the GR causes the gradient dynamics to eventually converge to better minima with higher performance, the previous work provides no concrete theoretical evaluation for this result. Third, it remains unclear whether the GR has any potential connection to other regularization methods. Because the finite difference is composed of both gradient ascent and descent steps by definition, we are reminded of some learning algorithms for exploring flat minima such as sharpness-aware minimization (SAM) (Foret et al., 2021) and the flooding method (Ishida et al., 2020) , which are also composed of ascent and descent steps. Clarifying these points would help to deepen our understanding on efficient regularization methods for deep learning. In this work, we reveal that GR works efficiently with a finite-difference computation. This approach has a lower computational cost, and surprisingly achieves better generalization performance than the other computation methods. We present three main contributions to deepen our understanding of GR: • We demonstrate some advantages to using the finite-difference computation. We give a brief estimation of the computational costs of finite difference and DB in a deep neural network and show that the finite difference is more efficient than DB (Section 3.2). We find that a so-called forward finite difference leads to better generalization than a backward one and DB (Section 3.3). Learning with forward finite-difference GR requires two gradients of the loss function, gradient ascent and descent. A relatively large ascent step improves the generalization. • We give a theoretical analysis of the performance improvement obtained by GR. we analyze the selection of global minima in a diagonal linear network (DLN), which is a theoretically solvable model. We prove that GR has an implicit bias for selecting desirable solutions in the so-called rich regime (Woodworth et al., 2020) which would potentially lead to better generalization (Section 4.2). This implicit bias is strengthened when we use forward finitedifference GR with an increasing ascent step size. In contrast, it is weaken for a backward finite difference, i.e., a negative ascent step. • Finite-difference GR is also closely related to other learning methods composed of both gradient ascent and descent, that is, SAM and the flooding method. In particular, we reveal that the flooding method performs finite-difference GR in an implicit way (Section 5.2). Thus, this work gives a comprehensive perspective on GR for both practical and theoretical understanding.

2. RELATED WORK

Barrett & Dherin (2021) and Smith et al. ( 2021) investigated explicit and implicit GR in deep learning. They found that the discrete-time update of the usual gradient descent implicitly regularizes the gradient norm when its dynamics are mapped to the continual-time counterpart. This is referred to as implicit GR. They also investigated explicit GR, i.e., adding a GR term explicitly to the original loss, and reported that it improved generalization performance even further. Jia & Su (2020) also empirically confirmed that the explicit GR gave the improvement of generalization. Barrett & Dherin (2021) characterized GR as the slope of the loss surface and showed that a low GR (gentle slope) prefers flat regions of the surface. Recently, Zhao et al. (2022) independently proposed a similar but different gradient norm regularization, that is, explicitly adding a non-squared L2 norm of the gradient to the original loss. They used a forward finite-difference computation, but its superiority to other computation methods remains unconfirmed. The implementation of GR has not been discussed in much detail in the literature. In general, to compute the gradient of the gradient, there are two well-known computational methods: DB and finite difference. Some previous studies applied DB to the regularization of an information matrix (Jastrzebski et al., 2021) and input-Jacobian regularization, i.e., adding the L2 norm of the derivative with respect to the input dimension (Drucker & Le Cun, 1992; Hoffman et al., 2019) . Others have used the finite-difference computation for Hessian regularization (Peebles et al., 2020) and input-Jacobian regularization (Finlay & Oberman, 2021) . Here, we apply the finite-difference computation to GR and present some evidence that the finite-difference computation outperforms DB computation with respect to computational costs and generalization performance. In Section 4, we give a theoretical analysis of learning with GR in diagonal linear networks (DLNs) (Woodworth et al., 2020) . The characteristic property of this solvable model is that we can evaluate the implicit bias of learning algorithms (Nacson et al., 2022; Pesme et al., 2021) . Our analysis includes the analysis of SAM in DLN as a special case (Andriushchenko & Flammarion, 2022) . In contrast to previous work, we evaluate another lower-order term, and this enables us to show that forward finite-difference GR selects global minima in the so-called rich regime.

3. GRADIENT REGULARIZATION

We consider GR (Barrett & Dherin, 2021; Smith et al., 2021) , wherein the squared L2 norm of the gradient is explicitly added to the original loss L(θ) as follows: L(θ) = L(θ) + γ 2 R(θ), R(θ) = ∥∇L(θ)∥ 2 , where ∥ • ∥ denotes the Euclidean norm and γ > 0 is a constant regularization coefficient. We abbreviate the derivative with respect to the parameters ∇ θ by ∇. Its gradient descent is given by θ t+1 = θ t -η∇ L(θ t ) for time step t = 0, 1, ... and learning rate η > 0. While previous studies have reported that explicitly adding a GR term empirically improves generalization performance, its algorithms and implementations have not been discussed in much detail.

3.1. ALGORITHMS

To optimize the loss function with GR (1) using a gradient method, we need to compute the gradient of the gradient, i.e., ∇R(θ). As is well studied in input-Jacobian regularization (Drucker & Le Cun, 1992; Hoffman et al., 2019; Finlay & Oberman, 2021) , there are two main approaches to computing the gradient of the gradient.

Finite difference:

The finite-difference method approximates a derivative by a finite step. In the case of GR, we have ∇R(θ t )/2 = (∇L(θ ′ ) -∇L(θ t ))/ε + O(ε) with θ ′ = θ t + ε∇L(θ t ) for a constant ε > 0. The final term is expressed in Landau notation and is neglected in the computation. We update the GR term by ∆R F (ε) = ∇L(θ t + ε∇L(θ t )) -∇L(θ t ) ε (F-GR). We refer to this gradient as Forward finite-difference GR (F-GR). Because the gradient ∇L(θ t ) is computed for the original loss, the finite difference (3) requires only one additional gradient computation ∇L(θ ′ ). The order of the computation time is only double that of the usual gradient descent. The finite-difference method also has a backward computation: ∆R B (ε) = ∇L(θ t ) -∇L(θ t -ε∇L(θ t )) ε (B-GR). If we allow a negative step size, ∆R B corresponds to ∆R F through ∆R B (ε) = ∆R F (-ε). For a sufficiently small ε, both finite-difference GRs yield the same original gradient ∇R(θ) if we can neglect any numerical instability caused by the limit. The finite-difference method has been used in the literature for the optimization of neural networks, especially for Hessian-based techniques (Bishop, 2006; Peebles et al., 2020) . When we need a more precise ∇R, we can use a higher-order approximation, e.g., the centered finite difference, but this requires additional gradient computations, and hence we focus on the first-order finite difference.

Double Backpropagation:

The other approach is to apply the automatic differentiation directly to the GR term, i.e., ∇R. For example, its PyTorch implementation is quite straightforward, as shown in Section C.1 of the Appendices. This approach is referred to as DB, which was originally developed for input-Jacobian regularization (Drucker & Le Cun, 1992) . We explain more details on the DB computation and its computational graph in Section 3.2. DB, in effect, corresponds to computing the following Hessian-vector product: ∆R DB = H(θ t )∇L(θ t ), where H(θ) = ∇∇L(θ). The following equation may give us an intuition about the difference between the finite difference and DB alternatives. From the mean value theorem, F-GR is equivalent to ∆R F (ε) = 1 ε ε 0 dsH(θ t + s∇L(θ t ))∇L(θ t ). We can interpret the finite difference as taking an average of the curvature (Hessian) along the line of gradient update. For ε → 0, this reduces to ∆R DB . Note that the difference among these algorithms appears in non-linear models. For a naive linear model Xθ, the squared error loss has a constant Hessian XX ⊤ . Therefore, all of ∆R have the same update. We analyze a simple network model with non-linearity on the parameters in Section 4 and reveal the difference of implicit biases.

3.2. COMPUTATIONAL COST

We clarify the computational efficiencies of each algorithm of GR in deep networks. First, we give a rough estimation of the computational cost by counting the number of matrix multiplication required to compute ∇ L. Consider an L-layer fully connected neural network with a linear output layer: A l = ϕ(U l ), U l = W l A l-1 for l = 1, ..., L. Note that A l denotes a batch of activation and W l A l-1 requires a matrix multiplication. We denote the element-wise activation function as ϕ(•) and weight matrix as W l . For simplicity, we neglect the bias terms. The number of matrix multiplications required to compute ∇ L is given by N mul ∼ 6L (for F-GR), 9L (for DB), where ∼ hides an uninteresting constant shift independent of the depth. One can evaluate N mul straightforwardly from the computational graph (Figure 2 ), originally developed for the DB computation of input-Jacobian regularization (Drucker & Le Cun, 1992) . In brief, the original gradient ∇L, that is, the backpropagation on the forward pass {A 0 → A 1 → • • • → A L }, requires 3L matrix multiplications: L for the forward pass, L for backward pass B l = ϕ ′ (U l )•(W ⊤ l+1 B l+1 ), and L for gradient G l := ∂L/∂W l = B l A ⊤ l-1 . Because F-GR is composed of both gradient ascent and descent steps, we eventually need 6L. In contrast, for learning using the DB of GR, we need 3L for ∇L and additional 6L for the GR term. The GR term requires a forward pass of composed of A l , B l , and G l obtained in the gradient computation of ∇L. Note that the upper part {A 0 → A 1 → • • • → B L → • • • → B 1 } is well known as the DB of input-Jacobian regularization. As pointed out in Drucker & Le Cun (1992) , the computation of ∇B 1 is equivalent to treating the upper part of the graph as the forward pass and applying backpropagation. It requires 2L multiplications. In our GR case, we have additional L multiplications due to G l . Because the backward pass doubles the number of required multiplications, we eventually need 2 × (2L + L) = 6L multiplication. Further details are given in Section C.1. The results of numerical experiments shown in Figure 1 confirm the superiority of finite-difference GR in typical experimental settings. We trained deep neural networks using an NVIDIA A100 GPU for this experiment. All experiments were implemented by PyTorch. We summarize the pseudo code and implementation of GR and present the detailed settings of all experiments in Section C. Figure 1 (a) shows the wall time required for one epoch of training with stochastic gradient descent (SGD) and the objective function (1). We trained various multi-layer perceptrons (MLPs) and residual neural networks (ResNets) with different depths. The wall time increased almost linearly as the depth increased. The slope of the line is different for F-GR and DB, and F-GR was faster. This observation is consistent with the number of multiplications (7). In particular, in ResNet, one of the most typical deep neural networks, learning with finite-difference GR was more than twice as fast as learning with DB. Figure 1  → 1 → … → -1 → → -1 → … → 1 1 -1

Forward Pass for

Backward Pass for , we also show the convergence measured by the training loss and time steps. All of them showed better convergence for the finite difference. Note that the finite difference is also better to use from the perspective of memory efficiency. This is because DB requires all of the {A l , B l , G l } to be retained for the forward pass, which occupies more memory. It is also noteworthy that in general, it is difficult for theory to completely predict the realistic computational time required because it could heavily depend on the hardware and the implementation framework and does not necessarily correlate well with the number of floating-point operations (FLOPs) (Dehghani et al., 2021) . Our result suggests that at least the number of matrix multiplication explains well the superiority of the finite-difference approach in typical settings.

3.3. GENERALIZATION PERFORMANCE

Here, we show that the superiority of finite-difference computation over DB also appears in the eventual performance of trained models. Figures 3 and S .2 show the test accuracy of a 4-layer MLP and ResNet-18 trained by using SGD with GR on CIFAR-10 We trained the models in an exhaustive manner with various values for γ and ε for each algorithm of the GR. For learning with F-GR, the model achieved the highest accuracy on relatively large ascent steps (ε ∼ 0.1). In contrast, learning with B-GR showed a rapid decrease of the performance as the step size ε increased. The highest average test accuracy of F-GR was better than those of B-GR and DB although our purpose is to confirm the difference among the algorithms and not to achieve higher accuracy by tuning both γ and ε. It was (F-GR, B-GR, DB) = (58.6, 58.3, 57.6) ± (0.2, 0.2, 0.2) for MLP and (87.0, 86.2, 86.3) ± (0.2, 0.3, 0.3) for ResNet-18. We also confirmed that the same tendencies appeared in the grid search of ResNet-34 on CIFAR-100 (Figure S.3 ) . Moreover, we confirmed that F-GR performed better than B-GR and DB in a more realistic training of a wide residual network (WRN-28-10) on CIFAR-10 and CIFAR-100 with/without data augmentation (Table S .1 ). It is noteworthy that the best accuracy of F-GR was obtained close to the line of γ = ε. This line is closely related to SAM algorithm. We explain more details in Section 5.1. We also observed that when the ascent step was too small (e.g., ε ≲ 10 -4 ), numerical instability sometime appeared in the calculation of the finite difference ∆R. Overall, the experiments suggest that F-GR with a large ascent step is better to use for achieving higher generalization performance. 

4. THEORETICAL ANALYSIS OF IMPLICIT BIAS

Although previous work and our experiments in Section 3.3 indicate improvements of prediction performance caused by GR, theoretical understanding of this phenomenon remains limited. Because the gradient norm itself eventually becomes zero after the model achieves a zero training loss, it seems challenging to distinguish the generalization capacity by simply observing the value of the gradient norm after training. In addition, our experiments clarified that the performance also depends on the choice of the algorithm and revealed that the situation is complicated. To attack this problem, we consider a solvable model and reveal that GR methods actually contribute to the selection of global minima and the eventual performance.

4.1. DIAGONAL LINEAR NETWORK

A DLN is a solvable model proposed by Woodworth et al. (2020) . It is a linear transformation of input x ∈ R d defined as ⟨β, x⟩ where β is parameterized by β = w 2 + -w 2 -with w = (w + , w -) ∈ R 2d . Here, the square of the vector is an element-wise square operation. Suppose we have n training samples (x i , y i ) (i = 1, ..., n). The training loss is given by L(w) = 1 4n n i=1 w 2 + -w 2 -, x i -y i 2 . ( ) Consider continual-time training dynamics dw/dt = -∇L. We set an initialization w + (t = 0) = w -(t = 0) = α 0 which is a d-dimensional vector and whose entries are non-zero. We define a data matrix X whose i-th row is given by x i . Woodworth et al. (2020) found that interpolation solutions of usual gradient descent are given by β ∞ (α) = arg min β∈R d s.t. Xβ=y ϕ α (β) with α = α 0 . The potential function ϕ α is given by ϕ α (β) = d i=1 α 2 i q β i /α 2 i with q(z) = 2 - √ 4 + z 2 + z arcsinh(z/2). For a larger scale of initialization α, this potential function becomes closer to L2 regularization as α 2 i q(β i /α 2 i ) ∼ |β i | 2 , which corresponds to the L2 min-norm solution of the lazy regime (Chizat et al., 2019) . In contrast, for a smaller scale of initialization α, it becomes closer to L1 regularization as α 2 i q(β i /α 2 i ) ∼ |β i |. In this way, we can observe a one-parameter interpolation between L1 and L2 implicit biases. Deep neural networks in practice acquire rich features depending on data structure and are believed to be beyond the lazy regime. Thus, obtaining an L1 solution by setting small α is referred to as the rich regime and desirable. Previous work has revealed that effective values of α decreases by a larger learning rate in the discrete update (Nacson et al., 2022) , SGD (Pesme et al., 2021) , and SAM update (Andriushchenko & Flammarion, 2022) . These learning methods have an implicit bias that chooses the L1 sparse solution in the rich regime.

4.2. IMPLICIT BIAS OF GR

Now, consider gradient descent with F-GR dw/dt = -∇L(w) -γ∆R F (w). We find that the GR has implicit bias for the rich regime, and moreover, the strength of the bias depends on the ascent step size. Theorem 4.1. Assume that (i) the gradient dynamics converges to the interpolation solution satisfying Xβ = y, (ii) L2 norm of the parameter ∥w(t)∥ has a constant upper bound independent of γ and ε, (iii) for sufficiently small γ and ε, the integral of the training loss, i.e., ∞ 0 L(w(t))dt, has a constant upper (lower, respectively) bound R (R) independent of γ and ε. Then, for sufficiently small γ and ε, interpolation solutions are given by β ∞ (α F -GR ) with α F -GR ≤ α 0 • exp(-γεc * + O(γ 2 ) + O(ε 2 )). ( ) The exponent c * ∈ R d is a non-negative constant vector given by c * = 1 2n 2 (X ⊤ (Xβ(t = 0) -y)) 2 . ( ) Note that the inequality is element-wise. The proof is given in Section A.1. Technically speaking, learning with F-GR requires to evaluate a novel c * term, which has not appeared in the analyses of 2022) recently reported that we can obtain interpolation solutions with a smaller parameter norm ∥w(t)∥ using the discrete update with a larger learning rate. Because the interpolation solutions of gradient descent are also those of our learning with GR, assumption (ii) seems rational. The upper bound of assumption (iii) means that the convergence speed of L(w(t)) does not get too small for sufficiently small γ and ε. As a side note, we can replace assumption (iii) with the positive definiteness of a certain matrix (assumption A.2). This is seemingly rather technical, but related to a sufficient condition that the dynamics converge to the global minima. See Section A.2 for details. This theorem reveals that GR has an implicit bias to select the L1 solution, that is, the rich regime because α is always smaller than α 0 . As the ascent step increases, we have an exponentially smaller upper bound and the implicit bias to L1 solution will become stronger. We confirm this dependence of solutions on the ascent step in numerical experiments (Figure 4 ). As in previous work, we trained DLNs on the synthetic data of a sparse regression problem, where x i ∼ N (µ1, σ 2 I) and ). The dashed lines show the results of gradient descent without GR. This result is consistent with our experiments of in more realistic settings (Figure 3 ) where a relatively large ε achieves the best performance. In Figure 4 (c), we also present the largest eigenvalue of the Hessian (S.50 ), computed after training. As the ascent step size increases, F-GR chooses flatter minima. This is also consistent with empirical observations in previous studies on GR. Note that B-GR can potentially make the bound looser as the step size ε increases since B-GR is equivalent to F-GR with -ε. Actually, we can immediately find a lower bound α B-GR ≳ C • exp(γεc * ) for a positive constant vector C, as is remarked in Section A.1. The results of numerical experiments on DLNs shown in Figure S.3 confirm that learning with F-GR achieved better generalization performance than B-GR. y i ∼ N (⟨β * , x i ⟩ , 0.01), While Theorem 4.1 gives us insight into the finite-difference GR, the upper bound converges to α 0 for the DB limit (ε → 0 + ) and becomes meaningless. Fortunately, we can construct an upper bound applicable to the DB limit. Proposition 4.2. Under the same assumptions as in Theorem 4.1, for sufficiently small ε and γ, α F -GR ≤ α 0 • exp(-γc + O(γ 2 ) + O(ε 2 )), ( ) where the exponent c ∈ R d is a non-negative variable given by c = n -2 ∞ 0 (X ⊤ (Xβ(s) -y)) 2 ds. Its derivation is given in Section A.3. One can regard this proposition as a minor extension of Theorem 1 in Andriushchenko & Flammarion (2022) , which has investigated γ = ε. This setting has a special meaning as we mention in Section 5.1. From the proposition, one can see that the DB limit still has the implicit bias to select the rich regime. This is consistent with the numerical experiments in Figure 4 where the limit of small ε achieves slightly better and sparser solutions than GD without GR. Although the bound ( 12) is informative, it is difficult to evaluate a concrete value of c. As a side note, we can make a bound of the average over entries, that is, d i=1 c i /d ≥ (4n/d)λ min (XX ⊤ )R. See Section A.3 for details.

5. GR IN GRADIENT-ASCENT-AND-DESCENT LEARNING

We have revealed that learning with finite-difference GR, F-GR in particular, improves performance. We recall that the GR is composed of both gradient ascent and descent steps. This computation makes the GR essentially related to two other learning methods similarly composed of both gradient ascent and descent steps: the SAM algorithm and the flooding method.

5.1. CONNECTION WITH SAM

The SAM algorithm was derived from the minimization of a surrogate loss max ∥ε∥≤ρ L(θ + ε) for a fixed ρ > 0, and has achieved the highest performance in various models (Foret et al., 2021) . After some heuristic approximations, its update rule reduces to iterative gradient ascent and descent steps: θ t+1 = θ t -η∇L(θ ′ ) with θ ′ = θ t + ε t ∇L(θ t ) and ε t = ρ/∥∇L(θ t )∥. Under a specific condition, the SAM update can be seen as gradient descent with F-GR. Let us consider time-dependent regularization coefficient γ t and ascent step ε t . Then, for γ t = ε t , the gradient descent with F-GR becomes equivalent to the SAM update: ∇L(θ) + γ t ε t (∇L(θ ′ ) -∇L(θ)) = ∇L(θ ′ ). A similar equivalence has been pointed out in Zhao et al. (2022) which supposes a non-squared gradient norm and ε t = ρ/∥∇L(θ t )∥ naturally appears. Let us focus on the SAM update without the gradient normalization for simplicity, that is, ε t = ρ. This simplified SAM update was analyzed on DLNs in Andriushchenko & Flammarion (2022) . We can recover the SAM case by setting a sufficiently small γ = ε in Proposition 4.2. Although it will be curious to identify any optimal setting of (γ, ε), our analysis is limited to the range of the first-order Taylor expansion and characterizing any optimal setting seems beyond the scope of our analysis. In Figure 3 , we empirically observed the optimal setting for generalization was very close to or just on the line γ = ε. In contrast, our Figures 4, S .3 and the previous study Zhao et al. (2022) demonstrated that the optimal setting was not necessarily on γ = ε, and thus combining the ascent and descent steps would be still promising.

5.2. FLOODING PERFORMS GR IN AN IMPLICIT WAY

The flooding method (Ishida et al., 2020) is another learning algorithm composed of both gradient ascent and descent steps. Its update rule is given by θ t+1 = θ t -ηSign(L -b)∇L for a constant b > 0, referred to as the flood level. When the training loss becomes lower than the flood level, the sign of the gradient is flipped and the parameter is updated by gradient ascent. Therefore, the flooding causes the training dynamics to continue to wander around L(θ) ∼ b, and its gradient continues to take a non-zero value. This would seem a kind of early stopping, but previous work empirically demonstrates that flooding performs better than naive early stopping and finds flat minima. For simplicity, let us focus on the gradient descent for a full batch. The following theorem clarifies a hidden mechanism of flooding. Theorem 5.1. Consider the time step t satisfying L(θ t ) < b and L(θ t+1 ) > b. Then, the flooding update from θ t to θ t+2 is equivalent to the gradient of the F-GR with ε = γ = η: θ t+2 = θ t -η 2 ∇L(θ t + η∇L(θ t )) -∇L(θ t ) η . ( ) Similarly, for L(θ t ) > b and L(θ t+1 ) < b, the flooding update is equivalent to the gradient of the B-GR. Although its derivation is quite straightforward (see Section B), this essential connection between finite-difference GR and flooding has been missed in the literature. Ishida et al. (2020) conjectured that flooding causes a random walk on the loss surface and this would contribute to the search for flat minima in some ways. Our result implies that the dynamics of flooding are not necessarily random and it can actively search the loss surface in a direction that decreases the GR. This is consistent with the observations that the usual gradient descent with GR finds flat minima (Barrett & Dherin, 2021; Zhao et al., 2022) . Note that the ascent step is given by the learning rate η, and η is usually decayed in the training. This implies that because the ascent step size is relatively small, the implicit B-GR in the flooding update would not make the generalization performance much worse. Figure 5 empirically confirms that the flooding method decreases the gradient norm R(θ). We trained ResNet-18 on CIFAR-10 by using flooding. Figure 5(a) shows that at the beginning of the training, the training loss decreases in the usual way because the loss is far above flood level b. Around the 10th epoch, the loss value becomes sufficiently close to the flood level for the decrease in the loss to slow (Figure S.5 ) . Then, the flooding update becomes dominant in the dynamics the gradient norm begins to decrease. Figure 5 (b) demonstrates that the gradient norm of the trained model decreases as the initial learning rate increases. This is consistent with Theorem 5.1 because the theorem claims that the larger learning rate induces the larger regularization coefficient of the GR γ = η. In contrast, naive SGD training without flooding always reaches an almost zero gradient norm regardless of the learning rate. Thus, the change in the gradient norm depending on the learning rate is specific to flooding and implies that it implicitly performs GR.

6. DISCUSSION

This work presented novel practical and theoretical insights into GR. The finite-difference computation is effective in the sense of both reducing computational cost and improving performance. Theoretical analysis supports the empirical observation that the forward difference computation has an implicit bias that chooses potentially better minima depending on the size of the ascent step. Because deep learning requires large-scale models, it would be reasonable to use learning methods only composed of first-order descent or ascent gradients. The current work suggests that the F-GR is a promising direction for further investigation and could be extended for our understanding and practical usage of gradient-based regularization. We suggest several potentially interesting research directions. From a broader perspective, we may regard finite-difference GR, SAM, and flooding as a single learning framework composed of iterative gradient ascent and descent steps. It would be interesting to investigate if there is optimal combination of these steps for further improving performance. As our experiments suggest, only using the gradient descent or ascent does not necessarily achieve the best performance, and a combination of them seems to be the best approach. Similar results were empirically observed in other gradient-based regularization techniques (Zhao et al., 2022; Zhuang et al., 2022) . Related to the combination between the gradient descent and ascent, although we fixed the ascent step size as a constant, a step size decay or any scheduling could enhance the performance further. For instance, Zhuang et al. (2022) used a time-step dependent ascent step to achieve high prediction performance for SAM. It will also be interesting to explore any theoretical clarification beyond the scope of DLNs. Although a series of analyses in DLNs enable us to explore the implicit bias for selecting global minima, it assumes global convergence and avoids an explicit evaluation of convergence dynamics. Thus, it would be informative to explore the convergence rate or escape from local minima in other solvable models or a more general formulation if possible. Constructing generalization bounds would also be an interesting direction. Some theoretical work has proved that regularizing first-order derivatives of the network output control the generalization capacity (Ma & Ying, 2021) , and such derivatives are included in the gradient norm as a part. We expect that the current work will serve as a foundation for further developing and understanding regularization methods in deep learning.

A.1.2 EVALUATION ON α F -GR

From the definitions of r(t) and r * (t), we have r * (t) -r(t) = 2ε n X(( X⊤ r(t)) • w(t) 2 ) + ε 2 n 2 X(( X⊤ r(t)) 2 • w(t) 2 ). (S.12) Then, Ψ = ∞ 0 (X ⊤ r(s)) 2 ds + ε n ∞ 0 2(X ⊤ X(( X⊤ r(s)) • w(s) 2 )) • (X ⊤ r(s)) =:z(s) ds + ε 2 n 2 ∞ 0 (X ⊤ X(( X⊤ r(s)) 2 • w(t) 2 )) • (X ⊤ r(s)) =:z h (s) ds. (S.13) Let us put Ψ = Ψ 0 + ε n Ψ 1 + ε 2 n 2 Ψ 2 , (S.14) Ψ 0 := ∞ 0 (X ⊤ r(s)) 2 ds, Ψ 1 := ∞ 0 z(s)ds, Ψ 2 := ∞ 0 z h (s)ds. (S.15) Note that the first term Ψ 0 essentially corresponds to the implicit bias of the SAM update investigated in the previous study (Andriushchenko & Flammarion, 2022) . Because the SAM update corresponds to γ = ε, the dominant term of γΨ is Ψ 0 and they neglect the other terms. In our GR case, γ and ε have different scales in general and we need to evaluate the coefficient of the ascent step, that is, Ψ 1 . Lemma A.1. Under assumptions (i)-(iii), for sufficiently small γ, Ψ 1 > nb(0) 2 /2 + O(γ). If we further assume b i (0) ̸ = 0 for all i, Ψ 1 > nb(0) 2 /4. Proof of Lemma A.1. The dynamics (S.3 ) are rewritten as n dw dt = -b • w - γ n [2( Z( b • w 2 )) • w + b2 • w] - γε n 2 [( Z( b2 • w 2 )) • w + 2( Z( b • w 2 )) • w • b] - γε 2 n 3 [( Z( b2 • w 2 )) • w • b], (S.16) where we put b = X⊤ r and Z = X⊤ X. This gives us n 2 dβ dt = -b • a - γ n [2(Z(b • a)) • a + b 2 • β] =:Q1(t) - γε n 2 [(Z(b 2 • β)) • a + 2(Z(b • a)) • β • b] =:Q2(t) - γε 2 n 3 [(Z(b 2 • β)) • β • b] =:Q3(t) , (S.17) where we put a = w 2 + + w 2 -, b = X ⊤ r and Z = X ⊤ X. Note that db/dt = X ⊤ (dr/dt) = X ⊤ X(dβ/dt). By multiplying X ⊤ X to (S.17 ) and taking the Hadamard product with b, we have n db 2 dt = -4b • (X ⊤ X(b • a)) - 4γ n b • [X ⊤ X(Q 1 (t) + ε n Q 2 (t) + ε 2 n 2 Q 3 (t))] =:Q(t) . (S.18) The point is that we have 2b • (X ⊤ X(b • a)) = z(t) . This relation makes us to evaluate the seemingly complicated term Ψ 1 by the change of b(t) 2 , which corresponds to a training loss. By taking the integral over time, the above dynamics become Ψ 1 = ∞ 0 z(s)ds = n 2 b(0) 2 -2 γ n ∞ 0 Q(s)ds. (S.19) We used assumption (i) that we have a global minimum and b(∞) = 0. If γ is sufficiently small and ∞ 0 Q(s)ds is finite, we will have a non-negative Ψ 1 . Here, we use assumption (ii) that the parameter norm has a finite constant upper bound independent of γ and ε. Because ∥a(t)∥ = ∥w + (t) 2 + w -(t) 2 ∥ ≤ ∥w∥ 2 , we have an upper bound of ∥a(t)∥ as well: ∥a(t)∥ ≤ ā. (S.20) Define κ 1 := arg max i ∥Xx i ∥, κ 2 := arg max i ∥x i ∥ and κ 3 := ∥XX ⊤ ∥ 2 . Then, we find |Q 1,i (t)| ≤ 2a i ∥Xx i ∥∥b • a∥ + b 2 i |β i | (S.21) ≤ 2ā 2 κ 1 √ κ 3 ∥r(t)∥ + āκ 2 2 ∥r(t)∥ 2 . (S.22) where we used ∥b • a∥ ≤ ∥b∥∥a∥ ≤ √ κ 3 ā∥r∥ and ∥β∥ ≤ ∥a∥ ≤ ā. Similarly, we have |Q 2,i (t)| ≤ ā2 κ 1 κ 3 ∥r∥ 2 + 2ā 2 κ 1 κ 2 √ κ 3 ∥r∥ 2 , (S.23) where we used ∥b 2 ∥ ≤ i (X i r) 4 ≤ i (X i r) 2 = ∥b∥ 2 . We also have |Q 3,i (t)| ≤ ā2 κ 1 κ 2 κ 3 ∥r∥ 3 . (S.24) Note that under assumption (ii), the training loss is upper-bounded as well because ∥r(t)∥ ≤ ∥Xβ∥ + ∥y∥ ≤ √ κ 3 ā + ∥y∥ =: L. (S.25) Therefore, we have |Q 3,i (t)| ≤ ∥a∥κ 1 κ 2 κ 3 L∥r∥ 2 . (S.26) After all, the (S.22 ,S.23 ,S.26 ) lead to ∞ 0 dsQ i (s) ≤ C ∞ 0 ds∥r∥ 2 , (S.27) where C denotes an uninteresting positive constant. By using assumption (iii) that ∞ 0 ds∥r∥ 2 has an constant upper bound 4n R independent of γ and ε, we have Ψ 1 ≥ nb(0) 2 2 -8γC R. (S.28) Therefore, for sufficiently small γ, the dominant term is non-negative. Moreover, if we have b i (0) ̸ = 0 for all i, Ψ 1 ≥ nb(0) 2 4 > 0 for γ < min i nb i (0) 2 32C R . (S.29)

■

As a side note, the inequality (S.29 ) of γ gives us some insight into non-asymptotic evaluation on how large γ we can take. First, the constant C includes ā and it implies that we need a smaller γ for a larger parameter norm ā. Second, note that R controls the integral of the training loss over the whole training dynamics. We need a smaller γ as well for a larger R which implies the convergence of dynamics is slower. Next, we evaluate Ψ 2 . Since z h (s) = (Z(b 2 • β)) • b, (S.30) we have |z h,i | ≤ κ 1 κ 2 κ 3 ā L∥r∥ 2 . (S.31) Therefore, | ∞ 0 z h,i (s)ds| ≤ C ′ R. (S.32) exThus, Ψ 2 is finite and becomes negligible in Ψ for a sufficiently small ε. Finally, we have γΨ ≥ εγ b(0) 2 2 - 8εγ 2 n C R - ε 2 γ n 2 C ′ R, (S.33) where we used Lemma A.1. Substituting the above inequality and b(0) = X ⊤ r(0) into (S.7 ), we obtain Theorem 4.1. Remark on higher-order terms in Theorem 4.1: First, let us remark on O(γ 2 ) term. Lemma A.1 tells us that if we have b i (0) ̸ = 0 for all i, we have a slightly stronger result than Theorem 4.1: S.34) where the O(γ 2 ) term disappears. The condition of b i (0) ̸ = 0 seems to hold in usual cases because the network parameters and training samples are randomly assigned at initialization. Second, regarding O(ε 2 ) term, we have α F -GR ≤ α 0 • exp(-γεc * /2 + O(ε 2 )), Ψ ≥ 0 for ε ≤ n 2 C ′ R min i b i (0) 2 2 - 8γ n C R (S.35) < min i n 2 b i (0) 2 4C ′ R (S.36) where the first line comes from (S.33 ) and the second one from a small γ satisfying (S.33 ). This implies that we need a smaller ε for larger ā and R in a similar way to γ.

Remark on B-GR:

We have obtained the upper bound of α F -GR , that is, the lower bound of Ψ for F-GR. Since we can see B-GR as the F-GR with a negative ε, the sign of Ψ 1 is flipped in B-GR. Then, we can easily obtain the upper bound of Ψ as follows. First, we have Ψ 0 ≤ 2λ max (XX ⊤ ) ∞ 0 ∥r(s)∥ 2 ds (S.37) ≤ 8nλ max (XX ⊤ ) R, (S.38) where λ max (XX ⊤ ) denotes the largest eigenvalue of XX ⊤ . We have Ψ = Ψ 0 - ε n Ψ 1 + O(ε 2 ) (S.39) ≲ 8nλ max (XX ⊤ ) R -ε b(0) 2 2 , (S.40) where we used Lemma A.1. Substituting the above inequality into (S.7 ), we obtain α F -GR ≳ C • exp(γεc * ) for a positive constant C. Thus, the lower bound increases for a larger step size ε > 0 in B-GR and the implicit bias is strengthen in the direction to L2 solutions.

A.2 ALTERNATIVE TO ASSUMPTION (III)

Instead of assumption (iii), we may use Assumption A.2. For sufficiently small ε and γ, the smallest eigenvalue of S(t) := Xdiag(a(t))X ⊤ is positive. Since we suppose the overparameterized case (d > n), the matrix X is a wide matrix and S has no trivial zero eigenvalue. The positive definiteness of S is a sufficient condition of global convergence as follows. From Eq. (S.17 ), we have n 4 where we take the lower bound of the smallest eigenvalue as λ * min = min t,γ,ε λ min (S(t)). By taking a sufficiently small γ such that γ < 3λ * min /(4C), we obtain ∥r(t)∥ 2 ≤ ∥r(0)∥ 2 exp(-λ * min t/n), (S.43) d∥r∥ 2 dt = n 2 b ⊤ dβ dt = -r ⊤ Sr - γ n r ⊤ X(Q 1 (t) + ε n Q 2 (t) + ε 2 n 2 Q 3 (t) Taking the summation, we get θ t+2 = θ t -η(∇ θ L(θ t+1 ) -∇ θ L(θ t )) (S.53) = θ t -η 2 ∇L (θ t + η∇L (θ t )) -∇L (θ t ) η . (S.54) Similarly, for L(θ t ) > b and L(θ t+1 ) < b, we have θ t+1 = θ t -η∇ θ L(θ t ), (S.55) θ t+2 = θ t+1 + η∇ θ L(θ t+1 ). (S.56) and get In the experiments on benchmark datasets, we computed the GR term in each mini-batch of SGD update. The pseudo-code for F-GR is given in Algorithm 1. The double backward computation is implemented as shown in Listing 1. θ t+2 = θ t + η(∇ θ L(θ t+1 ) -∇ θ L(θ t )) (S.57) = θ t -η 2 ∇L (θ t ) -∇L (θ t -η∇L (θ t )) η . (S. Algorithm 1 Learning with F-GR Input: mini-batches{B 1 , ..., B K } 1: while SGD update do 2: if i-th mini-batch then 3: ∆L ← ∇L(θ; B i ) 4: θ ′ ← θ + ε∆L 5: ∆L ′ ← ∇L(θ ′ ; B i ) 6: ∆R ← (∆L ′ -∆L)/ε 7: θ ← θ -η(∆L + γ∆R) 8: end if 9: end while 1 ... 

C.1.2 EVALUATION ON THE NUMBER OF MATRIX MULTIPLICATION

We represent an L-layer fully connected neural network with a linear output layer by A l = ϕ(U l ), U l = W l A l-1 for l = 1, ..., L. We define the element-wise activation function by ϕ(•) and weight matrix by W l . For simplicity, we neglect bias terms. Note that we have multiple samples A 0 (within each minibatch) as an input and W l A l requires a matrix-matrix product. Therefore, the forward pass requires L matrix multiplication. Next, let us overview usual backpropagation on the forward pass {A 0 → A 1 → • • • → A L }. We can express the backward pass as B l = ϕ ′ (U l ) • (W ⊤ l+1 B l+1 ) , where the backward signal B l corresponds to ∂L/∂U l (l = 1, ..., L -1). Then, the backward pass requires L -1 matrix-matrix multiplication between weights W and backward signals B. In addition, we need to compute the gradient ∂L/∂W l = B l A ⊤ l-1 for ∇L and this is also a matrix-matrix multiplication. Alter all, we need 3L -1 matrix multiplication for ∇L. Finite difference computation: ∇L(θ ′ ) requires the same number of matrix multiplication as the normal backpropagation. Therefore, ∇ L requires 6L -2. For a sufficiently deep network, this is ∼ 6L. Double Backward computation: Let us denote ∂L/∂W l by G l . Figure 2 represents the forward pass for computing the gradient of GR. Note that the upper part of this graph, i.e., {A 0 → A 1 → • • • → B L → • • • → B 1 }, is well-known in double backpropagation of ∇B 1 for the input-Jacobian regularization. As explained in Drucker & Le Cun (1992) , the computation of ∇B 1 is equivalent to apply backpropagation to this upper part of the graph. GR requires additional L nodes for G l . Note that when we have a forward pass with matrix multiplication, its backward computation requires two matrix multiplications. That is, when a node of the forward pass S is a function of the matrix X given by X = U V , we need to compute ∂S/∂U = (∂S/∂X)V and ∂S/∂V = U (∂S/∂X) in the backpropagation. In addition, we do not need to compute the derivative of A 0 . After all, we need 2 × (3L -1) -2 = 6L -4 for the ∇R. Since we also compute the gradient of the original loss ∇L, we need 9L -5. For a sufficiently deep network, this is ∼ 9L.

C.2 DETAILS OF EXPERIMENTAL SETTINGS

Figure 1 : We trained MLP (width 512) and ResNet on CIFAR-10 by using SGD with GR. We set batch size 256, momentum 0.9, initial learning rate 0.01 and used a step decay of the learning rate (scaled by 5 at epochs 60, 120, 160), γ = ε = 0.05 for GR. We showed the average and standard deviation over 5 trials of different random initialization. We trained the models with various hyper-parameters ε = {10 -5 , 5 × 10 -5 , ..., 0.5, 1} and γ = {10 -4 , 2 × 10 -4 , 5 × 10 -4 , 10 -3 , ..., 1, 2, 5}. The other settings are the same as in Figure 1 . We set batch size 128, weight decay 0.0001, and used no other regularization technique or data augmentation. Figure 4 : We generated synthetic data by x i ∼ N (µ1, σ 2 I) and y i ∼ N (⟨β * , x i ⟩ , 0.01). β * is k * -sparse with non-zero entries equal to 1/ √ k * . We set d = 100, n = 50, µ = σ 2 = 5, γ = 0.02 and initialization α 0,i ∼ N (0, 0.01). We trained the models by the discrete time update with a small learning rate η = 0.001. We showed the average of 25 trials with different seeds. We trained the models until the training loss L became lower than 10 -8 . 1 . We observed that learning with F-GR could make the loss decrease faster than DB in the sense of convergence rate (i.e., the number of epochs). This means that the loss converges even faster in wall time.

Figure S.2:

To see the difference among algorithms in more detail, we show test accuracy along ε axis with a fixed γ of the grid search shown in Figure 3 . Each line represents the average and standard deviation over 5 trials of different random initialization. We fixed γ = 0.5 for MLP and γ = 0.05 for ResNet-18. This means that the objective function is the same among different algorithms. Nevertheless, the eventual performance is different. For a large ε, F-GR achieves the higher test accuracy than DB beyond one standard deviation. For such a large ε, F-GR also performs better than B-GR. We did experiments on a different architecture and dataset, that is, ResNet-34 on CIFAR-10. The result is consistent with those in Figure 3 . Learning with F-GR achieves the highest accuracy for large ascent steps. B-GR performs much worse for them. In addition, the highest accuracy of F-GR is better than that of DB. The best test accuracy was (F-GR, B-GR, DB) = (59.9, 58.6, 59.5) ± (0.5, 0.4, 0.5). From this experiment, we can see that the result of the finite difference computation with small ε does not necessarily coincide with that of DB. We observed that the training dynamics showed instability for too small ε. This would be attributed to numerical instability. The important point is that F-GR shows better accuracy than DB for large ascent steps. Table S .1: We trained WideResNet-28-10 (WRN-28-10) with γ = {0, 10 -4 , 10 -3 , 10 -2 , 10 -1 }. For F-GR and B-GR, we set ϵ = {0.001, 0.01, 0.1}. We computed the average and standard deviation over 5 trials of different random initialization, and reported the best average accuracy achieved over all the above combinations of hyper-parameters. F-GR performs better than DB and B-GR beyond one standard deviation in most cases. We used crop and horizontal flip as data augmentation, cosine scheduling with an initial learning rate 0.1, and set momentum 0.9, batch size 128, and weight decay 0.0001. 



Figure 1: Finite-difference computation is more efficient than DB computation in wall time. (a) Wall time required for learning with GR in one epoch. For the ResNet, we used ResNet-{18, 34, 50, 101, 152}. (b)Training dynamics in ResNet-18 on CIFAR-10. Learning with F-GR is much faster in wall time.

(b) confirms that F-GR has fast convergence in ResNet-18 on CIFAR-10. In Figure S.1 0

Figure 2: Computational graph of DB. Each node with an incoming solid arrow requires one matrix multiplication for the forward pass.

Figure 3: Grid search on learning with different GR algorithms shows the superiority of F-GR and that a relatively large ε achieves a high test accuracy. The color bar shows the average test accuracy over 5 trials. Gray dashed lines indicate γ = ε.

Figure 4: Results of training of DLNs using gradient descent with F-GR (γ = 0.02). (a) L1 norm of the solutions, (b) test loss, and (c) the largest eigenvalue of the Hessian of the training loss.

and where β * is k * -sparse with non-zero entries equal to 1/ √ k * (d = 100 and n = 50). Following Nacson et al. (2022), we chose µ = σ 2 = 5, where the parameter norm a(t) is suppressed and assumption (ii) ix expected to hold. As the ascent step increases, models trained by F-GR obtain sparser solutions (Figure 4(a)) and better generalization performance (Figure 4(b)

Figure 5: Flooding decreases the gradient norm, as expected by theory. (a) Training dynamics of flooding with b = 0.05. (b) Test accuracy and gradient norm after the training.

2 loss.backward(create_graph=True) #backpropagation of original loss 3 loss_DB = (gamma/2) * sum([torch.sum(p.grad ** 2) for p in model.parameters() ]) #computing GR term 4 loss_DB.backward() #backpropagation of GR term 5 optimizer.step() 6 ... Listing 1: Implementation of DB in PyTorch.

Figure 3: (a) We trained the 4-layer MLP and ResNet-18 on CIFAR-10 by using SGD with GR.We trained the models with various hyper-parameters ε = {10 -5 , 5 × 10 -5 , ..., 0.5, 1} and γ = {10 -4 , 2 × 10 -4 , 5 × 10 -4 , 10 -3 , ..., 1, 2, 5}. The other settings are the same as in Figure1. We set batch size 128, weight decay 0.0001, and used no other regularization technique or data augmentation.

Figure 5: We trained ResNet-18 on CIFAR-10 by SGD with flooding (b = 0.05).The setting is the same as in Figure1. We computed the gradient norm R by the average of mini-batches in each epoch. We showed the average and standard deviation over 10 trials of different random initialization.

Figure S.1 : Training dynamics in ResNet-18 on CIFAR-10. Learning with F-GR is much faster in wall time.

Figure S.2 : Test accuracy shown in Figure 3 along ε axis with a fixed γ. We fixed γ = 0.5 for MLP and γ = 0.05 for ResNet-18.

Figure S.3: This figure shows an additional experiment of the grid search shown in Section 3.3.We did experiments on a different architecture and dataset, that is, ResNet-34 on CIFAR-10. The result is consistent with those in Figure3. Learning with F-GR achieves the highest accuracy for large ascent steps. B-GR performs much worse for them. In addition, the highest accuracy of F-GR is better than that of DB. The best test accuracy was (F-GR, B-GR, DB) = (59.9, 58.6, 59.5) ± (0.5, 0.4, 0.5).From this experiment, we can see that the result of the finite difference computation with small ε does not necessarily coincide with that of DB. We observed that the training dynamics showed instability for too small ε. This would be attributed to numerical instability. The important point is that F-GR shows better accuracy than DB for large ascent steps.

Figure S.3 : Grid search on learning with different GR algorithms in ResNet-34 on CIFAR-100. The color bar shows the average test accuracy over 5 trials. Gray dashed lines indicate γ = ε.

Figure S.4: We trained DLNs with various ε and γ in the same setting as in Figure 4. The black circles in the figure show the cases of the lowest test error. The best test error was (F-GR, B-GR) = (10 -1.37 , 10 -1.16 ) and F-GR performed better.

Figure S.4 : Learning of diagonal linear networks with GR. The color bar shows the average test loss over 10 trials. Training dynamics exploded in the gray area.

Figure S.5 : Flip rate of flooding with b = 0.05.

1 : Test accuracy of WRN-28-10 shows that F-GR performs better. We trained the models with/without data augmentation (DA).

annex

= -q 1 ∇L(w t ) -q 2 ∇L(w t + ε∇L(w t )), (S.2)where q 1 = (1 -γ/ε), q 2 = γ/ε. The training loss L(w) is defined in (8). The dynamics are rewritten aswhere • denotes the element-wise product between vectors. We defined r(t) = Xw(t) 2 -y, r * (t) = Xw * (t) 2 -y, w * (t) = w(t) + ε∇L(w(t)), and put X = [ X -X ] ∈ R n×2d . We recall that the square of the vector is an element-wise square operation. The general solution of (S.3 ) is written asThis recovers the GD solution obtained by Woodworth et al. (2020) for (q 1 , q 2 ) = (1, 0), and SAM solution by Andriushchenko & Flammarion (2022) for (q 1 , q 2 ) = (0, 1). To evaluate the effect of both ε and γ on the implicit bias, we need a lower-order evaluation compared to previous work.Suppose an interpolation solution β ∞ satisfying Xβ ∞ = y. We can represent it byBecause the form of the function β ∞ = B α X ⊤ ν is the same as in the analysis of usual gradient descent (Woodworth et al., 2020) , we can use their transformation of β ∞ as it is. We have a KKT condition ∇ϕ α (w) = X ⊤ ν and the function ϕ α satisfiesWe have β ∞ (α) = arg min(S.11) In our GR case, α is just replaced by α F -GR . from Grönwall's inequality. Since L(w(t)) = ∥r(t)∥ 2 /(4n), we obtain global convergence. In addition, we haveThis gives the upper bound R. Similarly, we obtain R by taking the lower bound of (S.41 ) and using Grönwall's inequality. Thus, instead of assumption (iii), we can apply Assumption A.2 in the transformation from (S.27 ) to (S.29 ).Note that S(t) is known as the neural tangent kernel in the lazy regime and its positive definiteness is straightforward (Woodworth et al., 2020) . Although there is no proof of the positive definiteness in the rich regime, we observed it in numerical experiments and the assumption A.2 seems rational.A.3 DERIVATION OF PROPOSITION 4.2We obtained the upper bound of α F -GR (in other words, the lower bound of Ψ) from the term of Ψ 1 .In some cases, Ψ 0 gives us complementary insight. Proposition A.3. Under the same assumptions as in Theorem 4.1, for sufficiently small ε and γ,where the exponent c is a non-negative variable given by c = n -2 ∞ 0 (X ⊤ (Xβ(s) -y)) 2 ds.Proof. Ψ 0 is non-negative by definition and written asIn addition, we have Ψ 1 ≥ O(γ) from Lemma A.1. Therefore, we haveSubstituting this into (S.7 ), we obtain the result.Remark on an average of c: Note that c may depend on ε and γ because it is given by the integral of training dynamics. It seems hard to obtain a concrete value of this integral. Instead of evaluating each entries of c, let us analyze the average value of c, that is, ∥c∥ 1 /d = d i=1 c i /d. This approach gives us some insight into a typical value of the exponent c:(S.48)A similar evaluation has been used in the analysis of SAM (Andriushchenko & Flammarion, 2022) .

A.4 HESSIAN

The MSE loss of the diagonal linear network has the following Hessian: (S.52)

