SECOND-ORDER REGRESSION MODELS EXHIBIT PRO-GRESSIVE SHARPENING TO THE EDGE OF STABILITY Anonymous

Abstract

Recent studies of gradient descent with large step sizes have shown that there is often a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability). These phenomena are intrinsically non-linear and do not happen for models in the constant Neural Tangent Kernel (NTK) regime, for which the predictive function is approximately linear in the parameters. As such, we consider the next simplest class of predictive models, namely those that are quadratic in the parameters, which we call second-order regression models. For quadratic objectives in two dimensions, we prove that this second-order regression model exhibits progressive sharpening of the NTK eigenvalue towards a value that differs slightly from the edge of stability, which we explicitly compute. In higher dimensions, the model generically shows similar behavior, even without the specific structure of a neural network, suggesting that progressive sharpening and edge-of-stability behavior aren't unique features of neural networks, and could be a more general property of discrete learning algorithms in high-dimensional non-linear models.

1. INTRODUCTION

A recent trend in the theoretical understanding of deep learning has focused on the linearized regime, where the Neural Tangent Kernel (NTK) controls the learning dynamics (Jacot et al., 2018; Lee et al., 2019) . The NTK describes learning dynamics of all networks over short enough time horizons, and can describe the dynamics of wide networks over large time horizons. In the NTK regime, there is a function-space ODE which allows for explicit characterization of the network outputs (Jacot et al., 2018; Lee et al., 2019; Yang, 2021) . This approach has been used across the board to gain insights into wide neural networks, but it suffers a major limitation: the model is linear in the parameters, so it describes a regime with relatively trivial dynamics that cannot capture feature learning and cannot accurately represent the types of complex training phenomena often observed in practice. While other large-width scaling regimes can preserve some non-linearity and allow for certain types of feature learning (Bordelon & Pehlevan, 2022; Yang et al., 2022) , such approaches tend to focus on the small learning-rate or continuous-time dynamics. In contrast, recent empirical work has highlighted a number of important phenomena arising from the non-linear discrete dynamics in training practical networks with large learning rates (Neyshabur et al., 2017; Gilmer et al., 2022; Ghorbani et al., 2019; Foret et al., 2022) . In particular, many experiments have shown the tendency for networks to display progressive sharpening of the curvature towards the edge of stability, in which the maximum eigenvalue of the loss Hessian increases over the course of training until it stabilizes at a value equal to roughly two divided by the learning rate, corresponding to the largest eigenvalue for which gradient descent would converge in a quadratic potential (Wu et al., 2018; Giladi et al., 2020; Cohen et al., 2022b; a) . In order to build a better understanding of this behavior, we introduce a class of models which display all the relevant phenomenology, yet are simple enough to admit numerical and analytic understanding. In particular, we propose a simple quadratic regression model and corresponding quartic loss function which fulfills both these goals. We prove that under the right conditions, this simple model shows both progressive sharpening and edge-of-stability behavior. We then empirically analyze a more general model which shows these behaviors generically in the large datapoint, large model limit. Finally, we conduct a numerical analysis on the properties of a real neural network and use tools from our theoretical analysis to show that edge-of-stability behavior "in the wild" shows some of the same patterns as the theoretical models.

2.1. MODEL DEFINITION

We consider the optimization of the quadratic loss function L(θ) = z 2 /2, where z a quadratic function on the P × 1-dimensional parameter vector θ and Q is a P × P symmetric matrix: z = 1 2 θ Qθ -E . This can be interpreted either as a model in which the predictive function is quadratic in the input parameters, or as a second-order approximation to a more complicated non-linear function such as a deep network. In this objective, the gradient flow (GF) dynamics with scaling factor η is given by θ = -η∇ θ L = ηz ∂z ∂θ = η 2 θ Qθ -E Qθ . It is useful to re-write the dynamics in terms of z and the 1 × P -dimensional Jacobian J = ∂z/∂θ: ż = -η(JJ )z, J = -2ηzQJ . ( ) The curvature is a scalar, described by the neural tangent kernel (NTK) JJ . In these coordinates, we have E = JQ + J -2z, where Q + denotes the Moore-Penrose pseudoinverse. The GF equations can be simplified by two transformations. First, we transform to z = ηz and J = η 1/2 J. Next, we rotate θ so that Q is diagonal. This is always possible since Q is symmetric. Since the NTK is given by JJ , this rotation preserves the dynamics of the curvature. Let ω 1 ≥ . . . ≥ ω P be the eigenvalues of Q, and v i be the associated eigenvectors (in case of degeneracy, one can pick any basis). We define J(ω i ) = Jv i , the projection of J onto the ith eigenvector. Then the gradient flow equations can be written as: dz dt = -z P i=1 J(ω i ) 2 , d J(ω i ) 2 dt = -2zω i J(ω i ) 2 . ( ) The first equation implies that z does not change sign under GF dynamics. Modes with positive ω i z decrease the curvature, and those with negative ω i z increase the curvature. In order to study edge-of-stability behavior, we need initializations which allow the curvature (JJ in this case) to increase over time -a phenomenon known as progressive sharpening. Progressive sharpening has been shown to be ubiquitous in machine learning models (Cohen et al., 2022a) , so any useful phenomenological model should show it as well. One such initialization for this quadratic regression model is ω 1 = -ω, ω 2 = ω, J(ω 1 ) = J(ω 2 ). This initialization (and others) show progressive sharpening at all times.

2.2. GRADIENT DESCENT

We are interested in understanding the edge-of-stability (EOS) behavior in this model: gradient descent (GD) trajectories where the maximum eigenvalue of the NTK, JJ , remains close to the critical value 2/η. We define edge of stability with respect to the maximum NTK eigenvalue instead of the maximum loss Hessian eigenvalue from Cohen et al. (2022a) . We will prove this form of EOS in our simpler models, and find that it holds empirically in more complex models. See Appendix A.1 for further discussion. When Q has both positive and negative eigenvalues, the loss landscape is the square of a hyperbolic parabaloid (Figure 1 , left). As suggested by the gradient flow analysis, this causes some trajectories to increase their curvature before convergence. This causes the final curvature to depend on both the initialization and learning rate. One of the challenges in analyzing the gradient descent (GD) dynamics is that they rapidly and heavily oscillate around minima for large learning rates. One

