SECOND-ORDER REGRESSION MODELS EXHIBIT PRO-GRESSIVE SHARPENING TO THE EDGE OF STABILITY Anonymous

Abstract

Recent studies of gradient descent with large step sizes have shown that there is often a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability). These phenomena are intrinsically non-linear and do not happen for models in the constant Neural Tangent Kernel (NTK) regime, for which the predictive function is approximately linear in the parameters. As such, we consider the next simplest class of predictive models, namely those that are quadratic in the parameters, which we call second-order regression models. For quadratic objectives in two dimensions, we prove that this second-order regression model exhibits progressive sharpening of the NTK eigenvalue towards a value that differs slightly from the edge of stability, which we explicitly compute. In higher dimensions, the model generically shows similar behavior, even without the specific structure of a neural network, suggesting that progressive sharpening and edge-of-stability behavior aren't unique features of neural networks, and could be a more general property of discrete learning algorithms in high-dimensional non-linear models.

1. INTRODUCTION

A recent trend in the theoretical understanding of deep learning has focused on the linearized regime, where the Neural Tangent Kernel (NTK) controls the learning dynamics (Jacot et al., 2018; Lee et al., 2019) . The NTK describes learning dynamics of all networks over short enough time horizons, and can describe the dynamics of wide networks over large time horizons. In the NTK regime, there is a function-space ODE which allows for explicit characterization of the network outputs (Jacot et al., 2018; Lee et al., 2019; Yang, 2021) . This approach has been used across the board to gain insights into wide neural networks, but it suffers a major limitation: the model is linear in the parameters, so it describes a regime with relatively trivial dynamics that cannot capture feature learning and cannot accurately represent the types of complex training phenomena often observed in practice. While other large-width scaling regimes can preserve some non-linearity and allow for certain types of feature learning (Bordelon & Pehlevan, 2022; Yang et al., 2022) , such approaches tend to focus on the small learning-rate or continuous-time dynamics. In contrast, recent empirical work has highlighted a number of important phenomena arising from the non-linear discrete dynamics in training practical networks with large learning rates (Neyshabur et al., 2017; Gilmer et al., 2022; Ghorbani et al., 2019; Foret et al., 2022) . In particular, many experiments have shown the tendency for networks to display progressive sharpening of the curvature towards the edge of stability, in which the maximum eigenvalue of the loss Hessian increases over the course of training until it stabilizes at a value equal to roughly two divided by the learning rate, corresponding to the largest eigenvalue for which gradient descent would converge in a quadratic potential (Wu et al., 2018; Giladi et al., 2020; Cohen et al., 2022b; a) . In order to build a better understanding of this behavior, we introduce a class of models which display all the relevant phenomenology, yet are simple enough to admit numerical and analytic understanding. In particular, we propose a simple quadratic regression model and corresponding quartic loss function which fulfills both these goals. We prove that under the right conditions, this simple model shows both progressive sharpening and edge-of-stability behavior. We then empirically analyze a more general model which shows these behaviors generically in the large datapoint, large model limit. Finally, we conduct a numerical analysis on the properties of a real neural network and use tools from our theoretical analysis to show that edge-of-stability behavior "in the wild" shows some of the same patterns as the theoretical models.

2.1. MODEL DEFINITION

We consider the optimization of the quadratic loss function L(θ) = z 2 /2, where z a quadratic function on the P × 1-dimensional parameter vector θ and Q is a P × P symmetric matrix: z = 1 2 θ Qθ -E . This can be interpreted either as a model in which the predictive function is quadratic in the input parameters, or as a second-order approximation to a more complicated non-linear function such as a deep network. In this objective, the gradient flow (GF) dynamics with scaling factor η is given by θ = -η∇ θ L = ηz ∂z ∂θ = η 2 θ Qθ -E Qθ . It is useful to re-write the dynamics in terms of z and the 1 × P -dimensional Jacobian J = ∂z/∂θ: ż = -η(JJ )z, J = -2ηzQJ . ( ) The curvature is a scalar, described by the neural tangent kernel (NTK) JJ . In these coordinates, we have E = JQ + J -2z, where Q + denotes the Moore-Penrose pseudoinverse. The GF equations can be simplified by two transformations. First, we transform to z = ηz and J = η 1/2 J. Next, we rotate θ so that Q is diagonal. This is always possible since Q is symmetric. Since the NTK is given by JJ , this rotation preserves the dynamics of the curvature. Let ω 1 ≥ . . . ≥ ω P be the eigenvalues of Q, and v i be the associated eigenvectors (in case of degeneracy, one can pick any basis). We define J(ω i ) = Jv i , the projection of J onto the ith eigenvector. Then the gradient flow equations can be written as: dz dt = -z P i=1 J(ω i ) 2 , d J(ω i ) 2 dt = -2zω i J(ω i ) 2 . ( ) The first equation implies that z does not change sign under GF dynamics. Modes with positive ω i z decrease the curvature, and those with negative ω i z increase the curvature. In order to study edge-of-stability behavior, we need initializations which allow the curvature (JJ in this case) to increase over time -a phenomenon known as progressive sharpening. Progressive sharpening has been shown to be ubiquitous in machine learning models (Cohen et al., 2022a) , so any useful phenomenological model should show it as well. One such initialization for this quadratic regression model is ω 1 = -ω, ω 2 = ω, J(ω 1 ) = J(ω 2 ). This initialization (and others) show progressive sharpening at all times.

2.2. GRADIENT DESCENT

We are interested in understanding the edge-of-stability (EOS) behavior in this model: gradient descent (GD) trajectories where the maximum eigenvalue of the NTK, JJ , remains close to the critical value 2/η. We define edge of stability with respect to the maximum NTK eigenvalue instead of the maximum loss Hessian eigenvalue from Cohen et al. (2022a) . We will prove this form of EOS in our simpler models, and find that it holds empirically in more complex models. See Appendix A.1 for further discussion. When Q has both positive and negative eigenvalues, the loss landscape is the square of a hyperbolic parabaloid (Figure 1 , left). As suggested by the gradient flow analysis, this causes some trajectories to increase their curvature before convergence. This causes the final curvature to depend on both the initialization and learning rate. One of the challenges in analyzing the gradient descent (GD) dynamics is that they rapidly and heavily oscillate around minima for large learning rates. One way to mitigate this issue is to consider only every other step (Figure 1 , right). We will use this observation to analyze the gradient descent (GD) dynamics directly to find configurations where these trajectories show edge-of-stability behavior. In the eigenbasis coordinates, the gradient descent equations are zt+1 -zt = -z t P i=1 J(ω i ) 2 t + 1 2 (z 2 t ) P i=1 ω i J(ω i ) 2 t (5) J(ω i ) 2 t+1 -J(ω i ) 2 t = -z t ω i (2 -zt ω i ) J(ω i ) 2 t for all 1 ≤ i ≤ P . We'll find it convenient in the following to write the dynamics in terms of weighted averages of J(ω i ) 2 instead of the modes J(ω i ): T (α) = P i=1 ω α i J(ω i ) 2 . ( ) The dynamical equations become: zt+1 -zt = -z t T t (0) + 1 2 (z 2 t )T t (1) T t+1 (k) -T t (k) = -z t (2T t (k + 1) -zt T t (k + 2)) . ( ) If Q is invertible, then we have E = T t (-1) -2z t . Note that by definition T t (0) = ηJ t J t is the (rescaled) NTK. edge-of-stability behavior corresponds to dynamics which keep T t (0) near the value 2 as zt goes to 0.

2.2.1. REDUCTION TO CATAPULT DYNAMICS

If the eigenvalues of Q are {-ω, ω}, and E = 0, the model becomes equivalent to a single hidden layer linear network with one training datapoint (Appendix A.2) -also known as the catapult phase dynamics. This model doesn't exhibit sharpening or edge-of-stability behavior (Lewkowycz et al., 2020) . We will analyze this model in our z -T (0) variables as a warmup, with an eye towards analyzing a different parameter setting which does show sharpening and edge of stability. We assume without loss of generality that the eigenvalues are {-1, 1} -which can be accomplished by rescaling z. The loss function is then the square of a hyperbolic parabaloid. Since there are only 2 variables, we can rewrite the dynamics in terms of z and the curvature T (0) only (Appendix B.1): zt+1 -zt = -z t T t (0) + 1 2 (z 2 t )(2z t + E) (10) T t+1 (0) -T t (0) = -2z t (2z t + E) + z 2 t T t (0) . For E = 0, we can see that sign(∆T (0)) = sign(T t (0) -4), as in Lewkowycz et al. (2020) -so convergence requires strictly decreasing curvature. For E = 0, there is a region where the curvature can increase (Appendix B.1). However, there is still no edge-of-stability behavior -there is no set of initializations which starts with λ max far from 2/η, which ends up near 2/η. In contrast, we will show that asymmetric eigenvalues can lead to EOS behavior.

2.2.2. EDGE OF STABILITY REGIME

In this section, we consider the case in which Q has two eigenvalues -one of which is large and positive, and the other one small and negative. Without loss of generality, we assume that the largest eigenvalue of Q is 1. We denote the second eigenvalue by -, for 0 < ≤ 1. With this notation we can write the dynamical equations (Appendix B.1) as zt+1 -zt = -z t T t (0) + 1 2 (z 2 t )((1 -)T t (0) + (2z t + E)) T t+1 (0) -T t (0) = -2z t ( (2z t + E) + (1 -)T t (0)) + z2 t [T t (0) + ( -1) (T t (0) -E -2z t )] . (13) For small , there are trajectories where λ max is initially away from 2/η but converges towards it (Figure 2 , left) -in other words, EOS behavior. We used a variety of step sizes η but initialized at pairs initialized at pairs (ηz 0 , ηT 0 (0)) to show the universality of the z-T (0) coordinates. In order to quantitatively understand the progressive sharpening and edge of stability, it is useful to look at the two-step dynamics. One additional motivation for studying the two-step dynamics follows from the analysis of gradient descent on linear least squares (i.e., linear model) with a large step size λ. For every coordinate θ, the one-step and two-step dynamics are θt+1 -θt = -λ θt and θt+2 -θt = (1 -λ) 2 θt (GD in quadratic potential) . While the dynamics converge for λ < 2, if λ > 1 the one-step dynamics oscillate when approaching minimum, whereas the the two-step dynamics maintain the sign of θ and the trajectories exhibit no oscillations. Likewise, plotting every other iterate in the two parameter model more clearly demonstrates the phenomenology. For small , the dynamics shows the distinct phases described in (Li et al., 2022) : an initial increase in T (0), a slow increase in z, then a decrease in T (0), and finally a slow decrease of z while T (0) remains near 2 (Figure 2 , middle). Unfortunately, the two-step version of the dynamics defined by Equations 12 and 13 are more complicated -they are 3rd order in T (0) and 9th order in z; see Appendix B.2 for a more detailed discussion. However we can still analyze the dynamics as z goes to 0. In order to understand the mechanisms of the EOS behavior, it is useful to understand the nullclines of the two step dynamics. The nullcline f z (z) of z and f T (z) of T (0) are defined implicitly by (z t+2 -zt )(z, f z (z)) = 0, (T t+2 (0) -T t (0))(z, f T (z)) = 0 ( ) where zt+2 -zt and T t+2 (0) -T t (0) are the aforementioned high order polynomials in z and T (0). Since these polynomials are cubic in T (0), there are three possible solutions as z goes to 0. We are particularly interested in the solution that goes through z = 0, T (0) = 2 -that is, the critical point corresponding to EOS. Calculations detailed in Appendix B.2 show that the distance between the two nullclines is linear in , so they become close as goes to 0. (Figure 2 , middle). In addition, the trajectories stay near f z -which gives rise to EOS behavior. This suggests that the dynamics are slow near the nullclines, and trajectories appear to be approaching an attractor. We can find the structure of the attractor by changing variables to y t ≡ T t (0) -f z (z t ) -the distance from the z nullcline. To lowest order in z and y, the two-step dynamical equations become (Appendix B.3): zt+2 -zt = 2y t zt + O(y 2 t zt ) + O(y t z2 t ) y t+2 -y t = -2(4 -3 + 4 2 )y t z2 t -4 z2 t + O(z 3 t ) + O(y 2 z2 t ) We immediately see that z changes slowly for small y -since we chose coordinates where zt+2 -z t = 0 when y = 0. We can also see that y t+2 -y t is O( ) for y t = 0 -so for small , the y dynamics is slow too. Moreover, we see that the coefficient of the z2 t term is negative -the changes in z tend to drive y (and therefore T (0)) to decrease. The coefficient of the y t term is negative as well; the dynamics of y tends to be contractive. The key is that the contractive behavior takes y to an O( ) fixed point at a rate proportional to z2 , while the dynamics of z are proportional to . This suggests a separation of timescales if z2 , where y first equilibrates to a fixed value, and then z converges to 0 (Figure 2 , right). This intuition for the lowest order terms can be formalized, and gives us a prediction of lim t→∞ y t = -/2, confirmed numerically in the full model (Appendix B.5). -3 , left). Trajectories are the same up to scaling because corresponding rescaled coordinates z and T (0) are the same at initialization. Plotting every other iterate, we see that for a variety of initializations (black x's), trajectories in z -T (0) space stay near the nullcline (z, f z (z)) -the curve where zt+2 -zt = 0 (middle). Changing variables to y = T (0) -f z (z) shows quick concentration to a curve of near-constant, small, negative y (right). We can prove the following theorem about the long-time dynamics of z and y when the higher order terms are included (Appendix B.4): Theorem 2.1. There exists an c > 0 such that for a quadratic regression model with E = 0 and eigenvalues {-, 1}, ≤ c , there exists a neighborhood U ⊂ R 2 and interval [η 1 , η 2 ] such that for initial θ ∈ U and learning rate η ∈ [η 1 , η 2 ], the model displays edge-of-stability behavior: 2/η -δ λ ≤ lim t→∞ λ max ≤ 2/η , for δ λ of O( ). This neighborhood corresponds to the inverse image of the z -y space region [0, zc ) × [0, y c ), for -independent zc and y c . Therefore, unlike the catapult phase model, the small provably has EOS behavior -whose mechanism is well-understood by the z -y coordinate transformation.

3.1. GENERAL MODEL

While the model defined in Equation 1 provable displays edge-of-stability behavior, it required tuning of the eigenvalues of Q to demonstrate it. We can define a more general model which exhibits edge-of-stability behavior with less tuning. We define the quadratic regression model as follows. Given a P -dimensional parameter vector θ, the D-dimensional output vector z is given by z = y + G θ + 1 2 Q(θ, θ) . ( ) Here y is a D-dimensional vector, G is a D × P -dimensional matrix, and Q is a D × P × Pdimensional tensor symmetric in the last two indices -that is, Q(•, •) takes two P -dimensional vectors as input, and outputs a D-dimensional vector verifying Q(θ, θ) α = θ Q α θ. If Q = 0, the model corresponds to linearized learning (as in the NTK regime). When Q = 0, we obtain the first correction to NTK regime. We note that: G αi = ∂z α ∂θ i θ=0 , Q αij = ∂ 2 z α ∂θ i ∂θ j , → J = G + Q(θ, •) , for the D × P dimensional Jacobian J. For D = 1, we recover the model of Equation 1. In the remainder of this section, we will study the limit as D and P increase with fixed ratio D/P . The quadratic regression model corresponds to a model with a constant second derivative with respect to parameter changes -or a second order expansion of a more complicated ML model. Quadratic expansions of shallow MLPs have been previously studied (Bai & Lee, 2020; Zhu et al., 2022) , and the perturbation theory for small Q is studied in Roberts et al. (2022) . Other related models are detailed in Appendix A. We will provide evidence that even random, unstructured quadratic regression models lead to EOS behavior.

3.2. GRADIENT FLOW DYNAMICS

We will focus on training with squared loss L(z) = 1 2 α z 2 α . We begin by considering the dynamics under gradient flow (GF): θ = - ∂L(z) ∂θ = -J z . ( ) We can write the dynamics in the output space z and the Jacobian J as ż = J θ = -JJ z, J = -Q(J z, •) When Q = 0 (linearized/NTK regime), J is constant, the dynamics are then linear in z, and are controlled by the eigenstructure of JJ , the empirical NTK. In this regime there is no EOS behavior. We are interested in settings where progressive sharpening occurs under GF. We can study the dynamics of the maximum eigenvalue λ max of JJ at early times for random initializations. In Appendix C.1, we prove the following theorem: Theorem 3.1. Let z, J, and Q be initialized with i.i.d. elements with zero mean and variances σ 2 z , σ 2 J , and 1 respectively, with distributions invariant to rotation in data and parameter space, and have finite fourth moments. Let λ max be the largest eigenvalue of JJ . In the limit of large D and P , with fixed ratio D/P , at initialization we have E[ λmax (0)] = 0, E[ λmax (0)]/E[λ max (0)] = σ 2 z (23) where E denotes the expectation over z, J, and Q at initialization. Much like in the D = 1 case, Theorem 3.1 suggests that it is easy to find initializations that show progressive sharpening -and increasing σ z makes sharpening more prominent.

3.3. GRADIENT DESCENT DYNAMICS

We now consider finite-step size gradient descent (GD) dynamics. The dynamics for θ are given by: θ t+1 = θ t -ηJ t z t . In this setting, the dynamic equations can be written as z t+1 -z t = -ηJ t J t z t + 1 2 η 2 Q(J t z t , J t z t ) (25) J t+1 -J t = -ηQ(J t z t , •) . ( ) If Q = 0, the dynamics reduce to discrete gradient descent in a quadratic potential -which converges iff λ max < 2/η. One immediate question is: when does the η 2 in Equation 25 affect the dynamics? Given that it scales with higher powers of η and z than the first term, we can conjecture that the ratio of the magnitudes of the terms, r N L , is proportional to ||z|| 2 and η. A calculation in Appendix C.2 shows that, for random rotationally invariant initializations, we have: where as before the expectation is taken over the initialization of z, J, and Q. This suggests that increasing the learning rate increases the deviation of the dynamics from GF (which is obvious), but increasing ||z|| also increases the deviation from GF. r N L ≡ E[|| 1 2 η 2 Q(J 0 z 0 , J 0 z 0 )|| 2 2 ] E[||ηJ 0 J 0 z 0 || 2 2 ] 1/2 = 1 2 ησ z D , We can see this phenomenology in the dynamics of the GD equations (Figure 3 ). Here we plot different trajectories for random initializations of the type in Theorem 3.1 with D = 60, P = 120, and η = 1. As σ z increases, so does the curvature λ max (as suggested by Theorem 3.1), and when σ z is O(1), the dynamics is non-linear (as predicted by r N L ) and EOS behavior emerges. This suggests that the second term in Equation 25 is crucial for the stabilization of λ max . We can confirm this more generally by initializing over various η, D, P , σ z , and σ J over multiple seeds, and plotting the resulting phase diagram of the final λ max reached. We can simplify the plotting with some rescaling of parameters and initializations. For example, in the rescaled variables z = ηz, J = η 1/2 J , the dynamics are equivalent to Equations 25 and 26 with η = 1. As in the z -T (0) model of Equations 8-9, λ max in the rescaled coordinates is equivalent to ηλ max in the unscaled coordinates. We can also define rescaled initializations for z and J. If we set σ z = σz /D, σ J = σJ / (DP ) 1/4 , then we have r N L = σz which allows for easier comparison across (D, P ) pairs. Using this initialization scheme, we can plot the final value of λ max reached as a function of σz and σJ for 100 independent random initializations for each σz , σJ pair (Figure 4 ). We see that the key is for r N L = σz to be O(1) -corresponding to both progressive sharpening and non-linear dynamics near initialization. In particular, initializations with small σJ values which converge at the EOS correspond to trajectories which first sharpen, and then settle near λ max = 2/η. Large σz and large σJ dynamics diverge. There is a small band of initial σJ over a wide range of σz which have final λ max ≈ 2/η; these correspond to models initialized near the EOS, which stay near it. This suggests that progressive sharpening and edge of stability aren't uniquely features of neural network models, and could be a more general property of learning in high-dimensional, non-linear models.

4. CONNECTION TO REAL WORLD MODELS

In this section we examine how representative is the proposed model and the developed theory to the behavior of "real world" models. Following Cohen et al. (2022a) , we trained a 2-hidden layer tanh network using the squared loss on 5000 examples from CIFAR10 with learning rate 10 -2a setting which shows edge of stability behavior. Close to the onset of EOS, we approximately computed λ 1 , the largest eigenvalue of JJ , and its corresponding eigenvector v 1 using a Lanczos method (Ghorbani et al., 2019; Novak et al., 2019) . We use v 1 to compute z 1 = v 1 z, where z is the vector of residuals f (X, θ) -Y for neural network function f , training inputs X, labels Y, and There is evidence that low-dimensional features of a quadratic regression model could be used to explain some aspects of EOS behavior. We empirically compute the the second derivative of the output f (x, θ) by automatic differentiation. We denote by Q(•, •) the resulting tensor. We can use matrix-vector products to compute the spectrum of the matrix 6 , left). This figure reveals that the spectrum does not shift much from step 3200 to 3900 (the range of our plots). This suggests that Q doesn't change much as these EOS dynamics are displayed. We can also see that Q is much larger in the v 1 direction than a random direction. Q 1 ≡ v 1 • Q(•, •), which is projection of the output of Q in the v 1 direction, without instantiating Q in memory (Figure Let y be defined as y = λ 1 η -2. Plotting the two-step dynamics of z 1 versus 2yz we see a remarkable agreement (Figure 6 , middle). This is the same form that the dynamics of z takes in our simplified model. It can also be found by iterating Equation 25 twice with fixed Jacobian for y = λ 1 η -2 and discarding terms higher order in η. This suggests that during this particular EOS behavior, much like in our simplified model the dynamics of the eigenvalue is more important than any rotation in the eigenbasis. The dynamics of y is more complicated; y t+2 -y t is anticorrelated with z 2 1 but there is no low-order functional form in terms of y and z 1 (Appendix D.1). We can get some insight into the stabilization by plotting the ratio of η 2 Q 1 (Jz 1 v 1 , Jz 1 v 1 ) (the non-linear contribution to the z 1 dynamics from the v 1 direction) and λ 1 z 1 (the linearized contribution), and compare it to the dynamics of y (Figure 6 , right). The ratio is small during the initial sharpening, but becomes O(1) shortly before the curvature decreases for the first time. It remains O(1) through the rest of the dynamics. This suggests that the non-linear feedback from the dynamics of the top eigenmode onto itself is crucial to understanding the EOS dynamics. FAR10 (left) . Projection onto largest eigendirection v 1 (blue and orange) is larger than projection onto random direction (green). Two step difference (z 1 ) t+2 -(z 1 ) t is well approximated by 2z 1 y (middle), leading order term of models with fixed eigenbasis. Non-linear dynamical contribution η 2 Q 1 (Jz 1 v 1 , Jz 1 v 1 ) is small during sharpening, but becomes large immediately preceding decrease in top eigenvalue (right) -as is the case in the simple model.

5.1. LESSONS LEARNED FROM QUADRATIC REGRESSION MODELS

The main lesson to be learned from the quadratic regression models is that behavior like progressive sharpening (for both GF and GD) and edge-of-stability behavior (for GD) may be common features of high-dimensional gradient-based training of non-linear models. Indeed, these phenomena can be revealed in simple settings without any connection to deep learning models: with mild tuning our simplified model, which corresponds to 1 datapoint and 2 parameters can provably show EOS behavior. This combined with the analysis of the CIFAR model suggest that the general mechanism may have a low-dimensional description. Quadratic approximations of real models quantitatively can capture the early features of EOS behavior (the initial return to λ max < 2/η), but do not necessarily capture the magnitude and period of subsequent oscillations -these require higher order terms (Appendix D.2). Nevertheless, the quadratic approximation does correctly describe much of the qualitative behavior, including the convergence of λ max to a limiting two-cycle that oscillates around 2/η, with an average value below 2/η. In the simplified two-parameter model, it is possible to analytically predict the final value at convergence, and indeed we find that it deviates slightly from the value 2/η. A key feature of all the models studied in this work is that looking at every-other iterate (the twostep dynamics) greatly aids in understanding the models theoretically and empirically. Near the edge of stability, this makes the changes in the top eigenmode small. In the simplified model, the slow z dynamics (and related slow T (0) dynamics) allowed for the detailed theoretical analysis; in the CIFAR model, the two-step dynamics is slowly varying in both z 1 and λ max . The quantitative comparisons of these small changes may help uncover any universal mechanisms/canonical forms that explain EOS behavior in other systems and scenarios.

5.2. FUTURE WORK

One avenue for future work is to quantitatively understand progressive sharpening and EOS behavior in the quadratic regression model for large D and P . In particular, it may be possible to predict the final deviation 2 -ηλ max in the edge-of-stability regime as a function of σ z , σ J , and D/P . It would also be useful to understand how higher order terms affect the training dynamics. One possibility is that a small number of statistics of the higher order derivatives of the loss function are sufficient to obtain a better quantitative understanding of the oscillations around y = 2. Finally, our analysis has not touched on the feature learning aspects of the model. In the quadratic regression model, feature learning is encoded in the relationship between J and z, and in particu-lar the relationship between z and the eigenstructure of JJ . Understanding how Q mediates the dynamics of these two quantities may provide a quantitative basis for understanding feature learning which is complementary to existing theoretical approaches (Roberts et al In this work we focus on EOS dynamics of the largest eigenvalue of the NTK, rather than the Hessian as in Cohen et al. (2022a) . We note that a version of Theorem 2.1 is true for the maximum Hessian eigenvalue as well. In general, the Hessian can be written as ∂ 2 L ∂θ∂θ = ∇L • ∂ 2 z ∂θ∂θ + J T ∂ 2 L ∂z∂z J For squared loss in particular, we have ∂ 2 L ∂θ∂θ = ∇L • ∂ 2 z ∂θ∂θ + J T J (31) As the loss gradient goes to 0, the Hessian eigenvalues approach the eigenvalues of JJ T -whose non-zero eigenvalues are the same as those of the empirical NTK JJ T . Since the theorem involves behavior as z goes to convergence, the maximum NTK and maximum Hessian eigenvalues are equal in the limit, and the same EOS behavior applied in both cases. For the higher dimensional models (quadratic regression model and fully connected network on CI-FAR10), our experiments show that the maximum NTK eigenvalue shows edge of stability behavior. The CIFAR model is the same as the one in Cohen et al. (2022a) which was used to illustrate the edge of stability in terms of the maximum Hessian eigenvalues. Therefore we focused on the NTK version of EOS in our paper, as we found it more amenable to theoretical analysis and explanation. There are almost certainly cases where EOS behavior is displayed in the Hessian eigenvalues but not the NTK eigenvalues, particularly in cases where the loss is highly non-isotropic in the outputs (that is, ∂ 2 L ∂z∂z is far from a multiple of the identity matrix). As pointed out in previous works in these cases even the Hessian-based EOS is more difficult to analyze Cohen et al. (2022a) . We leave understanding of EOS with more complicated loss functions for future work.

A.2 ONE-HIDDEN LAYER LINEAR NETWORK

Consider a one hidden layer network with a scalar output: f (x) = v Ux (32) where x is an input vector of length N , U is a K × N dimensional matrix, and v is a K dimensional vector. We note that ∂ 2 f (x) ∂v i ∂v j = ∂ 2 f (x) ∂U ij ∂U kl = 0, ∂ 2 f (x) ∂v i ∂U jk = δ ij x k (33) where δ ij is the Kroenecker delta. For a fixed training set, this second derivative is constant; therefore, the one-hidden layer linear network is a quadratic regression model of the type studied in Section 3. In the particular case of a single datapoint x, we can compute the eigenvectors of the Q matrix. Let (w, W) be an eigenvector of Q, representing the v and U components respectively. The eigenvector equations are ωw i = x m δ ij W jm (34) ωW jm = x m δ ij w i Simplifying, we have: ωw = Wx (36) ωW = wx We have two scenarios. The first is that ω = 0. In this case, we have w = 0, and W is a matrix with x in its nullspace. The latter condition gives us M constraints on M × N equations -for a total of M (N -1) of our M (N + 1) total eigenmodes. If ω = 0, then combining the equations we have the conditions: ω 2 w = (x • x)w (38) ω 2 W = Wxx (39) This gives us ω = ± √ x • x. We know from Equation 37 that W is low rank. Therefore, we can guess a solution of the form W ±,i = ±e i x where the e i are the M coordinate vectors. This suggests that we have w ±,i = ( √ x • x)e i This gives us our final 2M eigenmodes. We can analyze the initial values of of the J(ω i ) as well. The components of the Jacobian can be written as: (J v ) i ≡ ∂f (x) ∂v i = U im x m (42) (J U ) jm ≡ ∂f (x) ∂U jm = v j x m From this form, we can deduce that J is orthogonal to the 0 modes. We can also compute the conserved quantity. Let J 2 + be the total weight in the positive eigenmodes, and J 2 -be the total weight in the negative eigenmodes. A direct calculation shows that ω -1 (J 2 + -J 2 -) = 2f (x) which implies that E = 0. Therefore, the single-hidden layer linear model on one datapoint is equivalent to the quartic loss model with E = 0 and eigenvalues ± √ x • x.

A.3 CONNECTION TO BORDELON & PEHLEVAN (2022)

Since the one-hidden layer linear model has constant Q, the models in Section F.1 of Bordelon & Pehlevan (2022) fall into the quadratic regression class. In the case of Section F.1.1, Equation 67, we can make the mapping to a D = 1 model explicit. The dynamics are equivalent to said model with a single eigenvalue ω 0 if we make the identifications ∆ = z, H y = J 2 0 , γ 0 = √ 2ω, y = -E/2 A.4 CONNECTION TO NTH The Neural Tangent Hierarchy (NTH) equations extend the NTK dynamics to account for changes in the tangent kernel by constructing an infinite sequence of higher order tensors which control the non-linear dynamics of learning Huang & Yau (2020) . Truncation of the NTH equations at 3rd order is related to, but not the same as the quadratic regression model, as we will show here. The 3rd order NTH equation describes the change in the tangent kernel JJ . Consider the D × D × D-dimensional kernel K 3 whose elements are given by (K 3 ) αβγ = ∂ 2 z α ∂θ i ∂θ j J iγ J jβ + ∂ 2 z β ∂θ i ∂θ j J iγ J jα where repeated indices are summed over. In the NTH, for squared loss the change in the NTK JJ is given by d dt JJ αβ = -η(K 3 ) αβγ z γ For fixed Q = ∂ 2 z ∂θ∂θ , this equation is identical to the GF equations for the NTK in the quadratic regression model. We note that K 3 is not constant under the quadratic regression model. Conversely, for fixed K 3 , ∂ 2 z ∂θ∂θ is not constant either. Therefore, the two methods can be used to construct different low-order expansions of the dynamics.

B 2 PARAMETER MODEL B.1 DERIVATION OF z-T (0) EQUATIONS

We can use the conserved quantity E to write the dynamics in terms of z and T (0) only. Without loss of generality, let the eigenvalues are 1 and λ, with -1 ≤ λ ≤ 1. (We can achieve this by rescaling z.) Recall the dynamical equations zt+1 -zt = -z t T t (0) + 1 2 (z 2 t )T t (1) T t+1 (0) -T t (0) = -z t (2T t (1) -zt T t (2)) We will find substitutions for T (1) and T (2) in terms of z and T (0). Recall that we have T (-1) = E + 2z ( ) where E is conserved throughout the dynamics (and indeed is a property of the landscape). We will use this definition to solve for T (1) and T (2). Since P = 2, we can write T (-1) = bT (0) + aT (1), for coefficients a and b which are valid for all combinations of J. If J(λ) = 0, we have b = 1 -a. If J(1) = 0, we have 1 = λ(1 -a) + λ 2 a. Solving, we have: T (-1) = (1 -a)T (0) + aT (1) for a = - 1 λ (51) The restrictions on λ translate to a / ∈ (-1, 1). In terms of the conserved quantity E = T (-1) -2z, we have: T (-1) = E + 2z In order to convert the dynamics, we need to solve for T (1) and T (2) in terms of T (0) and z. We have: T (1) = 1 a (T (-1) + (a -1)T (0)) = 1 a (E + 2z + (a -1)T (0)) We also have T (2) = T (0) + 1 -a a 2 (T (0) -E -2z) This gives us  zt+1 -zt = -z t T t (0) + 1 2a (z 2 t )((a -1)T t (0) + 2z t + E) T t+1 (0) -T t (0) = - 2 a zt (2z t + E + (a -1)T t (0)) + z 2 t T t (0) + 1 -a a 2 (T t (0) -E -2z t ) (56) If λ = -(that is, a = -1 ) we recover the equations from the main text. The non-negativity of J2 gives us constraints on the values of z and T . For a > 1 (small negative second eigenvalue), the constraints are: T > 2z + E, T > -(2z + E)/a This is an upward-facing cone with vertex at z = -E/2 (Figure 8 , left). For a < -1, the constraints are -(2z + E)/a < T < 2z + E (58) This is a sideways facing cone with vertex at z = -E/2 (Figure 8 , right). We see that in this case, there is a limited set of values of T to converge to. Indeed, for E = 0, there is no convergence except at T (0) = 0. This why we focus on the case of one positive and one negative eigenvalue. We can also solve for the nullclines -the curves where either zt+1 -zt = 0 (blue in Figure 8 ), or T t+1 (0) -T t (0) = 0 (orange in Figure 8 ). The nullcline (z, f z (z)) for z is given by f z (z) = z(2z + E) 2a -(a -1)z (59) The nullcline (z, f T (z)) for T (0) is given by f T (z) = - (a -1)z -2a (a 2 -a + 1)z -2a(a -1) (2z + E) The line z = 0 is also a nullcline. For the symmetric model = 1, the structure of the nullclines determines the presence or lack of progressive sharpening. For E = 0, there is no sharpening; the phase portrait (Figure 7 , left) confirms this as the nullcline in T t (0) divides the space into two halves, one which converges, and the other which doesn't. However, when E = 0, the nullclines split, and there is a small region where progressive sharpening can occur (Figure 7 , middle). However, there is still no edge-ofstability behavior in this case -there is no region where the trajectories cluster near λ max = 2/η (Figure 7 , right). For the symmetric model = 1, the structure of the nullclines determines the presence or lack of progressive sharpening. For E = 0, there is no sharpening; the phase portrait (Figure 7 , left) confirms this as the nullcline in T t (0) divides the space into two halves, one which converges, and the other which doesn't. However, when E = 0, the nullclines split, and there is a small region where progressive sharpening can occur (Figure 7 , middle). However, there is still no edge-ofstability behavior in this case -there is no region where the trajectories cluster near λ max = 2/η (Figure 7 , right). 

B.2 TWO-STEP DYNAMICS

The two-step difference equations can be derived by iterating Equations 12 and 13. We have zt+2 -zt = p 0 (z t , ) + p 1 (z t , )T t (0) + p 2 (z t , )T t (0) 2 + p 3 (z t , )T t (0) 3 (61) T (0) t+2 -T t (0) = q 0 (z t , ) + q 1 (z t , )T t (0) + q 2 (z t , )T t (0) 2 + q 3 (z t , )T t (0) 3 Here the p i and q i are polynomials in z, maximum 9th order in z and 6th order in . They can be computed explicitly but we choose to omit the exact forms for now. Numerical simulation of the dynamics for small reveals an edge of stability effect (Figure 9 ). We see that the distribution of final values of T for random initializations has a peak near T (0) = 2 (right). By plotting the two-step dynamics, we can see that the two-step nullclines which go in to T (0) = 2 almost coincide (left). By studying these nullclines, we will be able to understand the edge of stability effect. For fixed , we can solve for the z two-step nullclines (z t+2 -z t = 0) and the T nullclines (T t+2 (0)-T t (0) = 0) using Cardano's formula to solve for T as a function of z. In particular, each nullcline equation has a solution that goes through z = 0, T (0) = 2, independent of . This is the family of solutions that we will focus on. Let (z, f z, (z)) be the nullcline of z, and let (z, f T, (z)) be the nullcline of T (0). We will show that the T values of the nullclines, as a function of z and , is differentiable around z = 0, = 0. The nullclines are defined by the implicit equations 0 = 6z 3 -2T z -3T z2 ( -1) -T z3 ( + 2)(2 + 1) + T 2 z + 7 2 T 2 z2 ( -1) + 1 2 T 2 z3 9 2 -10 + 9 - 1 2 T 3 z2 ( -1) - 1 2 T 3 z3 3 2 -4 + 3 + O(z 4 ) (63) 0 = -8z 2 -12z 3 ( -1) + 4T z( -1) + 2T z2 3 2 -+ 3 + 4T z3 ( -1) 2 + 4 + 1 -2T 2 z( -1) -T 2 z2 7 2 -8 + 7 -T 2 z3 ( -1) 9 2 -+ 9 + T 3 z2 2 -+ 1 + T 3 z3 ( -1) 3 2 -+ 3 + O(z 4 ) (64) We omit the higher order terms for now in anticipation of differentiating at z = 0 to use the implicit function theorem. Dividing by z, we have the equations 0 = 6z 2 -2T -3T z( -1) -T z2 ( + 2)(2 + 1) + T 2 + 7 2 T 2 z( -1) + 1 2 T 2 z2 9 2 -10 + 9 - 1 2 T 3 z( -1) - 1 2 T 3 z2 3 2 -4 + 3 + O(z 3 ) (65) 0 = -8z -12z 2 ( -1) + 4T ( -1) + 2T z 3 2 -+ 3 + 4T z2 ( -1) 2 + 4 + 1 -2T 2 ( -1) -T 2 z 7 2 -8 + 7 -T 2 z2 ( -1) 9 2 -+ 9 + T 3 z 2 -+ 1 + T 3 z2 ( -1) 3 2 -+ 3 + O(z 3 ) We immediately see that z = 0, T = 2 solves both equations for all . Let w( , z, T ) and v( , z, T ) be the right hand sides of Equations 65 and 66 respectively. We have ∂w ∂T (0,0,2) = 2, ∂v ∂T (0,0,2) = 4 In both cases the derivative is invertible. Therefore, f z, (z) and f T, (z) are continuously differentiable in both z and in some neighborhood of 0. In fact, since w and v are analytic in all three arguments, f z, (z) and f T, (z) are analytic as well. We can use the analyticity to solve for the low-order structure of the nullclines. One way to compute the values of the derivatives is to define the nullclines as formal power series: f z (z) = 2 + ∞ j=1 ∞ k=1 a j,k j zk (68) f T (z) = 2 + ∞ j=1 ∞ k=1 b j,k j zk We can then solve for the first few terms of the series using Equations 65 and 66. From this procedure, we have: f z, (z) = 2 + 2 (1 -) z + 2 1 -+ 2 z2 + O(z 3 ) (70) f T, (z) = 2 - 2 -3 + 2 2 1 - z + 1 2 4 -+ 4 2 z2 + O(z 3 ) The difference f ∆, (z) between the two is: f ∆ (z) ≡ f z (z) -f T (z) = - 1 - z - 3 2 z2 + O(z 3 ) As decreases, for the low order terms the distance between the nullclines also decreases. We can show that the difference goes as . The one-step dynamical equations for = 0 are zt+1 -zt = -z t T t (0) + 1 2 z2 t T t (0) T t+1 (0) -T t (0) = -2z t T t (0) + z 2 t T t (0) ) Therefore, ∆z = 2∆T . This means that both the one step AND two-step nullclines are identical. Since f z,0 (z) = f T,0 (z), and both are differentiable with respect to , we have: f z, (z) -f T, (z) = f ∆, (z) ) for some function f ∆, (z) which is analytic in and z in a neighborhood around (0, 0). From this we can conclude that zt converges to 0. Therefore, for any positive initialization with z0 ≤ zc , y 0 ≤ y c , and y 0 ≤ z2 0 , we have: lim t→∞ zt → 0, lim t→∞ y = -y f where y f = O( ). Now we can prove the statement of Theorem 2.1. Given a model with ≤ c , there is a continuous mapping between θ -η space and z -y space. Since the region [0, zc ) × [0, y c ) in z -y space displays edge-of-stability behavior for -independent zc and y c (that is, T t (0) converges to within O( ) of 2 in that region), the inverse image of that neighborhood is a neighborhood in θ -η space that displays edge-of-stability behavior. This concludes the proof.

B.5 LOW ORDER DYNAMICS

In order to predict the final value of y, and understand the convergence to the fixed point, We can study the low order dynamics in z and y. The low order dynamical equations are: zt+2 -zt = 2y t zt (121) y t+2 -y t = -2(4 -3 + 4 2 )y t z2 t -4 z2 t ( ) For these reduced dynamics, we can show the following: Theorem B.3. For the dynamics defined by Equations 121 and 122, for 1, for positive inititializations z0 1, y 0 1 with the additional constraints -log( ) 16z 2 0 and y 0 < 2z 2 0 , we have lim t→∞ zt = 0, lim t→∞ y t = -/2 + O( 2 ) Proof. The proof distinguishes two phases in the time evolution: • Phase 1: z starts positive and increases, y starts positive and decreases. At the end of the phase we want zt ≤ 2z 0 and y to be negative but bounded by -16z 2 0 . • Phase 2: z decreases slowly, and y settles to the fixed point (relatively) quickly, up to error O( 2 ). Let 1. Consider an initialization (z 0 , y 0 ) where both variables are positive, such that z0 1, log( ) z2 0 , and y 0 z2 0 . From Equations 121 and 122, we see that the dynamics of y will depend on the balance of the two terms. Initially z increases and y decreases. We analyze the dynamics of y assuming that z is fixed, and then compute the corrections. Phase 1. At initialization, the first term in the dynamics dominates, since by assumption z2 t y t z2 t . Since z2 0 1, y initially decreases exponentially with decay rate bounded from above by 8z 2 0 . Therefore within log(-/y 0 )/8z 2 0 steps, y < . At this point, the rate of change of y is at least -4 z2 0 . Therefore, in no more than 1/4z 2 0 additional steps, y becomes negative. Let t -be the first time that y becomes negative. We note that y t-≥ -4 z2 0 under this analysis -the first term in Equation 122 is less than y t in magnitude, so the smallest value that y t+2 can take if y t is positive is -4 z2 0 . We can now understand the corrections due to the change in z. We note that e -8z 2 0 t is an upper bound for y -since z is increasing, and the -4 z2 t decreases y faster than exponential decay from the first term. Since z is increasing, y t ≥ e -8z 2 0 t as long as y remains positive (t < t -). Let t sm be a time such that ztsm < 2z 0 . We can bound the change in zt for t < t sm . We know that y t ≥ y 0 e -8z 2 0 t . The change in z can be bounded by ztsm -z0 ≤ tsm t=0 2z t y t ≤ 4z 0 tsm t=0 y t ≤ 4z 0 y 0 tsm t=0 e -8z 2 0 t ≤ 1 2 • y 0 z0 . ( ) If y 0 < 2z 2 0 , then the bound holds independent of the value of t sm , as long as the bound on y is correct. We know that the bound on y is correct until time t -; therefore, t sm ≥ t -. Phase 2. This proves that there exists a time t -, such that zt-≤ 2z, and -16z 2 0 ≤ y t-≤ 0. Now that y is negative, it will stay negative, and z will decrease until it reaches 0. In order to understand the dynamics, we will use a change of coordinates. Consider solving Equation 122 for y t+2 -y t = 0 for zt = 0. We have y * = - 2 -3/2 + 2 2 (125) Consider now the coordinate δ t defined by the equation y t = -(1 + δ t ) 2 -3/2 + 2 2 (126) The dynamics of δ t are given by δ t+2 = (1 -2(4 -3 + 4 2 )z 2 t )δ t Since zt 1, δ t is strictly decreasing in magnitude. We can bound δ t from above by |δ t | ≤ exp   -8 t s=t- z2 s   |δ t-| Since δ starts negative, and is decreasing in magnitude, we know that y t > -2-3/2 +2 2 . This means that we can bound zt by zt ≥ 2e -t z0 (129) Substitution gives us the following bound on δ t : |δ t | ≤ exp   -8 t s=t- 4e -2 s z2 0   |δ t-| Using the integral approximation for the sum, the bound becomes |δ t | ≤ exp -32z 2 0 t 0 e -2 s ds δ t-= exp -16z 2 0 / (1 -e -2 t ) |δ t-| From our previous analysis, we know that -1 ≤ δ t-≤ 0. In the limit of large t we have lim t→∞ |δ t | ≤ exp -16z 2 0 / |δ t-| If we have the condition 16z 2 0 / ≥ -log( ) (133) then lim t→∞ |δ t | ≤ 2 . If we want lim t→∞ y t = -/2 + O( 2), then we need the condition 16z 2 0 ≥ -log( ) or equivalently -log( ) < 16z 2 0 . Under these conditions, lim t→∞ zt = 0 and lim t→∞ y t = -/2 + O( 2 ). This result can be confirmed numerically by running the dynamical equations from a variety of initializations, computing the median eigenvalue (restricted to the range [1.9, 2.0]), and plotting versus (Figure 10 ).We note that since the dynamics is slow, the ODE given by ż = 2yz (135) ẏ = -2(4 -3 + 4 2 )yz 2 -4 z2 (136) also obtains the same limit (Figure 10 ). The ODE suggests that the concentration relies on both the equal-orders in z of the y 0 and y 1 terms, as well as a separation of timescales -z converges to 0 at a rate of , while y converges to the fixed point at a rate z2 t . In both cases, the deviation from -/2 scales as O( 2 ) (Figure 10 , right). B.6 PARAMETER SPACE VS. z -T SPACE Most of our analysis has been focused in the normalized z -T coordinate space. In this section, we confirm that the more usual setup in parameter space is consistent with the normalized coordinate space. In particular, EOS behavior is often described by fixing an initialization, and training with different learning rates -as in Figure 1 . We can plot the dynamics of T (0) for the trajectories from Figure 1 . We see that for small learning rates there is convergence to T (0) < 2, large learning rates there's divergence, and for intermediate learning rates there is convergence to 2 -/2 (Figure 11 ). This confirms that the theorem is useful to describe the more traditional method of discovering and exploring EOS behavior. 

C QUADRATIC REGRESSION MODEL DYNAMICS

We use Einstein summation notation in this section -repeated indices on the right-hand-side of equations are considered to be summed over, unless they show up on the left-hand-side. C.1 PROOF OF THEOREM 3.1 Let z, J, and Q be initialized with i.i.d. random elements with 0 mean and variance σ 2 z , σ 2 J , and 1 respectively. Furthermore, Let the distributions be invariant to rotations in both data space and parameter space, and have finite 4th moment. In order to understand the development of the curvature at early times, we consider coordinates which convert J into its singular value form. In these coordinates, we can write: J αi = 0 if α = i σ α if α = i The singular values σ α are the square roots of the singular values of the NTK matrix. We assume that they are ordered from largest (σ 1 ) to smallest in magnitude. By assumption, under this rotation the statistics of z and Q are left unchanged. The time derivatives at t = 0 can be computed directly in the singular value coordinates. The first derivative is given by d dt σ 2 α = 2σ α σα Using the diagonal coordinate system, we have E d dt σ 2 α = E[Q αβj J βj z β ] = 0 However, the average second derivative is positive. Calculating, we have: d 2 dt 2 σ 2 α = 2( σ2 α + σ α σα ) We can compute the average at initialization. We have: E[ σ2 α ] = E[Q αβj J βj z β Q αδk J δk z δ ] = E[δ βδ δ jk J βj J δk z β z δ ] (141) E[ σ2 α ] = E[Q αβj J βj z β Q αδk J δk z δ ] = j E[J 2 βj z 2 β ] = DP σ 2 J σ 2 z (142) To compute the second term, we compute Jαi : Jαi = -Q αij (J βj żβ + Jβj z β ) Expanding, we have: Jαi = Q αij (J βj J βk J δk z δ + Q βjk J δk z δ z β ) In the diagonal coordinates J αα = σ α . This gives us: E[σ α σα ] = E[σ α Q ααj Q βjk J δk z δ z β ] Averaging over the Q, we get: E[σ α σα ] = P E[σ α δ αβ δ αk J δk z δ z β ] = E[σ α z α z δ J δα ] Which evaluates to: E[σ α σα ] = σ 2 z P E[σ 2 α ] In the limit of large D and P , for fixed ratio D/P , the statistics of the Marchenko-Pastur distribution allow us to compute the derivative of the largest eigenmode as E[σ 0 σ0 ] = σ 2 z σ 2 J P 2 D(1 + D/P ) 2 Taken together, this gives us E d 2 λ max dt 2 = σ 2 z σ 2 J DP (P (1 + D/P ) 2 + 1) We confirm the prediction numerically in Figure 12 . That is, the second derivative of the maximum curvature is positive on average. If we normalize with respect to the eigenvalue scale, in the limit of large D and P we have: E d 2 λ max dt 2 /E[λ max ] = σ 2 z (150) Therefore, increasing σ z increases the relative curvature of the λ max trajectory. This gives us the proof of Theorem 3.1. This result suggests that as σ z increases, so does the degree of progressive sharpening. This can be confirmed by looking at GF trajectories (Figure 13 ). The trajectories with small σ z don't change their curvature much, and the loss decays exponentially at some rate. However, when σ z is larger, the curvature increases initially, and then stabilizes to a higher value, allowing for faster convergence to the minimum of the loss. Figure 13 : Gradient flow trajectories of loss and max NTK eigenvalues for quadratic regression models for varying σ z . As σ z increases, λ max changes more quickly, and is generally increasing. Models with higher σ z converge faster in GF dynamics.

C.2 TIMESCALES FOR GRADIENT DESCENT

Consider a random initialization of z, J, and Q, where the terms are i.i.d. with zero mean variances σ 2 z , σ 2 J , and 1 respectively, and finite fourth moments. Furthermore, suppose that z, J, and Q are rotationally invariant in both input and output space. Under these conditions, we hope to compute r 2 N L ≡ E[|| 1 2 η 2 Q αij (J βi ) 0 (z β ) 0 (J δj ) 0 (z δ ) 0 || 2 2 ] E[||η(J αi ) 0 (J iβ ) 0 (z β ) 0 || 2 2 ] = 1 4 η 2 σ 2 z D 2 (151) at initialization, in the limit of large D and P . The denominator is given by: E[J αi J βi (z β )J αj J δj (z δ )] = σ 2 z E[J αi J βi J αj J δj δ βδ ] = σ 2 z E[J αi J βi J αj J βj ] Evaluation gives us: E[J αi J βi (z β )J αj J δj (z δ )] = σ 2 z (σ 4 J (P (P -1)D) + C 4 DP ) where C 4 is the 4th moment of J αi . To lowest order in D and P E[J αi J βi (z β )J αj J δj (z δ )] = σ 2 z σ 4 J DP 2 + O(DP ) Evaluating the numerator, we have: E[Q αij J βi z β J δj z δ Q αmn J γm z γ J νn z ν ] = E[J βi z β J δj z δ J γm z γ J νn z ν ](δ im δ jn + (M 4 -1)δ ijmn ) (155) where M 4 is the 4th moment of Q αij . This gives us: 1 D E[Q αij J βi z β J δj z δ Q αmn J γm z γ J νn z ν ] = E[J βi z β J δj z δ J γi z γ J νj z ν ]+ (M 4 -1)E[J βi z β J δi z δ J γi z γ J νi z ν ] Next, we perform the z averages. We have 1 D E[Q αij J βi z β J δj z δ Q αmn J γm z γ J νn z ν ] = σ 4 z E[J βi J δj J γi J νj ](δ βδ δ γν + δ βγ δ δν + δ βν δ δγ ) + (C 4 -σ 4 )E[J βi J δj J γi J νj ]δ βδγν + (M 4 -1)σ 4 z E[J βi J δi J γi J νi ](δ βδ δ γν + δ βγ δ δν + δ βν δ δγ ) + (M 4 -1)(C 4 -σ 4 )E[J βi J δi J γi J νi ]δ βδγν (157) where C 4 is the 4th moment of z. Simplification gives us: 1 D E[Q αij J βi z β J δj z δ Q αmn J γm z γ J νn z ν ] = σ 4 z (E[J βi J βj J δi J δj ] + E[J βi J δj J βi J δj ] + E[J βi J δj J δi J βj ]) + (C 4 -σ 4 z )E[J βi J βj J βi J βj ] + (M 4 -1)σ 4 z (E[J βi J βi J γi J γi ] + E[J βi J δi J βi J δi ] + E[J βi J δi J δi J βi ]) + (M 4 -1)(C 4 -σ 4 )E[J βi J βi J βi J βi ] For large D and P , the final three terms are asymptotically smaller than the first term. Evaluating the first term, to leading order we have: 1 D E[Q αij J βi z β J δj z δ Q αmn J γm z γ J νn z ν ] = σ 4 z σ 4 J (2DP 2 + 2D 2 P + D 2 P 2 ) + O(D 2 P + DP 2 ) (159) E[Q αij J βi z β J δj z δ Q αmn J γm z γ J νn z ν ] = σ 4 z σ 4 J D 3 P 2 + O(D 3 P + D 2 P 2 ) This gives us: r 2 N L = 1 4 σ 4 z σ 4 J D 3 P 2 σ 2 z σ 4 J DP 2 = 1 4 σ 2 z D 2 to leading order, in the limit of large D and P .

C.3 DEPENDENCE ON D AND P

We can see empirically that the sharpening is more pronounced in the overparameterized regime where D > P . Using the trajectories from Figure 4 , we can make a scatter plot of the normalized maximum NTK eigenvalues ηλ max at both initialization and the final point of the dynamics (Figure C.3) . In all cases, a variety of initializations (x-axis) lead to final values which concentrate around 2 (y-axis). We can see that the concentration is tightest for the overparameterized regime where P > D (right plot). We hypothesize that for large D and P , the EOS behavior is stronger and more likely to happen when P > D. We leave futher exploration of this hypothesis for future work. The dynamics of y in the CIFAR10 model analyzed in Section 4 are more complicated than the z 1 dynamics. We see from Figure 5 that there is a z 1 and y-independent component of the two-step change in y. We can approximate this change b by computing the average value of y t+2 -y t for small z 1 (taking z 1 < 10 -4 in this case). We can then subtract off b from y t+2 -y t , and plot the remainder against z 2 t (Figure 15 left ). We see that y t+2 -y t -b is negatively correlated with z 2 t , particularly for large z t . However, y t+2 -y t is clearly not simply function of z 1 . The two-step model dynamics could be written as (ay + c)z 2 . If we plot (y t+2 -y t -b)/z 2 1 versus y t , we again don't have a single-valued function (Figure 15, right) . Therefore, the functional form of y t+2 -y t is not given by b + ayz 2 1 + cz 2 1 . 



Figure 1: Quartic loss landscape L(•) as a function of the parameters θ, where D = 2, E = 0 and Q has eigenvalues 1 and -0.1. The GD trajectories (initialized at (1.5, -4.32), marked with an x) converge to minima with larger curvature than at initialization and therefore show progressive sharpening (left). The two-step dynamics, in which we consider only even iteration numbers, exhibit fewer oscillations near the edge of stability (right).

Figure2: For small , two-eigenvalue model shows EOS behavior for various step sizes ( = 5 • 10 -3 , left). Trajectories are the same up to scaling because corresponding rescaled coordinates z and T (0) are the same at initialization. Plotting every other iterate, we see that for a variety of initializations (black x's), trajectories in z -T (0) space stay near the nullcline (z, f z (z)) -the curve where zt+2 -zt = 0 (middle). Changing variables to y = T (0) -f z (z) shows quick concentration to a curve of near-constant, small, negative y (right).

Figure 3: Gradient descent dynamics in the quadratic regression model. As z initialization variance σ 2z increases, so does the curvature λ max upon convergence. As sharpening drives ηλ max near 2, larger σ z allows for non-linear effects to induce edge-of-stability behavior (right). Resulting loss trajectories are non-monotonic but still converge to 0 (left).

Figure 4: σz /σ 2J phase planes for quadratic regression models, for various D and P . Models were initialized with 100 random seeds for each σz , σJ pair and iterated until convergence. For each pair σz , σ2J we plot the median λ max of the NTK J J. For intermediate σz , where both sharpening and non-linear z dynamics occur, trajectories tend to converge so λ max of the NTK is near 2/η (EOS).

Figure6: Q is approximately constant during edge-of-stability dynamics for FCN trained on CI-FAR10 (left). Projection onto largest eigendirection v 1 (blue and orange) is larger than projection onto random direction (green). Two step difference (z 1 ) t+2 -(z 1 ) t is well approximated by 2z 1 y (middle), leading order term of models with fixed eigenbasis. Non-linear dynamical contribution η 2 Q 1 (Jz 1 v 1 , Jz 1 v 1 ) is small during sharpening, but becomes large immediately preceding decrease in top eigenvalue (right) -as is the case in the simple model.

Figure 7: Phase portraits for symmetric model. Arrows indicate signs of changes in z and T , and grey area represents disallowed coordinates. Dynamics are run from an evenly spaced grid of initializations, and the final value of the curvature T (0) is recorded. Nullclines representing zt+1 -zt = 0 (blue) and T t+1 (0) -T t (0) = 0 (orange) depend on E. Trajectories show progressive sharpening but no edge-of-stability effect (right).

Figure 8: Phase planes of D = 1, P = 2 model. Grey region corresponds to parameters forbidden by positivity constraints on J(ω i ) 2 . For λ > 0, allowed region is smaller and intersects z = 0 at a small range only. Nullclines can be solved for analytically.

Figure 10: Final values of y, normalized deviation from critical value T (0) = 2, for discrete dynamics and ODE approximation. Deviation is well approximated by /2 over a large range (left). Deviations from /2 are O( 2 ) (right).

Figure 11: Dynamics of T for trajectories from Figure 1. For small learning rate η, trajectories converge to T < 2, and for large learning rates trajectories diverge. For intermediate trajectories, we have EOS behavior, where final T is predicted by Theorem 2.1.

Figure 12: Average λmax (0) versus σ z , various D and P (100 seeds).

Figure 14: Scatter plots of initial vs. final normalized maximum eigenvalues ηλ max for quadratic regression models. Trajectories are taken from the data used to generate Figure 4. For large D and P , as the model becomes overparameterized (P > D), a subset of trajectories show tighter EOS behavior where ηλ max concentrates close to 2 for a variety of initializations.

Figure15: Gradient flow trajectories of loss and max NTK eigenvalues for quadratic regression models for varying σ z . As σ z increases, λ max changes more quickly, and is generally increasing. Models with higher σ z converge faster in GF dynamics.

., 2022; Bordelon & Pehlevan, 2022; Yang et al., 2022). Daniel A. Roberts, Sho Yaida, and Boris Hanin. The Principles of Deep Learning Theory. May 2022. doi: 10.1017/9781009023405. Lei Wu, Chao Ma, and Weinan E. How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Greg Yang. Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes. arXiv:1910.12478 [cond-mat, physics:math-ph], May 2021. Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer, March 2022.

B.3 TWO-STEP DYNAMICS OF y

It is useful to define dynamical equations in coordinates (z, y) where y is the difference between T (0) and the z nullcline:To lowest order in z and we haveWe note that y = 0, at z = 0 corresponds to T (0) = 2. y near but slightly less than 0 is equivalent to edge-of-stability behavior. For positive z, y = 0 implies T (0) > 2.We can write the dynamics of z and y. The dynamics for z are given by: zt+2 -zt = p 0 (z t , ) + p 1 (z t , )(y t + f z, (z t )) + p 2 (z t , )(y t + f z, (z t )) 2 + p 3 (z t , )(y t + f z, (z t )) 3 (78) We know that the right hand side of this equation is analytic in z, , and (trivially) y as well. By evaluating the multiple continuous derivatives of f , we can write: zt+2 -zt = 2y t zt + y 2 t zt f 1, (z t , y t ) + y t z2 t f 2, (z t )Here, f 1, and f 2, are analytic in z, , and y in some neighborhood around 0.This means that we have the boundsfor some non-negative constants F 1 and F 2 . Note that this bound is independent of . Now we consider the dynamics of y. We have:Since lim z→0,y→0 zt+2 = 0, f z, (z t+2 ) is analytic in some neighborhood of (0, 0, 0). Therefore y t+2 -y t is analytic as well. Substituting, we haveIf we write f z, (z) = f T, (z) + f ∆, (z), then we can write:By the definition of the nullclines, the first four terms vanish. Once again using the differentiability of the nullclines, as well as f 1, and f 2, , we can rewrite the dynamics in terms of the expansion:Here g 1, and g 2, are analytic near zero in z, y, and . We have the boundsfor some non-negative constants G 1 and G 2 . This bound is also independent of .We can summarize these bounds in the following lemma: Where f 1, , f 2, , g 1, , g 2, are all analytic in z, y, and . Additionally, there exist positive zc , y c , and c such thatwhere F 1 , F 2 , G 1 , and G 2 are all non-negative constants.We can use this Lemma to analyze the dynamics for small fixed , for small initializations of z, y.The control of the higher order terms will allow for an analysis which focuses on the effects of the lower order terms.

B.4 PROOF OF THEOREM 2.1

Using Lemma B.1, the dynamics in z and y can be written as:Let < d . Then we can use the bounds from Lemma B.1 to control the contributions of the higher order terms to the dynamics:Lemma B.2. Given constants A > 0 and B > 0, there exist zc and y c such that for zwe have the bounds:Proof. We begin by the following decomposition:where the magnitudes of f 1, , f 2, , g 1, , and g 2, are bounded by F 1 , F 2 , G 1 , and G 2 respectively.Define zc and y c asThe desired bounds follow immediately.We consider an initialization (z 0 , y 0 ) such that z0 ≤ zc and y 0 ≤ y c , and y 0 ≤ z2 0 . Armed with Lemma B.2, we can analyze the dynamics. There are two phases; in the first phase, z is increasing, and y is decreasing. The first phase ends when y becomes negative for the first time -reaching a value of O( ). In the second phase, z is decreasing, and y stays negative and O( ).

B.4.1 PHASE ONE

Let t sm be the time such that for t ≤ t sm , zt ≤ 2z 0 . (We will later show that zt ≤ 2z 0 over the whole dynamics.) For t ≤ t sm , using Lemma B.2, the change in z can be bounded from below byTherefore at initialization, z is increasing. It remains increasing until y t becomes negative, or zt ≥ 2z 0 . We want to show that y t becomes negative before zt ≥ 2z 0 .For any t ≤ t sm , Lemma B.2 gives the following upper bound on y t+2 -y t :Let t -be the first time that y t becomes negative. Since zt is increasing for t ≤ t -, we haveThis gives us the following bound on y t :valid for t ≤ t -and t ≤ t sm .We will now show that t -< t sm . Suppose that t sm ≤ t -. Then at t sm + 2, ztsm+2 > 2z 0 for the first time. Summing the bound in Equation 99, we have:where the second bound comes from the definition of t sm . Using our bound on y t , we have:Since y 0 ≤ z2 0 , ztsm+2 ≤ 2z 0 . However, by assumption ztsm+2 > 2z 0 . We arrive at a contradiction; t sm is not less than or equal to t -.There are three possibilities: the first is that t -is well-defined, and t -< t sm . Another possibility is that t -is not well-defined -that is, y t never becomes negative. In this case the bounds we derived are valid for all t. Therefore using Equation 102, there exists some time t where y t < (4 -B) z2 0 . Then, using Equation 101 we have y t +2 < 0. Therefore, we conclude that t -is finite and less than t sm .Since the well defined value t -< t sm ,, when y first becomes negative, zt-≤ 2z 0 . This means that we can continue to apply the bounds from Lemma B.2 at the start of the next phase. At t = t --2, applying Lemma B.2 and zt-≤ 2z 0 , we havewhich gives us y t-≥ -4(4 + B) z2 0 . This concludes the first phase. To summarize we have -4(4 + B) z2 0 < y t-≤ 0, zt-≤ 2z 0 (106)

B.4.2 PHASE TWO

Now consider the second phase of the dynamics. We will show that y remains negative and O( ), and z decreases to 0. While y is negative, z decreases. While y ≥ -y 0 , from Lemma B.2 we haveTherefore as long as -y 0 ≤ y < 0, zt is decreasing. If this is true for all subsequent t, z0 will converge to 0.We will now show that y remains negative and O( ), concluding the proof. LetWe can re-write the dynamical equation for y asApplying Lemma B.2 to the higher order terms, we have:These inequalities are valid as long as |y t | < y c .At t -, y * < y t < 0. When y * < y t < 0, then |y t | ≤ |y * |. Note that < 2|y * |. From Equation 109, we haveFrom this inequality we can conclude thatIf B < 1, then both terms are negative. We can conclude that if y * < y t < 0, y t+2 < 0. In fact, from the last term we can conclude that y t+2 < -4 z2 t . Now we must show that when y * < y t < 0, y t+2 does not become too negative (namely, smaller than -y c ). Using Equation 110, we have:This means that if y t starts larger than y * , it will be at most 3B z2 0 y * below y * at the next step. Since B < 1, y t+2 > -y c if y * < y t < 0.Finally, we will show that if y * (1 + 3B/(8 -B)) < y t < y * , y * (1 + 3B/(8 -B)) < y t+2 < 0. Since y t-+2 fits this condition, we can conclude that y t is negative for all t > t -, with magnitude bounded from below by y * (1 + 3B/(8 -B)), and complete the proof.We will first show that y * (1 + 3B/(8 -B)) < y t implies that y * (1 + 3B/(8 -B)) < y t+2 . Let y t = (1 + δ t )y * , for δ t < 3B/(8 -B). We will show that δ t+2 < 3B/(8 -B). Using Equation 110, we have:Substituting y t+2 = (1 + δ t+2 )y * , and dividing both sides by y * we haveIf δ t < 3B/(8 -B), then we have δ t+2 < 3B/(8 -B) as desired.Finally, we will show that 0which gives usFinally, we make some choices of B and z0 to guarantee convergence. Choose z2 0 < 3/7, and choose B < 1 2 . Then in summary, what we have shown for phase two is:• At the start of the phase (time t -), y * < y t-< 0.Through our choices of z0 and B, we know that [1 + 3B/(8 -B)]y * < y * (1 + 3B z2 0 ). Therefore, the entire trajectory for t > t -is accounted for by these regions, and [1 + 3B/(8 -B)]y * < y t < 0 for all t > t -. Additionally, we know that at least once every 2 steps, y t < -4 z2 t . This means that the dynamics of zt can be bounded from above byWe trained a CIFAR model using the first two classes only with 5000 datapoints using the Neural Tangents library (Novak et al., 2019) -which let us perform 2nd and 3rd order Taylor expansions of the model at arbitrary parameters. The models were 2-hidden layer fully-connected networks, with hidden width 256 and Erf non-linearities. Models were initialized with the NTK parameterization, with weight variance 1 and bias variance 0. The targets were scalar valued -+1 for the first class, -1 for the second class. A learning rate of 0.003204 was used in all experiments. All plots were made using float-64 precision.Taking a quadratic expansion at initialization, we see that the loss tracks the full model for the first 1000 steps in this setting (Figure 16 , left), but misses the edge-of-stability behavior. We use Neural Tangents to efficiently compute the NTK to get the top eigenvalue λ 1 (and consequently, y). We can also compute z 1 by computing the associated eigenvector v 1 and projecting residuals z. If the quadratic expansion is taken closer to the edge of stability, the dynamics of z 1 well approximates the true z 1 dynamics, up to a shift associated with exponential growth of z 1 occurring at different times (Figure 16 , middle). We see that the shape of the first peak in |z 1 | is the same for the full model and the quadratic model, but the subsequent oscillations are faster and more quickly damped in the full model. This suggests that the initial EOS behavior may be captured by the quadratic model, but the detailed dynamics require an understanding of higher order terms. For example, the 3rd order Taylor expansion improves the prediction of the magnitude and period of the oscillations, but still misses key quantitative features (Figure 16 , right). 

