UNDERSTANDING EDGE-OF-STABILITY TRAINING DYNAMICS WITH A MINIMALIST EXAMPLE

Abstract

Recently, researchers observed that gradient descent for deep neural networks operates in an "edge-of-stability" (EoS) regime: the sharpness (maximum eigenvalue of the Hessian) is often larger than stability threshold 2/η (where η is the step size). Despite this, the loss oscillates and converges in the long run, and the sharpness at the end is just slightly below 2/η. While many other well-understood nonconvex objectives such as matrix factorization or two-layer networks can also converge despite large sharpness, there is often a larger gap between sharpness of the endpoint and 2/η. In this paper, we study EoS phenomenon by constructing a simple function that has the same behavior. We give rigorous analysis for its training dynamics in a large local region and explain why the final converging point has sharpness close to 2/η. Globally we observe that the training dynamics for our example have an interesting bifurcating behavior, which was also observed in the training of neural nets. * Equal Contribution. 1 The value 2/η is called the stability threshold, because if the objective has a fixed Hessian, the gradient descent trajectory will become unstable if the largest eigenvalue of the Hessian is larger than 2/η. With c, d as defined, the η-EoS minimum simplifies to (c, d) ≜ ((η -2 -4) 1 4 , 1). To expand the dynamics near the η-EoS minimum, we let a ≜ c -(η -2 -4) 1 4 and b ≜ d -1 to be the offset from (c, d). Our analysis will primarily be using the (a, b)-parameterization.

1. INTRODUCTION

Many works tried to understand how simple gradient-based methods can optimize complicated neural network objectives. However, recently some empirical observations show that optimization for deep neural networks may operate in a more surprising regime. In particular, Cohen et al. (2021) observed that when running gradient descent on neural networks with a fixed step-size η, the sharpness (largest eigenvalue of the Hessian) of the training trajectory often oscillates around the stability threshold of 2/η 1 , while the loss still continues to decrease in the long run. This phenomenon is called "edge-of-stability" and has received a lot of attention (see Section 1.2 for related works). While many works try to understand why (variants of) gradient descent can still converge despite that the sharpness is larger than 2/η, empirically gradient descent for deep neural networks has even stronger properties. As shown in Fig. 1a , for a fixed initialization, if one changes the step size η, the final converging point has sharpness very close to the corresponding 2/η. We call this phenomenon "sharpness adaptivity". Another perspective on the same phenomenon is that for a wide range of initializations, for a fixed step-size η, their final converging points all have sharpness very close to 2/η. We call this phenomenon "sharpness concentration". Surprisingly, both sharpness adaptivity and sharpness concentration happen on deeper networks, while for shallower models of non-convex optimization such as matrix factorization or 2-layer neural networks, the gap between sharpness and 2/η is often much larger (see Fig. 1b ). This suggests that these phenomena are related to network depth. What is the mechanism for sharpness adaptivity and concentration, and how does that relate to the number of layers? To answer these questions, in this paper we consider a minimalist example of edge-of-stability. More specifically, we construct an objective function (4-layer scalar network with coupling entries), such that gradient descent on this objective has similar empirical behavior as deeper networks. We give a rigorous analysis for the training dynamics of this objective function in a large local region, which proves that the dynamics satisfy both sharpness adaptivity and sharpness concentration. The global training dynamics for our objective exhibit a complicated fractal behavior (which is also why our rigorous results are local), and such behavior has been observed in training of neural networks. We consider three models including a 5-layer ReLU activated fully connected network, a 2-layer fully connected linear network with asymmetric initialization factor (4, 0.1) (see Appendix A.1 for explanation), and a 4-layer scalar network equivalent to min x,y 1 4 (1x 2 y 2 ) 2 . For each model we run gradient descent from the same initialization using different learning rates. For (a) and (c), the sharpness converges very close to 2/η with loss continuing to decrease. For (b), the sharpness decreases to be significantly lower than 2/η.

1.1. OUR RESULTS

The objective function we consider is very simple: L(x, y, z, w) ≜ 1 2 (1xyzw) 2 . One can view this as a 4-layer scalar network (each layer has a single neuron). We even couple the initialization so that x = z, y = w so effectively it becomes an objective on two variables L(x, y) ≜ 1 4 (1x 2 y 2 ) 2 . For this objective function we prove its convergence and sharpness concentration properties: Theorem 1.1 (Sharpness Concentration, Informal). For any learning rate η smaller than some constant, there is a constant size region S η such that the GD trajectory with step size η from all initializations in S η converge to a global minimum with sharpness within (2/η -20 3 η, 2/η). As a direct corollary, we can also prove that it has the sharpness adaptivity property. Corollary 1.1 (Sharpness Adaptivity, Informal). There exists a constant size region S and a corresponding range of step sizes K that for all η ∈ K, the GD trajectory with step size η from any initialization in S converges to a global minimum with sharpness within (2/η -20 3 η, 2/η). The training dynamics are illustrated in Fig. 2 . To analyze the training dynamics, we reparametrize the objective function and show that the 2-step dynamics of gradient descent roughly follow a parabola trajectory. The extreme point of this parabola is the final converging point which has sharpness very close to 2/η. Intuitively, the parabola trajectory comes from a cubic term in the approximation of the training dynamics (see Section 3.1 for detailed discussions). We can also extend our result to a setting where x, y are replaced by vectors, see Theorem 3.2 in Section 3.3. In Section 4 we explain the difference between the dynamics of our degree-4 model with degree-2 models (which are more similar to matrix factorizations or 2-layer neural networks). We show that the dynamics for degree-2 models do not have the higher order terms, and their trajectories form an ellipse instead of a parabola. In Section 5 we show why it is difficult to extend Theorem 3.1 to global convergence -the training trajectory exhibits fractal behavior globally. Such behaviors can be qualitatively approximated by simple low-degree nonlinear dynamics standard in chaos theory, but are still very difficult to analyze. Finally, in Section 6 we present the similarity between our minimalist model and the GD trajectory of some over-parameterized deep neural networks trained on a real-world dataset. Toward the end of convergence, the trajectory of the deep networks mostly lies on a 2-dimensional subspace and can be well characterized by a parabola as in the scalar case.

1.2. RELATED WORKS

The phenomenon of gradient descent on the Edge of Stability (EoS) was first formalized and empirically demonstrated in Cohen et al. (2021) . They show that the loss can non-monotonically decrease even when the sharpness λ > 2/η. The non-monotone property of the loss has also been observed in many other settings (Jastrzebski et al., 2020; Xing et al., 2018; Lewkowycz et al., 2020; Wang et al., 2022; Arora et al., 2018; Li et al., 2022a) . Recently several works try to understand the mechanism behind EoS with different loss functions under various assumptions (Ahn et al., 2022; Ma et al., 2022; Arora et al., 2022; Lyu et al., 2022; Li et al., 2022b) . Ahn et al. (2022) studied the non-monotonic decreasing behavior of gradient descent (which they call unstable convergence) and discussed the possible causes of this phenomenon. From a landscape perspective, Ma et al. (2022) defined a special subquadratic property of the loss function, and proved that EoS occurs based on this assumption. Despite the simplicity, their model displayed the EoS phenomenon without sharpness adaptivity. Instead, our model focuses on a minimalist scalar network and proves the convergence results together with the sharpness adaptive phenomenon. Arora et al. (2022) and Lyu et al. (2022) studied the implicit bias on the sharpness of gradient descent in some general loss function. Both works focus on the regime where the parameter is close to the manifold of minimum loss. Arora et al. (2022) proved that with a modified loss √ L or using normalized GD, gradient descent enters the EoS regime and has a sharpness reduction effect around the manifold of minima. Lyu et al. (2022) provably showed how GD enters EoS regime and keeps reducing spherical sharpness on a scale-invariant objective. In both works, the effective step-size η changes throughout the training process, so sharpness adaptivity and concentration do not apply. Our results start from a simpler example without normalization, whereas the above works focus on general functions with normalized gradient or scale-invariance property. Another line of works (Lewkowycz et al., 2020; Wang et al., 2022) focuses on the implicit bias introduced by a large learning rate. Lewkowycz et al. (2020) first proposed "catapult phase", a regime similar to the EoS, where loss does not diverge even if sharpness is larger than 2/η. Wang et al. (2022) provided a convergence analysis on the matrix factorization problem for large learning rate beyond 2/λ where λ is the sharpness. Their results include two stages: in the first phase, the loss may oscillate but never diverge; the sharpness decreases to enter the second phase, where the loss decreases monotonically. Recently Li et al. (2022b) provided a theoretical analysis on sharpness along the GD trajectory in a two-layer linear network setting under some assumptions during the training process. These works mostly focus on the degree-2 setting which does not have the sharpness adaptivity and sharpness concentration properties.

2. PRELIMINARIES AND NOTATIONS

In this section, we introduce the minimalist model which exhibits both sharpness adaptivity and sharpness concentration.

2.1. GRADIENT DESCENT ON PRODUCT OF 4 SCALARS

We focus on the simple objective L(x, y, z, w) ≜ 1 2 (1xyzw) 2 . Let the learnable parameters x, y, z, w ∈ R to be trained using gradient descent with a fixed step size η ∈ R + that (x t+1 , y t+1 , z t+1 , w t+1 ) = (x t , y t , z t , w t ) -η∇L(x t , y t , z t , w t ). (1) Here x t denotes the value of parameter x after the t-th update. To further simplify the problem, we consider the symmetric initialization of z 0 = x 0 , w 0 = y 0 . Note that due to symmetry of objective, the identical entries will remain identical throughout the training process, so the training dynamics reduces to two dimensional and the 1-step update of x and y follows x t+1 = x tx t y 2 t η(x 2 t y 2 t -1), y t+1 = y tx 2 t y t η(x 2 t y 2 t -1). (2) It's easy to show that the set of global minima for this function form the hyperbola xy = 1. Without loss of generality we focus on the case when x, y > 0, and in most of the analysis we also focus on the side where x > y. As shown in Fig. 2 , with GD running on such a minimal model, we observe convergence on EoS for a wide range of initializations. Eventually all such trajectories converge to minima that are just slightly flatter than the "EoS minima" (the minima whose sharpness is exactly 2/η, see Definition 1). The sharpness of all trajectories converges to around their corresponding stability threshold 2/η while the loss decreases exponentially. In the 2D trajectory, the 2-step movement quickly converges to some smooth curves ending very close to the EoS minimum. In (b) we demonstrate sharpness concentration by running GD with constant learning rate η = 0.2 for 50000 iterations from a dense grid of initializations and plot the sharpness of their converging minima. Initializations in the red shaded area all converge to a minima with sharpness in (2/η -0.1, 2/η).

2.2. EOS MINIMA AND REPARAMETERIZATION

Given that a wide range of initializations all converge very close to the "EoS minima" with sharpness 2/η, we want to concretely characterize those points. The complete calculations are deferred to Appendix B.1. Denote γ = xy, the Hessian of the objective L at (x, x, y, y) admits eigenvalues λ 1 = 1 2 x 2 + y 2 3γ 2 -1 + (x 2 + y 2 ) 2 (1 -3γ 2 ) 2 + 4γ 2 (3 -10γ 2 + 7γ 4 ) , λ 2 = 1 2 x 2 + y 2 3γ 2 -1 -(x 2 + y 2 ) 2 (1 -3γ 2 ) 2 + 4γ 2 (3 -10γ 2 + 7γ 4 ) . (3) and λ 3 = x 2 (1γ), λ 4 = y 2 (1γ). When (x, y) converges to any minimum, γ = xy = 1, so λ 2 , λ 3 , λ 4 all vanishes. Therefore it is λ 1 that corresponds to the EoS phenomenon people observe. When η < 1 2 , solving λ 1 = 2/η with x 2 y 2 = 1 gives x = ± 1 √ 2 ((-4 + η -2 ) 1 2 + η -1 ) 1 2 , y = ± √ 2((-4 + η -2 ) 1 2 + η -1 ) -1 2 and their multiplicative inverses. These solutions correspond to the minima with sharpness exactly equal to the EoS threshold of 2/η. Since they are all symmetric with each other, without loss of generality we pick the minimum of interest as follows. Definition 1 (η-EoS Minimum). For any step size η ∈ (0, 1 2 ), the η-EoS minimum under the (x, y)parameterization is (x, y) ≜ 1 √ 2 (-4 + η -2 ) 1 2 + η -1 1 2 , √ 2 (-4 + η -2 ) 1 2 + η -1 -1 2 . ( ) Though we are able to obtain a closed-form expression for the EoS minimum, its x-y coordinate could still be tricky to analyze. Thus we consider the following reparameterization: For any (x, y) ∈ {(x, y) ∈ R + × R + : x > y}, define c ≜ (x 2 -y 2 ) 1 2 and d ≜ xy. This gives a bijective continuous mapping between {(x, y) ∈ R + × R + : x > y} and {(c, d) ∈ R + × R + }. This is a natural reparameterization since intuitively the basis in the new coordinate system are the two orthogonal family of hyperbolas xy = C and x 2y 2 = C. The former captures the movement orthogonal to the manifold of minima xy = 1 while the latter captures the movement along the manifold of minima. Note that a similar separation of dynamics was also used in Arora et al. (2022) . Definition 2 (η-EoS Reparameterization). For any step size η > 0, for any (x, y) ∈ R + × R + such that x > y, the (a, b) reparameterization of (x, y) are respectively given by (a, b) ≜ x 2 -y 2 1 2 -η -2 -4 1 4 , xy -1 . Let κ ≜ √ η, following Eq. ( 2), the 1-step update under the reparameterization becomes a t+1 = (κ -4 -4) 1 4 + a t + (κ -4 -4) 1 4 1 -(1 + b t ) 3 -(1 + b t ) 2 κ 4 1 2 , b t+1 = b t + ((1 + b t ) 3 -2(1 + b t ) 5 + (1 + b t ) 7 )κ 4 + (1 + b t ) -(1 + b t ) 3 4(1 + b t ) 2 κ 4 + (a t κ + (1 -4κ 4 ) 1 4 ) 4 1 2 . (6) Now we can proceed to analyze the dynamics of this simple example.

3. DYNAMICS OF GRADIENT DESCENT ON DEGREE-4 MODEL

In this section, we will rigorously analyze the training dynamics characterized by Eq. ( 6). First we will introduce the approximation of one and two-step update and build up intuition on the dynamics. Then we will present our main theoretical results that the degree-4 model exhibits both characterizations of EoS training.

3.1. APPROXIMATING 1-STEP AND 2-STEP UPDATES

Here we introduce the informal approximation on Eq. ( 6) and the corresponding two-step updates. For cleanness of presentation we will use ≈ to hide all dominated terms. The rigorous statements of the approximations and corresponding proofs are deferred to Appendix B.3. When we are only describing the one/two-step dynamics, we use a, a ′ , a ′′ to denote a t , a t+1 , a t+2 and b, b  a ′ ≈ a -2b 2 κ 3 , b ′ ≈ -b -4abκ -3b 2 -b 3 ; a ′′ ≈ a -4b 2 κ 3 , b ′′ ≈ b + 8abκ -16b 3 . ( ) In the approximation, a is monotonically decreasing at a steady rate of 2b 3 κ 3 per step. The one step update of b is flipping signs and contains second and third order terms of b. For the two-step approximation however, the oscillation behavior and the even-order terms of b all cancels. This is consistent with the analysis in (Arora et al., 2022) that the two step dynamics travels along a sharpness reducing flow. Before proceeding to analyze the discrete GD movement, we first get intuition by approximating the two-step dynamics with a simple ODE db da = b ′′ -b a ′′ -a = 16b 3 -8abκ 4b 2 κ 3 . ( ) This would be the limit when κ is going to 0 and the movement of two-step dynamics become very small. The general solution for Eq. ( 8) is given by b 2 = 1 2 aκ + 1 16 κ 4 + C exp(8aκ -3 ) (9) for some constant C ∈ R. As a decreases following Eq. ( 7), the trajectory converges toward the parabola b 2 = 1 2 aκ + 1 16 κ 4 . Note that the convergence to the parabola is exponential with respect to a, so if a is initialized positive and not too small, it will converge to a minima that is very close to a = -1 8 κ 3 as shown in Fig. 3 . This is a minimum that is just slightly flatter than the κ 2 -EoS minimum. -0.25 0.00 0.25 0.50 0.75 1.00  -0.5 0.0 0.5 1.0 1.5 b a Global Minima κ 2 -EoS Minimum b 2 = 1 2 aκ + 1 16 κ 4 Converging Minimum (a = -κ 3 /8) ) satisfies a 0 ∈ (12κ 5 2 , 1 4 K -2 κ -1 ) and b 0 ∈ (-K -1 , K -1 )\{0}. Consider the GD trajectory characterized in Eq. (6) with fixed step size κ 2 from (a 0 , b 0 ), for any ϵ > 0 there exists T = O(K -2 κ -15 2 + log(ϵ -1 ) + log(|b 0 | -1 )κ -7 2 ) such that for all t > T , |b t | < ϵ and a t ∈ (-5 3 κ 3 , -1 10 κ 3 ). Under the context of x, y coordinate and sharpness, Theorem 3.1 gives the following corollary: Corollary 3.1 (Sharpness Concentration under (x, y)-Parameterization). For a large enough absolute constant K, suppose η < 1 8000000 K -2 , and the initialization (x 0 , y 0 ) satisfies x 0 ∈ (x + 13η 5 4 , x + 1 5 K -2 η -1 2 ) and |x 0 y 0 -1| ∈ (0, K -1 ) where (x, y) is the η-EoS minima defined in Definition 1. The GD trajectory characterized in Eq. (2) with fixed step size η from (x 0 , y 0 ) will converge to a global minimum with sharpness λ ∈ ( 2 η -20 3 η, 2 η ). Note that when the step size η (and hence κ) is relatively small, the final sharpness is very close to 2/η. The range of initialization that satisfies the requirement is quite large: in the original (x, y)parameterization it contains a box of width Θ(K -2 η -1 2 ) and height Θ(K -1 η 1 2 ). Many of the initial points can be far from the EoS-minimum. The complete proofs are deferred to Appendix B.6. Here we discuss the proof sketch of Theorem 3.1. Our convergence analysis focuses on the 2-step update. It contains two phases: • We also show that once (a, b) enters II, it will stay in II until it exits from the left and enters IV (Lemma 11). Thus after Phase 1, all initializations will be in IV.

Phase 2. (Convergence along parabola)

• After (a, b) enters IV, we show that it will further converge to the parabola that |b 2 -1 2 aκ-1 16 κ 4 | < 1 200 κ 4 will be satisfied before a decreases to κ 5 2 and enters V (Lemma 13). • Then we show that the inequality will be preserved in V while it moves left until it enters VI (Lemma 14). • In VI, the dynamics is again similar to III, but with a being negative. We conclude our proof by showing |b| will converge to 0 superlinearly with respect to a (Lemma 15). Following Theorem 3.1, we can also formally characterize the sharpness adaptive phenomenon for a local region using the following corollary. The proof is deferred to Appendix B.6.2. Corollary 3.2 (Sharpness Adaptivity). For a large enough constant K, fix any α < a |b| a = 2κ 5 2 a = κ 5 2 a = -17 200 κ 3 I II III IV V VI κ 2 -EoS Minimum (0,0) b 2 = 2aκ b 2 = 1 4 aκ b 2 = 1 2 aκ + 1 16 κ 4 |b 2 -1 2 aκ -1 16 κ 4 | < 1 200 κ 4 1 2000 √ 2 K -1 . For all initialization (x 0 , y 0 ) in the region characterized by x 0 ∈ (α -1 + 1 15 K -2 α -1 , α -1 + 1 6 K -2 α -1 ) and |x 0 y 0 -1| ∈ (0, K -1 ), the GD trajectory from (x 0 , y 0 ) characterized by Eq. (2) with any step size η ∈ (α 2 -1 10 K -2 α 2 , α 2 ) will converge to a minima with sharpness λ ∈ ( 2 η -20 3 η, 2 η ).

3.3. CONVERGENCE ON EOS FOR RANK-1 FACTORIZATION OF ISOTROPIC MATRIX

Inspired by the scalar factorization problem, we extend it to a rank-1 factorization of an isotropic matrix. In particular, we consider the following optimization problem: min x,y∈R d 1 4 I d×d -xy ⊤ xy ⊤ 2 F (11) Similar to the under-parameterized case in Wang et al. (2022) , this problem also guarantees the alignment between x and y if (x, y) is a global minimum, i.e., x = cy for some c ∈ R. To prove the convergence for Eq. ( 11) at the edge of stability, we first prove the alignment can be soon achieved. After the alignment, we prove the equivalence between this problem and the degree-4 scalar model, and prove the convergence of this problem. We directly give the final theorem and the proof is deferred to Appendix C. The experiments demonstrate similar EoS phenomenon (See Appendix A.5). Theorem 3.2. For a large enough absolute constant K, with all the initialization (x 0 , y 0 ) satisfying x 0 ∼ δ x Unif(S d-1 ), y 0 ∼ δ y Unif(S d-1 ) 2 , δ x δ y = 1 2 , δ x ∈ (x + 1 80 K -2 η -1 2 , x + 1 8 K -2 η -1 2 ), if step size η < min{ K -4 8000000 , K -2 20000+2000(log(d)-log(δ0)) }, and a multiplicative perturbation y ′ t = y t (1 + 2K -1 ) is performed at time t = t p for some t p > O(-log(η) + log(d) -log(δ 0 ) + K 3 ), then for any ϵ > 0, with probability p > 1 -2δ 0 -2 exp{-Ω(d)} there exists T = O(K -2 κ -15 2 - log(ϵ) -log(δ 0 )) such that for all t > T , L(x, y) < ϵ and ∥x t ∥ 2 + ∥y t ∥ 2 ∈ ( 1 η -10 3 η, 1 η ). Note that we require an additional perturbation because we need to guarantee that the trajectory does not converge to an unstable point (where sharpness λ > 2/η). This was proved without perturbation for the scalar case but is more challenging in higher dimensions. The objective will still converge to a minimum very close to an η-EoS minimum. The experiment results are available in Appendix A.5.

4. DIFFERENCES IN DEGREE-2 AND HIGHER DEGREE MODELS

In this section, we will look at some similar models of lower degree, and explain why for degree-2 models the sharpness of final converging point is often farther from 2/η compared to higher degree models. We will use similar methods as in Section 2 and Section 3 to gain intuition for the dynamics. Previous works including (Chen & Bruna, 2022) and (Wang et al., 2022) have studied the dynamics of beyond EoS training on the problem of factorizing a single scalar or an isotropic matrix into two components. The objectives studied includes min x,y∈R d (µx ⊤ y)foot_0 , min x,y∈R d ∥µI dxy ⊤ ∥ 2 F , and the corresponding scalar case min x,y∈R (µ-xy) 2 . They were able to show that for initializations with sharpness greater than 2/η, GD with constant learning rate η provably converges to a global minimum with sharpness less or equal to 2/η. Empirically, the sharpness reduction process on these 2-component objectives will usually "overshoot" the EoS threshold and converge to a minima that is significantly flatter than the EoS minimum, and one does not observe the oscillation of sharpness around the EoS threshold (see Appendix A.3). In this section we consider the scalar objective min x,y∈R (1xy) 2 since it is able to captures the major dynamical properties of those more complex objectives as discussed in Wang et al. (2022) . As shown in Fig. 5 , initializations with sharpness exceeding the EoS threshold will converge to a minima that is distinguishably flatter than the EoS minimum, and globally there is not a region of initialization that gives EoS convergence. Unlike the parabola for the degree-4 case, the 2-step update travels in a roughly circular trajectory centered at the κ 2 -EoS minimum as shown in Fig. 5a (right). Therefore locally we observe that sharper initializations tend to converge to flatter minima. The difference between the degree-2 and degree-4 case can be easily explained by a local expansion. Using the same (c, d)-reparameterization and setting (a, b) to be the offset of (c, d) from the EoS minimum, the two step update of (a, b) under learning rate κ 2 can be approximated by a ′′ ≈ a - √ 2b 2 κ 3 , b ′′ ≈ b + 4 √ 2abκ. This is very similar to Eq. ( 7) except that we no longer have the -b 3 term for the 2-step update on b which was attracting b close to 0. In this case, the ODE approximation db/da = 4a/bκ 2 gives the general solution b 2 = 4(Ca 2 )/κ 2 for C ∈ R + , which corresponds to the family of ellipses centered at (0, 0) and matches the two step trajectory in Fig. 5a . We run the same experiment as in Fig. 2 except for using objective min x,y∈R (1xy) 2 . Note that in this case the two-step trajectories form circular curves and converge to points that are farther from EoS minima. In Appendix A.2.3, we discuss a degree-3 model exhibiting mixed behavior around different EoS minima, which further verifies our explanation above. We also empirically note that the coupling of entries will naturally arise when training general scalar networks from non-coupling initializations. Thus it is not an artifact we have to impose on the model to observe EoS (see Appendix A.4).

5. GLOBAL TRAJECTORY AND CHAOS

There exists very limited global convergence analysis for constant step size gradient descent training beyond EoS on complicated non-convex objectives. Even for the product of 4 scalars, the boundary separating converging and diverging initializations (Fig. 6a ) exhibits complicated fractal structures. Moreover, we observe that for initializations close to such boundary, their GD training trajectories usually begin with a phase of chaotic oscillation which eventually "de-bifurcates" and converges to the parabolic two-step trajectory as discussed in Section 3. Similar oscillation phenomenon has also been empirically observed by Ruiz-Garcia et al. (2021) in neural networks when they increase the learning rate and destabilize the network from a local trajectory. So what is causing the bifurcation? Previously, Ruiz-Garcia et al. ( 2021) attributed the phenomenon to the cascading effect of oscillation along multiple large eigendirections of the network. Yet this explanation is quite unsatisfying for our simple model as there is only one oscillating direction. Looking closely to the trajectory (Fig. 6b ), one will find it very similar to the bifurcation diagram of self-recurrent polynomial maps (such as the famous logistic map x t+1 = rx t (1x t ) parameterized by r). In the degree-4 model, the existence of such self-recurrent map is explicit since following Eq. ( 7), the approximate 2-step update of b can be rewritten as b ′′ = b(1 + 8aκ -16b 2 ). If we consider a to be relatively stationary, the trajectory of b will be locally characterized by the self-recurrent 1D nonlinear dynamical system b t+1 = b t (1 + 8aκ -16b 2 t ) parameterized by a. In Fig. 6c , we compute the bifurcation diagram for the recurrent map numerically and see that they are qualitatively similar. Following this analogy, one may instantly relate the first bifurcating point with the EoS minima that the trajectory eventually converges to, and the non-bifurcating regime for the polynomial maps with the "sub-EoS regime" on the left (in Fig. 6b ) of the EoS minima.

6. CONNECTION TO REAL-WORLD MODELS

In this section we show how the degree-4 model analyzed above resembles the converging dynamics of over-parameterized regression models trained on real-world dataset. We train a 5-layer ELUactivated fully connected network on a 2-class small subset of CIFAR-10 ( Krizhevsky et al., 2009) with GD. The loss converges to 0 and the sharpness converges to just slightly below 2/η. We visualize the dynamics by projecting the trajectory onto the subspace spanned by the top eigenvector of minimum (oscillation direction) and the movement direction of parameters orthogonal to oscillation (see Definition 4 in Appendix A.6.1 for exact characterization). As shown in Fig. 7 (mid), after some initial bifurcation-like oscillation, the 2-step trajectory stabilizes and moves along some smooth curves toward the minimum. Near the minimum (Fig. 7 , right), the trajectory in fact lies mostly in this 2-dimensional subspace (see Fig. 23c in Appendix) and can be very well-captured by a parabola, which is very similar to our minimalist example. More experimental results on realworld models are available in Appendix A.6. 

7. DISCUSSION AND CONCLUSION

In this paper we proposed a simple degree-4 model that captures the sharpness adaptivity and sharpness concentration phenomena that happen in gradient descent training of deep neural networks. The simplicity of the model allowed us to perform rigorous analysis on the training dynamics for a large local region. The analysis gives new insights on why the training dynamics of the degree-4 model is inherently different from the training dynamics of degree-2 models. Finally we show that the over-paramterized deep networks trained on real data exhibits a similar parabolic converging trajectory as the scalar example. We hope many of these observations can be generalized to highlight the difference between training dynamics of deeper networks and the shallower models. There are still many open problems. Can we identify the hidden dynamics of the real world model that yields the parabolic converging trajectory? Can we theoretically understand the automatic coupling of small entries as discussed in Appendix A.4? Is there a way to understand and leverage the fractal/bifurcation behavior in Section 5 toward global dynamics analysis?

Supplementary Materials for Understanding Edge-of-Stability

Training Dynamics with a Minimalist Example

A ADDITIONAL EXPERIMENTS

In this section, we provide more empirical evidences supporting the main text. In Appendix A.1, we will first introduce our experiment setup including the structure and initialization for the neural network models as well as the generative model for the synthetic datasets. In Appendix A.2, we will present some additional figures demonstrating the training dynamics near the EoS minima for the degree-2 and degree-4 examples discussed in Section 3 and Section 4. We will also discuss a degree-3 example exhibiting different behavior around different EoS minima. We will explain the phenomenon using our understanding of the 2 and degree-4 models. In Appendix A.3, we will provide additional empirical evidence that shallow neural networks usually does not converge to the exact EoS threshold. In Appendix A.4, we will present some results on the training dynamics of scalar networks without the coupling initialization. We will empirically show that the coupling of entries will arise along the training process. In Appendix A.5, we will demonstrate the EoS phenomenon on the rank-1 factorization. In Appendix A.6, we will introduce the experiment on learning real-world images (as presented in Section 6) in more detail. We will also present additional experiment results on networks with different activations and local trajectory with perturbation. In Appendix A.7, we will present some experiments on the edge of stability phenomenon when the model is optimized with stochastic gradient descent.

A.1.1 CALCULATION OF NUMERICAL SHARPNESS

For the scalar network examples the closed-form Hessian is simple. We compute the exact parameter Hessian and use numerical packages to compute its top eigenvalue. For neural networks, we use the PyHessian package by (Yao et al., 2020) , which compute the top eigenvector eigenvalue pair by inferencing the Hessian vector product and do power iteration. For all numerical sharpness computed for neural networks, we set tol=1e-6 and max iter=10000.

A.1.2 SYNTHETIC EXPERIMENTS

For all experiments involving neural networks on synthetic datasets (Fig. 1a , Fig. 1b , Fig. 15 ), we use fully connected networks with the same dimension for input, output, and all hidden layers. The bias of all layers are fixed to 0. Formally, a L-layer width d network can be modeled by f : R d → R d such that for input vector x ∈ R d , f (x) = W L σ(W L-1 . . . σ(W 2 σ(W 1 x)) . . . ) where σ : R d → R d is some entry-wise activation function and W l ∈ R d×d for all l ∈ [L]. For this paper we only considered σ being the ReLU activation σ(x) = x1 x≥0 or the identity σ(x) = x.

Initialization of Neural Networks

We use Xavier initialization (Glorot & Bengio, 2010) with gain of 1 to initialize the all weight matrices. For shallow two-layer networks that will not enter the EoS regime if using completely random initialization, we will asymmetrically re-scale the layers after random initialization by multiplying a constant to all entries of the same layer. When we present results for the re-scaled experiments, we will state the re-scale factor.

Synthetic Dataset and Loss Function

For the experiments involving neural networks, we use synthetic datasets very similar to the linear network experiment in section L.3 of Cohen et al. (2021) . For a neural network as described above with dimension d, we consider the problem of mapping n inputs x 1 , . . . , x n ∈ R d to n outputs y 1 , . . . , y n ∈ R d . Let X ∈ R d×d and Y ∈ R d×d denote the vertically stack inputs and outputs respectively (Here n = d). We generate X as a whitened matrix such that XX T = dI d and generate Y by Y = XA where A = diag(1, d-1 d , . . . , 2 d , 1 d ). For all experiments with neural networks on synthetic datasets, we consider the simple squared loss 1 n n i=1 ∥f (x i ) -y i ∥ 2 2 , which matches our analysis on scalar networks.

A.1.3 REAL-WORLD DATA EXPERIMENTS

Here we provide the detailed setting for the experiment results shown in Section 6 in the main text as well as Appendix A.6. We consider a binary classification problem on a subset of CIFAR-10 image classification dataset (Krizhevsky et al., 2009) .

Dataset

To study the training process in an over-parameterized setting (in which the loss can converge close to 0), we take a binary 50-sample subset from CIFAR-10 containing the first 25 samples of class 0 (airplane) and class 1 (automobile). Then we label samples from class 0 by -1 and samples from class 1 by +1. Here we are consider a binary classification problem since for networks with output dimension larger than 2, there is typically not a strong eigengap between the first eigenvalue and the other eigenvalues (Sagun et al., 2016; Papyan, 2018; Wu et al., 2020) . The dynamics with multiple eigenvalues around the stability threshold may exhibits cascading oscillation along different eigendirections (Ruiz-Garcia et al., 2021) , and could be complicated to analyze.

Network Structure

We conduct the experiment on fully-connected neural networks with four hidden layers of width 200. We consider tanh and ELU as activations. In Table 1 we provide the structure of a fully-connected ELU-activated architecture. This architecture follows the experiments in Li et al. (2022b) . where n = 50, x i ∈ R 3072 , and y i ∈ {1, -1}, we consider the mean squared loss 1 n n i=1 ∥f (x i ) -y i ∥ 2 2 , which matches our theoretical analysis on the scalar example. Published as a conference paper at ICLR 2023

A.2 ADDITIONAL EXPERIMENTS FOR SCALAR NETWORK EXAMPLES

In this section we show some additional figures demonstrating the local training dynamics and convergence boundary for the two cases we analyzed in Section 3 and Section 4. A.2.1 4-LAYER SCALAR NETWORK except for the degree-2 example. We use the same step size η = 0.2. There is no longer the concentration behavior as we see for the degree-4 case. Locally, only the initialization very close to the EoS minimum will converge to a sharpness near the stability threshold (which is 10 with η = 0.2).

A.2.3 3-LAYER SCALAR NETWORK

Now we look into an interesting example with different behaviors around different EoS minima. We consider a 3-layer scalar network with objective min x,y,z∈R 1 2 (1 -xyz) 2 . ( ) To make the dynamics two dimensional, we consider the initialization with z = y. The equality of the last two entries will be preserved through training so the dynamics is two dimensional in terms of x and y. In the positive quadrant, the global minima is √ xy = 1 and there are two EoS minima. In Fig. 13 , we plot the converging sharpness from different initializations in comparison with Fig. 2b and Fig. 5b in the main text. Around the EoS minimum that the single entry x is small and the duplicated entries y are large (upper left of Fig. 13 ), the behavior is similar to the 2 scalar case (Fig. 5b ) with no sharpness concentration. Around the EoS minima with large single entry and small duplicating entries (lower right of Fig. 13 ), we have a region of initialization (the red shaded area) with sharpness concentration similar to the 4 scalar case (Fig. 2b ). Converging Sharpness (50000 iters) A heuristic explanation to this difference lies in the difference in the degree of the small entries. Around the EoS minima that the single entry is small, the local two-step approximation is similar to Eq. ( 12) and gives us elliptical two-step trajectories. Around the minima with small duplicating entries, the two-step approximation would contain the cubic term as in Eq. ( 7), which gives us both sharpness concentration and adaptivity. Such heuristics can be further verified by visualizing the local dynamics around the minima. As shown in Fig. 14 , the local dynamics around the minima with small duplicating entries is similar to the case of the degree-4 example with convergence toward a parabolic trajectory. On the other hand, the local dynamics around the minima with only one small entry is similar to the case of the degree-2 example where parameters follow an locally elliptic trajectory centered at the EoS minima. . At this EoS minima, the duplicated entry y is small, and the local behavior is very similar to the case of degree-4 example (Fig. 9 ) for which we have provable sharpness concentration. The right figure corresponds to the EoS minima at the top left of Fig. 13 . The local behavior is very similar to the case of degree-2 example (Fig. 12 ), for which we do not have EoS behaviors.

A.3 ADDITIONAL EXPERIMENTS FOR 2-COMPONENT SCALAR FACTORIZATION

In this section we present the experiment results for 2-component scalar factorization deferred from Section 4. The dynamics as shown in Fig. 15 is very similar to the degree-2 example in Fig. 11 . We run gradient descent with η = 0.05 for two different objectives. Both models are asymmetrically initialized with factor (5, 0.1) so that they have an initial sharpness larger than 2/η. For both cases, the converging sharpness is distinguishably smaller than the stability threshold.

A.4 EXPERIMENTS FOR GENERAL SCALAR NETWORKS

In this section, we will present some empirical observations on training dynamics of more general scalar networks related to the sharpness concentration and adaptation phenomena. A n-layer scalar network is defined to be the model parameterized by n entries x 1 , . . . , x n ∈ R with objective L(x 1 , . . . , x n ) ≜ 1 2 1 - n i=1 x i . (17) A.4.1 INITIALIZATION WITHOUT DUPLICATED ENTRIES We first consider a variant of the degree-3 example as discussed in Appendix A.2.3. In particular, we initialize the two small entries differently and record their values throughout the training trajectory. In (right) we plot the distance of the last entry x 3 to other entries. As we can see in Fig. 16 , at the very beginning of the training, the second entry x 2 converges to x 3 geometrically, then the dynamics is reduced to the case of duplicated entries, which we know that the sharpness concentration behavior would happen for sufficiently asymmetric initialization. In Fig. 17 we consider a 7-layer scalar network with 3 different large entries and 4 different small entries. We observe similar behavior as x 4 , x 5 , x 6 all converges to x 7 geometrically, and we observe concentration of sharpness with 4 duplicated small entries. To probe into the detailed training dynamics of general scalar networks, we plot the pairwise dynamics of the entries as shown in Fig. 18 . x 2 and x 3 are initialized large while x 4 and x 5 are initialized small. We see that within the small entries and the large entries, the pairwise dynamics are all approximately linear while the cross comparisons across the small entries and large entries gives the parabolic two-step trajectory (and also some bifurcation behavior). We plot the value of x 3 (initialized to 3) and x 3 (initialized to 0.1) along the training process. The larger entry decreases in an approximately monotone manner and the small entry increase while oscillating. A.4.2 ON "LARGE" AND "SMALL" INITIALIZATIONS In the experiments shown above, we have seen that there are mainly two classes of behaviors for the entries: the entries that were initialized to be large moves slowly with little oscillation while the entries that were initialized to be small has significant oscillation along the trajectory. Intuitively, it is the decreasing large entry that decreases the sharpness and stabilizes the oscillating small entries and result in the final convergence close to the EoS minimum. A natural question to ask is whether there exists a clear boundary separating the "small" and "large" entries. We consider a 4-layer scalar network with initialization (6, 0.7, 0.3, 0.2) optimized with GD with step size η = 0.2. In Fig. 20 and Fig. 21 we visualize the training loss, sharpness, and the pairwise dynamics. The mixed behaviors suggests a clear boundary between the "large" and "small" entries does not exists, and the complexity of this problem is beyond this simple heuristics. 2) optimized by gradient descent with fixed step size η = 0.2. In (right) we plot the distance of the last entry x 4 to other entries. In this example, the small entries did not converge to be exactly the same value, yet the loss still decreased geometrically and the sharpness concentration phenomenon still occurred. x 1 was initialized large (at 6), x 3 , x 4 were initialized small (at 0.2, 0.3), and x 2 was initialized moderately small (at 0.7). We see that the two-step pairwise dynamics between x 2 and x 3 roughly follows a parabolic trajectory, yet the pairwise dynamics between x 1 and x 2 also exhibits similar features while still following a roughly linear relation.

A.5 ADDITIONAL EXPERIMENTS FOR RANK-1 FACTORIZATION OF ISOTROPIC MATRIX

In this section, we will demonstrate the EoS phenomenon on the rank-1 factorization of isotropic matrix in Section 3.3. We first show that the loss, the sharpness and the trajectory of GD is very similar to the degree-4 scalar network case. For each different learning rate, the sharpness concentrates to a tiny interval close to the stability threshold 2/η when trained with gradient descent. Also, we consider the 2D trajectory of the two vectors x, y. We plot the the trajectory in the norm of each vector, i.e. ∥x∥∥y∥, and get a similar figure to the scalar case. Actually, we can prove that the dynamics of this training objective will eventually be reduced to our degree-4 scalar network. That is because all the global minima of this optimization problem requires that x is aligned with y, i.e. x = cy. After the two vectors are aligned, the training dynamics of ∥x∥, ∥y∥ will be exactly equivalent to those of the scalar network. The following figure shows how GD enters EoS on the rank-1 factorization problem, and how fast the alignment of the two vectors is achieved. Here we consider an alignment indicator ∥x∥ 2 ∥y∥ 2 -(x ⊤ y) 2 showing how the two vectors are aligned. If x is parallel to y, i.e. x = cy, then the variable becomes 0. Detailed analysis for this problem is deferred to Appendix C. Figure 22 : EoS phenomenon on the rank-1 factorization of isotropic matrix. In (a), similar to the degree-4 scalar case, we demonstrate sharpness adaptivity by running GD with learning rate η = 2 8 , 2 10 , 2 12 from the same initialization (∥x 0 ∥∥y 0 ∥ = 1/2). All the sharpness of each trajectory converges to around their corresponding stability threshold 2/η while the loss decreases exponentially. In the 2D trajectory, GD quickly converges along some smooth curves ending near the EoS minimum. In (b), we see that in the first 30 iterations, the alignment indicator (∥x∥ 2 ∥y∥ 2 -(x ⊤ y) 2 ) decreases geometrically, and stabilizes at its numerical minimal value. Theoretically, we characterizes the geometric decay for the alignment variable by Lemma 20 in Appendix C.

A.6 EOS CONVERGENCE FOR DEEP NEURAL NETWORKS TRAINED ON REAL DATA

In this section, we present a more comprehensive description and additional results for the experiment of learning a 2-class small subset of CIFAR-10 with 50 images in an over-parameterized regression setting. The details for the network structures and dataset construction are available in Appendix A.1.3. In this experiment, we train two 5-layer fully connected (fc) networks of width 200 with ELU and tanh activation using (full-batch) gradient descent on the binary dataset with mean squared loss. We chose these two activation functions following the empirical experiments in Cohen et al. (2021) . ReLU is not being used since its training dynamics as the loss converges to 0 is very unstable. We record the training loss and sharpness of the two training processes. To better visualize the training trajectory, we consider the following projection mechanism.

A.6.1 TRAJECTORY PROJECTION

Inspired by the observation on the scalar example, we note that the dynamics toward the end of the convergence has two distinctive directions: an "oscillation direction" which is aligned with the first eigenvector of the Hessian, and an "movement direction" which the 2-step average of the model moves along and converges to the final minimum. In the context of our (a, b)-reparameterization for theoretical analysis (Definition 2), the oscillation direction corresponds to b and the movement direction corresponds to a. In a local region around the converging minima, (a, b) constitutes a parabolic trajectory that can be well captured by the solution of the ODE in Eq. ( 8). In a high-dimensional setting, the oscillation direction is still naturally the top eigenvector at minimum, but we have to manually pick a movement direction to project onto. To be concrete, consider a trajectory of the parameters {θ 1 , θ 2 , . . . , θ T -1 , θ T }, where θ t ∈ R d is the parameter vector for the model after the t-th iteration. We define the oscillation direction v osc as the first eigenvector of the parameter Hessian H(θ T ) and the movement direction v move ( t) as follows: Definition 3 (Movement direction). For some iteration t ∈ [T ], define v move ( t) as v move ( t) ≜ 1 2 θ t-1 + θ t -1 2 (θ T -1 + θ T ) . ( ) Fix some iteration t, v move ( t) captures the non-oscillatory movement of the parameters from step t to step T . We orthonormalize the basis by projecting v osc off from v move ( t) and get ṽmove ( t) ≜ v move ( t) -proj vmove( t) (v osc ), vmove ( t) ≜ ṽmove ( t)/ ṽmove ( t) , vosc ≜ v osc / ∥v osc ∥ . Now with the orthonormal basis, we define the movement-oscillation projection of θ t to be the projection of its offset from the minima (which we approximate by the mean of the last two steps in the trajectory) onto vmove ( t) and vosc . Definition 4 (Movement-Oscillation Projection). Fix an iteration t for determining the movement direction, the movement-oscillation projection of θ t is vmove ( t) ⊤ θ t -1 2 (θ T -1 + θ T ) , v⊤ osc θ t -1 2 (θ T -1 + θ T ) When doing the projection in practice (as in Fig. 24 and Fig. 26 ), we fix t = 5000, which is when the 2-step trajectory becomes relatively stable. We also record the norm of the component of the offset θ t -1 2 (θ T -1 + θ T ) that is orthogonal to the subspace spanned by vmove ( t) and vosc . These results are shown in Fig. 23c and Fig. 25c .

A.6.2 ELU-ACTIVATED FULLY CONNECTED NETWORK

Here we present the experiment results for training a 5-layer ELU-activated FC network on the binary subset of CIFAR-10. In Fig. 23 , we show the evolution of loss and sharpness along the training process. The sharpness eventually converge to just slightly below the 2/η threshold. We also observe that the dynamics toward the end of the converging process is mainly happening in the 2-dimensional subspace spanned by the oscillation and movement directions (Fig. 23c ). In Fig. 24 we plot the projected trajectory of the training process. Toward the end of the training process, the trajectory can be very accurately characterized by a parabola and the converging sharpness is just slightly below the stability threshold. This is identical to what we observe (and proved) for the scalar network case. We see that the model is capable of memorizing all data as the loss decreases exponentially to 0. Toward convergence, the sharpness oscillates very close to the stability threshold and eventually converges to 199.97. In (b) we show a section of (a) between iteration 5000 and 5030. We can clearly observe two distinctive features of the EoS regime: the loss decreases non-monotonically and the sharpness oscillates around 2/η. In (c) we plot the norm of the offset from minima that is orthogonal to the movement-oscillation projection. After 3000 iterations the residual becomes very small, suggesting that dynamics is mainly happening in the 2 dimensional subspace and hence the projection captures the dynamics quite well. 

A.7 EDGE OF STABILITY AND STOCHASTIC GRADIENT DESCENT

In this section, we will briefly discuss some empirical observations of EoS in stochastic gradient descent (SGD). In Appendix A.7.1, we will first empirically present the effects of different forms of noise on our scalar model. Then in Appendix A.7.2 we will compare it with the observations made on real world models trained with mini-batch gradient descent. Finally, we will discuss the limitations of our scalar model in explaining what people observe about EoS when the model is trained with SGD.

A.7.1 GD WITH NOISE ON SCALAR NETWORK

We first look into the training trajectory of our degree-4 scalar network example with noise injected to the gradient descent process. We consider label noise, which perturbs the target by a small amount per iteration, and gradient noise, which perturbs the gradient by a small amount per iteration before updating the parameter according to it.

Label Noise:

To simulate the existence of label noise, at each iteration we compute the gradient for the objective L LN (x, y, δ) = 1 4 (1 + δ -x 2 y 2 ) 2 where δ is sampled from a zero-mean Gaussian N (0, σ 2 ) for each iteration. This is equivalent to adding a perturbation of δ to the label (which is 1 in our original model). We start from the same initialization as in Fig. 8 and plot the trajectory in Fig. 27 . As shown in Fig. 27a and Fig. 27b , the trajectory first roughly follows a parabolic boundary and reaches close to the set of global minima near the EoS-minimum relatively quickly (for around 200 iterations). This part of the trajectory resembles our analysis for the case without label noise. After the model reaches the tip of the parabola, the dynamics is mainly dominated by the label noise. The gradient is dominated by its noise component of δxy(y, x), which is orthogonal to the manifold of global minima xy = 1, thus the model starts oscillating around the global minima. As shown in Fig. 27c and Fig. 27d , the sharpness further decreases very slowly (for 10 6 iterations) and eventually reaches the flattest global minimum at (1, 1) with sharpness of 4. We believe this is within the regime of the sharpness reduction flow near the manifold of minima, which is comprehensively studied by Damian et al. ( 2021 Gradient Noise: For gradient noise model, we sample a perturbation of (δ, δ ′ ) from a 2-dimensional spherical Gaussian N (0, σ 2 I 2 ) at each iteration and apply this perturbation to the gradient before we update the parameter. The one-step dynamics with gradient noise (δ, δ ′ ) is then: x t+1 = x t -η(x t y 2 t (x 2 t y 2 t -1) + δ), y t+1 = y t -η(x 2 t y t (x 2 t y 2 t -1) + δ ′ ). With gradient noise, the initial parabolic trajectory can still be observed as shown in Fig. 28b . As the model reaches close to the manifold of global minima near the tip of the parabola, it no longer follows a monotone sharpness reduction flow (as in Fig. 27c for the label noise case) but instead randomly oscillates along the manifold of global minima between the two EoS minima. We believe this is due to the component of the gradient noise parallel to the minima manifold, which dominates the sharpness reduction effect. 

A.7.2 MINIBATCH SGD FOR OVER-PARAMETERIZED MODELS

In this section, we empirically investigate what happens to the converging sharpness when overparameterized network are trained with minibatch SGD. We use the same 5-layer FC models and dataset as used in Section 6. Other than full-batch gradient descent, we also train the models with mini-batch gradient descent with varying batchsizes and record their converging sharpness. For each learning rate and batchsize, we train 10 models from different random initialization for 20000 epochs and record their converging sharpness. The standard deviation of sharpness is represented by the shaded area. (The loss of all models converges to lower than 10 -8 ). When the batch size of SGD is large, the converging sharpness is close to 2/η, which is very similar to the gradient descent cases. On the other hand, the converging sharpness is significantly lower when the batch size is small compared with the dataset. It is worth noting that the converging sharpness for each batch size is quite concentrated. Instead of going to the flattest minima (as in the label noise case) or randomly oscillating below the EoS threshold (as in the gradient noise case), the converging minima of overparameterized deep networks trained with mini-batch SGD have highly concentrated sharpness that is correlated to the batch size. We note that a key difference between the over-parameterized mini-batch SGD and the noisy GD experiment discussed in Appendix A.7.1 is that the loss for mini-batch SGD can converge to a fixed point with loss 0 (i.e. the model can memorize all training data) while the models with fixed additive noise will not converge to a fixed point. Currently the minimalist scalar model example we analyzed can only memorize one data point. We believe it is an interesting future direction to generalize the model to memorize more data and understand why mini-batch SGD converges below the EoS threshold.

B THEORETICAL ANALYSIS ON THE DEGREE-4 EXAMPLE

In this section we present the complete rigorous analysis on the training dynamics of the degree-4 example discussed in Section 3. The section will be organized in the following way: In Appendix B.1 we will first define the problem and two reparameterizations we used for analysis, this serves as a more comprehensive version of Section 2 in the main text. In Appendix B.2 we will first restate our main theorem for the degree-4 example (Theorem 3.1) along with two corollaries (Corollary 3.1, and Corollary 3.2). These theoretical results characterized the sharpness concentration and sharpness adaptivity phenomenon of the degree-4 example. Then we will provide a more comprehensive proof sketch for Theorem 3.1 that is similar to the discussion in Section 3.2 of the main text. Then we will provide the lemmas for dynamics approximation (Appendix B.3), phase I convergence (Appendix B.4), and phase II convergence (Appendix B.5). Finally, in Appendix B.6 we will use the lemmas to complete the proof for the main theorem along with its corollaries.

B.1 PRELIMINARIES

We consider a simple objective function L(x, y, z, w) = 1 2 (xyzw -1) 2 . Denote γ := xyzw, then ∇L(x, y, z, w) = (γ 2 -γ)[x, y, z, w] -1 , ∇ 2 L(x, y, z, w) = (γ 2 -γ)    γ 2 /x 2 (2γ 2 -γ)/xy (2γ 2 -γ)/xz (2γ 2 -γ)/xw (2γ 2 -γ)/xy γ 2 /y 2 (2γ 2 -γ)/yz (2γ 2 -γ)/yw (2γ 2 -γ)/xz (2γ 2 -γ)/yz γ 2 /z 2 (2γ 2 -γ)/zw (2γ 2 -γ)/xw (2γ 2 -γ)/yw (2γ 2 -γ)/zw γ 2 /w 2    . (22) Let the parameter [x, y, z, w] to be optimized by gradient descent with step size η, that [x (t+1) , y (t+1) , z (t+1) , w (t+1) ] = [x (t) , y (t) , z (t) , w (t) ] -η∇L(x (t) , y (t) , z (t) , w (t) ). ( ) To further simplify the problem, we consider the symmetric initialization of z 0 = x 0 , w 0 = y 0 . Note that due to symmetry of objective, the identical entries will remain identical throughout the training process, so the training dynamics reduces to two dimensional, and the global minima is simply S = {(x, y) ∈ R 2 : x 2 y 2 = 1}. Computing the closed-form of the gradient, we know the 1-step update of x and y follows x t+1 = x t -x t y 2 t η(x 2 t y 2 t -1), y t+1 = y t -x 2 t y t η(x 2 t y 2 t -1). Denote γ = xy, the parameter Hessian of the objective L at (x, y, x, y) admits eigenvalues λ 1 = x 2 (1 -γ), λ 2 = y 2 (1 -γ) and λ 3 = 1 2 x 2 + y 2 3γ 2 -1 -(x 2 + y 2 ) 2 (1 -3γ 2 ) 2 + 4γ 2 (3 -10γ 2 + 7γ 4 ) , λ 4 = 1 2 x 2 + y 2 3γ 2 -1 + (x 2 + y 2 ) 2 (1 -3γ 2 ) 2 + 4γ 2 (3 -10γ 2 + 7γ 4 ) . When (x, y) converges to any minimum, γ = x 2 y 2 = 1, so λ 1 , λ 2 , λ 3 all vanishes. Therefore it is λ 4 that corresponds to the EoS phenomenon people observe. When η < 1 2 , solving λ 4 = 2/η with x 2 y 2 = 1 gives x = ± 1 √ 2 ((-4 + η -2 ) 1 2 + η -1 ) 1 2 , y = ± √ 2((-4 + η -2 ) 1 2 + η -1 ) -1 2 (26) and their multiplicative inverses. These solutions correspond to the minima with sharpness exactly equal to the EoS threshold of 2/η. Since they are all symmetric with each other, without loss of generality we pick the minimum of interest as (x, y) ≜ ( 1 √ 2 ((-4 + η -2 ) 1 2 + η -1 ) 1 2 , √ 2((-4 + η -2 ) 1 2 + η -1 ) -1 2 ). ( ) To better analyze the dynamics under a more natural coordinate, we consider the reparameterization that For any (x, y) ∈ {(x, y) ∈ R + × R + : x > y}, define c ≜ (x 2 -y 2 ) 1 2 , d ≜ xy. This gives a bijective continuous mapping between {(x, y) ∈ R + × R + : x > y} and {(c, d) ∈ R + × R + }. Intuitively, we are taking the lower half of y = 1/x on the positive quadrant as d = 1. With c, d as defined, the η-EoS minimum simplifies to (c, d) ≜ ((η -2 -4) 1 4 , 1). The inverse map can be computed as x = c 2 + √ c 4 + 4d 2 √ 2 , y = √ 2d c 2 + √ c 4 + 4d 2 . ( ) To expand the dynamics near the η-EoS minimum, we define a ≜ c -(η -2 -4) 1 4 , b ≜ d -1 to be the offset from (c, d). Our analysis will primarily be using the (a, b)-parameterization. To summarize, the (c, d) and (a, b) reparameterization of (x, y) are respectively given by (c, d) ≜ x 2 -y 2 1 2 , xy , (a, b) ≜ x 2 -y 2 1 2 -η -2 -4 1 4 , xy -1 . Let κ ≜ √ η, under the reparameterization Eq. ( 24) becomes. a t+1 = (κ -4 -4) 1 4 + a t + (κ -4 -4) 1 4 1 -(1 + b t ) 3 -(1 + b t ) 2 κ 4 1 2 , b t+1 = b t + ((1 + b t ) 3 -2(1 + b t ) 5 + (1 + b t ) 7 )κ 4 + (1 + b t ) -(1 + b t ) 3 4(1 + b t ) 2 κ 4 + (a t κ + (1 -4κ 4 ) 1 4 ) 4 1 2 . (32)

B.2 THEORETICAL RESULTS AND PROOF SKETCH

Now with the reparameterization defined, we restate our convergence result on the 4 scalar objective and discuss the proof sketch. Theorem 3.1 (Sharpness Concentration). For a large enough absolute constant K, suppose κ < 1 2000 √ 2 K -1 , and the initialization (a 0 , b 0 ) satisfies a 0 ∈ (12κ 5 2 , 1 4 K -2 κ -1 ) and b 0 ∈ (-K -1 , K -1 )\{0}. Consider the GD trajectory characterized in Eq. ( 6) with fixed step size κ 2 from (a 0 , b 0 ), for any ϵ > 0 there exists T = O(K -2 κ -15 2 + log(ϵ -1 ) + log(|b 0 | -1 )κ - 2 ) such that for all t > T , |b t | < ϵ and a t ∈ (-5 3 κ 3 , -1 10 κ 3 ).

B.2.1 PROOF SKETCH OF THEOREM 3.1

Our analysis begins with approximating the local movement using primarily Taylor expansion around the κ 2 -EoS Minimum (Appendix B.3). We show that for initialization within a local region of width 2K -2 κ -1 and height 2K -1 centered at the κ 2 -EoS minimum (condition B.1), the local two-step update of a and b can be characterized by a ′′ = a -4b 2 κ 3 + R a , b ′′ = b -16b 3 + 8abκ + R b . ( ) Where R a , R b are remainders that we can effectively bound (Corollary B.1). We note that in the region we are considering, a is always monotonically decreasing at b 2 κ 3 per 2 steps (Lemma 5). With the approximation ready, we will conduct our convergence analysis with 2 phases. In Phase 1 (Appendix B.4), we consider all possible initializations (a 0 , b 0 ) such that a 0 ∈ (12κ b = 0 a = 0 a = 2κ 5 2 a = κ 5 2 a = -1 8 (1 -3δ)κ 3 I II III IV V VI κ 2 -EoS Minimum b 2 = 2aκ b 2 = 1 2 aκ + 1 16 κ 4 b 2 = 1 4 aκ |b 2 -1 2 aκ -1 16 κ 4 | < 1 8 δκ 4 5 2 , 1 4 K -2 κ -1 ) and b 0 ∈ (-K -1 , K -1 )\{0}. We partition the region of initializations into three parts separated by b 2 = 2aκ and b 2 = 1 4 aκ (shown as region I, II, III in Fig. 30 ). • For initializations in region I where b 2 > 2aκ (condition B.3), we show that the cubic term b 3 in the expression of b ′′ in Eq. ( 33) dominates the abκ term as well as the remainder, so that the two step update on |b| is monotonically decreasing with at least an additive update of -|b 3 |. Combining with slow movement of a, we show that initializations in region I will quickly enter region II (Lemma 8). • For initializations in region III where b 2 ∈ (0, 1 4 aκ) (condition B.4), we show that the abκ term will dominate the b 3 term and other remainders. Thus the two-step update of |b| will monotonically increase with a multiplicative rate of at least (1 + aκ). Combining with slow movement of a, we show that initializations in region III will also quickly enter region II (Lemma 9). • For the last step in Phase 1, we show that the 2-step trajectories entering region II will stay in the region in the sense that b 2 ∈ ( 1 4 aκ, 2aκ). (Lemma 11) We also show that a will keep decreasing and enter ( 3 2 κ 5 2 , 2κ 2 ). In the diagram (Fig. 30 ) this corresponds to the phase of entering region IV from II. At the end of Phase I, we would have shown that all trajectories starting from the required initialization will converge to near the parabola b 2 = 1 2 aκ and enter region IV from the right. In Phase 2 (Appendix B.5), we begin with initialization (a 0 , b 0 ) such that a 0 ∈ ( 3 2 κ 5 2 , 2κ 2 ) and b 2 0 ∈ ( 1 4 a 0 κ, 2a 0 κ). In phase 1 we have shown a rough convergence result close to the parabola b 2 = 1 2 aκ with the extreme point of (0, 0). In phase 2 we will change the parabola of interest to be b 2 = 1 2 aκ + 1 16 κ 4 which is characterized by the ODE approximation as discussed in Section 3.1. In particular we will focus on the residual ξ ≜ b 2 -1 2 aκ -1 16 κ 4 . The phase 2 convergence has the following three stages. Throughout the analysis we fix a small constant δ = 0.04 • Stage 1. After the trajectory enters a ∈ ( 3 2 κ 5 2 , 2κ 2 ) and b 2 ∈ ( 1 4 aκ 5 2 , 2aκ 2 ), we will show that the two-step update of ξ follows |ξ ′′ -(1 -32c 2 )ξ| < δb 2 κ 4 . (Lemma 12). Then we show that |ξ| will further decrease to less than 1 8 δκ 4 before a decreases to less than κ 5 2 , and the trajectory will enter region V from the right (Lemma 13). • Stage 2. After the trajectory enters region V, we will show that it will remain close to the parabola ξ ≜ b 2 -1 2 aκ -1 16 κ 4 that |ξ| will remain less than 1 8 δκ 4 while a decreases into the interval a ∈ (-1 8 (1 -3δκ 3 ), -1 10 (1 + 2δ)κ 3 ) and the trajectory enters region VI (Lemma 14). • Stage 3. Finally we conclude the proof by a convergence analysis in region VI. The twostep dynamics approximation in region VI is very similar to region III, that the abκ term in the two-step update of b will dominate. Since a is now negative, |b| will follow the multiplicative update |b ′′ | < (1 + aκ)|b|. We will also show that the movement of a will be small, and the final converging minima will not be far from the extreme point a = 1 8 κ 3 (Lemma 15).

B.3 1 AND 2-STEP DYNAMICS OF a AND b

Now we begin our rigorous analysis on the dynamics of (a t , b t ). For simplicity of notations, when analyzing the 1-step and 2-step dynamics of (a t , b t ), we use a, a ′ , a ′′ to denote a t , a t+1 , a t+2 and b, b ′ , b ′′ to denote b t , b t+1 , b t+2 . For simplicity of calculation, we consider the change of variable κ = √ η. In the following analysis, use operator O(•) to only hides absolute constants that are independent of ϵ, κ, a, b and no asymptotic limits are taken. Concretely, for monomial x and polynomial y of some variables, we denote y = O(x) if there exists some absolute constant K independent of the variables such that for any parameterization of the variables, |y| < K|x|. Note that this is stronger than the usual big-O notations. Throughout the analysis we will use K to represent the absolute constant that uniformly upper bounds all the absolute constants of the O(•) terms. This is well defined as we will only be considering finite number of O(•) terms.

B.3.1 ONE STEP DYNAMICS APPROXIMATION OF a AND b

Lemma 1. Fix any positive ϵ that ϵ < 0.5, for any κ ∈ (0, ϵ 1 4 ), for all (a, b) such that |a| < ϵκ -1 and |b| < min{1, 1 5 ϵκ -2 }, we have a ′ = a -2b 2 κ 3 + O(ϵb 2 κ 3 ) + O(b 2 κ 4 ). ( ) Proof of Lemma 1. Recall from Eq. ( 32) that a ′ = a + κ -4 -4 1 4 1 -b 2 (1 + b) 2 (2 + b) 2 κ 4 1 2 -κ -4 -4 1 4 . ( ) Since κ -4 -4 1 4 will approach infinity as κ goes to 0, we instead analyze κa ′ = κa + 1 -4κ 4 1 4 1 -b 2 (1 + b) 2 (2 + b) 2 κ 4 1 2 -1 -4κ 4 1 4 . (36) Note that 1 -b 2 (1 + b) 2 (2 + b) 2 κ 4 1 2 will be close to 1 -2b 2 (1 + b) 2 (2 + b) 2 κ 4 for not so large b and 1 -4κ 4 will be close to 1 for small κ, we will leverage these two properties to approximate a ′ . First observe that for any x ∈ (0, 8ϵ/(1 + 2ϵ) 2 ) we have 1 -x 2 -ϵx < √ 1 -x < 1 -x 2 . Since ϵ < 0.5, (1 + 2ϵ) 2 < 4, so it is sufficient to let x < 2ϵ for the inequality to hold.

Now we substitute x by b

2 (1 + b) 2 (2 + b) 2 κ 4 . Since |b| < 1, (1 + b) 2 (2 + b) 2 < 36. Thus when |b| < 1 5 ϵκ -2 we have b 2 (1 + b) 2 (2 + b) 2 κ 4 < 36b 2 κ 4 < 36( 1 5 ϵκ -2 ) 2 κ 4 < 2ϵ, and therefore 1 -b 2 (1 + b) 2 (2 + b) 2 κ 4 1 2 = 1 - 1 2 b 2 (1 + b) 2 (2 + b) 2 κ 4 + O(ϵb 2 κ 4 ) = 1 -2b 2 κ 4 + O(b 2 κ 5 ) + O(ϵb 2 κ 4 ). ( ) Since 1 -4κ 4 > 0, by Bernoulli inequality we have (1κ 4 ) 4 ≥ 1 -4κ 4 , combining with the requirement of κ < ϵ 1 4 , we have (1 -4κ 4 ) 1 4 ≥ 1 -κ 4 > 1 -ϵ. Meanwhile since we required |d| < ϵκ -1 , |κd| < ϵ, so (κd + (1 -4κ 4 ) 1 4 ) = 1 + O(ϵ). ( ) Proof of Lemma 4. Combining Lemma 1 and Lemma 2 we have a ′′ = a ′ -2b ′2 κ 3 + O(ϵb ′2 κ 3 ) + O(b ′2 κ 4 ). ( ) From Eq. ( 51) in the proof of Lemma 3 we have b ′2 = b 2 + O(b 3 ) + O(a 2 b 2 κ 2 ) + O(b 3 κ 4 ) + O(b 2 κ 5 ). (58) Hence the last three terms of Eq. ( 57) can be calculated that b ′2 κ 3 = b 2 κ 3 + O(b 4 κ 3 ) + O(a 2 b 2 κ 5 ) + O(b 3 κ 7 ) + O(b 2 κ 8 ), (59) O(ϵb ′2 κ 3 ) = O(b ′2 )ϵκ 3 = O(ϵb 3 κ 3 ) + O(ϵb 4 κ 3 ) + O(ϵa 2 b 2 κ 5 ) + O(ϵb 3 κ 7 ) + O(ϵb 2 κ 8 ) = O(ϵb 2 κ 3 ) + O(ϵbκ 8 ), (60) O(b ′2 κ 4 ) = O(b ′2 )κ 4 = O(b 2 κ 4 ) + O(b 3 κ 4 ) + O(a 2 b 2 κ 6 ) + O(b 3 κ 8 ) + O(b 2 κ 9 ) = O(b 2 κ 4 ) + O(a 2 b 2 κ 6 ) + O(b 2 κ 9 ). (61) Combining above we have that a ′′ = a ′ -2b ′2 κ 3 + O(ϵb ′2 κ 3 ) + O(b ′2 κ 4 ) = a -2b 2 κ 3 + O(ϵb 2 κ 3 ) + O(b 2 κ 4 ) -2b 2 κ 3 + O(b 3 κ 3 ) + O(a 2 b 2 κ 5 ) + O(b 3 κ 7 ) + O(b 2 κ 8 ) + O(ϵb 2 κ 3 ) + O(ϵa 2 b 2 κ 5 ) + O(ϵb 2 κ 8 ) + O(b 2 κ 4 ) + O(a 2 b 2 κ 6 ) + O(b 2 κ 9 ). = a -4b 2 κ 3 + O(ϵb 2 κ 3 ) + O(b 3 κ 3 ) + O(b 2 κ 4 ). ( ) This completes the proof. Combining Lemma 4 and Lemma 3, the 2-step dynamics approximation can be summarized as:  ) such that a 0 ∈ (3κ 5 2 , 1 4 K -2 κ -1 ) and b 0 ∈ (-K -1 , K -1 ) will converge near the parabola very fast. As mentioned above, there are mainly three regimes of interest. We will first determine the region in which the two step update b ′′b is solely dominated by -b 3 or abκ. To formally analyze the different dynamics and characterizing the regimes for them, in this section we use K to denote the uniform upper bound over the absolute constants hidden by the O(•) operator in the 2-step dynamics approximation characterized by Corollary B.1. Since there is only finite terms with O-notation, such constant K is well defined and independent of ϵ, κ, a, b. Without loss of generality we assume K > 512. Rewriting Corollary B.1 with the uniform upper bound K, we have the following corollary: Corollary B.2. (2-step dynamics approximation of a and b) There exists some absolute constants K such that for any constant ϵ < 0.5, for all κ, a, b satisfying condition B.1, the 2-step update of (a, b) can be characterized by a ′′ = a -4b 2 κ 3 + R a , b ′′ = b -16b 3 + 8abκ + R b . (64) where the remainder R a and R b have upper bound: |R a | < K ϵc 2 κ 3 + K c 3 κ 3 + K c 2 κ 4 , |R b | < K c 4 + K c 2 dκ + K cd 2 κ 2 + K c 2 κ 4 + K cκ 5 + K ϵc 2 κ 3 . ( ) Now we establish the conditions to characterize the work zones in order for the analysis to be more tractable in different regimes. B.4.1 WHEN a ′′ -a IS CLOSE TO -4b 2 κ 3 Here we formalize the observation that when b and κ are not large, the two-step movement of a is monotone and always close to 4b 2 κ 3 . Concretely we have the following lemma: Lemma 5.  ′′ = a -4b 2 κ 3 + R a where |R a | < K ϵb 2 κ 3 + K b 3 κ 3 + K b 2 κ 4 . (66) Thus to prove the lemma we only need to bound every term on RHS of Eq. ( 66) by b 2 κ 3 . (i) Since we fixed ϵ = K -1 , K ϵb 2 κ 3 = b 2 κ 3 . (ii) Since |b| < K -1 , K b 3 κ 3 < K K -1 b 2 κ 3 < b 2 κ 3 . (iii) Since κ < K -1 , K b 2 κ 4 < K K -1 b 2 κ 3 < b 2 κ 3 Therefore |R a | < 3b 2 κ 3 which completes the proof. Note that when we fix ϵ = K -1 , condition B.1 becomes κ < min{0.1, K -1 4 }, |a| < K -1 κ -1 , |b| < min{1, 1 5 K -1 κ -2 }. Combining with the additional condition of Lemma 5, we can summarize the condition for Lemma 5 to hold as κ < min{0.1, K -1 4 , K -1 }, ( ) |a| < K -1 κ -1 , (2) |b| < min{1, K -1 , 1 5 K -1 κ -2 }. (3) (68) With K > 512, K -1 < 0.1, so (1) can be reduced to κ < K -1 . Since κ < K -1 , 1 5 κ -2 > 1 5 K 2 , so 1 5 K -1 κ -2 > 1 5 K > 1 > K -1 , and thus (3) can be reduced to |b| < K -1 . In conclusion, the following condition is sufficient for |a ′′ -(a -4b 2 κ 3 )| < 3b 2 κ 3 . Condition B.2 (Condition for (4 ± 3)b 2 κ 3 movement of a). κ < K -1 , |a| < K -1 κ -1 , |b| < K -1 . (69) B.4.2 WHEN -b 3 DOMINATES THE DYNAMICS OF b Lemma 6. For all κ, a, b satisfying κ < K -1 , |a| ∈ (κ 3 , K -2 κ -1 ), |b| ∈ ( |a|κ, K -1 ). ( ) We have |b ′′ -(b -16b 3 )| ≤ 14|b 3 |. Proof of Lemma 6.  b | < K b 4 + K ab 2 κ + K a 2 bκ 2 + K b 2 κ 4 + K bκ 5 + K ϵb 2 κ 3 . ( ) Thus to prove the claim, it is sufficient to bound |8abκ| by 8|b 3 | and every term on RHS of Eq. ( 71) by |b 3 |. We will now bound them term by term.  (i) Since |a|κ < |b|, |aκ| < b 2 , so 8|abκ| = 8|b||aκ| < 8|b 3 |. (ii) Since |b| < K -1 , K|b 4 | < K|K -1 b 3 | = |b 3 |. (iii) Since |a| < K -2 κ -1 , (iv) Since |a| < K -2 κ -1 < K -1 κ -1 , κ < K -1 , |a| ∈ (κ 3 , K -2 κ -1 ), |b| ∈ ( |a|κ, K -1 ). κ < K -1 , |a| ∈ (κ 3 , K -2 κ -1 ), |b| < min{ 1 2 √ 2 |a|κ, K -1 }. ( ) We  |R b | < K b 4 + K ab 2 κ + K a 2 bκ 2 + K b 2 κ 4 + K bκ 5 + K ϵb 2 κ 3 . ( ) Thus to prove the claim, it is sufficient to bound 16|b 3 | by 2|abκ|, two terms in RHS of Eq. ( 74) by (ii) Since |a| < K -2 κ, we have |a|κ < K -2 . Multiplying a 2 κ 2 on both side gives |a 3 |κ 3 < K -2 a 2 κ 3 . If we take the 6-th root, we have |a|κ < (K -1 |a|κ) (iv) Since |a| < K -2 κ -1 < 1 2 K -1 κ -1 (as we assumed K > 512), multiply both side by |ab|κ 2 gives K|a 2 bκ 2 | < 1 2 |abκ|. 1 3 . Since |b| < |a|κ, we have |b| < (K -1 |a|κ) (v) Since |a| > κ 3 and κ < K -1 , we have |a| > κ 3 > K -2 κ 5 > K 2 κ 7 . Multiplying K -2 |a|κ -6 on both side gives K -2 a 2 κ -6 > |a|κ. It follows by taking the square root that |a|κ < K -1 |a|κ -3 . Since we have |b| < 1 2 √ 2 |a|κ < 1 2 |a|κ, combining with above gives |b| < 1 2 K -1 |a|κ -3 . Multiply K|b|κ 4 on both side gives K|b 2 κ 4 | < 1 2 |abκ|. (vi) Since |a| > κ 3 , we have |a| > Kκ 4 . Multiplying |b|κ on both side gives K|bκ 5 | < |abκ|. (vii) Since |a| > κ 3 and ϵ = 0.5, we have |a| > K 2 ϵ 2 κ 5 . Taking the multiplicative inverse and multiply a 2 κ on both side gives |a|κ < K -2 ϵ -2 κ -6 a 2 . It follows by taking the square root that |a|κ < K -1 ϵ -1 κ -3 |a|. Since |b| < |a|κ, we have |b| < K -1 ϵ -1 κ -3 |a|. Finally multiply Kϵ|b|κ 3 on both side, we have K|ϵb 2 κ 3 | < |abκ|. From (i) we have 16|b 3 | < 2|abκ|, from (ii) -(vii) we have R b = K b 4 + K ab 2 κ + K a 2 bκ 2 + K b 2 κ 4 + K bκ 5 + K ϵb 2 κ 3 < 5|abκ|. Therefore |b ′′ -(b + 8abκ)| ≤ 16|b 3 | + R b ≤ 7|abκ|, which completes the proof. We restate the sufficient condition for |b ′′ -(b + 8abκ)| ≤ 7|abκ| as follow: Lemma 8 (Convergence to near parabola from large b). For any κ < K -1 , for any initialization (a 0 , b 0 ) satisfying a 0 ∈ (12κ Condition B.4 (Condition for abκ Dominated b ′′ -b Movement). κ < K -1 , |a| ∈ (κ 3 , K -2 κ -1 ), |b| < min{ 1 2 √ 2 |a|κ, K -1 }. 5 2 , 1 4 K -2 κ -1 ) and |b 0 | ∈ [2 |a 0 |κ, K -1 ), there exists some T < κ -4 such that a T ∈ (2κ 5 2 , K -2 κ -1 ) and |b T | ∈ ( √ a T κ, 2 √ a T κ). Proof of Lemma 8. We will prove the claim using induction.

Consider the inductive hypothesis for

k ∈ {0, 1, • • • , ⌊κ -4 ⌋} that P (k): |a k | ∈ (2κ 5 2 , 1 4 K -2 κ -1 ) and |b k | ∈ (2 √ a k κ, min{K -1 , (k + 1) -1 2 )}. Since |b 0 | < K -1 < 1 = (0 + 1) -1 2 and a 0 ∈ (3κ 5 2 , 1 4 K -2 κ -1 ) ⊂ (2κ 5 2 , 1 4 K -2 κ -1 ) as required by the initialization, the base case P (0) holds trivially. Now we can proceed to the inductive step. Assume P (l) holds for all l ≤ k, we want to show that P (k + 1) holds unless |b k+1 | < 2 |a k+1 |κ. By the inductive hypothesis, (a k , b k ) satisfies condition B.2 and condition B.3. Thus by Lemma 5 and Lemma 6 we have a k+1 ∈ (a k -7b 2 k κ 3 , a k -b 2 k κ 3 ), |b k+1 | ∈ (|b k | -30|b 3 k |, |b k | -2|b 3 k |). By the strong inductive hypothesis, these properties also holds when substituting k by any l < k. We first check the lower bound for a k+1 under the assumption that k < κ -4 . Since a l+1 > a l -7bfoot_1 l κ 3 for all l ≤ k, we have a k+1 > a 0 - k l=0 7b 2 l κ 3 > a 0 - k l=0 7 (l + 1) -1 2 2 κ 3 (Since b l < (l + 1) -1 2 by IH) > 12κ 5 2 -7κ 3 k l=0 1 l + 1 > 12κ 5 2 -7κ 3 1 + k+1 1 1 τ dτ = 12κ 5 2 -7κ 3 (1 + log(k + 1)). Under the assumption that k < κ -4 and κ < K -1 < 1 512 , it is not hard to check that 1 + log(k + 1) < 2 + log(κ -4 ) = 2 -4 log(κ) < 10 7 κ -1 2 . ( ) The last inequality holds since when κ = 1 512 we have 2 -4 log( 1 512 ) -10 7 ( 1 512 ) -1 2 < -5 and 2 + log(κ -4 ) = 2 -4 log(κ) is monotonically increasing when κ < 1 512 . Plugging back into Eq. ( 77), we know that if P (l) holds for all l ≤ k and k < κ -4 , then a k + 1 > 12κ 5 2 -7κ 3 10 7 κ -1 2 = 12κ 5 2 -10κ 5 2 = 2κ 5 2 . The upper bound of a k < 1 4 K -2 κ -1 always holds since a is monotonically decreasing by Eq. ( 76).

Now we check the upper bound for

|b k+1 |. Consider f (x) = x -1 2 , we have f ′ (x) = -1 2 x -3 2 = -1 2 f (x) 3 and f ′′ (x) = 3 4 x -5 2 . Note that f ′′ (x) > 0 for all x > 0, so by a first order Taylor expansion around x = k + 1 we have f (k + 2) > f (k + 1) -1 2 f (k + 1) 3 . Combined with |b k+1 | < |b k | -|b 3 k |, it follows f (k + 2) -|b k+1 | ≥ f (k + 1) - 1 2 f (k + 1) 3 -|b k | + |b 3 k | = (f (k + 1) -|b k |) - 1 2 (f (k + 1) 3 -|b 3 k |) + 1 2 b 3 k ≥ (f (k + 1) -|b k |) - 1 2 (f (k + 1) -|b k |)(b 2 k + |b k |f (k + 1) + f (k + 1) 2 ) = (f (k + 1) -|b k |) 1 - 1 2 b 2 k + |b k f (k + 1) + f (k + 1) 2 . ( ) Since f (k + 1) ≤ 1 and |b k | < |b 0 | < K -1 < 1 2 . for all k ≥ 1, we have b 2 k + |b k |f (k + 1) + f (k + 1) 2 < 2, and hence 1 -1 2 b 2 k + |b k |f (k + 1) + f (k + 1) 2 > 0. Since |b k | < (k + 1) -1 2 = f (k + 1) by the induction hypothesis, we have f (k + 2) -|b t+1 | > 0, so |b k+1 | < (k + 2) -1 2 . The other upper bound |b k+1 | < K -1 holds trivially since |b 0 | < K -1 as required by the initialization and |b k | is monotonically decreasing according to Eq. ( 76). Now we will show that there exists some τ < κ -4 that |b τ | < 2 |a τ |κ. Assume toward contradiction that there is no such τ , then the induction may proceed to h ≜ ⌊κ -4 ⌋ so that a h > 2κ 5 2 , |b h | < (h + 1) 1 2 < (κ -4 ) 1 2 < κ 2 and |b h | > 2 √ a h κ > 2 2κ 5 2 κ = 2 √ 2κ 7 4 . The last two inequalities lead to contradiction as 2 √ 2κ 7 4 > κ 2 , so the assumption does not hold and there exists some τ < κ -4 for |b τ | < 2 √ a τ κ. Let T < ⌊κ -4 ⌋ be the smallest such τ , then the induction may proceed to k = T -1, which guarantees P (t) for all t < T . Thus for all t < T , a t > (2κ Now we still need to show |b T | > √ a T κ. Since a k is monotonically decreasing, it is sufficient to show |b T | > √ a T -1 κ. From Eq. (76) we have |b T | > |b T -1 | -30|b 3 T -1 | = |b T -1 |(1 -30b 2 T -1 ). Since b T -1 < K -1 < 1 512 , (1 -30b 2 T -1 ) > 1 2 . Combined with |b T -1 | > 2 √ a T -1 κ as shown above, we have |b T | > 1 2 (2 √ a T -1 κ) > √ a T -1 κ, which completes the proof. B.4.5 CONVERGENCE WHEN abκ DOMINATES b 3 Lemma 9 (Convergence to near parabola from small b). For any κ < K -1 , for any initialization (a 0 , b 0 ) satisfying a 0 ∈ (12κ 5 2 , 1 4 K -2 κ -1 ) and |b 0 | ∈ (0, 1 4 √ a 0 κ], there exists some T < 1 2 log(|b 0 | -1 )κ -7 2 such that |b T | ∈ ( 1 4 √ a T κ, 1 2 √ a T κ) and a T ∈ (2κ 5 2 , 1 4 K -2 κ -1 ). Proof of Lemma Lemma 9. We will prove this claim using induction. Consider the inductive hypothesis for k ∈ N that P (k) :|b k | ∈ (|b 0 |(1 + 4κ 7 2 ) k , 1 4 √ a k κ), a 0 ∈ (2κ 5 2 , 1 4 K -2 κ -1 ) and if k ≥ 1, (|b k | -|b 0 |)/(a 0 -a k ) > κ -5 4 . The base case when k = 0 holds from the initialization, so we proceed to the inductive step. Assume P (l) holds for all l ≤ k, we want to show P (k + 1) holds unless |b k | > 1 4 √ a k κ. First note that by the inductive hypothesis, since  a k < 1 4 K -2 κ -1 , we have |b k | < 1 4 √ a k κ < 1 4 √ K -2 κ -1 κ < K -1 . a k+1 ∈ (a k -7b 2 k κ 3 , a k -b 2 k κ 3 ), |b k+1 | ∈ (|b k | + 2|a k b k κ|, |b k | + 14|a k b k κ|). Observe that when a is not too small, the movement of b is significantly larger than the movement of a. From Eq. ( 81) we have a ka k+1 < 8bfoot_2 k κ 3 and |b k+1 | -|b k | > 2a k |b k |κ, so |b k+1 | -|b k | a k -a k+1 > 2a k |b k |κ 8b 2 k κ 3 = 1 |b k | a k 4κ 2 > 4 √ a k κ a k 4κ 2 (since |b t | < 1 4 √ a t κ) = √ a k κ 5 2 > κ -5 4 . (since a k > κ 5 2 ) When k = 0, we directly have Since a is monotonically decreasing from Eq. ( 81), we have (|b 1 | -|b 0 |)/(a 0 -a 1 ) > κ -5 4 . When k ≥ 1, from the inductive hypothesis we have (|b k | -|b 0 |)/(a 0 -a k ) > κ -5 4 , combining with (|b k+1 | -|b k |)/(a k -a k+1 ) > κ -5 4 , we have (|b k+1 | -|b 0 |)/(a 0 -a k+1 ) > κ - a k+1 < a 0 < 1 4 K -2 κ -1 . Since (|b k | -|b 0 |)/(a 0 -a k ) > κ -5 4 according to the inductive hypothesis and (a 0 -a k ) > 0 by monotonicity of a, a 0 -a k < (|b k | -|b 0 |)κ 5 4 < |b k |κ 5 4 < 1 4 √ a k κκ 5 4 = 1 4 √ a k κ 7 4 < 1 4 √ a 0 κ 7 4 . Here the last step holds again by monotonicity of a. Reorganizing the inequality we have a k > a 0 -1 4 √ a 0 κ 7 4 = √ a 0 ( √ a 0 -1 4 κ 7 4 ). Since a 0 > 12κ 5 2 by the initialization condition, √ a 0 > 3κ 5 4 and √ a 0 -1 4 κ 7 4 > 2κ 5 4 . Thus a k > √ a 0 ( √ a 0 -1 4 κ 7 4 ) > 6κ 5 2 . Note that since |b k | < K -1 < 1 512 , 7b 2 k κ 3 < 7K -2 κ 3 < κ 3 . From Eq. (81) we have a k+1 > a k -8b 2 k κ 3 > 6κ 5 2 -κ 3 > 2κ 2 . This gives the desired lower bound for a k+1 √ a k+1 κ, P (k + 1) will hold and the induction can proceed. Now claim that there exists some τ < Since |b k+1 | > |b k | + 2a k |b k |κ from Eq. (81), combining with a k > 2κ 5 2 we have |b k+1 | > |b k |(1 + 2a k κ) > |b k |(1 + 4κ 5 2 κ). Since |b k | > |b 0 |(1 + 4κ 1 2 log(|b 0 | -1 )κ -7 2 that |b τ | ≥ 1 4 √ a τ κ. Assume toward contradiction that there is no such τ , then the induction can proceed for any t ∈ N. Consider t ≥ 1 2 log(|b 0 | -1 )κ -7 2 , we have t ≥ log(|b 0 | -1 )( 1 4 κ -7 2 + 1 4 κ -7 2 ) > log(|b 0 | -1 )(1 + 1 4 κ -7 2 ) (Since κ -7 2 > 1 2 ) > 1 2 log(K -2 ) + log(|b 0 | -1 ) (1 + 1 4 κ -7 2 ) (Since log(K -2 ) < 0) = 1 2 log((K -2 κ -1 )κ) + log(|b 0 | -1 ) (1 + 1 4 κ -7 2 ) > 1 2 log(a k κ) + log(|b 0 | -1 ) (1 + 1 4 κ -7 2 ) (Since K -2 κ -1 > a k ) = 1 2 log(a k κ) + log(|b 0 | -1 ) (1 + 4κ 2 )/4κ 7 2 = 1 2 log(a k κ) + log(|b 0 | -1 ) log(1 + 4κ 7 2 ) -1 (Since log(1 + x) ≥ x/(1 + x)). It follows that t > 1 2 log(a k κ) + log(|b 0 | -1 ) log(1 + 4κ 7 2 ) -1 = log( √ a k κ) + log(|b 0 | -1) log(1 + 4κ 7 2 ) -1 > log( 1 4 √ a k κ) + log(|b 0 | -1) log(1 + 4κ 7 2 ) -1 = log (1+4κ 7 2 ) ( 1 4 √ a t κ/|b 0 |). Hence (1 + 4κ 7 2 ) t > 1 4 √ a t κ/|b 0 |, and therefore |b t | > |b 0 |(1 + 4κ 7 2 ) t > 1 4 √ a t κ, which contradicts that P (t) holds and lead to contradiction. Therefore there must exists some τ such that |b τ | > 1 4 √ a τ κ. Let T < 1 2 log(|b 0 | -1 )κ -7 2 be the smallest such τ , then the induction will proceed to k = T -1. Moreover since P (T -1) holds, following the previous analysis, the bounds on |a T | also holds, so we have |a T | ∈ (2κ 5 2 , 1 4 K -2 κ -1 ). Now to complete the proof we only need to show that |b T | < 1 2 √ a T κ. From Eq. ( 81) we have |b T | < |b T -1 | + 14a T -1 |b T -1 |κ = (1 + 14a T -1 κ)|b T -1 |. Since a T -1 < 1 4 K -2 κ -1 and K > 512, (1 + 14a T -1 κ) < (1 + 14K -2 ) < 1.1. So |b T | < 1.1|b T -1 | < 1.1( 1 4 √ a T -1 κ). On the other side, note that a T > a T -1 -7b 2 T -1 κ 3 by Eq. ( 81), where b 2 T -1 < ( 1 4 √ a T -1 κ) 2 = 1 16 a T -1 κ. We have a T > a T -1 -7 16 a T -1 κ > (1 -κ)a T -1 . Since κ < K -1 < 1 512 , we have a T > 0.99a T -1 and hence 1 4 √ a T -1 κ < 1.1( 1 4 √ a T κ). Combining Eq. ( 85) and Eq. ( 86) we have √ aκ, 2 √ aκ), we now show that it will not leave this region unless a is very small. To do so, we first determine a regime in which we can effectively bound the two-step movement of b. Lemma 10. For any κ, a, b satisfying |b T | < 1.21( 1 4 √ a T κ) < 1 2 √ a T κ. κ < K -1 , |a| < 1 4 K -2 κ -1 , |b| < 2 √ aκ. (87) we have |b ′′ -b| < 1 16 √ aκ. Proof of Lemma 10. Fix ϵ = 0.1, it is straightforward to check that κ, a, b satisfies condition B.1. Hence from Eq. ( 64) and Eq. ( 65) we have that |b ′′ -b| < 16|b 3 | + 8|abκ| + Kb 4 + K|ab 2 κ| + K|a 2 bκ 2 | + Kb 2 κ 4 + K|bκ 5 | + K|ϵ|b 2 κ 3 . (88) To prove the claim it is sufficient to bound all terms on RHS under 1 128 √ aκ. Note that with b < (iv) Since |b| < 2K -1 , K|ab 2 κ| < 2|abκ|, which is less than 2 √ aκ and a < K -2 κ -1 , √ aκ < √ K -2 = K -1 < 1 128 and hence b < 2K -1 . (i) Since K > 512, aκ < K - 1 512 √ aκ from (ii). (v) Since |aκ| < K -1 , K|ab 2 κ 2 | < |abκ| < 1 1024 √ abκ. (vi) Note that as (viii) Since we fixed ϵ = 0.1, κ < K -1 < 1 512 < 1 8 ϵ -1 3 , κ 3 < 1 512 ϵ -1 . Multiplying 4ϵ √ aκ on both sides we have 4ϵ √ aκκ 3 < 1 128 √ aκ. Since b 2 < 4K -1 √ aκ from (vi), we have κ < 0.1 in condition B.1, κ 4 < 1 512 . Since |b| < 2 √ aκ < 2K -1 , b 2 < 4K -1 √ aκ, and thus Kb 2 κ 4 < 4 √ aκκ 4 < 1 128 √ aκ. (vii) Since b < 2 √ aκ, K|bκ 5 | < 2Kκ 5 √ aκ. Given κ < K -1 < 1 4 K -1 5 , we have κ 5 < 1 1024 K -1 , Kϵb 2 κ 3 < K(4K -1 √ aκ)κ 3 = 4ϵ √ aκκ 3 < 1 128 √ aκ. Now that we have bounded every monomial term on RHS of Eq. ( 88) by 1 128 √ aκ, we have |b ′′ -b| < 1 16 √ aκ, which completes the proof. Here we restate the condition for Lemma 10 Condition B.5 (Condition for small b movement). κ < K -1 , |a| < K -2 κ -1 , |b| < 2 √ dκ. With the two-step movement of c bounded above, we may proceed to state the lemma that guarantees c will not leave ( 1 4 √ dκ, 2 √ dκ) unless d is very small. Lemma 11. For any κ and initialization a 0 , b 0 satisfying κ < 1 16 K -1 , a 0 ∈ (κ 5 2 , 1 4 K -2 κ -1 ), |b 0 | ∈ ( 1 4 √ a 0 κ, 2 √ a 0 κ). ( ) There exists some T ≤ 16a 0 κ -13 2 such that a T < κ 5 2 and for all t < T , a t > κ 5 2 and b t ∈ ( 1 4 √ a t κ, 2 √ a t κ). Proof of Lemma 11. First we check that the region defined is not empty. This is true since given κ < 1 16 K -1 , we have 1 4 K -2 κ -1 > 4K -1 > K -5 2 > κ 5 2 . To prove the claim we consider the inductive hypothesis P (k): a k ∈ (κ 5 2 , a 0 -1 16 (k -1)κ 13 2 ] and |b k | ∈ ( 1 4 √ a k κ, 2 √ a k κ). Assume that P (k) holds for some k. Since a k < a 0 < 1 4 K -2 κ -1 , we have Observe that since |b k | < 2 √ a k κ < 2 1 4 K -2 κ -1 κ = K -1 . With κ < 1 16 K -1 and a k < a 0 < 1 4 K -2 κ -1 , a k+1 > a k -8b 2 k κ 3 , 2 √ a k+1 κ > 2 (a k -8b 2 k κ 3 )κ > 2 (a k -8(4a k κ)κ 3 )κ (sinbe b k < 2 √ a k κ) = 2 (a k κ)(1 -32κ 4 ) = 2 1 -32κ 4 √ a k κ Note that with κ < 1 16 K -1 where we assume K > 128, we have √ 1 -32κ 4 > 0.99 and hence √ a k+1 κ > 0.99 √ a k κ. Now we will show that b k+1 will not leave ( 1 4 √ a k+1 κ, 2 √ a k+1 κ). There are three cases to consider: 91) and √ a k+1 κ. 1. When |b k | ∈ ( √ a k κ, 2 √ a k κ), along with |b k | < K - > 32κ 4 . Hence 1 -32κ 4 > 1 -b 2 k > 1 -2b 2 k + b 4 k = (1-b 2 k ) 2 . 1-b 2 k < √ 1 -32κ 4 we have |b k+1 | < 2 √ a k κ(1-b 2 k ) < 2 √ 1 -32κ 4 √ a k κ < 2 √ a k+1 κ,

2.. When |b

k | ∈ [ 1 2 √ 2 √ a k κ, √ a k κ], since |b k+1 -b k | < 1 16 √ a k κ, by triangle inequality we have |b k+1 | ∈ [( 1 2 √ 2 -1 16 ) √ a k κ, 17 16 √ a k κ]. Since ( 1 2 √ 2 -1 16 ) √ a k κ > 1 4 √ a k κ > 1 4 √ a k+1 κ and 17 16 √ a k κ < 2 √ a k+1 κ as √ a k+1 κ > 0.99 √ a k κ, we have |b k+1 | ∈ ( 1 4 √ a k κ, 2 √ a k κ). 3. When |b k | ∈ ( 1 4 √ a k κ, 1 2 √ 2 √ a k κ), along with |b k | < K -1 we have condition B.4 satisfied, and thus |b k+1 | > |b k | + |b k a k κ| > |b k |. Since √ a k+1 κ < √ a k κ, |b k+1 | > 1 4 √ a k+1 κ. On the other side, since |b k+1 -b k | < 1 16 √ a k κ, and |b k | < 1 2 √ 2 √ a k κ, by triangle in- equality we have |b k+1 | < ( 1 16 + 1 2 √ 2 ) √ a k κ < 2 √ a k+1 κ. The last step holds since √ a k+1 κ > 0.99 √ a k κ. Summarizing the three cases, we know that b k+1 ∈ ( 1 4 √ a k+1 κ, 2 √ a k+1 κ). By the assumption of P (k) we also have a k > κ 5 2 and a k < a 0 -1 16 (k -1)κ 13 2 . Since we know b k > 1 4 √ a k κ, we have b 2 k κ 3 > 1 16 a k κ 4 > 1 16 κ 13 2 . Thus a k+1 < a k -b 2 k κ 3 < a k -1 16 κ 13 2 < a 0 -1 16 kκ 13 2 . Therefore unless b k+1 < κ 5 2 , P (k + 1) holds. Note that there must be some t such that a t < κ 5 2 since when t > 16a 0 κ -13 2 + 1, a 0 -1 16 (t -1)κ 13 2 < 0. We induct on k from 1, the base case holds by the initialization of b 0 and a 0 . Let T be the smallest t such that a t < κ 5 2 , at which we terminate the induction. Then for all t < T , a t ∈ (κ 5 2 , a 0 ] and b t ∈ ( 1 4 √ a t κ, 2 √ a t κ). This concludes the proof. Corollary B.3. Following the initialization condition and notation of Lemma 11, if a 0 > 2κ 5 2 , there exists some τ < T such that a t ∈ ( 3 2 κ 5 2 , 2κ 2 ). Proof of Corollary B.3. We will follow the notations defined in the proof of Lemma 11. By definition of T , P (t) holds for all t < T . Then for all t < T , we have a t > κ 5 2 , b t < K -1 , and a t+1 < a tb 2 t κ 3 . Hence |a t+1a t | < K -2 κ 3 < 1 2 κ 5 2 since we assumed K > 128. Since a 0 > 2κ 5 2 and a T < κ 5 2 < 3 2 κ 5 2 , combining with |a t+1 -a t | < 1 2 κ 5 2 we know that there must exist some τ such that a τ ∈ ( 3 2 κ 5 2 , 2κ 2 ). Corollary B.4. Following Lemma 11, if a 0 ∈ ( 3 2 κ 5 2 , 2κ 2 ), T > 1 128 κ -4 Proof of Corollary B.4. We will follow the notations defined in the proof of Lemma 11. By definition of T , P (t) holds for all t < T . Then for all t < T , we have a t < a 0 = 2κ In this section, we will show that (a, b) will slowly move along the parabola, and will eventually converge to a point with sharpness just below the EoS threshold 2/η = 2/κ 2 . 5 2 and b t < 2 √ a t κ < 2 √ 2κ 7 4 . It follows that a t+1 > a t -8b 2 t κ 3 > a t -64κ 13 2 . Since a 0 -a T > 3 2 κ 5 2 -κ 5 2 = 1 2 κ 5 2 , To facilitate the analysis, we define the residual ξ ≜ b 2 -1 2 aκ -1 16 κ 4 and consider a small perturbation constant threshold δ = 0.04. Follow from Corollary B.1 we have that for any ϵ < 0.5, for any κ, a, b satisfying condition B.1, ξ ′′ = b ′′2 - 1 2 a ′′ κ - 1 16 κ 4 = b 2 + 8abκ -16b 3 2 + O(b) O(b 4 ) + O(ab 2 κ) + O(ab 2 κ 2 ) + O(b 2 κ 4 ) + O(bκ 5 ) + O(ϵb 2 κ 3 ) - 1 2 aκ + 2b 2 κ 4 + κ O(ϵb 2 κ 3 ) + O(b 3 κ 3 ) + O(b 2 κ 4 ) - 1 16 κ 4 = b 2 -32b 4 + 16b 2 aκ - 1 2 aκ + 2b 2 κ 4 - 1 16 κ 4 + O(b 5 ) + O(ab 3 κ) + O(a 2 b 2 κ 2 ) + O(b 3 κ 4 ) + O(b 2 κ 5 ) + O(ϵb 3 κ 3 ) + O(ϵb 2 κ 4 ). = 1 -32b 2 b 2 - 1 2 aκ - 1 16 κ 4 + + O(b 5 ) + O(ab 3 κ) + O(a 2 b 2 κ 2 ) + O(b 3 κ 4 ) + O(b 2 κ 5 ) + O(ϵb 3 κ 3 ) + O(ϵb 2 κ 4 ) = 1 -32b 2 ξ + O(b 5 ) + O(ab 3 κ) + O(a 2 b 2 κ 2 ) + O(b 3 κ 4 ) + O(b 2 κ 5 ) + O(ϵb 3 κ 3 ) + O(ϵb 2 κ 4 ). When |b| < κ, the above expression can be further reduced to ξ ′′ = 1 -32b 2 ξ + O(b 5 ) + O(ab 3 κ) + O(a 2 b 2 κ 2 ) + O(b 3 κ 4 ) + O(ϵb 2 κ 4 ). Hence there exists absolute constants K such that for all ϵ < 0.5, for all a, b, κ satisfying condition B.1 and |b| < κ, we have ξ ′′ = (1 -32b 2 )ξ + R ξ where |R ξ | < K b 5 + K ab 3 κ + K a 2 b 2 κ 2 + K b 3 κ 4 + K ϵb 2 κ 4 Now fix δ = 0.04, we will determine the regime such that |R ξ | is less than δb 2 κ 4 . Lemma 12. For any κ, a, b satisfying κ < 1 80 √ 2 δK -1 , |b| < 2 √ 2κ 7 4 , |a| < 2κ 5 2 where δ = 0.04, we have ξ ′′ -(1 -32c 2 )ξ < δb 2 κ 4 . ( ) Proof of Lemma 12. Fix ϵ = 1 5 δK -1 , claim that κ, b, a in the given regime satisfies condition B.1. We check the conditions one by one: (i) Since we assume K > 128, fixing ϵ = 1 5 δK -1 satisfies ϵ < 0.5 (ii) With both δ and K -1 less than 0, κ < 1 80 √ 2 δK -1 < 0.1. Also 1 80 √ 2 δK -1 < 1 5 δK -1 = ϵ < ϵ 1 4 . Thus κ < min{0.1, ϵ 1 4 }. (iii) With ϵ = 1 5 δK -1 and κ < 1 80 √ 2 δK -1 , ϵκ -1 > 16 √ 2 > 2κ 5 2 > |a|. Thus |a| < ϵκ -1 . (iv) Since ϵκ -1 > 16 √ 2 and κ < 1, ϵκ -2 /5 > 16 5 √ 2 > 1 > 2 √ 2κ 7 4 . Thus |b| < min{1, ϵκ -2 /5}. Thus Eq. ( 94) applies, and we only need to bound every term on its RHS by 1 5 δb 2 κ 4 to complete the proof. We will do that term by term. (iii) Since κ < 1 80 √ 2 δK -1 < 1, we have κ 3 < κ < 1 20 δK -1 . Multiplying 4κ 2 on both side gives 4κ 5 < 1 5 K -1 δκ 2 . Since |a| < 2κ 5 2 , we have a 2 < 4κ 5 < 1 5 K -1 δκ 2 . Multiplying Kb 2 κ 2 on both side, we have Ka 2 b 2 κ 2 < 1 5 δb 2 κ 4 . (iv ) Since κ < 1 2 , 2κ 5 2 < κ. Thus |b| < 2κ 5 2 < κ < 1 80 √ 2 δK -1 < 1 5 δK -1 . Multiplying Kb 2 κ 4 on both side gives K|b 3 κ 4 | < 1 5 δb 2 κ 4 . (v) Since we fixed ϵ = 1 5 δK -1 , Kϵb 2 κ 4 = 1 5 δb 2 κ 4 . Therefore we have |R ξ | < K b 5 + K ab 3 κ + K a 2 b 2 κ 2 + K b 3 κ 4 + K ϵb 2 κ 4 < δb 2 κ 4 . ( ) Plugging back to ξ ′′ = (1 -32b 2 )ξ + R ξ completes the proof. Here  |ξ ′′ | < |(1 -32b 2 )ξ| + δb 2 κ 4 = (1 -32b 2 )|ξ| + δb 2 κ 4 (Since 32b 2 < 1) = |ξ| -16b 2 |ξ| -(16b 2 |ξ| -δb 2 κ 4 ) < |ξ| -16b 2 |ξ| -b 2 (16( 1 16 δκ 4 ) -δκ 4 ) (Since |ξ| > 1 16 δκ 4 ) = |ξ| -16b 2 |ξ| < |ξ| -4κ 7 2 |ξ| (Since |b| > 1 2 κ 7 4 ) = (1 -4κ 2 )|ξ|. (100)

B.5.1 PHASE II STAGE 1

In this stage we will show that after |b| gets close to aκ/2 while a decreases to around 2κ 5 2 from Phase I of the convergence, the residual of (b, a) to the parabola will further decrease to below 1 8 δκ 4 . Lemma 13. For any κ < 1 80 √ 2 δK -1 , for all initialization (b 0 , a 0 ) such that a 0 ∈ ( 3 2 κ 5 2 , 2κ 2 ) and |b 0 | ∈ ( 1 4 √ a 0 κ, 2 √ a 0 κ). Let T be the time that a exits (κ 5 2 , 2κ 2 ) as characterized in Lemma 11 and Corollary B.4. There exists some τ < T such that ξ τ < 1 8 δκ 4 . Proof of Lemma 13. First note that since κ < 1 80 √ 2 δK -1 < 1 16 K -1 , the initialization condition given is a subset of the valid initialization for Lemma 11. Thus for all t < T , we have a t ∈ (κ 5 2 , 2κ  τ = log (1-4κ 7 2 ) 1 8 δκ 4 10κ 7 2 ≥ log( 1 80 δκ 1 2 ) log(1 -4κ 7 2 ) = log(80δ -1 κ -1 2 ) -log(1 -4κ 7 2 ) . ( ) Since log(1 -4κ 7 2 ) < -4κ 7 2 by a second order Taylor expansion, we have τ < log(80δ -1 κ -1 2 )/4κ 7 2 . Substituting δ = 0.04 in the expression, log(80δ -1 κ -1 2 ) = log(2000) + log(κ -1 2 ) < 8 + log(κ -1 2 ). Observe that for all x > 175, we have 8 + log(x) < √ x.  Since κ < 1 80 √ 2 δK -1 < 1 512×2000 √ 2 < 1 175 2 , let x = κ -1 2 , we have 8 + log(κ -1 2 ) < κ -1 4 . Hence log(80δ -1 κ -1 2 ) < κ -1 4 and τ < κ -1 4 /4κ 2 ) and |ξ 1 | < 1 8 δκ 4 , there exists some T < 48δ -1 κ -9 2 + 1 such that a T < -1 8 (1 -3δ)κ 3 and for all t < T , a t > -1 8 (1 -3δ)κ 3 and |ξ k | < 1 8 δκ 4 . We will prove the claim using induction. Consider the inductive hypothesis P (k): |ξ k | < 1 8 δκ 4 and a k ∈ (-1 8 (1 -3δ)κ 3 , a 1 -1 16 δκ 7 k]. Note that P (0) holds directly from construction, so we proceed to the inductive step. Assuming P (k) holds, We will show that either a k < -1 8 (1 -3δ)κ 3 or P (k + 1) holds. First we verify that b k , a k satisfies condition B.6. Since δ = 0.05, | -1 8 (1 -3δ)κ 3 | < | 1 8 κ 3 | < 2κ 5 2 . Also since a 1 < 2κ 5 2 as required, we have |a t | < max{| -1 8 (1 -3δ)κ 3 |, |a 1 -1 16 δκ 7 k|} < 2κ 5 2 . Since |ξ k | < 1 8 δκ 4 , we have b 2 k < 1 2 a k κ+ 1 16 κ 4 + 1 8 δκ 4 < 1 2 (2κ 2 )+κ 4 < 2κ  k > -1 8 (1 -3δ)κ 3 and |ξ k | = |b 2 k -( 1 2 a k κ + 1 16 κ 4 )| < 1 8 δκ 4 , we have c 2 k > 1 2 d k κ + 1 16 κ 4 -1 8 δκ 4 > 1 2 -1 8 (1 -3δ)κ 3 κ + 1 16 κ 4 -1 8 δκ 4 = 1 16 δκ 4 . ( ) Since |b k | < K -1 and |a k | < K -1 , condition B.2 is satisfied and we have a k+1 < a k -b 2 k κ 3 < a k -1 16 δκ 4 κ 3 . Given that a k ≤ a 1 - 2 )|ξ k | < |ξ k | < 1 8 δκ 4 . When |ξ k | ≤ 1 16 δκ 4 , since |ξ k+1 -(1 -32b 2 )ξ k | < δb 2 k κ 4 , we have |ξ k+1 -ξ k | < δb 2 k κ 4 + 32b 2 k |ξ k | ≤ δb 2 k κ 4 + 32b 2 k ( 1 16 δκ 4 ) = 3δb 2 k κ 4 . ( ) Since b 2 k < 2κ 7 2 < 1 48 , |ξ k+1 -ξ k | < 1 48 3δκ 4 = 1 16 δκ 4 . Since |ξ k | ≤ 1 16 δκ 4 , we have |ξ k+1 | ≤ |ξ k+1 -ξ k | + |ξ k | ≤ 1 8 δκ 4 as desired. In summary we have P (k) implies P (k + 1) unless a k+1 < -1 8 (1 -3δ)κ 3 . Note that there must exists some t such that a t+1 < -1 8 (1 -3δ)κ 3 since for any τ ≥ 48δ -1 κ -9 2 + 1, if the induction proceed to P (τ ), then a 1 -1 16 δκ 7 (τ -1) < a 1 -3κ 5 2 < -κ 5 2 < -1 8 (1 -3δ)κ 3 , which violates P (τ ). Let T < 48δ -1 κ -9 2 + 1 be the first t such that a t < -1 8 (1 -3δ)κ 3 , then by construction we have for all t < T , P (t) holds. This completes the proof of the lemma. Corollary B.6. Following Lemma 14, a T -1 ∈ (-1 8 (1 -3δ)κ 3 , -1 10 (1 + 2δ)κ 3 ). Proof of Corollary B.6. Denote T -1 by τ , by definition of T , we know P (τ ) holds and therefore we have b 2 τ < 2κ 7 2 , a τ > -1 8 (1 -3δ)κ 3 and a T > a τ -5b 2 τ κ 3 . Combining above we have a T > a τ -10κ 7 2 κ 3 , so a τ < a T + 10κ 13 2 < -1 8 (1 -3δ)κ 3 + 10κ 2 . Note that since we set δ = 0.04, we have -1 10 (1 + 2δ)κ 3 -(-1 8 (1 -3δ)κ 3 ) = ( 1 40 -( 2 10 + 3 8 )δ)κ 3 = 1 500 κ 3 . (104) Since κ -7 2 > K 7 2 > 5000, we have 10κ 13 2 < 1 500 κ 3 = -1 10 (1 + 2δ)κ 3 -(-1 8 (1 -3δ)κ 3 ). Adding -1 8 (1 -3δ)κ 3 on both sides, we have a τ < -1 8 (1 -3δ)κ 3 + 10κ 13 2 < -1 10 (1 + 2δ)κ 3 . Combining with a τ > -1 8 (1 -3δ)κ 3 concludes the proof.

B.5.3 PHASE II STAGE 3

Here we state the lemma which proves the final convergence of the two step trajectory. The proof is very similar to that of Lemma 7 except a is negative now and |b| is decreasing. Lemma 15 (Final Convergence). For all κ < 1 16 K -1 , for all a 0 , b 0 satisfying a 0 ∈ (-1 8 (1 - 3δ)κ 3 , -1 10 (1 + 2δ)κ 3 ) and |ξ 0 | = |b 2 0 -1 2 a 0 κ -1 16 κ 4 | < 1 8 δκ 4 , for all ϵ > 0, there exists some T < 25 log(ϵ -1 ) such that for all t ≥ T , |b t | < ϵ and a t ∈ (-5 3 κ 3 , -1 10 κ 3 ). Proof. We will prove the claim using induction. Consider the inductive hypothesis P (k): |b k | ≤ |b 0 |(1 -1 10 (1 + 2δ)) k , a 0 -a k ∈ [0, 8 √ 2κ(|b 0 | -|b k |)). When k = 0, the statement holds trivially, so we proceed to the inductive step. Assume P (k) holds for some k ∈ N, we want to show that P (k + 1) holds as well. First we check that (a k , b k ) with κ <  | > |a 0 |, thus to show |b k | < 1 2 √ 2 |a k |κ, one only need to show for k = 0 case, which is equivalent to b 2 0 < -1 8 a 0 κ. From the initialization condition, ξ 0 = |b 2 0 -1 2 a 0 κ -1 16 κ 4 | < 1 8 δκ 4 . Since a 0 < -1 10 (1 + 2δ)κ 3 , we must have 1 2 a 0 + 1 16 κ 4 < 0 ≤ b 2 0 . It follows that b 2 0 < 1 2 a 0 κ + 1 16 κ 4 + 1 8 κ 4 = 5 8 a 0 κ + 1 16 (1 + 2δ) -1 8 a 0 κ < 5 8 (-1 10 (1 + 2δ)κ 3 )κ + 1 16 (1 + 2δ)κ 4 -1 8 a 0 κ (Since a 0 < -1 10 (1 + 2δ)κ 3 ) = -1 16 (1 + 2δ)κ 4 + 1 16 (1 + 2δ)κ 4 -1 8 a 0 κ = 1 8 |a 0 |κ. From the initialization condition we have a 0 < -1 10 (1 + 2δ)κ 3 = -0.108κ 3 < -1 16 κ 3 , so |a k | > |a 0 | > 1 16 κ 3 . For upper-bound on |a k | we note that a 0 > -1 8 (1 -3δ)κ 3 > -κ 3 by initialization, combining with the inductive hypothesis we have a  k > -κ 3 -2κ(|b 0 | -|b t |) > -κ 3 -2|b 0 |. Since b 2 0 < 1 8 |a 0 |κ < 1 64 (1 -3δ)κ 4 < ( 1 8 κ 2 ) 2 , |b 0 | < 1 8 κ 2 , so a k > -κ 3 -1 4 κ 3 and hence |a k | < 5 4 κ 3 < K -2 κ -1 since we assumed κ < 1 16 K -1 . | < |b k + a k b k κ|. Since 1 + a k κ > 0 as |a k | < 1 8 κ 3 and κ < K -1 < 1 512 , we may write the update of b k as |b k+1 | < |b k |(1+a k κ). Since a k < a 0 < -1 10 (1+2δ)κ 3 , we have |b k+1 | < |b k |(1-1 10 (1+2δ)κ 3 ). Combining with the inductive hypothesis that |b k | < |b 0 |(1 -1 10 (1 + 2δ)κ 3 ) k , we have |b k+1 | < |b 0 |(1 -1 10 (1 + 2δ)κ 3 ) k+1 . Since condition B.4 is stronger than condition B.2, by Lemma 5 we have a k -a k+1 < 8b 2 k κ 3 . Combining with |b k | -|b k+1 | > |a k ||b k |κ, we have a k -a k+1 |b k | -|b k+1 | < 8b 2 k κ 3 |a k b k |κ = 8|b k |κ 2 |a k | < 8 1 2 √ 2 |a k |κκ 2 |a k | (Since |b k | < 1 2 √ 2 |a k |κ) = 2 √ 2|a k | -1 2 κ 5 2 < 2 √ 2( 1 16 κ 3 ) -1 2 κ 5 2 (Since|a k | > 1 16 κ 3 ) < 8 √ 2κ. Since the inductive hypothesis gives (a 0a k )/(|b 0 | -|b k |) < 8 √ 2κ, it follows by the mediant inequality that (a 0 -a k+1 )/(|b 0 | -|b k+1 |) < 8 √ 2κ. Thus we have shown P (k) → P (k + 1), and by induction we know P (k) holds for any k ∈ N. Now we we can wrap up the convergence analysis leveraging this property. Since for all t, |b t | ≤ |b 0 |(1 -1 10 (1 + 2δ)) k , for any ϵ > 0 we may pick T > log (1-1 10 (1+2δ)) (ϵ/|b 0 |) such that for all t > T , |b t | < ϵ. Note that since |b 0 | < 1 8 κ 2 < 1 and 1 -1 10 (1 + 2δ), we have T < log (1-1 10 (1+2δ)) (ϵ/|b 0 |) < log(ϵ) log(1 -1 10 (1.08)) < 25 log(ϵ -1 ). For the region of final convergence, for any t we know that from P (t) that a 0 -a t ∈ [0, 8 √ 2κ(|b 0 |- |b t |)), so we have a t > a 0 -8 √ 2κ|b 0 |. Since we know |b 0 | < 1 8 κ 2 and a 0 > -1 8 κ 3 by initialization, it follows that a t > -1 8 - √ 2κ 2 > -5 3 κ 3 . The upper bound of a t < -1 10 (1 + 2δ)κ 3 < 1 10 κ 3 is trivial since a is monotonically decreasing. Thus in summary we have shown that for any ϵ > 0, there exists some T < 25 log(ϵ -1 ) such that for all t > T , |b t | < ϵ and a t ∈ (-5 3 κ 3 , -1 10 κ 3 ).

B.6 PROOF OF THEOREM 3.1 AND ITS COROLLARIES

With all the lemmas ready, we may now prove Theorem 3.1 and its corollaries. We first restate Theorem 3.1 here. Theorem 3.1 (Sharpness Concentration). For a large enough absolute constant K, suppose κ < 1 2000 √ 2 K -1 , and the initialization (a 0 , b 0 ) satisfies a 0 ∈ (12κ 5 2 , 1 4 K -2 κ -1 ) and b 0 ∈ (-K -1 , K -1 )\{0}. Consider the GD trajectory characterized in Eq. ( 6) with fixed step size κ 2 from (a 0 , b 0 ), for any ϵ > 0 there exists T = O(K -2 κ -15 2 + log(ϵ -1 ) + log(|b 0 | -1 )κ -7 2 ) such that for all t > T , |b t | < ϵ and a t ∈ (-5 3 κ 3 , -1 10 κ 3 ). The proof for the main theorem is very simple after we have all the lemmas as discussed above. Proof of Theorem 3.1. We consider any initialization (a 0 , b 0 ) satisfying a 0 ∈ (12κ 5 2 , 1 4 K -2 κ -1 ) and b 0 ∈ (-K -1 , K -1 )\{0}. We abuse the notation to let a t and b t be the value of a and b after the t-th two step update from a 0 and b 0 . If |b 0 | ≥ 2 √ a 0 κ, then by Lemma 8 there exists some τ 1 < κ -4 such that a τ1 ∈ (2κ √ a τ2 κ) and a τ2 ∈ (2κ 5 2 , 1 4 K -2 κ -1 ). Thus there exists some T 1 = O(κ -4 + log(|b 0 | -1 )κ -7 2 ) such that a T1 ∈ (2κ √ a T2 κ, 2 √ a T2 κ), by Lemma 13, there exists τ 4 < |a T2 |κ -13 2 < 2κ -6 such that within τ 4 steps from (a T2 , b T2 ), we have a T2+τ4 ∈ (κ -5 2 , 2κ -5 2 ) and |ξ T2+τ4 | < 1 8 δκ 4 where we fixed δ = 0.04. Let T 3 = T 2 + τ 4 . After the residual |ξ| decreases to less than 1 8 δκ 4 with T 3 two-step updates, by Lemma 14 and Corollary B.6 there exists some τ 5 < 48δ -1 κ -9 2 such that a T3+τ5 ∈ (-1 8 (1-3δ)κ 3 , -1 10 (1+2δ)κ 3 ) while |ξ T3+τ5 | < 1 8 δκ 4 . Let T 4 = T 3 + τ 5 . Finally, by Lemma 15 we have that starting from (a T4 , b T4 ), there exists some τ 6 < 25 log(ϵ -1 ) that for any t > T 4 + τ 6 , |b t | < ϵ and a t ∈ (-5 3 κ 3 , -1 10 κ 3 ). Thus we have the trajectory converging to some minima with a ∈ (-5 3 κ 3 , -1 10 κ 3 ). Finally we bound the total number of steps required for convergence. Since |a 0 | < 1 4 K -2 κ -1 , we have τ 3 < 1 4 K -2 κ -15 2 . Moreover, since κ < K -1 , we have κ -4 = O(K -2 κ -15 2 ). Since T 5 = T 1 + τ 3 + τ 4 + τ 5 + τ 6 , we have T = O K -2 κ -15 2 + log(|b 0 | -1 )κ -7 2 + log(ϵ -1 ) . (108) This completes the proof of the theorem. B.6.1 PROOF OF COROLLARY 3.1 Before we proceed to prove Corollary 3.1, we first show a simple lemma on the approximity of x and c(x, y) ≜ x 2y 2 when x is large and |1 -xy| is small. Recall that c was previously defined in Eq. ( 31) and the (a, b) coordinate that we have been focusing on is the offset from the κ 2 -EoS minima in the (c, d) coordinate. Lemma 16 (Approximity of c to x). For any large constant K > 512, fix any κ < 1 2000 √ 2 K -1 . For any x ∈ ( √ 2 2 κ -1 , 2κ -1 ) and any y such that |1 -xy| < K -1 , we have c(x, y) ≜ x 2y 2 ∈ (x -32κ 3 , x). Proof of Lemma 16. Since xy < 1 + K -1 and x > √ 2 2 κ -1 , we must have y < 2(1 + K -1 )κ < 2 √ 2κ, where the last step holds since we assumed K > 512. Meanwhile 1xy < K -1 also implies xy > 1 -K -1 > 0, so y > 0. Thus we have c(x, y) = x 2y 2 > x 2 -(2 √ 2κ) 2 = x 1 -8κ 2 x -2 > x 1 -8κ 2 (2κ 2 ) > x(1 -16κ 4 ). ( ) where the last two inequality holds since x > √ 2 2 κ -1 and √ x > x when x ∈ (0, 1). Thus c(x, y) > x -16xκ 4 > x -32κ 3 since we assume x < 2κ -1 . Since y > 0, x 2y 2 < x, so c(x, y) ∈ (x -32κ 3 , x). Now we can proceed to proving Corollary 3.1. We first restate the result here: Corollary 3.1 (Sharpness Concentration under (x, y)-Parameterization). For a large enough absolute constant K, suppose η < 1 8000000 K -2 , and the initialization (x 0 , y 0 ) satisfies x 0 ∈ (x + 13η 5 4 , x + 1 5 K -2 η -1 2 ) and |x 0 y 0 -1| ∈ (0, K -1 ) where (x, y) is the η-EoS minima defined in Definition 1. The GD trajectory characterized in Eq. (2) with fixed step size η from (x 0 , y 0 ) will converge to a global minimum with sharpness λ ∈ ( 2 η -20 3 η, 2 η ). Proof of Corollary 3.1. Following the convergence proof for Theorem 3.1, to prove this corollary we only need to show that all initializations x 0 , y 0 satisfies the initialization condition of Theorem 3.1 after re-parameterized to (a, b), and the sharpness λ of the minima will satisfy λ ∈ ( 2 η -20 3 η, 2 η ). We first check the initialization in the (x, y) coordinate satisfies the initialization condition in Theorem 3.1. Due to the different contexts, we will use κ = √ η and η itself interchangeably. Recall from Eq. ( 27) that x = √ 2 2 ((-4 + η -2 ) 1 2 + η -1 ) 1 2 . It is not hard to check that x > √ 2 2 κ -1 and x < κ -1 where κ = √ η. Since |x 0 y 0 -1| < K -1 as required by the initialization, from Lemma 16 we have c 0 ≜ c(x 0 , y 0 ) ∈ (x 0 -32κ 3 , x 0 ). By the same reasoning we also have c ≜ c(x, y) ∈ (x -32κ 3 , x). Thus x 0 > x + 13κ (121) By doing some tedious calculation we have ∂ 5 δ ∂κ 5 (x) = 105 16γ 2 κ 3 + 4β 3 ρ 1 5 32ϕ 9 -75 16γ 2 κ 3 + 4β 3 ρ 1 3 48γ 2 κ 2 + 12β 2 ρ 2 1 + 4β 3 ρ 2 8ϕ 7 + 45 16γ 2 κ 3 + 4β 3 ρ 1 48γ 2 κ 2 + 12β 2 ρ 2 1 + 4β 3 ρ 2 2 8ϕ 5 + 15 16γ 2 κ 3 + 4β 3 ρ 1 2 96γ 2 κ + 24βρ 3 1 + 36β 2 ρ 1 ρ 2 + 4β 3 ρ 3 4ϕ 5 -5 48γ 2 κ 2 + 12β 2 ρ 2 1 + 4β 3 ρ 2 96γ 2 κ + 24βρ 3 1 + 36β 2 ρ 1 ρ 2 + 4β 3 ρ 3 2ϕ 3 -5 16γ 2 κ 3 + 4β 3 ρ 1 96γ 2 + 24ρfoot_4 1 + 144βρ 2 1 ρ 2 + 36β 2 ρ 2 2 + 48β 2 ρ 1 ρ 3 + 4β 3 ρ 4 4ϕ 3 + 240ρ 3 1 ρ 2 + 360βρ 1 ρ 2 2 + 240βρ 2 1 ρ 3 + 120β 2 ρ 2 ρ 3 + 60β 2 ρ 1 ρ 4 + 4β 3 ρ 5 2ϕ (122) Since κ < 0.1, we have α = (1 -4x 4 ) 1 4 > 0.99 and ϕ = 4γ 2 x 4 + β 4 1 2 ≥ β 2 = (α + dx) 2 ≥ α 2 > 0.9. Thus α and ϕ are bounded from below by some constants. Since κ is bounded from above, we have ρ 1 , . . . , ρ 5 bounded from above by some constants. Now we give upper bounds for β and γ. Since |x| < 0.1, we have (1 -4x 4 ) 1 From Eq. ( 122) we have ∂ 6 δ ∂κ 6 (x) as some finite degree polynomial of β, γ, κ, ρ 1 , . . . , ρ 6 , and ϕ -1 . Since the absolute value for all of these terms are bounded above by some constants independent of x, we have ∂ 6 δ ∂κ 6 (x) uniformly bounded above by some constant K for all x ∈ [0, κ]. This completes the proof of the lemma. Similarly, we can have the lower bound of the 2-step dynamics. c t+2 > c t (1 + (1 + α(t))(1 -c t ) -2η 2 ) 2 (1 + (1 + α(t + 1))(1 -c t+1 ) -2η 2 ) 2 > c t (1 + (1 + α(t))(1 -c t ) -2η 2 ) 2 (1 + (1 + α(t))(1 -c t+1 ) -7η 2 ) 2 > c t (1 + (1 + α(t))(1 -c t )) 2 (1 + (1 + α(t))(1 -c t+1 )) 2 -100η 2 > c t (1 + (1 + α(t))(1 -c t )) 2 (1 + (1 + α(t))(1 -c t (1 + (1 + α(t))(1 -c t ) + 10η 2 ) 2 )) 2 -100η 2 > c t (1 + (1 + α(t))(1 -c t )) 2 (1 + (1 + α(t))(1 -c t (1 + (1 + α(t))(1 -c t )) 2 )) 2 -200η 2 (129) After bound all small terms to 200η 2 , we consider the main part c t (1 + (1 + α(t))(1 -c t )) 2 (1 + (1 + α(t))(1 -c t (1 + (1 + α(t))(1 -c t )) 2 )) 2 To prove it will decrease, we need to prove the factor f (α, c) = (1 + (1 + α)(1 -c))(1 + (1 + α)(1 -c(1 + (1 + α)(1 -c)) 2 )) < 1 when α ∈ (-1 100 , min{ 1 2 K -2 , 1/8}), 1 + K -1 -K -2 < c < 1.3. We first prove that the function f (α, c) monotonically increases when α > -1 100 increases, i.e. ∂f (α,c) ∂α > 0. We directly give the simplified expression of this partial derivative. Therefore, f (α, c) < f ( 1 2 K -2 , c) for all 1+K -1 -K -2 < c < 1.3. Denote β = 1 2 (K -2 -2K -3 + K -4 ) and suppose c = 1 + t √ β for some t > √ 2 (since c > 1 + K -1 -K -2 ). Meanwhile, since t √ β < 0.3, t < 0.3β -1/2 . Then we plug in c = 1 + t √ β and expand f (β, c). f (β, c) = f (β, 1 + t β) = 1 + 2tβ 3/2 -2t 3 β 3/2 -3t 2 β 2 + t 4 β 2 + 2tβ 5/2 -5t 3 β 5/2 -6t 2 β 3 + 4t 4 β 3 -3t 3 β 7/2 -3t 2 β 4 + 6t 4 β 4 + t 3 β 9/2 + 4t 4 β 5 + t 3 β 11/2 + t 4 β 6 ≤ 1 + (2tt 3 )β 3/2t 3 β 3/2 -3t 2 β 2 + 0.3t 3 β 3/2 + 2tβ 5/2 -5t 3 β 5/2 -6t 2 β 3 + 1.2t 3 β 5/2 -3t 3 β 7/2 -3t 2 β 4 + 1.8t 3 β 7/2 + 0.3t 2 β 4 + 0.36t 2 β 4 + 0.3t 2 β 5 + 0.09t 2 β 5 < 1 -0.7t 3 β 3/2 -3t 2 β 2 < 1β 3/2 -6β 2 (131) Thus we have c t+2 < c t (1 -β 3/2 -6β 2 ) 2 + 200η 2 < c t -β 3/2 < c t - 1 10 K -3 Thus it takes at most 2 × 0.3/( 1 10 K -3 ) < 6K 3 steps for c t to get into the region c t ∈ (1 -K -1 , 1 + K -1 -K -2 ). Finally we prove that if c s ∈ (1 -K -1 , 1 + K -1 -K -2 ) for some s, then for all t > s, That means c t1+2k+2 > ((1 + 10 27 K -2 ) 2 -200η 2 )c t1+2k > c t1+2k > (1 + K -1 20 ) and the proof is done. Otherwise, we have q ∈ ( 2 3 , 30 + 40K -1 ), also we have the lower bound of this function: c t ∈ (1 -K -1 , 1 + K -1 -K -2 f (γ, c) ≥ f (γ, 1 + 2 3 √ γ) = -127920γ 3/2 -319920γ 5/2 -192000γ 7/2 + 64000γ 9/2 + 64000γ 11/2 + 2560000γ 6 + 10240000γ 5 + 15355200γ 4 + 10230400γ 3 + 2555200γ 2 + 1 > 1 -100000γ 3/2 . (since K > 512, γ 1/2 = K -1 10 ) That means c t1+2k+2 > ((1 -100000γ 3/2 ) 2 -200η 2 )c t1+2k > (1 -200000γ 3/2 -200η 2 )(1 + 2 3 √ γ) > (1 - 200000 26214400 γ 1/2 -200η 2 )(1 + 2 3 √ γ) > (1 - 1 1000 K -1 - 1 4000 K -4 )(1 + K -1 15 ) > (1 + K -1 20 ) Then the lower bound of c t1+2k is proved. As for the upper bound, since when t = t 1 + 2k, x, y satisfies all the conditions in Lemma 22. If c t1+2k < 1 + K -1 -K -2 , we have proved that it will never be larger than 1 + K -1 -K -2 . Otherwise if c t1+2k ∈ (1 + K -1 -K -2 1 + 3K -1 + 2K -2 ) We apply the inequality Eq. ( 131) and know c t1+2k+2 < c t1+2k . In this way, by induction we prove the bound (x ⊤ t1+2k y t1+2k ) 2 ∈ (1 + 1 20 K -1 , 1 + 3K -1 + 2K -2 ). As for the upper bound of (x ⊤ t1+2k-1 y t1+2k-1 ) 2 , we directly apply the upper bound of 1-step dynamics (if c t > 1 + 1 20 K -1 ): c t+1 ≤ c t (1 + (1 + α(t))(1 -c t ) + 10η 2 ) 2 < c t (2 -c t + 10η 2 ) 2 < (1 + K -1 20 )(1 - K -1 20 ) 2 + 100η 2 < 1 - K -1 20 - 1 400 K -2 + 1 8000 K -3 + 100η 2 < 1 - 1 20 K -1 Therefore the upper bound (x ⊤ t1+2k-1 y t1+2k-1 ) 2 < 1 -1 20 K -1 holds for all k < t 1 /2. Finally, we denote a := ∥x∥ 2 -∥y∥ 2 -(η -2 -4) 1 2 , b := ∥x∥∥y∥ -1. With all the lemmas above, we prove the equivalence between the dynamics of (a, b) and the one step dynamics of in the scalar case (Lemma 1, Lemma 2). For simplicity of notations, when analyzing the 1-step and 2-step dynamics of (a t , b t ), we use a, a ′ to denote a t , a t+1 and b, b ′ to denote b t , b t+1 , etc. For simplicity of calculation, we consider the change of variable κ = √ η.



δ0Unif(S d-1 ) denote the uniform distribution over (d -1)-dimensional sphere with radius δ0. , K -2 κ -1 ) and |b t | ∈ (2 √ a T κ, K -1 ).Since P (T -1) holds, following Eq. (77) we also have a T > 2κ ) k+1 . 2 )|ξ|. ≤ 1. Since |a| < κ -1 , we have β = α + ax < α + aκ < α + 1 < 2. Since |b| < 1, we have γ = 1 + b < 2.Note that it is straightforward from construction that all these terms are positive. t = yt(1 + 2K -1 ) at time tp means at this iteration, we multiply (1 + 2K -1 ) to the vector y.



Figure1: EoS Phenomena in NN Training. We consider three models including a 5-layer ReLU activated fully connected network, a 2-layer fully connected linear network with asymmetric initialization factor (4, 0.1) (see Appendix A.1 for explanation), and a 4-layer scalar network equivalent to min x,y

Figure2: EoS phenomenon on degree-4 model. In (a) we demonstrate sharpness adaptivity by running GD with learning rate η = 2 8 , 2 10 , 2 12 from the same initialization. The sharpness of all trajectories converges to around their corresponding stability threshold 2/η while the loss decreases exponentially. In the 2D trajectory, the 2-step movement quickly converges to some smooth curves ending very close to the EoS minimum. In (b) we demonstrate sharpness concentration by running GD with constant learning rate η = 0.2 for 50000 iterations from a dense grid of initializations and plot the sharpness of their converging minima. Initializations in the red shaded area all converge to a minima with sharpness in (2/η -0.1, 2/η).

Figure 3: Solutions of Eq. (8) (κ = 1)

Phase 1. (Convergence to near parabola) We consider initializations in region I, II, and III. • In I, b ′′b is dominated by -b 3 and (a, b) follows an exponential trajectory. We show that |b| decreases exponentially with respect to a and enters region II (Lemma 8). • In III, b ′′ -b is dominated by abκ and (a, b) follows an elliptic trajectory centered at (0, 0). We show that |b| increases at superlinearly with respect to a and enters II (Lemma 9).

Figure 4: Convergence Diagram for GD on the degree-4 example. The quiver arrows indicate the directions of local 2-step movement. This diagram is only for demonstration purpose and ratios are not exact.

Figure 5: Beyond EoS training on product of two scalars. We run the same experiment as in Fig. 2 except for using objective min x,y∈R (1xy) 2 . Note that in this case the two-step trajectories form circular curves and converge to points that are farther from EoS minima.

step-wise trajectory with asymmetric init. (η = 0.01).

Figure 6: Bifurcation behavior of GD on the degree-4 model. In (a) we show the zoomed in version of the lower right part of Fig. 2b, the fractal boundary can be clearly observed. In (b), we run GD with η = 0.01 starting from the asymmetric initialization (x 0 , y 0 ) = (12.5, 0.05) close to the boundary of divergence until it converge close to the EoS minimum at around (10, 0.1). In (c), we plot the trajectory with bifurcation under (a, b)-reparameterization and compare it with the bifurcation diagram of the approximated dynamical system characterized by b ′′ = b(1+8aκ-16b 2 ).

Figure 7: Training trajectory of 5-layer ELU-activated FC Network. We train the model using with η = 0.01 for 18500 iterations. The sharpness converges to 199.97 while 2/η = 200. The local trajectory (right) can be very well approximated by the parabola x = 7500y 2 .

Figure8: Sharpness concentration for the degree-4 example. We run GD with η = 0.2 from 3 initializations that are above, below, and very close to the line of global minima. The sharpness of all trajectories converges to around 2/η while the loss decreases exponentially. In the 2D trajectory, the 2-step movement quickly converges to the parabolic curves ending very close to the EoS minimum.

Figure9: Local two-step movement for the degree-4 example. We record the two-step movement with η = 0.2 from a grid of initializations near the EoS minima. At each point, the arrow points toward the direction of two-step movement from that point and the color of arrow indicates the converging sharpness of trajectories passing that point. We can see the parabolic trajectory and how initialization to the right of the EoS minima all tend to converge to it.

Figure12: Local two-step movement for the degree-2 example. This is the same figure as Fig.9except for the degree-2 example. We use the same step size η = 0.2. There is no longer the concentration behavior as we see for the degree-4 case. Locally, only the initialization very close to the EoS minimum will converge to a sharpness near the stability threshold (which is 10 with η = 0.2).

Figure 13: Converging sharpness of (x, y, y) parameterized initializations (η = 0.2).

Figure 14: Local two-step movement for the degree-3 example. In this figure, we show the local dynamics near the two EoS minima in the positive quadrant of the degree-3 example. The left figure corresponds to the EoS minima at the lower right of Fig.13. At this EoS minima, the duplicated entry y is small, and the local behavior is very similar to the case of degree-4 example (Fig.9) for which we have provable sharpness concentration. The right figure corresponds to the EoS minima at the top left of Fig.13. The local behavior is very similar to the case of degree-2 example (Fig.12), for which we do not have EoS behaviors.

Figure15: GD on 2-component scalar factorization problems. We run gradient descent with η = 0.05 for two different objectives. Both models are asymmetrically initialized with factor (5, 0.1) so that they have an initial sharpness larger than 2/η. For both cases, the converging sharpness is distinguishably smaller than the stability threshold.

Figure 16: This figure records the training loss (left) and sharpness (middle) of a degree-3 scalar network with initialization (6, 0.1, 0.4) optimized by gradient descent with fixed step size η = 0.2. In (right) we plot the distance of the last entry x 3 to other entries.

Figure17: This figure records the training loss (left) and sharpness (middle) of a 7-layer scalar network with initialization (2, 2.5, 3, 0.1, 0.2, 0.3, 0.4) optimized by gradient descent with fixed step size η = 0.2. In (right) we plot the distance of the last entry x 7 to other entries.

Figure 18: Pairwise Training Dynamics. This set of figures record the pairwise training dynamics for entries x 2 , x 3 , x 4 , x 5 in the same experiment as Fig. 17.x 2 and x 3 are initialized large while x 4 and x 5 are initialized small. We see that within the small entries and the large entries, the pairwise dynamics are all approximately linear while the cross comparisons across the small entries and large entries gives the parabolic two-step trajectory (and also some bifurcation behavior).

Figure 18: Pairwise Training Dynamics. This set of figures record the pairwise training dynamics for entries x 2 , x 3 , x 4 , x 5 in the same experiment as Fig. 17.x 2 and x 3 are initialized large while x 4 and x 5 are initialized small. We see that within the small entries and the large entries, the pairwise dynamics are all approximately linear while the cross comparisons across the small entries and large entries gives the parabolic two-step trajectory (and also some bifurcation behavior).

Figure19: Single Entry Movement. We plot the value of x 3 (initialized to 3) and x 3 (initialized to 0.1) along the training process. The larger entry decreases in an approximately monotone manner and the small entry increase while oscillating.

Figure20: This figure records the training loss (left) and sharpness (middle) of a 4-layer scalar network with initialization (6, 0.7, 0.3, 0.2) optimized by gradient descent with fixed step size η = 0.2. In (right) we plot the distance of the last entry x 4 to other entries. In this example, the small entries did not converge to be exactly the same value, yet the loss still decreased geometrically and the sharpness concentration phenomenon still occurred.

Figure 21: Pairwise Training Dynamics. This set of figures record the pairwise training dynamics for entries x 1 , x 2 , x 3 , x 4 in the same experiment as Fig. 20.x 1 was initialized large (at 6), x 3 , x 4 were initialized small (at 0.2, 0.3), and x 2 was initialized moderately small (at 0.7). We see that the two-step pairwise dynamics between x 2 and x 3 roughly follows a parabolic trajectory, yet the pairwise dynamics between x 1 and x 2 also exhibits similar features while still following a roughly linear relation.

Evolution of training loss, sharpness and the trajectory of GD on the rank-1 factorization of isotropic matrix. Evolution of the alignment indicator ∥x∥ 2 ∥y∥ 2 -(x ⊤ y) 2 .

Figure 23: Training Statistics for ELU-activated 5-layer FC network. (η = 0.01, 2/η = 200) (a) is identical to Fig. 7 (left) in the main text.We see that the model is capable of memorizing all data as the loss decreases exponentially to 0. Toward convergence, the sharpness oscillates very close to the stability threshold and eventually converges to 199.97. In (b) we show a section of (a) between iteration 5000 and 5030. We can clearly observe two distinctive features of the EoS regime: the loss decreases non-monotonically and the sharpness oscillates around 2/η. In (c) we plot the norm of the offset from minima that is orthogonal to the movement-oscillation projection. After 3000 iterations the residual becomes very small, suggesting that dynamics is mainly happening in the 2 dimensional subspace and hence the projection captures the dynamics quite well.

Figure 24: Projected Trajectory of ELU-activated 5-layer FC network. (η = 0.01, 2/η = 200) In (a), we plot the projected trajectory for the entire training process. After some large bifurcation like oscillation, the 2-step trajectory quickly stabilizes and moves toward the minimum along the movement direction. In (b), we show the tip of the trajectory, which can be very well captured by a parabola. These figures are identical to Fig. 7 in the main text. The color of the dots reflects the local numerical sharpness.

Figure 25: Training Statistics for tanh-activated 5-layer FC network. (η = 0.01, 2/η = 200) Please refer to the caption of Fig. 23 for detailed explanation.

Figure 26: Projected Trajectory of tanh-activated 5-layer FC network. (η = 0.01, 2/η = 200) Please refer to the caption of Fig. 24 for detailed explanation.

);Li et al. (2021; 2022b);Lyu et al. (2022).

Figure 27: Training Trajectory of Degree-4 Model with Label Noise. (η = 0.2) We plot the loss, sharpness, and training trajectory of the label noise model with σ = 0.01. The label noise model first follows a trajectory similar to the original GD training trajectory and reaches near the EoS minimum as shown in (a, b). Then it follows the sharpness reduction flow along the manifold of minima and reaches the flattest minima as shown in (c, d).

Figure 28: Training Trajectory of Degree-4 Model with Gradient Noise. (η = 0.2) We plot the loss, sharpness, and training trajectory of the gradient noise model with σ = 0.01. Like the label noise model, the gradient noise model first follows a trajectory similar to the original GD training trajectory and reaches near the EoS minimum as shown in (a, b). After getting around the EoS-minimum, the parameter begins to randomly oscillate and traverse around the manifold of global minima between the two EoS minima as shown in (c,d). It is likely that the gradient noise finally converges to some distribution along the manifold of global minima.

Figure 29: FC networks trained with SGD of varying batchsizes. (η = 0.005, 0.01, 0.02)For each learning rate and batchsize, we train 10 models from different random initialization for 20000 epochs and record their converging sharpness. The standard deviation of sharpness is represented by the shaded area. (The loss of all models converges to lower than 10 -8 ). When the batch size of SGD is large, the converging sharpness is close to 2/η, which is very similar to the gradient descent cases. On the other hand, the converging sharpness is significantly lower when the batch size is small compared with the dataset. It is worth noting that the converging sharpness for each batch size is quite concentrated.

Figure 30: Convergence Diagram for GD on the 4 scalar example. The horizontal directions represents a and the vertical direction represents b. The arrows indicate the directions of local 2-step movement. This diagram is for demonstration and ratios are not exact.

Corollary B.1. For any positive constant ϵ < 0.5, with κ, a, b satisfying condition B.1 we have b′′ = b -16b 3 + 8abκ + O(b 4 ) + O(ab 2 κ) + O(a 2 bκ 2 ) + O(b 2 κ 4 ) + O(bκ 5 ) + O(ϵb 2 κ 3 ), a ′′ = a -4b 2 κ 3 + O(ϵb 2 κ 3 ) + O(b 3 κ 3 ) + O(b 2 κ 4 ).(63)B.4 PHASE I: CONVERGENCE TO NEAR PARABOLA Now we show that under the (a, b) parameterization, any initialization (a 0 , b 0

Fix ϵ = K -1 , for all κ, a, b satisfying condition B.1 as well as the extra condition of |κ| < K -1 and |b| < K -1 , we have |a ′′ -(a -4b 2 κ 3 )| < 3b 2 κ 3 . Proof of Lemma 5. Since we assume K > 512, we have ϵ = K -1 < 0.5 as required by Corollary B.2. Since κ, a, b satisfies condition B.1, by Corollary B.2 we have a

First it is straightforward to check that the condition on κ, a, b is stronger than condition B.1 if we set ϵ = 0.1, thus by Corollary B.2 we have b ′′ = b -16b 3 + 8abκ + R b where |R

multiply K|a|κ on both side gives Ka 2 κ 2 < |a|κ and hence √ K|a|κ < |a|κ < |b|. Squaring both sides and multiply by b gives K|a 2 bκ 2 | < |b 3 |. (v) Since |a| > κ 3 and |b| > |a|κ, we have |b| > κ 2 > Kκ 4 . Multiplying b 2 on both sides, we have K|b 2 κ 4 | < |b 3 |. (vi) In (v) we observe that |b| > κ 2 , so |b| > √ Kκ and hence b 2 > Kκ 5 . Multiplying |b| on both side gives K|bκ 3 | < |b 3 |. (vii) Since we fixed ϵ = 0.1 while κ < K -1 < 0.1, we have ϵκ < 1. Since |b| > κ 2 , we have |b| > κ 3 ϵ > Kϵκ 3 . Multiply b 2 on both side gives K|ϵc 2 κ 3 | < |c 3 |. Therefore we have |8abκ| + |R b | < 14|c 3 |, which completes the proof. We restate the sufficient condition for |b ′′ -(b -16b 3 )| ≤ 14|b 3 | as follow Condition B.3 (Condition for -b 3 Dominated b Movement).

.3 WHEN abκ DOMINATES THE b ′′b Lemma 7. For any κ, a, b satisfying

have |b ′′ -(b + 8abκ)| ≤ 7|abκ|. Proof of Lemma 7. First fix ϵ = 0.5. It is straightforward to check that for all κ, a, b satisfying the given condition, condition B.1 is also satisfied, so by Corollary B.2 we have b ′′ = b -16b 3 + 8abκ + R b where

|b 3 | and the remaining 4 terms by |b 3 |. We will now bound them term by term. Multiply 16|b| on both side gives 16|b 3 | < 2|abκ|.

and hence |b 3 | < K -1 |a|κ. Multiply K|b| on both side gives K|b 4 | < |abκ|. (iii) Since |b| < K -1 , multiply both side by K|abκ| gives K|ab 2 κ| < |abκ|.

.4 CONVERGENCE WHEN -b 3 DOMINATES abκ Now we are ready to analyze the dynamics when c 3 dominates the movement of c. For t ∈ N, we abuse the notation to let (c t , d t ) denote c and d after the t-th 2-step update with step size η = κ 2 from the initialization (c 0 , d 0 ).

Combining with conditions on |a k | and |b k | from the inductive hypothesis we have (a k , b k ) satisfying condition B.2 and condition B.4. Thus by Lemma 5 and Lemma 7 we have

5 4 by the mediant inequality. Now we check the bounds on |a k+1 | and |b k+1 |.

k by the inductive hypothesis, we have |b k+1 | > |b 0 |(1 + 4κ With the guarantees on |a k+1 |, |b k+1 |, and (|b k |-|b 0 |)/(a 0 -a k ) as shown above, if we additionally assume that |b k+1 | < 1 4

aκ, 2 √ aκ) AS a DECREASES WHEN a IS NOT TOO SMALL After showing that b will enter ( 1 4

and hence 2Kκ 5 √ aκ < 1 512 √ aκ. Thus K|bκ 5 | < 1 512 √ abκ.

we have κ, b k , a k satisfying condition B.2 and condition B.5. Thus a k+1 < a kb 2 k κ 3 , a k+1 > a k -8b 2 k κ 3 (by Lemma 5), and |b k+1b k | < 1 16 √ a k κ (by Lemma 10).

The last inequality holds since b 2 k > b 4 k as |b k | < 1. Now taking the square root on both sides we have 1-b 2 k < √ 1 -32κ 4 . Since |b k | < 2 √ a k κ, combining with Eq. (

which gives the desired upper bound to |b k+1 |. Now we prove the lower bound for |b k+1 |. Since we know |b k+1b k | < 1 16 √ a k κ, by triangle inequality, |b k | > √ aκ implies |b k+1 | > 15 16 √ a k κ > 1 4 √ a k κ > 1 4

II: CONVERGENCE ALONG THE PARABOLA In Appendix B.4 we have shown that for a certain range of initializations, (a, b) converges close to the parabola 2b 2 = aκ very fast.

-1 κ 4 . Multiplying Kb 2 on both sides gives K|b 5 | < 1 5 δb 2 κ 4 . -1 κ 3 . Multiplying Kb 2 κ on both sides gives K|ab 3 κ| < 1 5 δb 2 κ 4 .

Hence |b t | < 2 √ 2κ 7 4 as required. Therefore we have |ξ k+1 -(1 -32b 2 )ξ k | < δb 2 k κ 4 . Next we establish the lower bounds for |b k |, which will give lower bound for the movement of a. Since a

2 κ -1 ) and |b τ1 | ∈ ( √ a τ1 κ, 2 √ a τ1 κ). If |b 0 | ≤ 1 4|a 0 κ|, by Lemma 9 there exists someτ 2 < 1 2 log(|b 0 | -1 )κ -7 2 such that |b τ2 | ∈ (

κ, 2 √ a T1 κ).Now by Lemma 11 and Corollary B.3, we know that there exists some τ 3 ≤ a 0 κ -13 2 such that within τ 3 two-step updates from (a T1 , b T1 ) we have aT1+τ3 ∈ ( T1+τ3 κ, 2 √ a T1+τ3 κ). Let T 2 = T 1 + τ 3 .This completes phase 1 of convergence.For phase 2, since a T2 ∈ (

implies c 0 > c + 13κ 5 2 -32κ 3 . Since we assume κ < 1 2000 √ 2 K -1 where K > 512, 32κ 3 < κ 5 2 , and hence c 0 > c + 12κ 5 2 . Therefore a 0 ≜ c 0c > 12κ 5 2 . B.7 OTHER AUXILIARY LEMMAS B.7.1 CONSTANT BOUND ON R δ (κ)Lemma 18. For all κ < 0.1, for all (a, b) ∈ (-κ -1 , κ -1 ) × (-1, 1), there exists some absolute constant K independent of a, b, κ such thatδ ≜ 4κ 4 (1 + b) 2 + aκ + (1 -4κ 4 ) 2aκ + a 2 κ 2 + 2b(2 + b)κ 4 + Kκ 5 . (119)Proof of Lemma 18. By explicitly computing the derivatives for δ with respect to κ, we know that δ is C 6 with respect to κ and have the Taylor expansionδ = 1 + 2aκ + a 2 κ 2 + 2b(2 + b)κ 4 + R δ (κ)κ 5 . (120)where R δ (κ) is the Lagrangian remainder that R δ (κ) = ∂ 5 δ ∂κ 5 (x)/120 for some x ∈ [0, κ]. To prove the lemma we only need to bound the R δ by some absolute constants. For simplicity of notation, denote α ≜ (1 -4x 4 ) 1 4 , β = α + ax, γ = 1 + b, and ϕ = 4γ 2 x 4 + β 4

∂f (α, c) ∂α = (c -1)(-4 -2α + c(1 + α)[

Structure of fully-connected network

multiplying K 2 |aκ 2 | on both side gives K 2 a 2 κ 2 < |a|κ. it follows by taking square root that K|a|κ < |a|κ. Since |b| > |a|κ, we have K|a|κ < |b|. Multiply b 2 on both side gives K|ab 2 κ| < |b 3 |.

1 we have condition B.3 satisfied andthus |b k+1 | < |b k |(1b 2 k ) by Lemma 6. Since κ < 1 16 K -1 < 1 2048 , we have κ

we restate the condition for Lemma 13: Condition B.6. With δ = 0.04 and K > 512, Proof of Corollary B.5. Since condition B.6 holds, by Lemma 13 we have |ξ ′′ | < |(1 -32b 2 )ξ| + δb 2 κ 4 .

In this phase, we show that once |ξ| is smaller than 1 8 δκ 4 , a will decrease slowly while |ξ| does not increase beyond 1 8 δκ 4 .

1 16 δκ 7 k by the inductive hypothesis, a k+1 ≤ a 1 -1 16 δκ 7 (k+1). What remains to show for the inductive step is that |ξ k+1 | < 1 8 δκ 3 . There are two cases to consider. When |ξ k | ∈ ( 1 16 δκ 4 , 1 8 δκ 4 ), by Corollary B.5 we know |ξ k+1 | < (1 -4κ

< 1, by the inductive hypothesis and the initialization condition on |b 0 | we have |b k | < |b 0 |. It is also from the inductive hypothesis that |a k

Therefore we have shown that κ, a k , b k satisfies condition B.4 and by Lemma 7 we have |b k+1

).

8. ACKNOWLEDGMENTS

This work is supported by NSF Award DMS-2031849, CCF-1845171 (CAREER), CCF-1934964 (Tripods) and a Sloan Research Fellowship.

annex

Combining Eq. (37) and Eq. ( 38) we haveDividing both sides by κ completes the proof.Lemma 2. For any κ ∈ (0, 0.1), for all (a, b) such that |a| < κ -1 and |b| < 1,Proof of Lemma 2. For simplicity of notation, letPlugging Eq. ( 41) into Eq. ( 32), since |b| < 1 we have ThusCombining the conditions required by Lemma 1 and Lemma 2 we have Proof. Combining the one step dynamics characterized in Lemma 1 and Lemma 2 we haveWe will analyze these terms one by one.HenceCombining above we haveLemma 4. Fix some positive constant ϵ < 0.5, with κ, a, b satisfying condition B.1,On the other end, since x < κ -1 as shown above, we have x + 1 4 K -2 κ -1 < 2κ -1 . So we can again apply Lemma 16 so that c 0 < c(x) by construction, we know (a 0 , b 0 ) satisfies the initialization condition of Theorem 3.1, and we can have the trajectory converging to a global minima with a ∈ [-5 3 , -1 10 ] and b = 0. Now we show that for global minima with satisfies a k ∈ [-5 3 , -1 10 ] and b k = 0, the sharpness λ satisfies λ ∈ ( 2 η -20 3 η, 2 η ). Recall from Eq. ( 3) that the sharpness of (x, y) near the global minima is given bywhere γ ≜ xy. When (x, y) is a global minima, γ = 1, and Eq. ( 110) reduces toNow we prove the lower bound for λ. Follow from c > (κ -4 -4) 1 4 -5 3 κ 3 , we haveHence in conclusion, the converging minima has sharpness λ ∈ ( 2 η -20 3 η, 2 η ), which completes the proof for the corollary. B.6.2 PROOF OF COROLLARY 3.2 Before proving Corollary 3.2, we first show a simple lemma on the approximity of x and κ -1 where x is the x-coordinate for the κ 2 -EoS minima.Lemma 17 (Approximity of x and κ -1 ). For any large constant K > 512, fix any κ <Proof of Lemma 17. Since the condition for κ is identical to that of Lemma 16, we have |x -c| < 32κ 3 from Lemma 16 where c = (κ -4 -4) 1 4 from the calculation in Appendix B.1. Note that c = (κ -4 -4)It is straightforward that c < κ -1 , so |cκ -1 | < 4κ 3 . Combining with |x -c| < 32κ 3 , we have |xκ -1 | < 36κ 3 , which completes the proof.Now we can proceed to prove Corollary 3.2. We first restate the corollary. Corollary 3.2 (Sharpness Adaptivity). For a large enough constant K, fix any α <. For all initialization (x 0 , y 0 ) in the region characterized byand |x 0 y 0 -1| ∈ (0, K -1 ), the GD trajectory from (x 0 , y 0 ) characterized by Eq. (2) with any step size η ∈ (α 2 -1 10 K -2 α 2 , α 2 ) will converge to a minima with sharpness λ ∈ ( 2 η -20 3 η, 2 η ).Proof of Corollary 3.2. To prove this corollary, we only need to show that for all η in the required range, the initialization region characterized by the corollary is a subset of the initialization region required by Corollary 3.1 for that particular η.For the ease of derivation, we will use κ 2 to substitute for η. Since we are dealing with different step sizes, we augment our notation to let (x κ 2 , yκ 2 ) denote the κ 2 -EoS minimum. Note that the condition for y 0 , namely |x 0 y 0 -1| ∈ (0, K -1 ) is identical to what is required by Corollary 3.1 so we only need to show for any learning rateWe will first show xη + 13κ)α where the last inequality holds since we may assume K > 512. Taking the multiplicative inverse, we have κNote that since κ < α <where K > 512, we haveThus combining with xκ 2 + 13κThe other side is much simpler to show. Since κ < α, we have κ -1 > α -1 . From Lemma 17 we have. This concludes the proof for the corollary.

C THEORETICAL ANALYSIS ON RANK-1 APPROXIMATION OF ISOTROPIC MATRIX

In this section we prove the convergence of the vector case.

Model:

We consider a generalized model from the scalar product case.The normalization factor 1 4 is added to show the equivalence of this rank-1 isotropic matrix factorization problem and the scalar case considered in Appendix B.6. We will show how the equivalence is achieved due to the alignment in the next section.Update Rule: We use gradient descent to optimize this problem. By computing the closed-form of the gradient, we know the 1-step update of x and y follows:With the gradient descent update above, we can have the dynamics of the following four quantities: ∥x∥ 2 , ∥y∥ 2 , andFurthermore, we define an alignment notation for the variable:. By Eq. ( 126) and Eq. ( 125), we can derive the following update rule of ξ.C.2 PROOF OF THEOREM 3.2We restate the theorem of the vector case.Theorem C.1. For a large enough absolute constant K, with all the initialization (x 0 , y 0 ) satisfying20000+2000(log(d)-log(δ0)) }, and a multiplicative perturbation y ′ t = y t (1 + 2K -1 ) 3 is performed at time t = t p for some t p > O(log(η) + log(d)log(δ 0 ) + K 3 ), then for any ϵ > 0, with probability p > 1 -2δ 0 -2 exp{-Ω(d)} there exists T = O(K -2 κ -15 2log(ϵ)log(δ 0 )) such that for all t > T , L(x, y) < ϵ and ∥x t ∥ 2 + ∥y t ∥ 2 ∈ ( 1 η -10 3 η, 1 η ).In the following analysis, we still use the operator O(•) to only hides absolute constants, and K to represent the absolute constant that uniformly upper bounds all the absolute constants of the O(•) terms. Also K > 512. We still use the notation of η-EoS minimum (x, y) of the scalar case in the vector case.We first prove some properties of the global minimizers. The following lemma guarantees the alignment of the two vectors x, y at any global minimizer (x, y). Lemma 19. (Global minimizers) For all global minimizers (x, y) of the optimization problem Eq. (123), we have x = cy for some c ∈ R, and ∥x∥∥y∥ = 1.Proof. We directly consider the objectiveThe global minimizer takes value d-1 4 and the equality holds whenwhich is equivalent to x = cy for some c ∈ R, and ∥x∥∥y∥ = 1.Now we consider the formal proof. To prove the theorem we have four steps: (i) we prove the two vectors x, y will decay geometrically after T = O(log(η)log(δ 0 ) + log(d)) time (Lemma 20, Lemma 21); (ii) we prove the norm of the two vectors ∥x∥, ∥y∥ will satisfy the initialization condition in T ′ < 6K 3 + 4K time (Lemma 22);(iii) To escape from some sharp minima, we add some deterministic perturbation, and we prove that after the perturbation gradient descent will re-enter the feasible regime, while at the same time keep a constant distance to the manifold of the minimizers;(iv) after entering the feasible region and ξ is small enough, the re-parameterized dynamics of ∥x∥ and ∥y∥ can be captured by the scalar dynamics (Lemma 25). Then, we can reduce this vector case to the scalar product case and finish the proof.Here we present the following key lemmas to prove the convergence at the edge of stability. For simplicity, we denote α(t01. Then as long as -1 100 < α(t) < 1 8 holds for t ∈ [0, t 1 ] for some t 1 , then for all t ∈ [0, t 1 + 1], 0 < (x ⊤ t y t ) 2 < 13 10 and ξ t+1 < ξ t ; Moreover, there exists some), for all t ∈ [t 0 , t 1 + 1], 7 20 < (x ⊤ t y t ) 2 < 13 10 and ξ t+1 < 0.7ξ t .Proof. We use induction. For iteration 0, the induction basis holds. Now we consider iteration t + 1 when assuming the conclusion holds at iteration 0, 1, 2, ..., t.Consider Equation ( 126). We denote c t = (x ⊤ t y t ) 2 . And from the induction hypothesis, ξ t < ∥x 0 ∥ 2 ∥y 0 ∥ 2 < 1/4 for all time. Thus ∥x t ∥ 2 ∥y t ∥ 2 = c t + ξ t < 31 20 . Then we haveSince 0 < c t < 13/10 and -0.01 < α < 1/8, this function takes maximal value when c t = 1001 1500(1+α(t)) .The lower bound 0 of c t is straightforward since we havedue to c t + ξ t /2 < 31/20 and -0.01 < α(t) < 1/8.Then we consider the tighter bound and a faster decaying rate after some t 0 . If c 0 > 7/20 and the lower bound holds at t = 0, then). Then we begin the induction from t 0 . Consider t ≥ t 0 ,1 4 and the lower bound of c t+1 becomes (since 7/20 < c t < 1.3)The last inequality is because of the monotonic decrement of the last function, which takes minimal value at c t = 13 10 . This proves the first statement. Then consider the alignment dynamics by Equation ( 127). The factor can be bounded as follow:By induction, we finish the proof.If c 0 ≤ 7/20, then we prove that c t will eventually become larger than 7/20 after some time t 0 , and then apply the same induction process above to finish the proof.Still we consider the lower bound of the dynamics of c t . If we have c t < 7/20,Then it at most takes t 0 = ⌈log 2 ( 7 20x ⊤ 0 y0 )⌉ to satisfy the condition. Now we finish the proof.We first consider the time ξ takes to become smaller than η 2 .Lemma 21. (Alignment convergence time) Suppose all the conditions in Theorem C.1 holds. Then with probability 1 -2δ 0 -2 exp{-Ω(d)}, there exists some time T 0 = log 0.7 (η 2 ) + log 2 ( 21dProof. We first prove that the bound of (x ⊤ y) 2 can be reached with probability 1 -2δ 0 -2 exp{-Ω(d)}, and meanwhile the alignment ξ begins to shrink after t 0 = ⌈log 2 ( 21d 20δ 2 0 )⌉.For the initialization, δ x δ y = 1 2 , and by symmetry we knowIt is equivalent to consider sampling from N (0, I) and then divide it by its norm. By the initialization condition, we apply the Gaussian concentration bound and Theorem 3.1.1 in Vershynin (2018) and getThen with probability p > 1-2δ 0 -2 exp{-Ω(d)},we can apply the first argument in Lemma 20 and induction to prove that (x ⊤ t y t ) 2 < 13 10 and 0 < α(t) < 13K -2 50 .For t = 0, since δThen we suppose for time t ∈ [0, t 1 -1] the statement is correct. By the induction hypothesis, we know for t ∈ [0, t 1 -1] the condition of Lemma 20 holds. Therefore, by Lemma 20, for all t ∈ [0, t 1 ], (x ⊤ t y t ) 2 < 13 10 . With the upper bound of (x ⊤ t y t ) 2 , the one step movement of ∥x∥ 2 + ∥y∥ 2 can be bounded:For t < T 0 + 6K 3 + 4K, the total movement of ∥x t ∥ 2 + ∥y t ∥ 2 is smaller than 5η(T + 6K 3 + 4K), and the corresponding movement of α(t) is smaller than 5ηThus by induction, (x ⊤ t y t ) 2 < 13 10 and 0 < α(t) <Then by the second argument of Lemma 20, we have 7 20 < (x ⊤ t y t ) 2 < 13 10 and ξ t+1 < 0.7ξ t for t > log 2 ( 21d)⌉, we can calculate the time when ξ t < η 2 . It needs at most log 0.7 (η 2 ) =-O(log(η)) for ξ t to become smaller than η 2 . Now we know with T 0 = log 0.7 (η 2 ) + log 2 ( 21dOn the other hand, we can also have the lower bound of the ∥x t ∥ 2 + ∥y t ∥ 2 . The one step movement has the lower boundAfter we have the alignment guarantee, we can prove that we will enter the feasible regime for the scalar case with high probability. Still, we denoteProof. To prove the statement, we try to prove a stronger statement:If this statement holds, because ξ t < η 2 , the product of the norms are) and the lemma is proved. So in the rest of the proof, we will prove this stronger statement.First we prove that there exists some time t < 4K + 1 that the square of the inner product c t will be larger thanThus c t will provably enter c t > 1/2 in one step. And if 1/2 < c t < 1 -K -1 , we havewhich means it takes at most 4K steps for c t to become larger than 1 -K -1 . Now we prove if c t > 1 -K -1 , then it will take at most), which finishes the proof. We first prove there exists some t,); afterwards we prove that once c t gets in, it will never get out of the region.. Next, we consider the two step dynamics of c t and prove it will decay by a constant factor every two steps.First we pick out the relatively small terms, and find out the main part of the two step dynamics. If α(t) < α 0 < 1/8, and 1 + K -1 -K -2 < c t < 1.3, we have:Then we upper bound the difference α(t + 1)α(t) in one step. Consider the dynamics of ∥x∥ 2 + ∥y∥ 2 . The difference each step is at mostThen the corresponding update of α(t) will be less than 5η 2 . That means we havePublished as a conference paper at ICLR 2023We use induction and suppose c t satisfies the condition above. Note that we have the upper and lower bound of c t+1 above:). Thus we have:Therefore, by induction we prove that for all t > s,). In this way, |∥x t ∥∥y t ∥ -1| < K -1 for t > t 1 + T .Entering the feasible region is not enough for establishing the equivalence to the scalar case. Furthermore, we need the alignment variable ξ t to be small enough, such that some O(•) notation term can contain all the terms with ξ in the (a, b)-parameterization dynamics.However, we need to guarantee that the trajectory does not converge to an unstable point near the minima. We can prove that for the scalar case if the initialization is not exactly on the manifold of minimizers, but it becomes more challenging in higher dimensions. Therefore, we require an additional perturbation to escape from any unstable point. We pick the time t p = T 0 + 6K 3 + 4K = O(log(η) + log(d)log(δ 0 ) + K 3 ) to guarantee that when the perturbation happens, x and y are aligned and (x ⊤ y) 2 is not large enough to cause instability. After the perturbation at t p , the gradient descent dynamics prevent the objective from hitting the manifold of minimizers.The following lemma proves that the properties of the bound of c t := (x ⊤ y) 2 , α = η(∥x∥ 2 + ∥y∥ 2 ) -1 and ξ are still valid after the perturbation.Lemma 23. After the perturbation at time t p , we have the following properties hold in t ∈ [t p , 2t p + 2]:), and c t ∈ ( 7 20 , 13 10 ); (ii)Proof. We prove this property by induction. Firstly, we prove the basis of induction at t = t p .For (i), before perturbation we have c tp ∈ (1 -K -1 , 1 + K -1 -K -2 ), so after the perturbation we have c ′ tp ∈ (1 + K -1 -2K -2 , 1 + 3K -1 + 2K -2 ) ⊂ ( 7 20 , 13 10 ).For (ii), the initial value of α(t) satisfies that 1 40 K -2 < α(0) < 1 4 K -2 . But the total movement of α(t) before perturbation is bounded within (-5η 2 t p , 5η 2 t p ) ⊂ (-K -2 200 , K -2 200 ), and the perturbation introduce a movement of ∥y∥ 2 by ∥y t ∥ 2 < 13 10 /∥x t ∥ 2 < 1 1000 K -2 η -1 2 . Due to the upper bound of η, α(t) ∈ ( K -2 100 , K -2 2 ).Published as a conference paper at ICLR 2023 For (iii), since K > 512, thus at t p , ξ tp < (0.7) 6K 3 ξ t < (0.7) 6K 3 η 2 . After the perturbation ξ ′ tp = (1 + 2K -1 ) 2 ξ tp < η 2 . So all three statement holds for t = t p . Then we suppose for t ∈ [t p , t p + t 1 ] all three statement holds and prove them for t p + t 1 + 1.First by the dynamics of c t with condition c t ∈ ( 7 20 , 13 10 ) and 0 < α(t) < K -2 2 , we haveFor (ii), since c t and ξ t are bounded for t ≤ t p + t 1 + 1, we can still have the total movement of α(t) lies in (-5η 2 t p , 5η 2 t p ) ⊂ (-K -2 200 , K -2 200 ). Combine with the movement of y at perturbation (< 1 1000 K -2 η -1 2 ), the total movement is still bounded by (-6K -2 1000 , 6K -2 1000 ), which proves the bound in (ii).Finally for (iii), as long as c t < 13 10 , ξ t+1 < 0.7ξ t < 0.7 t1+1 η 2 . Therefore all three statements are proved by induction.After reclaiming all the bounds after the perturbation, we need to prove that the alignment variable ξ will be small enough to approximate the scalar case. The following lemma proves that b stays at a constant level when α(t) is some constant.Lemma 24. Suppose ξ t < η 2 , η <Here we consider the two step dynamics of the inner product (Eq. ( 128) and Eq. ( 129)).We first consider the lower bound of the (x ⊤ t1+2k y t1+2k ) 2 . We prove the sequence of the c t+2k , k = 0, 1, 2, ... by induction, and then use this conclusion to prove the upper bound of c t+2k-1 . We know that k = 0 the statement holds. Then suppose the lower bound holds for k ≤ k 0 , and we prove it for k = k 0 + 1.Here we consider the function as we do in Lemma 22:-3q 2 γ 4 + 6q 4 γ 4 + q 3 γ 9/2 + 4q 4 γ 5 + q 3 γ 11/2 + q 4 γ 6Notice that when c > 1, f (γ, c) decrease as c increase. Now we consider the range of q: If q ∈ ( 1 2 , 2 3 ], we have: ) and |∥x t ∥∥y t ∥ -1| < K -1 , the following equations hold for some fixed constant ϵ = min{0.5,Proof. For simplicity, we denote x = ∥x∥ and y = ∥y∥. We first prove that:We suppose y = cx + θ for some θ ∈ R d and ⟨θ, x⟩ = 0. In this way, we haveThen check the dynamics of x and plug in y = cx + θ.xSince (x ⊤ y) and xy are both bounded as constant, we can directly take the norm of both sides and with triangle inequality we have:Similarly, we have the dynamics of y.We now reparameterize the dynamics in (a, b)-parameterization.where δ ≜ 4η 2 (1 + b) 2 + aηAnd by the bound of ∥x t ∥ and |∥x t ∥∥y t ∥ -1|, we can have |a| < ϵκ, |b| < min{1, 1 5 ϵκ -2 }, which satisfies the condition in Lemma 1 and Lemma 2. Then we follow the proof of Lemma 1 and Lemma 2 to finish the proof (since the only difference is the two O(•) notation).After we can reduce the dynamics of vector case to the scalar case dynamics, we need to prove that in the scalar case (in all stages), for any t ≥ 0, ( bt+1 bt ) 4 > 0.7, which means ξ t shrinks faster than b 4 t . In that case, we can conclude that ξ t > ηb 4 t always holds along the scalar case trajectory. Lemma 26. Suppose all conditions in Theorem 3.1 hold. Then for all t > 0, ( bt+1 bt ) 4 > 0.7.Proof. From the proof of Theorem 3.1, we know that for all t > 0, |b t | < K -1 and |a t | < κ -1 . Thus Lemma 2 holds and for all t ≥ 0 we have ≥ (1 -2K -2 -3K -1 -K -2 -K -2κκ 4κ 4 ) 4≥ (1 -10K -1 ) 4 > 502 512 4 > 0.7.Finally we conclude the proof of Theorem C.1.Proof of Theorem C.1. By Lemma 21, we know with probability 1 -2δ 0 -2 exp{-Ω(d)}, there exists some time T 0 = O(log(η)log(δ 0 ) + log(d)) that ξ t < η 2 for all t > T 0 . Also for t ∈ [T 0 , T 0 +6K 3 +4K], ∥x t ∥ ∈ (x+ 1 200 K -2 η -1 2 , x+ 1 6 K -2 η -1 2 ). This bound of ∥x∥ guarantees that for t ∈ [T 0 , T 0 + 6K 3 + 4K], α(t) ∈ (0, K -22 ), which satisfies the condition of Lemma 22. Then by Lemma 22, we know for some t * < T 0 + 6K 3 + 4K < t p , |∥x ∥∥y t ∥ -1| < K -1 for t ∈ [t * , t p ]. After entering and staying in the region, we add the perturbation at t p . By Lemma 23, we have the bounds before perturbation still hold: (i) After the perturbation c ′ tp ∈ (1+K -1 -2K -2 , 1+ 3K -1 + 2K -2 ), and c t ∈ ( 7 20 , 13 10 ); (ii) K -2 100 < α(t) < 1 2 K -2 ; (iii) ξ t < (0.7) t-tp η 2 . Therefore, the condition of Lemma 22 and Lemma 24 are both met. So after another 6K 3 + 4K steps, we have |b t | = |∥x t ∥∥y t ∥ -1| < K -1 and |(x ⊤ t y t ) 2 -1| > 1 -K -1 20 . The second expression can lead to |b t | > K -1 20 /(1 + ∥x t ∥∥y t ∥) > K -1 41 . This means ξ t < (0.7) 4K η 2 < (0.7) 2000 η • K -4 8000000 < η|b t | 4 . By Lemma 25, we know the dynamics of the norm of the vectors (∥x∥, ∥y∥) can be captured by the scalar case (including the initialization condition and the one step update rules). And by Lemma 26, we know the alignment will be kept and the dynamics of the vectors will always be true.Finally, we apply Theorem 3.1 and finish the proof.

