ON GRADIENT DESCENT CONVERGENCE BEYOND THE EDGE OF STABILITY Anonymous authors Paper under double-blind review

Abstract

Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a 'bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called "Edge of Stability" (EoS), where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability and oscillatory behavior. The incipient theoretical analysis of this phenomena has mainly focused in the overparametrised regime, where the effect of choosing a large learning rate may be associated to a 'Sharpness-Minimisation' implicit regularisation within the manifold of minimisers, under appropriate asymptotic limits. In contrast, in this work we directly examine the conditions for such unstable convergence, focusing on simple, yet representative, learning problems. Specifically, we characterize a local condition involving third-order derivatives that stabilizes oscillations of GD above the EoS, and leverage such property in a teacher-student setting, under population loss. Finally, focusing on Matrix Factorization, we establish a nonasymptotic 'Local Implicit Bias' of GD above the EoS, whereby quasi-symmetric initializations converge to symmetric solutions -where sharpness is minimum amongst all minimisers. Hence, in the global minima manifold where vw = w, the eigenvalues of the Hessian are Therefore, the largest eigenvalue λ 1 measures the imbalance (i.e., | w -v|) between the two layers again as λ 1 = ( w -v) 2 +2 w d similar to the 2-D case in (15) in Appendix A.1. So we would like to investigate where GD converges if η > 2 2 w /d = d/ w that is too large even for the flattest minima. Note that a key difference between the current case and the previous 2-D analysis is that the current one includes a neuron as a vector and a nonlinear ReLU unit.

1. INTRODUCTION

Given a differentiable objective function f (θ), where θ ∈ R d is a high-dimensional parameter vector, the most basic and widely used optimization method is gradient descent (GD), defined as θ (t+1) = θ (t) -η∇ θ f (θ (t) ), where η is the learning rate. For all its widespread application across many different ML setups, a basic question remains: what are the convergence guarantees (even to a local minimiser) under typical objective functions, and how they depend on the (only) hyperaparameter η? In the modern context of large-scale ML applications, an additional key question is not only to understand whether or not GD converges to minimisers, but to which ones, since overparametrisation defines a whole manifold of global minimisers, all potentially enjoying drastically different generalisation performance. The sensible regime to start the analysis is η → 0, where GD inherits the local convergence properties of the Gradient Flow ODE via standard arguments from numerical integration. However, in the early phase of training, a large learning rate has been observed to result in better generalization (LeCun et al., 2012; Bjorck et al., 2018; Jiang et al., 2019; Jastrzebski et al., 2021) , where the extent of "large" is measured by comparing the learning rate η and the curvature of the loss landscape, measured with λ(θ) := λ max ∇ 2 θ f (θ) , the largest eigenvalue of the Hessian with respect to learnable parameters. Although one requires sup θ λ(θ) < 2/η to guarantee the convergence of GD (Bottou et al., 2018) to (local) minimisersfoot_0 , the work of (Cohen et al., 2020) noticed a remarkable phenomena in the context of neural network training: even in problems where λ(θ) is unbounded (as in NNs), for a fixed η, the curvature λ(θ (t) ) increases along the training trajectory (1), bringing λ(θ (t) ) ≥ 2/η (Cohen et al., 2020) . After that, a surprising phenomena is that λ(θ (t) ) stably hovers above 2/η and the neural network still eventually achieves a decreasing training loss -the so-called "Edge of Stability". We would like to understand and analyse the conditions of such convergence with a large learning rate under a variety models that capture such observed empirical behavior. Recently, some works have built connections between EoS and implicit bias (Arora et al., 2022; Lyu et al., 2022; Damian et al., 2021; 2022) in the context of large, overparametrised models such as neural networks. In this setting, GD is expected to converge to a manifold of minimisers, and the question is to what extent a large learning rate 'favors' solutions with small curvature. In essence, these works show that under certain structural assumptions, GD is asymptotically tracking a continuous sharpnessreduction flow, in the limit of small learning rates. Compared with these, we study non-asymptotic properties of GD beyond EoS, by focusing on certain learning problems (e.g., single-neuron ReLU networks and matrix factorization). In particular, we characterize a range of learning rates η above the EoS such that GD dynamics hover around minimisers. Moreover, in the matrix factorization setup, where minimisers form a manifold with varying local curvature, our results give a non-asymptotic analogue of the 'Sharpness-Minimisation' arguments from Arora et al. (2022) ; Lyu et al. (2022) ; Damian et al. (2022) . The straightforward starting point for the local convergence analysis is via Taylor approximations of the loss function. However, in a quadratic Taylor expansion, gradient descent diverges once λ(θ) > 2/η (Cohen et al., 2020) , indicating that a higher order Taylor approximation is required. By considering a 1-D function with local minima θ * of curvature λ * = λ(θ * ), we show that it is possible to stably oscillate around the minima with η slightly above the threshold 2/λ * , provided its high order derivative satisfies mild conditions as in Theorem 1. A typical example of such functions is f (x) = 1 4 (x 2 -µ) 2 with µ > 0. Furthermore, we prove that it converges to an orbit of period 2 from a more global initialization rather than the analysis of high-order local approximation. As it turns out, the analysis of such stable one-dimensional oscillations is sufficiently intrinsic to become useful in higher-dimensional problems. First, we leverage the analysis to a two-layer single-neuron ReLU network, where the task is to learn a teacher neuron with data on a uniform high-dimensional sphere. We show a convergence result under population loss with GD beyond EoS, where the direction of the teacher neuron can be learnt and the norms of two-layer weights stably oscillate. We then focus on matrix factorization, a canonical non-convex problem whose geometry is characterized by a manifold of minimisers having different local curvature. Our techniques allow us to establish a local, non-asymptotic implicit bias of GD beyond EoS, around certain quasi-symmetric initialization, by which the large learning rate regime 'attracts' the dynamics towards symmetric minimisers -precisely those where the local curvature is minimal. A further discussion is provided in Appendix M.

2. RELATED WORK

Implicit regularization. Due to its theoretical closeness to gradient descent with a small learning rate, gradient flow is a common setting to study the training behavior of neural networks. Barrett & Dherin (2020) suggests that gradient descent is closer to gradient flow with an additional term regularizing the norm of gradients. Through analysing the numerical error of Euler's method, Elkabetz & Cohen (2021) provides theoretical guarantees of a small gap depending on the convexity along the training trajectory. Neither of them fits in the case of our interest, because it is hard to track the parametric gap when η > 1/λ. For instance, in a quadratic function, the trajectory jumps between the two sides once η > 1/λ. Damian et al. (2021) shows that SGD with label noise is implicitly subjected to a regularizer penalizing sharp minimizers but the learning rate is constraint strictly below the edge of stability threshold. Balancing effect. Du et al. (2018) proves that gradient flow automatically preserves the norms' differences between different layers of a deep homogeneous network. (Ye & Du, 2021) shows that gradient descent on matrix factorization with a constant small learning rate still enjoys the auto-balancing property. Also in matrix factorization, Wang et al. (2021) proves that gradient descent with a relatively large learning rate leads to a solution with a more balanced (perhaps not perfectly balanced) solution while the initialization can be in-balanced. In a similar spirit, we extend their finding to a larger learning rate, with which the perfect balance may be achieved in our setting. We estimate our learning rate is strictly larger than theirs (Wang et al., 2021) , where they show GD with large learning rates converges to a flat region in the interpolation manifold while the flat region w.r.t. our larger learing rate does not exists so GD is forced to wander around the flattest minima. Note that the implication of balancing effect is to get close to a flatter solution in the global minimum manifold, which may help improve generalization in some common arguments in the community. Edge of stability. Cohen et al. (2020) observes a two-stage process in gradient descent, where the first is loss curvature grows until the sharpness touches the bound 2/η, and the second is the curvature hovers around the bound and training loss still decreases in a macro view regardless of local instability. Gilmer et al. (2021) reports similar observations in stochastic gradient descent and conducts comprehensive experiments of loss sharpness on learning rates, architecture choices and initialization. Lewkowycz et al. (2020) argues that gradient descent would "catapult" into a flatter region if loss landscape around initialization is too sharp. Some concurrent works (Ahn et al., 2022; Ma et al., 2022; Arora et al., 2022; Damian et al., 2022) are also theoretically investigating the edge of stability. Ahn et al. (2022) suggests that unstable convergence happens when the loss landscape of neural networks forms a local forward-invariant set near the minima due to some ingredients, such as tanh as the nonlinear activation. Ma et al. (2022) empirically observes a multi-scale structure of loss landscape and, with it as an assumption, shows that gradient descent with different learning rates may stay in different levels. Arora et al. (2022) shows the training provably enters the edge of stability with modified gradient descent or modified loss, and then its associated flow goes to flat regions. Under mild conditions, Damian et al. (2022) proves that GD beyond EoS follows an optimization trajectory subjected to a sharpness constraint so that a flatter region is found. Learning a single neuron. Yehudai & Ohad (2020) studies necessary conditions on both the distribution and activation functions to guarantee a one-layer single student neuron aligning with the teacher neuron under gradient descent, SGD and gradient flow. Vardi et al. (2021) extends the investigation into a neuron with a bias term. Vardi & Shamir (2021) empirically studies the training dynamics of a two-layer single neuron, focusing on its implicit bias. In this work, we present a convergence analysis of a two-layer single-neuron ReLU network trained with population loss in a large learning rate beyond the edge of stability.

3. PROBLEM SETUP

We consider a differentiable objective function f (θ) with θ ∈ R d , and the GD algorithm from (1). Definition 1. A differentiable function f is L-gradient Lipschitz if ∇f (θ 1 ) -∇f (θ 2 ) ≤ L θ 1 -θ 2 , ∀ θ 1 , θ 2 . (2) The above definition is equivalent to saying that the spectral norm of the Hessian is bounded by L, or the local curvature at each point is bounded by L. Then η needs to be bounded by 1/L in GD so that it is guaranteed to visit an approximate first-order stationary point (Nesterov, 1998) . The perturbed GD requires η = 1/L to visit an approximate second-order stationary point (Jin et al., 2021) , and stochastic variants share similar assumptions (Ghadimi & Lan, 2013; Jin et al., 2021) . However, in practice, such an assumption may be violated, or even impossible to satisfy when ∇ 2 f is not uniformly bounded. Cohen et al. (2020) observes that, with learning rate η fixed, the largest eigenvalue λ 1 of the loss Hessian of a neural network is below 2/η at initialization, but it grows above the threshold along training. Such a phenomena is more obvious when the network is deeper or narrower. This reveals the non-smooth nature of the loss landscape of neural networks. Furthermore, another observation from Cohen et al. (2020) is that once λ 1 ≥ 2/η, the training loss starts to perturb sharply. This is not surprising because GD would diverge in a quadratic function with such a large curvature. However, despite of local instability, the training loss still decreases in a longer range of steps, during which the local curvature stays around 2/η. A further phenomena is that, when GD is at the edge of stability, if the learning rate suddenly changes to a smaller value η s < η, then the local curvature quickly grows to 2/η s -indicating the ability to 'manipulate' the local curvature by adjusting the learning rate. Besides the analysis of GD, the local curvature itself has also received a lot of attention. Due to the nature of over-parameterization in modern neural networks, the global minimizers of the objective f form a manifold of solutions. There have been active directions to understand the implicit bias of GD methods, namely where do they converge to in the manifold, and why some points in the manifold are more preferable than others. For the former question, it is believed that (stochastic) GD prefers flatter minima (Barrett & Dherin, 2020; Smith et al., 2021; Damian et al., 2021; Ma & Ying, 2021) . For the latter, flatter minima brings better generalization (Hochreiter & Schmidhuber, 1997; Li et al., 2018; Keskar et al., 2016; Ma & Ying, 2021; Ding et al., 2022) . It would be meaningful if flatter minima could be obtained via GD with a large learning rate. More specifically, it has been shown that the eigenvalues of the hessian of a deep homogeneous network could be manipulated to infinity via rescaling the weights of each layer (Elkabetz & Cohen, 2021) . Fortunately, gradient flow preserves the difference of norms across layers along the training (Du et al., 2018) . As a result, a balanced initialization induces balanced convergence, while GD would break this balancing effect due to finite learning rate. However, recently it has been observed that GD with large learning rates enjoys a balancing effect (Wang et al., 2021) , where it converges to a (not perfect) balanced result despite of inbalanced initialization. Motivated by the connections of optimization, loss landscape and generalization, we would like to understand the training behavior of gradient descent with a large learning rate, from low-dimensional to representative models.

4. STABLE OSCILLATION ON 1-D FUNCTIONS

Definition 2. (Period-2 stable oscillation.) Consider GD on a function f in domain Ω. Denote the update rule of GD as F (x) for x ∈ Ω. A period-2 stable oscillation is ∃ x ∈ Ω such that F (F (x)) = x and x is not a minima of f . We initiate our analysis of the stable oscillation phenomenon in 1-D. Starting from a condition on general 1-D functions, we look into several specific 1-D functions to verify our arguments. Then, focusing on a function in the form of f (x) = (x 2 -µ) 2 , we present the convergence analysis as a foundation for the following discussions. Furthermore, to shed light on the multi-layer setting, we propose a balancing effect to make a connection to the 1-D analysis, as shown in Appendix A.1. General 1-D functions. Consider a 1-D function f (x) with a learnable parameter x ∈ R. The parameter updates following GD with the learning rate η as t) ). x (t+1) := x (t) -ηf (x ( (3) Assuming f is differentiable and all derivatives are bounded, the function value in the next step can be approximated by f (x (t+1) ) = f (x (t) ) -η[f (x (t) )] 2 1 - η 2 f (x (t) ) + o((x (t+1) -x (t) ) 2 ). ( ) If η < 2/f (x (t) ), this approximation reveals that the function monotonically decreases for each step of GD, ignoring higher terms. Such an assumption would guarantee the convergence to a global minimum in a convex function. However, our interest is what happens if η > 2/f (x). For instance, if f is a quadratic function, the second-order derivative f is constant. As a result, once η > 2/f , GD diverges except when being initialized at the optimum. However, when trained with a large learning rate η > 2/f (x), there is still some hope for a function to stay around a local minima x, as stated in the following theorem. Theorem 1. Consider any 1-D differentiable function f (x) around a local minima x, satisfying (i) f (3) (x) = 0, and (ii) 3[f (3) ] 2 -f f (4) > 0 at x. Then, there exists with sufficiently small | | and • f (3) > 0 such that: for any point x 0 between x and x -, there exists a learning rate η such that the update rule F η of GD satisfies F η (F η (x 0 )) = x 0 , and 2 f (x) < η < 2 f (x) -• f (3) (x) . The details of proof are presented in the Appendix C. As stated in the Theorem 1, we provide a necessary condition that allows GD to stably oscillate around a local minima. But still we cannot tell whether or not some functions allow it with f (3) (x) = 0. For instance, a quadratic function does not satisfy this condition since f (3) = f (4) ≡ 0 and it diverges when GD is beyond the edge of stability. For f (x) = sin(x) around x = -π 2 where f (3) (x) = 0, it turns out the sine function allows stable oscillation. Therefore, we extend the argument in Theorem 1 to a higher order case in Lemma 1. Lemma 1. Consider any 1-D differentiable function f (x) around a local minima x, satisfying that the lowest order non-zero derivative (except the f ) at x is f (k) (x) with k ≥ 4. Then, there exists with sufficiently small | | such that: for any point x 0 between x and x -, and 1. if k is odd and • f (k) (x) > 0, f (k+1) (x) < 0, then there exists η ∈ ( 2 f , 2 f -f (k) k-2 ), 2. if k is even and f (k) (x) < 0, then there exists η ∈ ( 2 f , 2 f +f (k) k-2 ), such that: the update rule F η of GD satisfies F η (F η (x 0 )) = x 0 . With Lemma 1, we can verify the sine function to allow stable oscillation as in Corollary 1, because its lowest order of nonzero derivative (except f ) at the local minima is f (4) (x) < 0. Meanwhile, Theorem 1 provides a guarantee that squared-loss on any function g provably allows stable oscillation once g satisfies some mild conditions, as stated below. Lemma 2. Consider a 1-D function g(x) , and define the loss function f as f (x) = (g(x) -y) 2 . Assuming (i) g is not zero when g(x) = y, (ii) g (x)g (3) (x) < 6[g (x) ] 2 , then it satisfies the condition in Theorem 1 or Lemma 1 to allow period-2 stable oscillation around x. This setup covers generic non-linear least squares problems, including the base model g being sine, tanh, high-order monomial, exponential, logarithm, sigmoid, softplus, gaussian, etc. The proof details for these settings of g(x) are provided as Corollaries 1-8 in Appendix D and E. Moreover, we provide a straightforward method to build a more complicated model from two simple base models, as follows. Proposition 1 (Composition Rule for Stable Oscillation). Consider two 1-D functions p, q. Assume both p(x), q(y) at x = x, y = p(x) satisfies the conditions of g in Lemma 2 to allow stable oscillations. Then q(p(x)) allows stable oscillation around x = x. Proof details of the above lemmas and proposition are presented in the Appendix D and E. Next we are going to present a careful analysis on g(x) = x 2 . A special 1-D function. Consider f (x) = 1 4 (x 2 -µ) 2 with µ > 0, f (3) ( √ µ) = 6 √ µ, f ( √ µ) = 2µ. Note that this function is more special to us because it can be viewed as a symmetric scalar factorization problem subjected to the squared loss. Later we will leverage it to gain insights for asymmetric initialization, two-layer single-neuron networks and matrix factorization. Before that, we would like to show where it converges to when η > 2 f ( √ µ) as follows. Theorem 2. For f (x) = 1 4 (x 2 -µ) 2 , consider GD with η = K • 1 µ where 1 < K < √ 4.5-1 ≈ 1. 121, and initialized on any point 0 < x 0 < √ µ. Then it converges to an orbit of period 2, except for a measure-zero initialization where it converges to √ µ. More precisely, the period-2 orbit are the solutions x = δ 1 ∈ (0, √ µ), x = δ 2 ∈ ( √ µ, 2 √ µ) of solving δ in η = 1 δ 2 µ δ 2 -3 4 + 1 2 . ( ) The details of proof are presented in the Appendix F. As shown above, Theorem 1 and Theorem 2 stand in two different levels: Theorem 1 restricts the discussion in a local view because of Taylor approximation, while Theorem 2 starts from local convergence and then generalizes it into a global view. However, Theorem 1 builds a foundation for Theorem 2 because the latter would degenerate to the former when K is extremely close to 1. A natural follow-up question is what implications Theorem 2 brings, because 1-D is far from the practice of neural networks that contain multi-layer structures, nonlinearity and high dimensions. We precisely incorporate two layers and nonlinearity in Section 5, and high dimensions in Section 6.

5. ON A TWO-LAYER SINGLE-NEURON HOMOGENEOUS NETWORK

We denote a two-layer single-neuron network as f (x; θ) = v • σ(w x) where v ∈ R, w ∈ R d , the set of trained parameters θ = (v, w ) ∈ R d+1 , and the nonlinearity σ is ReLU. We will keep such an order in θ to view it as a vector. The input x ∈ R d is drawn uniformly from a unit sphere S d-1 . The parameters are trained by GD subjected to L 2 population loss, as θ t+1 = θ t -η∇ θ L(θ t ), L(θ t ) = E x∈S d-1 f (x; θ t ) -y 2 . We generate labels from a single teacher neuron function, as y|x = σ( w x). Hence w is our target neuron to learn. We denote the angle between w and w as α ≥ 0. Note that α is set as non-negative because the loss function is symmetric w.r.t. the angle. Moreover, the rotational symmetry of the population data distribution results in a loss landscape that only depends on w through the angle α and the norm w . Indeed, from the definition, we have ∇ θ L = 1 d v w 2 2 -w π sin α + (π -α) cos α w v 2 w -v π (π -α + 1 2 sin 2α) • w -v π (-1 2 cos 2α + 1 2 ) w w⊥ , where we denote w⊥ as the normalized of w -proj w w. Consider the Hessian H ∂ 2 v L ∂ w ∂ v L ∂ v ∂ w L ∂ 2 w L From the second row of ∇ θ L, which is ∇ w L, it is clear that updates of w always stay in the plane spanned by w and w (0) . Hence, this problem can be simplified to three variables (v, w x , w y ) with the target neuron w = [1, 0]. The three variables stand for v (t) := v (t) , w (t) x := proj w w (t) , w (t) y := proj w⊥ w (t) = w (t) 2 -(w (t) x ) 2 . We keep w y as nonnegative because the loss L is invariant to its sign and our previous notation α ≥ 0 requires a non-negative w y . Then we show that w y decays to 0 as follows. Theorem 3. In the above setting, consider a teacher neuron w = [1, 0] and set the learning rate η = Kd with K ∈ (1, 1.1]. Initialize the student as w (0) = v (0) ∈ (0, 0.10] and w (0) , w ≥ 0. Then, for t ≥ T 1 + 4, w (t) y decays as w (t) y < 0.1 • (1 -0.030K) t-T1-4 , T 1 ≤ log 2.56 1.35 πβ 2 , β = 1 + 1.1 π . Proof sketch The details of proof are presented in the Appendix I. The proof is divided into two stages, depending on whether w y grows or not. The key is that the change of w y follows (omitting all superscripts t) ∆w y w y ∝ -vw x + 1 π wy wx 1 + ( wy wx ) 2 , w (t+1) y = |w y + ∆w y | . ( ) where the second term in ∆wy /wy is bounded in [0, 1 2π ]. In stage 1 where vw x is relatively small, we show the growth ratio of w y is smaller than those of w x and vw x , resulting in an upper bound of number of iterations for vw x to reach 1 2π , so max(w y ) is bounded too. Although the initialization is balanced as v (0) = w (0) for simplicity of proof, v -w x is also bounded at the end of stage 1. From the beginning of stage 2, thanks to the relatively narrow range of K, we are able to compute the bounds of three variables (including v -w x , vw x and w y ) and they turn out to fall into a basin in the parameter space after four iterations. In this basin, w y decays exponentially with a linear rate of 0.97 at most. With the guarantee of w y decaying in the above theorem, the dynamics of the single-neuron ReLU network follow the convergence of the 2-D case in Appendix A.1, with a convergence result as follows. Proposition 2. The single-neuron model in Theorem 3 converges to a period-2 orbit where w y = 0 and (v, w x ) ∈ γ K with γ K = {(δ 1 , δ 1 ), (δ 2 , δ 2 )}. Here δ 1 ∈ (0, 1), δ 2 ∈ (1, 2) are the solutions δ in K = 1 δ 2 1 δ 2 -3 4 + 1 2 . ( ) Remark. Actually this convergence is close to the flattest minima because: if the learning rate decays to infinitesimal after sufficient oscillations, then the trajectory walks towards the flattest minima (v = w x = 1, w y = 0). To summarize, the single-neuron model goes through three phases of training dynamics, with a intialization of the angle (w, w) as π 2 at most. First, the angle decreases monotonically but, due to the growth of norms, the absolute deviation w y still increases. Meanwhile, the inbalance v -w x stays in a bounded level. Second, w y starts to decrease and the parameters fall into a basin within four steps. Third, in the basin, w y decreases exponentially and, after w y at a reasonable low level, the model approximately follows the dynamic of the 2-D case and the inbalance v -w x decreases as well, following Theorem 5. The model converges to a period-2 orbit as in the 1-D case in Theorem 2. 6 QUASI-SYMMETRIC MATRIX FACTORIZATION: WALKING TOWARDS FLATTEST MINIMA Consider a matrix factorization problem, parameterized by learnable weights X ∈ R m×p , Y ∈ R q×p , and the target matrix is C ∈ R m×q . The loss L is defined as L(X, Y) = 1 2 XY -C 2 F . Obviously {X, Y : XY = C} forms a minimum manifold. In this context, the question is to describe GD dynamics in terms of a 'descent' phase (i.e., reaching the manifold), followed by a 'hovering' phase, where the dynamics evolve nearby the minimum manifold. Although we prove that the necessary 1-D condition holds around minimum as Theorem 6 (in Appendix A.2), it is more attracting to investigate GD in high dimensions. A straightforward subsets of the "flattest" points in the manifold of minimisers are in fact given by symmetric matrices, i.e., points of the form (X, X) with XX = C. As it turns out, the local behavior of GD beyond EoS in this symmetric submanifold of minimisers can be explicitly analysed. Indeed, Theorem 7 (in Appendix A.2) shows that the dynamics follows the direction of the leading eigenvector and then stably oscillates with a period-2 analogous to the 1D case in Theorem 2. Note that, although {X : XX = X 0 X 0 } forms a manifold that contains infinite number of minimizersfoot_1 , all of them have the same sharpness due to the same leading singular values. So a natural follow-up question is to analyse minimizers with different sharpness. The simplest setting that contains minimizers of varying-sharpness is to rescale symmetric minimizers, leading to Quasi-symmetric Matrix Factorization. Given a symmetric target C = X 0 X 0 , assume that we are around the (global) minima Y 1 = αX 0 + ∆Y 1 , Z 1 = 1 α X 0 + ∆Z 1 with α > 0 and small deviation ∆Y 1 , ∆Z 1 ≤ . The top singular value and vectors in SVD of X 0 is σ 1 u 1 v 1 . Then the EoS-learning rate at (αX 0 , 1 α X 0 ) is 2 σ 2 1 (α 2 + 1 α 2 ) , which is largest as 1 σ 2 1 at α = 1. We study the convergence of GD starting from Y 1 = αX 0 + ∆Y 1 , Z 1 = 1 α X 0 + ∆Z 1 with learning rate η = 1 σ 2 1 + β, β > 0. The following theorem shows that, although starting nearby a sharper minima, GD still converges to and stably scillate around the flattest one. Theorem 4. Consider the above quasi-symmetric matrix factorization with learning rate η = 1 σ 2 1 + β. Assume 0 < βσ 2 1 < √ 4.5 -1 ≈ 0.121. Consider a minimum (Y 0 = αX 0 , Z 0 = 1 /αX 0 ), α > 0. The initialization is around the minimum, as Y 1 = Y 0 + ∆Y 1 , Z 1 = Z 0 + ∆Z 1 , with the deviations satisfying u 1 ∆Y 1 v 1 = 0, u 1 ∆Z 1 v 1 = 0 and ∆Y 1 , ∆Z 1 ≤ . The second largest singular value of X 0 needs to satisfy max η σ 2 1 α 2 1 + α 4 σ 2 2 σ 2 1 , ησ 2 1 α 2 1 + σ 2 2 α 4 σ 2 1 ≤ 2. ( ) Then GD would converge to a period-2 orbit γ η approximately with error in O( ), formally written as (Y t , Z t ) → γ η + (∆Y, ∆Z), ∆Y , ∆Z = O( ), γ η = Y 0 + (ρ i -α) σ 1 u 1 v 1 , Z 0 + (ρ i -1 /α) σ 1 u 1 v 1 , (i = 1, 2) ( ) where ρ 1 ∈ (1, 2), ρ 2 ∈ (0, 1) are the two solutions of solving ρ in 1 + βσ 2 1 = 1 ρ 2 1 ρ 2 -3 4 + 1 2 . ( ) Proof sketch Details of proof can be found in Appendix J.3, which shares a similar spirit with Theorem 7. The analysis consists of two phases, depending on whether y,t Y t -Y 0 , u 1 v 1 , z,t Z t -Z 0 , u 1 v 1 are small or not. In Phase I, all components of Y t -Y 0 and Z t -Z 0 are small due to the initialization near minima, but both y,t and z,t are growing exponentially in a rate of ησ 2 1 α 2 + η σ 2 1 α 2 -1 ≥ 2ησ 2 1 -1 > 1. In Phase II, both y,t and z,t are much larger than other components, as long as other components are still not growing. So the dynamics of them matches GD of 2-D function f (y, z) = 1 2 (yz -1) 2 with learning rate η = 1 + βσ 2 1 . Following the 2-D analysis in Theorem 5, we have y,t and z,t converge to the same values, which degenerates the 2-D problem to 1-D function. Therefore, the proof concludes with 1-D convergence analysis of f (x) = 1 4 (x 2 -1) 2 as shown in Theorem 2. Remark. Note that both Y 0 -α • σ 1 u 1 v 1 , Z 0 -1 /α • σ 1 u 1 v 1 are residuals of Y 0 , Z 0 with the top singular value eliminated. Then, compared with Theorem 7, we have ρ i corresponds to δ i + 1, which means both symmetric and quasi-symmetric cases converge to parameters with the same top singular values and wander around the flattest minima. In other words, this convergence is close to the flattest minima because: if the learning rate decays to infinitesimal after sufficient oscillations, then the trajectory walks towards the flattest minima approximately with parameter distance in O( ). Also note that, if η < 1 σ 2 1 , we anticipate it still escapes from the sharp minima and converges to a flatter one (not necessarily the flattest). The result could be obtained by tracking GD on f (x, y) = 1 2 (xy -1) 2 with η < 1 slightly. But the closed form can not be expressed explicitly, because it strongly depends on initialization.

7. NUMERICAL EXPERIMENTS

In this section, we provide numerical experiments to verify our theorems. Additional experiments on 2-D functions, MLP and MNIST can be found in Appendix B. 1-D functions. As discussed in the Section 4, we have f (x) = 1 4 (x 2 -1) 2 satisfying the condition in Theorem 1 and g(x) = 2 sin(x) satisfying Lemma 1, so we estimate that both f and g allow stable oscillation around the local minima. It turns out GD stably oscillates around the local minima on both functions, when η > 2 f (x) slightly, as shown in Figure 1 . Two-layer single-neuron model. As discussed in the Section 5, with a learning rate η ∈ (d, 1.1d], a single-neuron network f (x) = v • σ(w x) is able to align with the direction of the teacher neuron under population loss. We train such a model in empirical loss on 1000 data points uniformly sampled from a sphere S 1 , as shown in Figure 2 . The student neuron is initialized orthogonal to the teacher neuron. In the end of training, w y decays to a small value before the inbalance |v -w x | decays sharply, which verifies our argument in Section 5. With a small w y , this nonlinear problem degenerates to a 2-D problem on v, w x . Then, the balanced property makes it align with the 1-D problem where v and w x converge to a period-2 orbit. Note that the small residuals of |v -w x | and w y are due to the difference between population loss and empirical loss. Symmetric and quasi-symmetric matrix factorization. As discussed in the Section 6 and Appendix A.2, with mild assumptions, both symmetric and quasi-symmetric cases stably wanders around the flattest minima. We train GD on a matrix factorization problem with X 0 X 0 = C ∈ R 8×8 . The learning rate is 1.02× EoS threshold. Following the setting in Section 6, for symmetric case, the training starts near X 0 and, for quasi-symmetric case, it starts near (αX 0 , 1 /αX 0 ) with α = 0.8, as shown in Figure 3 . Although starting with a re-scaling, the quasi-symmetric case achieves the same top singular values in Y and Z, which verifies the balancing effect of 2-D functions in Theorem 5. Then, the top singular values of both cases converge to the same period-2 orbit, supported by Theorem 2, 4 and 7. 

A.1 ON

A 2-D FUNCTION Similar to f (x) = 1 4 (x 2 -µ) 2 , consider a 2-D function f (x, y) = 1 2 (xy -µ) 2 . Apparently, if x and y initialize as the same, then (x (t) , y (t) ) would always align with the 1-D case from the same initialization. Therefore, it is significant to analyze this problem under different initialization for x and y, which we would call "in-balanced" initialization. Meanwhile, another giant difference is that all the global minima in 2-D case form a manifold {(x, y)|xy = µ} while the 1-D case only has two points of global minima. It would be great if we could understand which points in the global minima manifold, or in the whole parameter space, are preferable by GD. Note that reweighting the two parameters would manipulate the curvature to infinity as in (Elkabetz & Cohen, 2021) , so the inbalance strongly affects the local curvature. Viewing f (x) as a symmetric scalar factorization problem, we treat f (x, y) as asymmetric scalar factorization. The update rule of GD is x (t+1) := x (t) -η(x (t) y (t) -µ)y (t) , y (t+1) := y (t) -η(x (t) y (t) -µ)x (t) . ( ) Consider the Hessian as H ∂ 2 x f ∂ y ∂ x f ∂ x ∂ y f ∂ 2 y f = y 2 2xy -µ 2xy -µ x 2 . ( ) When xy = µ, the eigenvalues of H are λ 1 = x 2 + y 2 , λ 2 = 0. Note that λ 1 = (x -y) 2 + 2µ. Hence, in the global minima manifold, the local curvature of each point is larger if its two parameters are more inbalanced. Among all these points, the smallest curvature appears to be λ 1 = 2µ when x = y = √ µ. In other words, if the learning rate η > 2/2µ, all points in the manifold would be too sharp for GD to converge. We would like to investigate the behavior of GD in this case. It turns out the two parameters are driven to a perfect balance although they initialized differently, as follows. Theorem 5. For f (x, y) = 1 2 (xy -µ) 2 , consider GD with learning rate η = K • 1 µ . Assume both x and y are always positive during the whole process {x i , y i } i≥0 . In this process, denote a series of all points with xy > µ as P = {(x i , y i )|x i y i > µ}. Then |x -y| decays to 0 in P, for any 1 < K < 1.5.

Proof sketch

The details of proof are presented in the Appendix G. Start from a point (x (t) , y (t) ) where x (t) y (t) > µ. Because y (t+1) -x (t+1) = (y (t) -x (t) )(1 + η(x (t) y (t) -µ)), it suffices to show y (t+2) -x (t+2) y (t) -x (t) = |(1 + η(x (t) y (t) -µ))(1 + η(x (t+1) y (t+1) -µ))| < 1. Since 1 + η(x (t) y (t) -µ) > 1, the analysis of 1 + η(x (t+1) y (t+1) -µ) is divided into three cases considering the coupling of (x (t) , y (t) ), (x (t+1) , y (t+1) ). Remark. Actually, for a larger K ≥ 1.5, it is possible for GD to converge to an inbalanced orbit. For instance, Figure 15 in (Wang et al., 2021) shows inbalanced orbits for f (x) = 1 2 (xy -1) 2 with K = 1.9. Combining with the fact that the probability of GD converging to a stationary point that has sharpness beyond the edge of stability is zero (Ahn et al., 2022) , Theorem 5 reveals x and y would converge to a perfect balance. Note that this balancing effect is different from that of gradient flow (Du et al., 2018) , where the latter states that gradient flow preserves the difference of norms of different layers along training. As a result, in gradient flow, inbalanced initialization induces inbalanced convergence, while in our case inbalanced-initialized weights converge to a perfect balance. Furthermore, Theorem 5 shows an effect that the two parameters are squeezed to a single variable, which re-directs to our 1-D analysis in Theorem 2. Therefore, actually both cases converge to the same orbit when 1 < K < 1.121, as stated in Prop 3. Numerical results are presented in Figure 4 . Proposition 3. Following the setting in Theorem 5. Further assume 1 < K < √ 4.5 -1 ≈ 1.121. Then GD converges to an orbit of period 2. The orbit is formally written as {(x = y = δ i )|i = 1, 2}, with δ 1 ∈ (0, √ µ), δ 2 ∈ ( √ µ, 2 √ µ) as the solutions of solving δ in η = 1 δ 2 µ δ 2 -3 4 + 1 2 . Remark. Actually this convergence is close to the flattest minima because: if the learning rate decays to infinitesimal after sufficient oscillations, then the trajectory walks towards the flattest minima. However, one thing to notice is that the inbalance at initialization needs to be bounded in Theorem 5 because both x and y are assumed to stay positive along the training. More precisely, we have x (t+1) y (t+1) = x (t) y (t) (1 -η(x (t) y (t) -µ)) 2 -η(x (t) y (t) -µ)(x (t) -y (t) ) 2 , ( ) and then x (t+1) y (t+1) < 0 when |x (t) -y (t) | is large with x (t) y (t) > µ fixed. Therefore, we provide a condition to guarantee both x, y positive as follows, with details presented in the Appendix H. Lemma 3. In the setting of Theorem 5, denote the initialization as m = |y0-x0| √ µ and x 0 y 0 > µ. Then, during the whole process, both x and y will always stay positive, denoting p = 4 (m+ √ m 2 +4) 2 and q = (1 + p) 2 , if max η(x 0 y 0 -µ), 4 27 (1 + K) 3 + 2 3 K 2 - 1 3 K + qK 2 2(K + 1) m 2 qm 2 -K < p.

A.2 ON MATRIX FACTORIZATION

In this section, we present two additional results of matrix factorization.

A.2.1 ASYMMETRIC CASE: 1D FUNCTION AT THE MINIMA

Before looking into the theorem, we would like to clarify the definition of the loss Hessian. Inherently, we squeeze X, Y into a vector θ = vec(X, Y) ∈ R mp+pq , which vectorizes the concatnation. As a result, we are able to represent the loss Hessian w.r.t. θ as a matrix in R (mp+pq)×(mp+pq) . Meanwhile, the support of the loss landscape is in R mp+pq . Similarly, we use (∆X, ∆Y) in the same shape of (X, Y) to denote . In the following theorem, we are to show the leading eigenvector ∆ vec(∆X, ∆Y) ∈ R mp+pq of the loss Hessian. Since the cross section of the loss landscape and ∆ forms a 1D function f ∆ , we would also show the stable-oscillation condition on 1D function holds at the minima of f ∆ . Theorem 6. For a matrix factorization problem, assume XY = C. Consider SVD of both matrices as X = min{m,p} i=1 σ x,i u x,i v x,i and Y = min{p,q} i=1 σ y,i u y,i v y,i , where both groups of σ •,i 's are in descending order and both top singular values σ x,1 and σ y,1 are unique. Also assume v x,1 u y,1 = 0. Then the leading eigenvector of the loss Hessian is ∆ = vec(C 1 u x,1 u y,1 , C 2 v x,1 v y,1 ) with C 1 = σy,1 √ σ 2 x,1 +σ 2 y,1 , C 2 = σx,1 √ σ 2 x,1 +σ 2 y,1 . Denote f ∆ as the 1D function at the cross section of the loss landscape and the line following the direction of ∆ passing vec(∆X, ∆Y). Then, at the minima of f ∆ , it satisfies 3[f (3) ∆ ] 2 -f (2) ∆ f (4) ∆ > 0. ( ) The proof is provided in Appendix J.1. This theorem aims to generalize our 1-D analysis into higher dimension, and it turns out the 1-D condition is sastisfied around any minima for two-layer matrix factorization. In Theorem 1 and Lemma 1, if such 1-D condition holds, there must exist a period-2 orbit around the minima for GD beyond EoS. However, this is not straightforward to generalize to high dimensions, because 1) directions of leading eigenvectors and (nearby) gradient are not necessarily aligned, and 2) it is more natural and practical to consider initialization in any direction around the minima instead of strictly along leading eigenvectors. Therefore, below we present a convergence analysis with initialization near the minima, but in any direction instead.

A.2.2 SYMMETRIC CASE: CONVERGENCE ANALYSIS AROUND THE MINIMA

In this section, we focus on the symmetric case of matrix factorization where Y = X . Accordingly, we rescale the loss function as L(X, X) = 1 4 XX -C 2 F . Denote the target as C = X 0 X 0 , and assume we are around the minima X 1 = X 0 + ∆X 1 with small ∆X 1 ≤ . Consider SVD as X 1 = min{m,p} i=1 σ i u i v i σ 1 u 1 v 1 + X 0 . Then the EoS-learning-rate threshold at X = X 0 is η = 1 σ 2 1 . Therefore, we are to show the convergence of GD starting from X 1 = X 0 + ∆X 1 with learning rate η = 1 σ 2 1 + β where β > 0. Theorem 7. Consider the above symmetric matrix factorization with learning rate η = 1 σ 2 1 + β. Assume 0 < βσ 2 1 < √ 4.5 -1 ≈ 1.121 and ησ 2 2 < 1. The initialization is around the minimum, as X 1 = X 0 + ∆X 1 , with the deviation satisfying u 1 ∆X 1 v 1 = 0 and ∆X 1 ≤ bounded by a small value. Then GD would converge to a period-2 orbit γ η approximately by a small margin in O( ), formally written as X t → γ η + ∆X, ∆X = O( ), γ η = {X 0 + δ 1 σ 1 u 1 v 1 , X 0 + δ 2 σ 1 u 1 v 1 }, ( ) where δ 1 ∈ (0, 1), δ 2 ∈ (-1, 0) are the two solutions of solving δ in 1 + βσ 2 1 = 1 (δ + 1) 2 1 (δ+1) 2 -3 4 + 1 2 . ( ) Proof sketch The proof is provided in Appendix J.2. The analysis consists of two phases, depending on whether t X t -X 0 , u 1 v 1 is small or not. In Phase I, all components of X t -X 0 are small due to the initialization near minima, but only t is growing exponentially in a rate of 1 + βσ 2 1 . In Phase II, t is much larger than other components, as long as other components are still not growing. So the dynamics of t matches GD of 1-D function f (α) = 1 4 ((α + σ 1 ) 2 -σ 2 1 ) 2 with learning rate η = 1 σ 2 1 + β > 2 f (0) = 1 σ 2 1 . The proof concludes with 1-D convergence analysis of f (x) = 1 4 (x 2 -µ) 2 as shown in Theorem 2. Remark. Theorem 7 assumes GD starts from any point in an -ball near the minima, except u 1 ∆X 1 v 1 = 0. Note that this exception is with Lebesgue measure zero and, even if it is unfortunately satisfied, GD still has a chance to reach u 1 ∆X t v 1 = 0 after several steps due to some higher-order small noise. Then, this assumption could be relaxed as ∆X 1 = 0.

B ADDITIONAL EXPERIMENTS B.1 2-D FUNCTION

As discussed in the Section A.1, on the function f (x, y) = 1 2 (xy -1) 2 , we estimate that |x -y| decays to 0 when η ∈ (1, 1.5), as shown in Figure 4 . Since it achieves a perfect balance, the two parameters follows convergence of the corresponding 1-D function f (x) = 1 4 (x 2 -1) 2 . As shown in Figure 4 , xy with η = 1.05 converges to a period-2 orbit, as stated in the 1-D discussion of Theorem 2 while xy with η = 1.25 converges to a period-4 orbit, which is out of our range in the theorem. But still it falls into the range for balance in Theorem 5.

B.2 HIGH DIMENSION AND MNIST

We perform two experiments in relatively higher dimension settings. We are to show two observations that coincides with our discussions in the low dimension: Observation 1: GD beyond EoS drives to flatter minima. Observation 2: GD beyond EoS is in a similar style with the low dimension.

B.2.1 2-LAYER HIGH-DIM HOMOGENEOUS RELU NNS WITH PLANTED TEACHER NEURONS

We conduct a synthetic experiment in the high-dimension teacher-student framework. The teacher network is in the form of where x ∈ R 16 and e i is the i-th vector in the standard basis of R 16 . The student and the loss are in forms of y|x := f teacher (x; θ) = 16 i=1 ReLU(e i x), f (x; θ) = 16 i=1 v i • ReLU(w i x), L(θ; θ) = 1 m 16 i (f (x; θ) -y|x i ) 2 . ( ) Apparently, the global minimum manifold contains the following set M as (w.l.o.g., ignoring any permutation) M = {(v i , w i ) 16 i=1 | ∀i ∈ [16], w i = k i • e i , v i = 1 k i , k i > 0}. However, different choices of {k i } 16 i=1 induce different extents of sharpness around each minima. Our aim is to show that GD with a large learning rate beyond the edge of stability drives to the flattest minima from sharper minima. Initialization. We initialize all student neurons directionally aligned with the teachers as w i e i but choose various k i , as k i = 1 + 0.0625(i -1). Obviously, such a choice of {k i } 16 i=1 is not at the flattest minima, due to the isotropy of teacher neurons. Also we add small noise to w i to make the training start closely (but not exactly) from a sharp minima, as w i = k i • (e i + 0.01 ), ∼ N (0, I). ( ) Data. We uniformly sample 10000 data points from the unit sphere S 15 . Training. We run gradient descent with two learning rates η 1 = 0.5, η 2 = 2.6. Later we will show with experiments that the EoS threshold of learning rate is around 2.5, so η 2 is beyond the edge of stability. GD with these two learning rates starts from the same initialization for 100 epochs. Then we extend another 20 epochs with learning rate decay to 0.5 from 2.6 for the learning-rate case. Results. All results are provided in Figure 5 . Both Figure 5 (a, b ) present the gap between these two trajectories, where GD with a small learning rate stays around the sharp minima, while that with a larger one drives to flatter minima. Then GD stably oscillates around the flatter minima. Meanwhile, from Figure 5 (b), when we decrease the learning rate from 2.6 to 0.5 after 100 epochs, GD converges to a nearby minima which is significantly flatter, compared with that of lr=0.5. Figure 5 (c) provides a more detailed view of wi vi for all 16 neurons. All neurons with lr=0.5 stay at the original ratio k 2 i . But those with lr=2.6 all converge to the same ratio around k 2 = w v = 1.21, as shown in Figure 5 (d) . We compute the relationship between the sharpness of global minima in M and different choices of k, as shown in Figure 5 (e, f ). Actually, k 2 = 1.21 is the best choice of {k i } 16 i=1 such that the minima is the flattest. Therefore, we have shown that, in such a setting of high-dimension teacher-student network, GD beyond the edge of stability drives to the flattest minima. We conduct an experiment on real data to show that our finding in the low-dimension setting in Theorem 1 is possible to generalize to high-dimensional setting. More precisely, our goals are to show, when GD is beyond EoS, 1. the oscillation direction (gradient) aligns with the top eigenvector of Hessian. 2. the 1D function at the cross-section of oscillation direction and high-dim loss landscape satisfies the conditions in Theorem 1. Network, dataset and training. We run 3, 4, 5-layer ReLU MLPs on MNIST LeCun et al. (1998) . The networks have 16 neurons in each layer. To make it easier to compute high-order derivatives, we simplify the dataset by 1) only using 2000 images from class 0 and 1, and 2) only using significant input channels where the standard deviation over the dataset is at least 110, which makes the network input dimension as 79. We train the networks using MSE loss subjected to GD with large learning rates η = 0.5, 0.4, 0.35 and a small rate η = 0.1 (for 3-layer). Note that the larger ones are beyond EoS. Definition 3 (line search minima). Consider a function f , learning rate η and a point x ∈ domain(f ). We call x as the line search minima of x if x = x -c * • η∇f (x), (27) c * = argmin c∈[0,1] f (x -c • η∇f (x)) . ( ) The line search minima x can interpreted as the lowest point on the 1D function induced by the gradient at x. If GD is beyond EoS, x stays in the valley below the oscillation of x. Results. All results are presented in Figure 6 , 7 and 8. Take the 3-layer as an example. From Figure 6 (a, b), GD is beyond EoS during epochs 10-14 and 21-60. For these epochs, cosine similarity between the top Hessian eigenvector v 1 and the gradient is pretty close to 1, as shown in Figure 6 (c), which verifies our goal 1. In Figure 6 (d), we compute 4) at line search minima along training, which is required to be positive in Theorem 1 to allow stable oscillation. Then it turns out most points have 3[f (3) ] 2 -f (2) f ( 3[f (3) ] 2 -f (2) f (4) > 0 except a few points, all of which are not in the EoS regime, and these few exceptional points might be due to approximation error to compute the fourth-order derivative since their negativity is quite small. This verifies our goal 2. Both the above arguments are the same in the cases of 4 and 5 layers as shown in Figure 7 and 8. ] 2 ] 2 ] 2  f (2) f (4) 3[f (3) ] 2 f (2) f (4) non-positive (c) similarity of gradient and top eig-vector v 1 (d) 3[f (3) ] 2 -f (2) f (4) at line search minima f (2) f (4) 3[f (3) ] 2 f (2) f (4) non-positive (c) similarity of gradient and top eig-vector v 1 (d) 3[f (3) ] 2 -f (2) f (4) at line search minima f (2) f (4) 3[f (3) ] 2 f (2) f (4) non-positive (c) similarity of gradient and top eig-vector v 1 (d) 3[f (3) ] 2 -f (2) f (4) at line search minima

C PROOF OF THEOREM 1

Theorem 8 (Restatement of Theorem 1). Consider any 1-D differentiable function f (x) around a local minima x, satisfying (i) f (3) (x) = 0, and (ii) 3[f (3) ] 2 -f f (4) > 0 at x. Then, there exists with sufficiently small | | and • f (3) > 0 such that: for any point x 0 between x and x -, there exists a learning rate η such that the update rule F η of GD satisfies F η (F η (x 0 )) = x 0 , and 2 f (x) < η < 2 f (x) -• f (3) (x) . Proof. For simplicity, we assume f (3) (x) > 0. Imagine a starting point 4) . After running two steps of gradient descent, we have x 0 = x -, > 0. We omit f (x), f (x), f (3) (x), f (4) (x) as f , f , f (3) , f x 0 = x -, f (x 0 ) = f -f + 1 2 f (3) 2 - 1 6 f (4) 3 + O( 4 ) = -f + 1 2 f (3) 2 - 1 6 f (4) 3 + O( 4 ), x 1 = x 0 -ηf (x 0 ) = x --η -f + 1 2 f (3) 2 - 1 6 f (4) 3 + O( 4 ), f (x 1 ) = f • (x 1 -x) + 1 2 f (3) • (x 1 -x) 2 + 1 6 f (4) • (x 1 -x) 3 + O( 4 ), x 2 = x 1 -ηf (x 1 ), x 2 -x 0 η = --f + 1 2 f (3) 2 - 1 6 f (4) 3 -f • --η -f + 1 2 f (3) 2 - 1 6 f (4) 3 - 1 2 f (3) --η -f + 1 2 f (3) 2 - 1 6 f (4) 3 2 - 1 6 f (4) • (--η(-f )) 3 + O( 4 ) = (2f -ηf f ) + - 1 2 f (3) + 1 2 ηf f (3) - 1 2 f (3) (-1 + ηf ) 2 2 + 1 6 f (4) - 1 6 ηf f (4) + 1 2 (-1 + ηf )ηf (3) f (3) - 1 6 (-1 + ηf ) 3 f (4) 3 + O( 4 ). When η = 2 f , it holds x 2 -x 0 η = 1 2 ηf (3) f (3) - 1 3 f (4) 3 + O( 4 ), which would be positive if 1 2 ηf (3) f (3) -1 3 f (4) = 1 3f (3[f (3) ] 2 -f f (4) ) > 0 and | | is sufficiently small. When η = 2 f -•f (3) then ηf = 2 + 2 f (3) f + O( 2 ), it holds x 2 -x 0 η = -2f (3) 2 + - 1 2 f (3) + f (3) - 1 2 f (3) 2 + O( 3 ) = -2f (3) 2 + O( 3 ), which is negative when | | is sufficiently small. Therefore, there exists a learning rate η ∈ ( 2 f , 2 f -•f (3) ) such that x 2 = x 0 due to the continuity of (x 2 -x 0 ) with respect to η. The above proof can be generalized to the case of x 0 = xwith ∈ (0, ] and the learning rate is still bounded as η ∈ ( 2 f , 2 f -•f (3) ).

D PROOF OF LEMMA 1

Lemma 4 (Restatement of Lemma 1). Consider any 1-D differentiable function f (x) around a local minima x, satisfying that the lowest order non-zero derivative (except the f ) at x is f (k) (x) with k ≥ 4. Then, there exists with sufficiently small | | such that: for any point x 0 between x and x -, and 1. if k is odd and • f (k) (x) > 0, f (k+1) (x) < 0, then there exists η ∈ ( 2 f , 2 f -f (k) k-2 ), 2. if k is even and f (k) (x) < 0, then there exists η ∈ ( 2 f , 2 f +f (k) k-2 ), such that: the update rule F η of GD satisfies F η (F η (x 0 )) = x 0 . Proof. (1) If k is odd, assuming f (k) > 0 for simplicity, we have x 0 = x -, f (x 0 ) = -f + 1 (k -1)! f (k) k-1 - 1 k! f (k+1) k + O( k+1 ), x 1 = x 0 -ηf (x 0 ) = x -+ ηf - 1 (k -1)! ηf (k) k-1 + 1 k! ηf (k+1) k + O( k+1 ), f (x 1 ) = f • (x 1 -x) + 1 (k -1)! f (k) • (x 1 -x) k-1 + 1 k! f (k+1) • (x 1 -x) k + O( k+1 ), x 2 -x 0 η = x 1 -ηf (x 1 ) -x 0 η = -f (x 0 ) -f (x 1 ) = (2f -ηf f ) + - 1 (k -1)! f (k) + 1 (k -1)! ηf f (k) - 1 (k -1)! f (k) • (-1 + ηf ) k-1 k-1 + 1 k! f k+1 - 1 k! ηf f (k+1) - 1 k! f (k+1) • (-1 + ηf ) k k + O( k+1 ) When η = 2 f , it holds x 2 -x 0 η = - 2 k! f (k+1) k + O( k+1 ). ( ) When η = 2 f -f (k) k-2 then ηf = 2 + 2 f (k) f k-2 + O( 2k-4 ), then it holds x 2 -x 0 η = -2f (k) k-1 + O( k ). ( ) Since k is odd and • f (k) (x) > 0, f (k+1) (x) < 0, the above two estimations of x2-x0 /η have one positive and one negative exactly. Therefore, due to the continuity of x 2 -x 0 wrt η, there exists a learning rate η ∈ ( 2 f , 2 f -f (k) k-2 ) such that x 2 = x 0 . The above proof can be generalized to any x 0 between x and xwith the same bound for η. (2) If k is even, we have x 0 = x -, f (x 0 ) = -f - 1 (k -1)! f (k) k-1 + O( k ), x 1 = x 0 -ηf (x 0 ) = x -+ ηf + 1 (k -1)! ηf (k) k-1 + O( k ), f (x 1 ) = f • (x 1 -x) + 1 (k -1)! f (k) • (x 1 -x) k-1 + O( k ), x 2 -x 0 η = x 1 -ηf (x 1 ) -x 0 η = -f (x 0 ) -f (x 1 ) = (2f -ηf f ) + 1 (k -1)! f (k) - 1 (k -1)! ηf f (k) - 1 (k -1)! (-1 + ηf ) k-1 k-1 + O( k ). When η = 2 f , it holds x 2 -x 0 η = - 2 (k -1)! f (k) k-1 + O( k ). When η = 2 f +c•f (k) k-2 with c > 0 as some constant implying ηf = 2(1-c f (k) f k-2 )+O( 2k-4 ), then it holds x 2 -x 0 η = 2 c - 1 (k -1)! f (k) k-1 + O( k ), where we then set c = 1. Hence, the above two estimations of x2-x0 /η have one positive and one negative with sufficiently small | |. Therefore, due to the continuity of x 2 -x 0 , there exists a learning rate η ∈ ( 2 f , 2 f +f (k) k-2 ) such that x 2 = x 0 . The above proof can be generalized to any x 0 between x and xwith the same bound for η. Corollary 1. f (x) = sin(x) allows stable oscillation around its local minima x. Proof. Its lowest order nonzero derivative (expect f ) is f (4) x = sin(x) = -1 < 0 and the order 4 is even. Then Lemma 1 gives the result.

E PROOF OF LEMMA 2

Lemma 5 (Restatement of Lemma 2). Consider a 1-D function g(x) , and define the loss function f as f (x) = (g(x) -y) 2 . Assuming (i) g is not zero when g(x) = y, (ii) g (x)g (3) (x) < 6[g (x)] 2 , then it satisfies the condition in Theorem 1 or Lemma 1 to allow period-2 stable oscillation around x. Proof. From the definition, we have f (x) = 2[g(x) -y]g (x) + 2[g (x)] 2 , ( ) f (3) (x) = 2[g(x) -y]g (3) (x) + 6g (x)g (x), f (4) (x) = 2[g(x) -y]g (4) (x) + 6g (x)g (x) + 8g (x)g (3) (x). Then at the global minima where g(x) = y, we have f (x) = 2[g (x)] 2 and f (3) (x) = 6g (x)g (x). If we assume y is not a trivial value for g(x), which means g (x) = 0 at the minima, and g is not linear around the minima (implies g = 0), then f satisfies f (3) (x) = 0 in Theorem 1. Meanwhile, we need 3f (3) f (3) -f f (4) > 0 as in Theorem 1, hence it requires 1 2g (x)g (x) 36g (x)g (x)g (x)g (x) - 1 3 6g (x)g (x) + 8g (x)g (3) (x) > 0 (36) 6g (x)g (x) > g (x)g (3) (x). ( ) The remaining case is, if g (x) = 0 and g = 0 at the minima, it satisfies the condition for Lemma 1 with k = 4, because f (3) = 0 and f (4) < 0 due to (35, 37) Corollary 2. f (x) = (x 2 -1) 2 allows stable oscillation around the local minima x = 1. Proof. With g(x) = x 2 , it has g (1) = 2 = 0, g (1) = 2 = 0. All higher order derivatives of g are zero. Then Lemma 2 gives the result. Corollary 3. f (x) = (sin(x) -y) 2 allows stable oscillation around the local minima x = arcsin(y) with y ∈ (-1, 1). Proof. With g(x) = sin(x), it has g (x) = cos(x) = 0, g (3) (x) = -cos(x). We have g (3) (x) is bounded as g g (3) -6[g ] 2 = -cos 2 (x) -6 sin 2 (x) < 0. Then Lemma 2 gives the result. Corollary 4. f (x) = (tanh(x) -y) 2 allows stable oscillation around the local minima x = tanh -1 (y) with y ∈ (-1, 1). Proof. With g(x) = tanh(x), it has g (x) = sech 2 (x) = 0, and g (3) (x) = -2sech 4 (x) + 4sech 2 (x) tanh 2 (x) is bounded as g g (3) -6[g ] 2 = -2sech 6 + 4sech 4 tanh 2 -24sech 4 tanh 2 = -2sech 6 -20sech 4 tanh 2 < 0. Then Lemma 2 gives the result. Corollary 5. f (x) = (x α -y) 2 (with k ∈ Z, k ≥ 2) allows stable oscillation around the local minima x = y 1 /α except y = 0. Proof. With g(x) = x α , it has g (x) = αx α-1 , g (x) = α(α -1)x α-2 , g (3) (x) = α(α -1)(α - 2)x α-3 . Then we have g g (3) -6[g ] 2 = α 2 (α -1)(-5α + 4)x 2α-4 < 0. Then Lemma 2 gives the result. Corollary 6. f (x) = (exp(x) -y) 2 allows stable oscillation around the local minima x = log y for y > 0. Proof. With g(x) = exp x, it has g (x) = g (x) = g (3) (x) = exp(x). Then we have g g (3) -6[g ] 2 < 0. Then Lemma 2 gives the result. Corollary 7. f (x) = (log(x) -y) 2 allows stable oscillation around the local minima x = exp y. Proof. With g(x) = logx, it has g (x) = 1 x , g (x) = -1 x2 , g (3) (x) = -2 x3 . Then we have g g (3) -6[g ] 2 < 0. Then Lemma 2 gives the result. Corollary 8. f (x) = ( 1 1+exp(-x) -y) 2 allows stable oscillation around the local minima x = sigmoid -1 (y) for y ∈ (0, 1). Proof. With g(x) = 1 1+exp(-x) , it has g (x) = exp(-x) (exp(-x)+1) 2 , g (x) = -exp(x)(exp(x)-1) (exp(x)+1) 3 , g (3) (x) = exp(x)(-4 exp(x)+exp(2x)+1) (exp(x)+1) 4 . Then we have g g (3) -6[g ] 2 ∝ -4 exp(x)+exp(2x)+1-6(exp(x)-1) 2 < 0. Then Lemma 2 gives the result. Proposition 4 (Restatement of Prop 1). Consider two functions f, g. Assume both f (x), g(y) at x = x, y = f (x) satisfies the conditions in Lemma 2 to allow stable oscillations. Then g(f (x)) allows stable oscillation around x = x. Proof. Denote F (x) g(f (x)). Then we have F (x) = g (f (x))f (x), F (x) = g (f (x))[f (x)] 2 + g (f (x))f (x), F (3) (x) = g (3) (f (x))[f (x)] 3 + 3g (f (x))f (x)f (x) + g (f (x))f (3) (x). Thus, omitting all variables x and f (x) in the derivatives, it holds F (x)F (3) (x) -6[F (x)] 2 = g f g (3) (f ) 3 + 3g f f + g f (3) -6 g (f ) 2 + g f 2 ≤ -9g g (f ) 2 f , where the inequality is due to all conditions in Lemma 2. So the only problem is whether we can achieve g g f > 0. The good news is that, even if it holds g g f < 0, we can still find functions to re-represent g(f (x)) as ĝ( f (x)) such that ĝ ĝ f < 0 and all other conditions in Lemma 2 are satisfied by ĝ, f . For g g f < 0, construct ĝ(y) g(-y), f (x) -f (x). In this sense, it holds ĝ( f (x)) = g(f (x)). It is easy to verify that both ĝ, f at y = -f (x), x = x satisfy the conditions in Lemma 2, because ĝ (y) = -g (-y) = -g (f (x)), ĝ (y) = g (-y) = g (f (x)), ĝ(3) (y) = -g (3) (-y) = -g (3) (f (x)), f (x) = -f (x), f (x) = -f (x), f (3) (y) = -f (3) (x). Then, it has ĝ (y)ĝ (y) f (x) = -g g f > 0 at y = -f (x), x = x. Therefore, we have F (x)F (3) (x) -6[F (x) ] 2 < 0 and Lemma 2 gives the result.

F PROOF OF THEOREM 2

Theorem 9 (Restatement of Theorem 2). For f (x) = 1 4 (x 2 -µ) 2 , consider GD with η = K • 1 µ where 1 < K < √ 4.5 -1 ≈ 1. 121, and initialized on any point 0 < x 0 < √ µ. Then it converges to an orbit of period 2, except for a measure-zero initialization where it converges to √ µ. More precisely, the period-2 orbit are the solutions x = δ 1 ∈ (0, √ µ), x = δ 2 ∈ ( √ µ, 2 √ µ) of solving δ in η = 1 δ 2 µ δ 2 -3 4 + 1 2 . ( ) Proof. Assume the 2-period orbit is (x 0 , x1 ), which means x1 = x0 -η • f (x 0 ) = x0 + η • (µ -x2 0 )x 0 , x0 = x1 -η • f (x 1 ) = x1 + η • (µ -x2 1 )x 1 . First, we show the existence and uniqueness of such an orbit when K ∈ (1, 1.5] via solving a high-order equation, some roots of which can be eliminated. Then, we conduct an analysis of global convergence by defining a special interval I. GD starting from any point following our assumption will enter I in some steps, and any point in I will back to this interval after two steps of iteration. Finally, any point in I will converge to the orbit (x 0 , x1 ). Before diving into the proof, we briefly show it always holds x > 0 under our assumption. If x t-1 > 0 and x t ≤ 0, the GD rule reveals η(µ - x 2 t-1 ) ≤ -1 which implies x 2 t-1 ≥ µ + 1 η . However, the maximum of x + η(µ -x 2 )x on x ∈ (0, µ + 1 η ) is achieved when x 2 = 1 3 (µ + 1 η ) so the maximum value is 1 3 (µ + 1 η )( 2 3 + 2 3 ηµ) ≤ 1.4 1 3 (µ + 1 η ) < µ + 1 η . As a result, it always holds x > 0. Part I. Existence and uniqueness of (x 0 , x1 ). In this part, we simply denote both x0 , x1 as x 0 . This means x 0 in all formulas in this part can be interpreted as x0 and x1 . Then the GD update rule tells, for the orbit in two steps, x 0 → x 1 := x 0 + η(µ -x 2 0 )x 0 , x 1 → x 0 = x 1 + η(µ -x 2 1 )x 1 , which means 0 = η(µ -x 2 0 )x 0 + η µ -x 0 + η(µ -x 2 0 )x 0 2 x 0 + η(µ -x 2 0 )x 0 , 0 = µ -x 2 0 + µ -x 0 + η(µ -x 2 0 )x 0 2 1 + η(µ -x 2 0 ) . Denote z := 1 + η(µ -x 2 0 ), it is equivalent to 0 = µ -x 2 0 + (µ -z 2 x 2 0 )z = (z + 1)(-x 2 0 z 2 + x 2 0 z + µ -x 2 0 ) = (z + 1) -x 2 0 (z - 1 2 ) 2 + µ - 3 4 x 2 0 . If z + 1 = 0, it means x 1 = -x 0 which is however out of the range of our discussion on the x > 0 domain. So we require -x 2 0 (z -1 2 ) 2 + µ -3 4 x 2 0 = 0. To ensure the existence of solutions z, it is natural to require µ -3 4 x 2 0 ≥ 0. Then, the solutions are z = 1 2 ± µ x 2 0 - 3 4 . However, z = 1 2 - µ x 2 0 -3 4 can be ruled out. If it holds, η(µ -x 2 0 ) = z -1 < -1 2 which means x 2 0 > µ + 1 2η . Since we restrict ηµ ∈ (1, 1.121], it tells x 2 0 > µ(1 + 1 1.242 ) contradicting with µ ≥ 3 4 x 2 0 . Hence, z = 1 2 + µ x 2 0 -3 4 is the only reasonable solution, which is saying η(µ -x 2 0 ) = - 1 2 + µ x 2 0 - 3 4 . Given a certain η, the above expression is a third-order equation of x 2 0 to solve. Apparently x 2 0 = µ is one trivial solution, since for any learning rate, the gradient descent stays at the global minimum. Then the two other solutions are exactly the orbit (x 0 , x1 ), if the equation does have three different roots. This also guarantees the uniqueness of such an orbit. Assuming x 2 0 = µ, the above expression can be reformulated as η = 1 x 2 0 µ x 2 0 -3 4 + 1 2 . ( ) One necessary condition for existence is µ ≥ 3 4 x 2 0 . Note that here x 0 can be both x0 , x1 , one of which is larger than √ µ. For simplicity, we assume x0 < √ µ < x1 . Since η from Eq(39) is increasing with x 2 0 when µ < x 2 0 , let x 2 0 = 4 3 µ and achieve the upper bound as ηµ ≤ 3 2 , ( ) which is satisfied by our assumption 1 < ηµ < √ 4.5 -1 ≈ 1.121. Therefore, we have shown the existence and uniqueness of a period-2 orbit. Part II. Global convergence to (x 0 , x1 ). The proof structure is as follows: 1. There exists a special interval I := [x s , √ µ) such that any point in I will back to this interval surely after two steps of gradient descent. And x0 ∈ I. 2. Initialized from any point in I, the gradient descent process will converge to x0 (every two steps of GD). 3. Initialized from any point between 0 and √ µ, the gradient descent process will fall into I in some steps. (II.1) Consider a function F η (x) = x + η(µ -x 2 )x performing one step of gradient descent. Since F η (x) = 1 + ηµ -3ηx 2 , we have F η (x) > 0 for 0 < x 2 < 1 3 µ + 1 η and F η (x) < 0 otherwise. It is obvious that the threshold has x 2 s := 1 3 µ + 1 η < µ. In the other words, for any point on the right of x s , GD returns a point in a decreasing manner. To prove anything further, we would like to restrict x0 ≥ x s , which is x2 0 ≥ 1 3 µ + 1 η = 1 3 µ + x2 0 µ x2 0 - 3 4 + 1 2 . Solving this inequality tells x2 0 ≥ 3 + √ 2 7 µ. Consequently, by applying Eq(39), we have ηµ ≤ √ 4.5 -1 ≈ 1.121. ( ) With the above discussion of x s , we are able to define the special internal I := [x s , √ µ). From the definition of F η , consider a function representing two steps of gradient descent F 2 η (x) := F η (F η (x)). From previous discussion, we know F 2 η (x 0 ) = x0 . What about F 2 η (x s )? It turns out F 2 η (x s ) > x s : we have F η (x s ) = x s (1 + ηµ -ηx 2 s ) = x s • 2 3 (1 + ηµ) and, furthermore, F 2 η (x s ) = F η (x s • 2 3 (1+ηµ)) = x s • 2 3 (1+ηµ)• 1 + ηµ -4 27 (1 + ηµ) 3 . Then we get F 2 η (x s ) > x s because 2 3 (1 + ηµ) • 1 + ηµ - 4 27 (1 + ηµ) 3 > 1 if ηµ ∈ (1, √ 4.5 -1). ( ) Combining the following facts, i) F 2 η (x) -x is continous wrt x, ii) F 2 η (x s ) -x s > 0, and iii) F 2 η (x 0 ) -x0 = 0 is the only zero point on x ∈ [x s , x0 ], we can conclude that F 2 η (x) > x, ∀x ∈ [x s , x0 ). Meanwhile, we can prove F 2 η (x) < x for any x ∈ (x 0 , √ µ). Since F 2 η (µ) -µ = 0 and F 2 η (x 0 ) -x0 = 0 are the only two zero cases, we only need to show ∃ x ∈ (x 0 , √ µ), such that F 2 η (x) < x. We compute the derivative of F 2 η (x) -x at x 2 = µ, which is d dx F 2 η (x) -x| x 2 =µ = -1 + F (F (x))F (x)| x 2 =µ = -1 + [F ( √ µ)] 2 = -1 + (1 -2ηµ) 2 > 0. Then combining it with F 2 η (x 0 ) = x0 , there exists a point x ∈ (x 0 , √ µ) that is very close to √ µ such that F 2 η (x) < x. Hence, we can conclude that F 2 η (x) < x, ∀x ∈ (x 0 , √ µ). ( ) Since F η (•) is decreasing on [x s , ∞) and F η (x) > x s for x ∈ [x s , √ µ], it is fair to say F 2 η (x) is increasing on x ∈ [x s , √ µ]. Hence, we have F 2 η (x) ≤ F 2 η (x 0 ) = x0 , ∀x ∈ [x s , x0 ]. And F 2 η (x) ≥ F 2 η (x 0 ) = x0 , ∀x(x 0 , √ µ) Combining the above results, we have F 2 η (x) ∈ (x, x0 ], ∀x ∈ [x s , x0 ), F 2 η (x) ∈ [x 0 , x), ∀x ∈ (x 0 , √ µ). ( ) (II.2) A consequence of Exp(46, 47) is that any point in I will converge to x0 with even steps of gradient descent. For simplicity, we provide the proof for x ∈ [x s , x0 ). Denote a 0 ∈ [x s , x0 ) and a n := F 2 η (a n-1 ), n ≥ 1. The series {a i } i≥0 satisfies x0 ≥ a n+1 > a n > a 0 . Since the series is bounded and strictly increasing, it is converging. Assume it is converging to a. If a < x0 , then x0 ≥ F 2 η (a) > a > F 2 η (a n ). Since F 2 η (•) is continuous, so ∃ δ > 0, such that, when |x -a| < δ, we have |F 2 η (x) -F 2 η (a)| < F 2 η (a) -a. ( ) Since a is the limit, so ∃ N > 0, such that, when n > N , 0 < a -F 2 η (a n ) < δ. So, combining with Exp(49), we have |F 2 η (F 2 η (a n )) -F 2 η (a)| < F 2 η (a) -a. But LHS = F 2 η (a) -a n+2 > F 2 η (a) -a, so we reach a contradiction. Hence, we have {a i } converges to x0 . (II.3) Obviously, any initialization in (0, √ µ) will have gradient descent run into (i) the interval I, or (ii) the interval on the right of √ µ, i.e., ( √ µ, ∞). The first case is exactly our target. Now consider the second case. From the definition of x s in part III.1, we know F η (x s ) = max x∈[0, √ µ] F η (x). So it is fair to say this case is x n ∈ ( √ µ, F η (x s )]. Then the next step will go into the interval I, because F η (x n ) ≥ F η (F η (x s )) = F 2 η (x s ) > x s , where the first inequality is from the decreasing property of F η (•) and the second inequality is due to F 2 η (x) > x on x ∈ [x s , x0 ).

G PROOF OF THEOREM 5

Theorem 10 (Restatement of Theorem 5). For f (x, y) = 1 2 (xy -µ) 2 , consider GD with learning rate η = K • 1 µ . Assume both x and y are always positive during the whole process {x i , y i } i≥0 . In this process, denote a series of all points with xy > µ as P = {(x i , y i )|x i y i > µ}. Then |x -y| decays to 0 in P, for any 1 < K < 1.5. Proof. Consider the current step is at (x t , y t ) with x t y t > µ. After two steps of gradient descent, we have x t+1 = x t + η(µ -x t y t )y t (50) y t+1 = y t + η(µ -x t y t )x t (51) x t+2 = x t+1 + η(µ -x t+1 y t+1 )y t+1 (52) y t+2 = y t+1 + η(µ -x t+1 y t+1 )x t+1 , with which we have the difference evolve as y t+1 -x t+1 = (y t -x t ) (1 -η (µ -x t y t )) (54) y t+2 -x t+2 = (y t+1 -x t+1 ) (1 -η (µ -x t+1 y t+1 )) . Meanwhile, we have x t+1 y t+1 = x t y t + η (µ -x t y t ) x 2 t + y 2 t + η 2 (µ -x t y t ) 2 x t y t = x t y t (1 + η (µ -x t y t )) 2 + η (µ -x t y t ) (x t -y t ) 2 Note that the second term in Eq(56) vanishes when x and y are balanced. When they are not balanced, if x t y t > µ, it holds x t+1 y t+1 < x t y t (1 + η (µ -x t y t )) 2 . Incorporating this inequality into Eq(54, 55) and assuming y t -x t > 0, it holds y t+2 -x t+2 < (y t -x t ) (1 -η (µ -x t y t )) 1 -η µ -x t y t (1 + η (µ -x t y t )) 2 . ( ) To show that |x -y| is decaying as in the theorem, we are to show 1. y t+2 -x t+2 < y t -x t 2. y t+2 -x t+2 > -(y t -x t ) Note that, although x t y t > µ, it is not sure to have x t+2 y t+2 > µ. However, for any 0 < x i y i < µ and K < 2, we have |x i+1 -y i+1 | |x i -y i | = |1 -η (µ -x i y i )| < 1, which is saying |x -y| decays until it reaches xy > µ. So it is enough to prove the above two inequalities, whether or not x t+2 y t+2 > µ. Part I. To show y t+2 -x t+2 < y t -x t Since we wish to have y t+2 -x t+2 < y t -x t , it is sufficient to require (1 -η (µ -x t y t )) 1 -η µ -x t y t (1 + η (µ -x t y t )) 2 < 1. ( ) Since we assume x t+1 , y t+1 > 0, Eq (50, 51) tells η (µ - x t y t ) > -min{ xt yt , yt xt }, which is equivalent to 1 -η (µ -x t y t ) < 1 + min{ xt yt , yt xt }. (I.1) If η(µ -x t+1 y t+1 ) ≥ 1 2 Then we have 1 -η(µ -x t+1 y t+1 ) ≤ 1 2 . As a result, y t+2 -x t+2 y t -x t = (1 -η (µ -x t y t )) (1 -η (µ -x t+1 y t+1 )) < 1 + min{ x t y t , y t x t } × 1 2 (60) = 1 2 + 1 2 min{ x t y t , y t x t } ( ) (I.2) If η(µ -x t+1 y t+1 ) < 1 2 and x t+1 y t+1 ≤ x 2 s = 1 3 µ + 1 η The second condition reveals y t+2 -x t+2 y t+1 -x t+1 = 1 -η (µ -x t+1 y t+1 ) ≤ 1 -η µ - 1 3 µ + 1 η = 4 3 - 2 3 K. ( ) The first condition is equivalent to x t+1 y t+1 > µ -1 2η . Since the second term in Eq(56) is negative, we have x t y t (1 + η (µ -x t y t )) 2 > µ - 1 2η , with which we would like to find an upper bound of x t y t . Denoting b = x t y t , consider a function q(b) = b (1 + η (µ -b)) 2 . Obviously q(µ) = µ. Its derivative is q (b) = (1 + ηµ -ηb) (1 + ηµ -3ηb) < 0 on the domain of our interest. If we can show an (negative) upper bound for the derivative as q (b) < -1 on a proper domain, then it is fair to say that, from Exp(63), x t y t < µ + 1 2η . Then we have y t+1 -x t+1 y t -x t = 1 -η(µ -x t y t ) < 1 -η µ -µ - 1 2η = 3 2 . ( ) Then, combining Exp(64, 62), it tells y t+2 -x t+2 y t -x t < 2 -K. ( ) The remaining is to show q (b) < -1 on a proper domain. We have q (b) = (1 + ηµ -2ηb) 2 -(ηb) 2 , which is equal to 1 -2ηµ < -1 when b = µ. Meanwhile, the derivative of q (b) is q (b) = -2η(ηb + (1 + ηµ -2ηb)) = -2η(1 + ηµ -ηb), which is negative when b < µ + 1 η . As a result, it always holds q (b) < -1 when b < µ + 1 η . (I.3) If x t+1 y t+1 ≥ x 2 s Denoting again b = x t y t , the above inequality in is saying, with b > µ, p(b) = (1 -η (µ -b)) 1 -η µ -b (1 + η (µ -b)) 2 < 1. After expanding p(•), we have p(b) -1 = η (µ -b) -2 + η (µ -b) + 2ηb -η 2 b (µ -b) -η 3 b (µ -b) 2 . Apparently p(µ) = 1. So it is necessary to investigate whether p (b) < 0 on b > µ, as p (b) = 2 -2ηb + (µ -b) η 2 (1 + η (µ -b)) (-µ + 3b) + η 3 b (µ -b) . Since ηb > 1 and b > µ, it is enough to require (1 + η (µ -b)) (-µ + 3b) + ηb (µ -b) > 0 (1 + η(µ -b))(-µ + b) + ηb(µ -b) + 2b(1 + η(µ -b)) > 0. It suffices to show η(µ -b) + 2(1 + η(µ -b)) = 2 + 3η(µ -b) > 0. (67) Since x t+1 y t+1 ≥ x 2 s = 1 3 µ + 1 η , it holds b (1 + η(µ -b)) 2 ≥ 1 3 µ + 1 η 2 + 3η(µ -b) ≥ 3 µ + 1 η b -1 > 0, where the last inequality holds because: if b ≥ 3 µ + 1 η , then 1 + η(µ -b) ≤ -2ηµ -2 < 0, which contradicts with the assumption that both x t+1 , y t+1 are positive. As a result, the above argument gives y t+2 -x t+2 y t -x t < p(b) < 1 -2(K -1)(b -µ). Part II. To show y t+2 - x t+2 > -(y t -x t ) Since x t y t > µ, we have 1 -η(µ -x t y t ) > 1. Combining with 1 -η(µ -x t y t ) < 2, it holds y t+1 -x t+1 y t -x t = 1 -η(µ -x t y t ) ∈ (1, 2). So the remaining is to have yt+2-xt+2 yt+1-xt+1 > -0.5. Actually it is 1-η(µ-x t+1 y t+1 ) ≥ 1-ηµ = 1 -K. Therefore, we have y t+2 -x t+2 y t -x t > -1 + (3 -2K), as required. Part III. To show y t -x t converges to 0 From Exp (61, 65, 68, 69), we have for points in P, |y -x| is a monotone strictly decreasing sequence lower bounded by 0. Hence it is convergent. Actually it converges to 0. If not, assuming it converges to > 0, the next point will have the difference as ˜ < as well as all following points. Hence, the contradiction gives the convergence to 0.

H PROOF OF LEMMA 3

Lemma 6 (Restatement of Lemma 3). In the setting of Theorem 5, denote the initialization as m = |y0-x0| √ µ and x 0 y 0 > µ. Then, during the whole process, both x and y will always stay positive, denoting p = 4 (m+ √ m 2 +4) 2 and q = (1 + p) 2 , if max η(x 0 y 0 -µ), 4 27 (1 + K) 3 + 2 3 K 2 - 1 3 K + qK 2 2(K + 1) m 2 qm 2 -K < p. Proof. Considering x t y t > µ, one step of gradient descent returns x t+1 = x t + η(µ -x t y t )y t y t+1 = y t + η(µ -x t y t )x t . To have both x t+1 > 0, y t+1 > 0, it suffices to have η(x t y t -µ) < min y t x t , x t y t . ( ) This inequality will be the main target we need to resolve in this proof. First, we are to show min y 0 x 0 , x 0 y 0 > 4 m + √ m 2 + 4 2 . With the difference fixed as m = (y 0 -x 0 )/ √ µ, assuming y 0 > x 0 , we have m/y 0 = (1 - x 0 /y 0 )/ √ µ. if x 0 y 0 increases, both x 0 and y 0 increase then m/y 0 decreases, which means x 0 /y 0 increases. As a result, we have min y 0 x 0 , x 0 y 0 > min y 0 x 0 , x 0 y 0 x0y0=µ = 4 m + √ m 2 + 4 2 . Therefore, at initialization, to have positive x 1 and y 1 , it is enough to require η(x 0 y 0 -µ) < 4 m + √ m 2 + 4 2 r. From Theorem 5, it is guaranteed that |x t -y t | < |x 0 -y 0 | with t ≥ 2 until it reaches x t y t > µ, with which r is still a good lower bound for min{y t /x t , x t /y t }. So what remains to show is it satisfies η(x t y t -µ) < r for the next first time x t y t > µ. If this holds, we can always iteratively show, for any x t y t > µ along gradient descent, η(x t y t -µ) < r < min y t x t , x t y t . Note that r itself is independent of x t y t and all the history, so it is ideal to compute a uniform upper bound of η(x t y t -µ) with any pair of (x t-1 , y t-1 ) satisfying x t-1 y t-1 < µ. Actually it is possible, since we have |x t-1 -y t-1 | bounded as in Theorem 5. Assume x i y i > µ and it satisfies the condition of η(x i y i -µ) < r and |x i -y i | < |x 0 -y 0 |. As in (54), we have x i+1 -y i+1 x i -y i = 1 -η (µ -x i y i ) ∈ (1, 1 + r). Hence, it suffices to get the maximum value of g(z), with z ∈ (0, µ), as g(z) = z (1 + η(µ -z)) 2 + η(µ -z)(1 + r) 2 (x 0 -y 0 ) 2 , ( ) which is from (56). Denote z = argmax g(z). Obviously z < 1 3 (µ + 1 η ) z b , because the first term of g(z) achieves maximum at z = 1 3 (µ + 1 η ) and the second term is in a decreasing manner with z. Then let's take the derivative of g(z) as g (z) = (1 + η(µ -z)) (1 + ηµ -3ηz) -η(1 + r) 2 (x 0 -y 0 ) 2 = (1 + η(µ -z)) 1 + ηµ -3ηz - η(1 + r) 2 (x 0 -y 0 ) 2 1 + η(µ -z) , where the first term is always positive, so we have 1 + ηµ -3ηz - η(1 + r) 2 (x 0 -y 0 ) 2 1 + η(µ -z) = 0, which means z = 1 3η 1 + ηµ - η(1 + r) 2 (x 0 -y 0 ) 2 1 + η(µ -z) (74) > 1 3η 1 + ηµ - η(1 + r) 2 (x 0 -y 0 ) 2 1 + η(µ -1 3 (µ + 1 η )) (75) = 1 3 µ + 1 η - 3(1 + r) 2 2(η + 1) (x 0 -y 0 ) 2 (76) z s , ( ) where the inequality is from z < 1 3 (µ + 1 η ). As a result, it is safe to say g(z) ≤ z (1 + η(µ -z)) 2 z=z b + η(µ -z)(1 + r) 2 (x 0 -y 0 ) 2 z=zs (78) = 4 27 (1 + ηµ) 3 • 1 η + η(1 + r) 2 2 3 µ - 1 3η + 2 ηµ + 1 (x 0 -y 0 ) 2 (x 0 -y 0 ) 2 , ( ) with which we are able to compute max η(g(z) -µ), which is exactly the final result.

I PROOF OF THEOREM 3

Theorem 11 (Restatement of Theorem 3). In the above setting, consider a teacher neuron w = [1, 0] and set the learning rate η = Kd with K ∈ (1, 1.1]. Initialize the student as w (0) = v (0) ∈ (0, 0.10] and w (0) , w ≥ 0. Then, for t ≥ T 1 + 4, w (t) y decays as w (t) y < 0.1 • (1 -0.030K) t-T1-4 , T 1 ≤ log 2.56 1.35 πβ 2 , β = 1 + 1.1 π . Proof. We restate the update rules as ∆v (t) := v (t+1) -v (t) = Kw (t) x (-v (t) w (t) x + 1) -v (t) w (t) y w (t) y w (t) x - 1 π arctan w (t) y w (t) x - w (t) y w (t) x , = Kw (t) x (-v (t) w (t) x + 1) - 1 π arctan w (t) y w (t) x - w (t) x w (t) y w (t) 2 + K (w (t) y ) 2 v (t) -(v (t) ) 2 + v (t) w (t) y π w (t) 2 (80) ∆w (t) x := w (t+1) x -w (t) x = Kv (t) (-v (t) w (t) x + 1) - 1 π arctan w (t) y w (t) x - w (t) x w (t) y w (t) 2 , ( ) ∆w (t) y = w (t) y • K -(v (t) ) 2 + v (t) w (t) y π w (t) 2 , ( ) w (t+1) y = w (t) y + ∆w (t) y . ( ) For simplicity, we will omit all superscripts of time t unless clarification is necessary. From ( 83), if the target is to show w y decaying with a linear rate, it suffices to bound the factor term in (82) (by a considerable margin) as -2 < K -v 2 + vw y π w 2 < 0. ( ) The technical part is the second inequality of ( 84). If v, w x > 0, it is equivalent to vw x > w x w y π w 2 = w x w y π(w 2 x + w 2 y ) , where the RHS is smaller than or equal to 1 2π . Hence, 1 2π is a special threshold with which we will frequently compare vw x . Another important variable to control is v -w x that reveals how the two layers are balanced. If it is too large, for the iteration v (t+1) w (t+1) x may explode as shown in the 2-D case. The main idea of our proof is that • Stage 1 with vw x ≤ wxwy π w 2 : in this stage, w y grows but it grows in a smaller rate than that of v and w x . Therefore, since we have an upper bound for vw x to stay in this stage, we are able to compute the upper bound of #iterations to finish this stage, which is T 1 in the theorem. At the end of this stage, both of v -w x and w y are bounded under our assumption of initialization. • Stage 2 with vw x > wxwy π w 2 : in this stage, w y decreases. Since our range of a large learning rate is relatively narrow (1 < K ≤ 1.1), we are able to compute bounds of vw x , v -w x and w y . After eight iterations, it falls into (and stays in) a bounded basin of these three terms, in which w y decays at least in a linear rate.

Stage 1.

We are to show that, in the last iteration of this stage, there are three facts: 1) vw x ≤ 1 2π , 2) v -w x ∈ [-0.017, 0.17], and 3) w y ≤ 0.44. At initialization, we assume v (0) = w (0) . Denote α 0 = arctan(w (0) y /w (0) x ) ∈ [0, π/2]. So for next iteration we have w (1) y = v (0) 1 + K -(v (0) ) 2 + 1 π sin α 0 , w (1) x = v (0) cos α 0 + K 1 -(v (0) ) 2 cos α 0 + cos α 0 sin α 0 -α 0 π . ( ) Apparently w (1) y increases with α 0 increasing. And ∂ α0 w (1) x = v (0) -sin α 0 + K (v (0) ) 2 sin α 0 + -sin 2 α 0 + cos 2 α 0 -1 π = v (0) -sin α 0 + K (v (0) ) 2 - sin α 0 π sin α 0 + -sin 2 α 0 π . Since in stage 1 it holds ∆w y > 0 which means -(v (0) ) 2 + 1 π sin α 0 > 0 in (85). So it follows ∂ α0 w (1) x ≤ 0. Combining the above arguments, we have w (1) x ≥ w (1) x | α0= π 2 = K 2 v (0) , w (1) y ≤ w (1) y | α0= π 2 = 1 + K π -K(v (0) ) 2 v (0) ≤ 1 + K π v (0) , w (1) y w (1) x ≤ 2 + 2K π K ≤ 2.7. Regarding v wy , it has v (0) ≥ w (0) y at initialization due to v (0) = w (0) . From (80, 81, 82), we have v∆v = w x ∆w x + w y ∆w y . So it holds v∆v ≥ y∆y. Meanwhile, ∆wy v = K(-vw y + w 2 y π w 2 ) ∈ [0, K π ]. From Lemma 7, given v (t) ≥ w (t) y and ∆wy v ∈ [0, 1] for any t in this stage, it always holds v (t+1) ≥ w (t+1) y . Therefore, it is fair to say v (1) w (1) x (w (1) y ) 2 ≥ 1 2.7 . Additionally, to bound the term vw y / w 2 in ∆w y , we would like to show it always has vw y ≤ w 2 . At initialization, it naturally holds. Then, for the every next iteration, given it holds in the last iteration, we have (v + ∆v)(w y + ∆w y ) -[(w x + ∆w x ) 2 + (w y + ∆w y ) 2 ] = (v + w x ∆w x + w y ∆w y v )(w y + ∆w y ) -[(w x + ∆w x ) 2 + (w y + ∆w y ) 2 ] = vw y + v∆w y + w x ∆w x ( w y v + ∆w y v ) + (w y ∆w y + (∆w y ) 2 ) w y v -[(w x + ∆w x ) 2 + (w y + ∆w y ) 2 ] ≤ vw y + v∆w y + w y ∆w y w y v -(w 2 x + w 2 y + 2w y ∆w y + (∆w x ) 2 ) ≤ v∆w y + w y ∆w y w y v -2w y ∆w y -(∆w x ) 2 = v∆w y (1 - w y v ) 2 -(∆w x ) 2 ≤ v∆w y -(∆w x ) 2 where the first equality uses v∆v = w x ∆w x + w y ∆w y , the first inequality uses the proven v ≥ w y and v ≥ ∆w y , the second inequality uses the assumption vw y ≤ w 2 . Now we are to show v∆w y -(∆w x ) 2 ≤ 0. We have v∆w y -(∆w x ) 2 ≤ Kv 2 w 2 y π w 2 -K 2 v 2 1 - 1 2π -γ (t) 2 , γ (t) = 1 π arctan w (t) y w (t) x - w (t) x w (t) y w (t) 2 . Since we have proven w (1) y /w (1) x ≤ 2.7, it is easy to check that 1 π 1 + ( w (1) x w (1) y ) 2 ≤ (1 - 1 2π -γ (1) ) 2 . As a result, v∆w y -(∆w x ) 2 ≤ 0 at time 1. Furthermore, by checking each term, v∆w y -(∆w x ) 2 decreases with w y /w x decreasing. We will soon show that w y /w x itself decreases, by showing the growth ratio of w x is larger than that of w y . Our target lower bound of the growth ratio of w x is that ∆w x w x ≥ 1 - 1 π -γ, ( ) which is larger than the growth ratio of w y bounded by 1 π due to v∆w y < w 2 . So it suffices to show Kv/w x ≥ 1. Assuming Kv/w x ≥ 1 for the current step, we need to show Kv (t+1) /w (t+1) x ≥ 1 also holds for the next step. Let's denote A (t) = K (-v (t) w (t) x + 1) - 1 π arctan w (t) y w (t) x - w (t) x w (t) y w (t) 2 . ( ) Then (v + ∆v) - 1 K (w x + ∆w x ) ≥ v + Aw x - w x K - Av K = (v - w x K )(1 -KA) + v(K - 1 K )A. If KA ≤ 1, since K > 1 and A > 0, we have (89) as positive, which is what we need. If KA > 1, then (89) ≥ (v - w x K )(1 -K 2 ) + v(K - 1 K )A = ((-K + A)v + w x ) (K - 1 K ), where the first inequality is due to A ≤ K and the assumption of Kv (t) /w (t) x ≥ 1. Then it suffices to show (-K + A)v + w x ≥ (-K + 1 K )v + w x ≥ 0. Note that -K + 1/K ∈ (-0.2, 0] when K ∈ (1, 1.1]. It is easy to verify that v (1) ≤ 5w (1) x . Then, for the next step, we need to show v (t+1) ≤ 5w t+1) . To prove this, we are to bound v -w x , as (t+1) x given v (t) ≤ 5w ( v (t+1) -w (t+1) x = (1 -A)(v -w) + K w 2 y v (-v 2 + vw y π w 2 ) ≤ 0.4(v -w) + Kw y w 2 y π w 2 ≤ 0.4(v -w) + Kw y π , ( ) where the first inequality is due to, when w y /w x ≤ 2.7, A = K -v (t) w (t) x + 1 π w (t) x w (t) y w (t) 2 + K 1 - 1 π arctan w (t) y w (t) x ≥ K 1 - 1 π arctan w (t) y w (t) x ≥ 0.6. We will later show that v (t+1) -w (t+1) ≥ -0.1(v (t) -w (t) ). Combining this with (90), it is safe to say v (t+1) -w (t+1) ≤ 0.4(v -w) + Kw y π ≤ 0.4 × 4w + K × 5w π ≤ 4w, where the second inequality is due to v ≤ 5w and v ≥ w y . Since w (t+1) ≥ w (t) (due to A > 0) in this stage, we have v (t+1) ≤ 5w (t+1) x . Combining the above discussion, we have prove (87). Obviously, when w y /w x ≤ 2.7, RHS of ( 87) is at least 0.55, larger than 1.1/π, which is the upper bound of the ∆w y /w y . As a result, w y /w x keeps decreasing. The next step is to show the growing ratio of vw x is much larger than that of w y . From (81, 82), it holds v (t+1) w (t+1) x = (v + ∆v)(w x + ∆w x ) ≥ vw x + KA(v 2 + w 2 x ) + K 2 A 2 vw x ≥ vw x (1 + A) 2 , where the first inequality is due to ∆w y ≥ 0. It follows v (t+1) w (t+1) x /v (t) w (t) x ≥ 1.6 2 = 2.56. So far, we have shown the following facts: under the defined initialization at time 0, starting from time 1, we have 1. vw x ≤ 1/2π. 2. ∆w x /w x + 1 ≥ 1.55. 3. ∆w y /w y + 1 ≤ 1 + K/π. 4. w y /w x ≤ 2.7 and keeps decreasing. 5. v (t+1) w (t+1) x /v (t) w (t) x ≥ 2.56. 6. v ≥ w y . 7. v∆w y < (∆w x ) 2 . Now we are to use the above facts to bound vw x , w y and v -w x to the end of stage 1. For vw x , in previous discussion, we have shown that vw x ≤ 1 2π . Actually, there is another special value w x w y π(w 2 x + w 2 y ) = 0.104 when w y /w x = 2.7. ( ) This value is slightly larger than 1/4π. Hence, we would like to split the analysis into three parts: in the first step of stage 2, 1. vw x ≥ 1 2π . 2. 1 4π ≤ vw x < 1 2π . 3. vw x < 1 4π . Note that, although we are discussing the stage 1 in this section, investigating the lower bound of the first step in stage 2 helps calculate the number of iterations in stage 1. Furthermore, it helps bound several variables in stage 1.

Case (I). If vw

x ≥ 1 2π in first step of stage 2: Since we have prove v (1) w (1) x (w (1) y ) 2 ≥ 1/2.7 and v (t+1) w (t+1) x /v (t) w (t) x ≥ 2.56, the number of iterations for vw x to reach 1/2π is at most T 1 ≤ log 2.56 1 2π (w (1) y ) 2 /2.7 . Meanwhile, starting from time 1, the growth ratio of w y is (w y + ∆w y )/w y ≤ 1 + K(-v 2 + 1/π) ≤ 1 + 1.1/π -(v (1) ) 2 ≤ 1 + 1.1/π -(w (1) y ) 2 , (93) where the first inequality is due to vw y ≤ w 2 , the second is due to K > 1 and the third is from v ≥ w y . Therefore, combining with (92), we can bound w y in the end of stage 1 as w y ≤ 1 + 1.1/π -(w (1) y ) 2 log 2.56 1 2π (w (1) y ) 2 /2.7 . Since it initializes as w (0) ≤ 0.1, we have w (1) y ≤ 0.1(1 + 1.1/π) = 0.135. Then, it can be verified that, when w (1) y ∈ (0, 0.135], it holds w y ≤ 0.44. ( ) The next is to bound v -w x . Combining the update rules of v and w x in (80, 81), we have ∆(v -w x ) := (v (t+1) -w (t+1) x ) -(v (t) -w (t) x ) = K(v -w x ) vw x -1 + arctan(w y /w x ) - wxwy w 2 π + K w 2 y v (-v 2 + vw y π w 2 ). Note that -1 ≤ vw x -1 + arctan(w y /w x ) - wxwy w 2 π ≤ -1 + arctan(w y /w x ) π , where the left is due to vw x > 0 and , the right is from ∆w y ≥ 0. When w y /w x ≤ 2.7, the RHS follows -1 + arctan(wy/wx) π ≤ -0.6. So combining both sides tells 1 + K vw x -1 + arctan(w y /w x ) - wxwy w 2 π ∈ [-K + 1, 0.4] ⊂ [-0.1, 0.4]. Since ∆w y ≥ 0, we have 0 ≤ K w 2 y v (-v 2 + vwy π w 2 ) ≤ K π w y w 2 y w 2 . Note that at initialization w (0) x ≤ v (0) . Then it is easy to verify that -0.01 ≤ -0.1(v (0) -w (0) ) ≤ v (1) -w (1) ≤ (1 + K π - K 2 )v (0) ≤ 0.082. Because the coefficient on the positive side in ( 98) is larger than 0.4 > 0.1, it is appropriate to upper bound the v -w x as v -w x ≤ max 0.082, 0.082 • 0.4 T + T t=1 0.4 t-1 K π w (t) y (w (t) y ) 2 w (t) 2 ≤ max      0.082, 0.082 • 0.4 T + T t=1 0.4 t-1 K π w (t) y 1 1 + 1 2.7 1.55 1+K/π 2(t-1)      ≤ max      0.082, 0.082 • 0.4 T + T t=1 0.4 t-1 1.1 • 4.4 π 1 1 + 1 2.7 1.55 1+1.1/π 2(t-1)      , where the second inequality is from the different growth ratios of w x and w y . Note that here we take all T ≥ 1 and pick the largest value of RHS to bound w y . It turns out v -w x ≤ 0.17. (100) Furthermore, to lower bound v -w x , since obviously |v -w x | ≤ 0.17, it follows v -w x ≥ -0.1 • |v -w x | max ≥ -0.017. Case (II). If 1 4π ≤ vw x < 1 2π in first step of stage 2: Similar to the discussion in Case (I), we are able to compute the number of iterations for vw x to reach 1/4π. It is at most T 1 ≤ log 2.56 1 4π (w (1) y ) 2 /2.7 . Accordingly, w y is bounded as w y ≤ 1 + 1.1/π -(w (1) y ) 2 log 2.56 1 4π (w (1) y ) 2 /2.7 ≤ 0.37. For simplicity, we just keep the bounds for v -w x as in Case (I), as v -w x ∈ [-0.017, 0.17]. Case (III). If vw x < 1 4π in first step of stage 2: From the condition, we know vw x < 1 4π as well in the last step of stage 1. Since ∆w y > 0 in stage 1, it tells 1 π w x w y w 2 < vw x ≤ 1 4π , which means max{ w x w y , w y w x } ≥ 2 + √ 3. ( ) Since 2 + √ 3 > 2.7, if w y /w x ≥ 2 + √ 3, then for time 1, (v (1) , w x , w y ) is already in the stage 2. However, it is not possible because w (0) = v (0) ≤ 0.1, which means v (1) w (1) x can not reach 1 π 2.7 1+2.7 2 . Therefore, the only possible is wx wy ≥ 2 + √ 3. In this case, we are able to bound w y as w y ≤ (2 - √ 3)w x ≤ (2 - √ 3) 1 4π + 0.0085 2 + 0.0085 ≤ 0.078, where the second inequality is due to vw x ≤ 1 4π and v -w x ≥ -0.017. Note that here we still use the bound of v -w x from Case (I), although it is loose somehow but it is enough for our analysis. We leave the analysis of the bound of number of iterations to the end of this section.

Stage 2.

In the case (I) of stage 1, where the first step in stage 2 is with vw x ≥ 1 2π , it has v -w x ∈ [-0.017, 0.17] and w y ≤ 0.44. In the case (II), where the first step of stage 2 is with vw x ∈ [ 1 4π , 1 2π ], it has v -w x ∈ [-0.017, 0.17] and w y ≤ 0.37. In the case (III), where the first step of stage 2 is with vw x ∈ [ 1 4π , 1 2π ] , it has v -w x ∈ [-0.017, 0.17] and w y ≤ 0.078. To upper bound vw x in the first step of stage 2, there are two candidates. One is from the case (I), v (t+1) w (t+1) x = vw x   1 + K(1 -vw x - arctan( wy wx ) - wy/wx 1+(wy/wx) 2 π )   2 + K w x w 2 y v -v 2 + vw y π w 2 + K(v -w x ) 2   1 + K(1 -vw x - arctan( wy wx ) - wy/wx 1+(wy/wx) 2 π )   ≤ vw x (1 + K(1 -vw x )) 2 + K w x w 2 y w x -vw x + w x w y π w 2 + K(v -w x ) 2 (1 + K(1 -vw x )) ≤ 1 2π 1 + 1.1(1 - 1 2π ) 2 + 1.1 • 0.44 2 - 1 4π + 1 2π + 1.1 • 0.17 2 1 + 1.1(1 - 1 2π ) ≤ 0.668, where we use vw x ≥ 1/4π, x/(1 + x 2 ) ≤ 0.5 for any x. One is from the case (II), v (t+1) w (t+1) x ≤ vw x (1 + K(1 -vw x )) 2 + K w x w 2 y w x -vw x + w x w y π w 2 + K(v -w x ) 2 (1 + K(1 -vw x )) ≤ 1 4π 1 + 1.1(1 - 1 4π ) 2 + 1.1 • 0.37 2 1 2π + 1.1 • 0.17 2 1 + 1.1(1 - 1 4π ) ≤ 0.48, where we use vw x ≤ 1/4π, x/(1 + x 2 ) ≤ 0.5 for any x. Therefore, we can see that, in the first step of stage 2, vw x ≤ 0.668. Next we are going to show how the iteration goes in the stage 2. In Case (I), there are three facts: 1. w y ≤ 0.44.

2.. v -w

x ∈ [-0.017, 0.17].

3.. vw

x ∈ [ 1 2π , 0.668]. Similarly, in Case (II), there are three facts as well: 1. w y ≤ 0.37.

2.. v -w

x ∈ [-0.017, 0.17].

3.. vw

x ∈ [ 1 4π , 1 2π ]. The main idea is to find a basin that any iteration with the above properties (i.e., in the interval) will converge to and then stay in. The method is to iteratively compute the ranges of the variables for several steps, thanks to the narrow range of K. Before explicitly computing the ranges, let's write down the computing method, depending on whether or not vw x ≥ 1.

Consider any iteration with vw

x ∈ [m 1 , m 2 ], v -w x ∈ [d 1 , d 2 ], w y ≤ e, we compute the bounds of v (t+1) w (t+1) x , v (t+1) -w (t+1) x , w in the following process (naturally assuming d 1 < 0 < d 2 ) 1. If m 1 ≥ 1: (a) Compute w x ≥ m 1 + (d 2 /2) 2 -d 2 /2 f . (b) Compute wy wx ≤ e/f g. (c) Compute arctan(wy/wx)- wxwy w 2 π ≤ arctan(g)-g/(1+g 2 ) π h. (d) Compute v (t+1) w (t+1) x ≥ m 2 (1 + 1.1(1 -m 2 -h)) 2 + 1.1(1 -m 2 - h) max{|d 1 |, |d 2 |} 2 -1.1e 2 m 2 . This is from v (t+1) w (t+1) x ≥ vw x (1 + K(1 -vw x -h)) 2 + K w x w 2 y v -v 2 + vw y π w 2 + K(v -w x ) 2 (1 + K(1 -vw x -h)) ≥ vw x (1 + K(1 -vw x -h)) 2 -Kw 2 y • vw x + K(v -w x ) 2 (1 + K(1 -vw x -h)) . (e) Compute v (t+1) w (t+1) x ≤ m 1 (1 + 1.0(1 -m 1 )) 2 . This is due to x(1 + K(1 -x)) 2 decreases with x increasing when x ≥ 1. (f) Compute v (t+1) -w (t+1) x ∈ [d 1 (1 + 1.1(m 2 -1 + h) -1.1e 2 • ( m 2 + (d 2 /2) 2 + d 2 /2)), d 2 (1 + 1.1(m 2 -1 + h))]. This is due to ∆v -∆w x = K(v -w x ) vw x -1 + 1 π (arctan(α) - w x w y w 2 ) + K w 2 y v -v 2 + vw y π w 2 , where vw x ≥ 1, the last term is between -Kvw 2 y and 0. (g) Compute w (t+1) y ≤ e • max{|j 1 |, |j 2 |}, where j 1 = 1 + 1.1 m 1 + (d 2 /2) 2 + d 2 /2 m 1 + (d 2 /2) 2 -d 2 /2 • (-m 2 ), j 2 = 1 + 1.0 m 1 + (d 1 /2) 2 -d 1 /2 m 1 + (d 1 /2) 2 + d 1 /2 • (-m 1 + 1 2π ). This is due to ∆w y w y = K v w x (-vw x + 1 π w x w y w 2 ), then we would like to have the smallest value as j 1 -1 and the largest value as j 2 -1. Since w y is always non-negative, taking the maximum absolute value gives the upper bound. 2. If m 2 < 1: (a) Compute w x ≥ m 1 + (d 2 /2) 2 -d 2 /2 f . (b) Compute wy wx ≤ e/f g. (c) Compute arctan(wy/wx)- wxwy w 2 π ≤ arctan(g)-g/(1+g 2 ) π h. (d) Compute v (t+1) w (t+1) x ≥ min x∈[m1,m2] x(1 + 1.0(1 -x -h)) 2 -1.1e 2 x. Compared with the case of m 1 ≥ 1, we drop the term 1.1(1 -m 2 -h) max{|d 1 |, |d 2 |} 2 because it is possible to have v -w x = 0 in some iterations. (e) Compute v (t+1) w (t+1) x ≤ max x∈[m1,m2] x(1 + 1.1(1 -x)) 2 + 1.1(1 - x) max{|d 1 |, |d 2 |} 2 . Compared with the case of m 1 ≥ 1, we add a term depending on the |v -w x | max because it enlarges vw x in the in-balanced case. (f) Compute v (t+1) -w (t+1) x ∈ [d 1 (1 + 1.1(m 2 -1 + h) -1.1e 2 • ( m 2 + (d 2 /2) 2 + d 2 /2)), d 2 (1 + 1.1(m 2 -1 + h))]. In fact, a rigorous left bound should include more terms to select a minimum from. Here it is simple because it keeps 1 + K(m 1 -1) ≥ 0 in the following computing, so we do not need to worry about the flipping sign of d 1 and d 2 . (g) Compute w (t+1) y ≤ e • max{|j 1 |, |j 2 |}, where j 1 , j 2 are the same with those in the case of m 1 ≥ 1. Therefore, with the above process, we are able to brutally compute the ranges of v (t+1) w (t+1) x , v (t+1) -w (t+1) x , w y from the current ranges. Note that this process plays a role of building a mapping from one interval to another interval, which covers all points from the source interval. However, it is loose to some extent because gradient descent is a mapping from a point to another point. The advantage of such a loose method is feasibility of obtaining bounds while losing tightness. To achieve tightness, later we will also include some wisdom in a point-to-point style. Also note that, a nice way to combine tightness and efficiency in this method is to split and to merge intervals when necessary.

For Case (I):

Now we are to compute the ranges starting from the interval where I = {w y ≤ 0.44, v -w x ∈ [-0.017, 0.17], vw x ∈ [ 1 2π , 0.668]}. First, we split it into three intervals: 1. I 1 = {w y ≤ 0.44, v -w x ∈ [-0.017, 0.17], vw x ∈ [0.213, 0.4]}. 2. I 2 = {w y ≤ 0.44, v -w x ∈ [-0.017, 0.17], vw x ∈ [0.4, 0.668]}. 3. I 30 = {w y ≤ 0.44, v -w x ∈ [-0.017, 0.17], vw x ∈ [ 1 2π , 0.213]}. Then, following the above method with splitting and merging intervals, we have  1. Starting from I 1 , I 10 = {w y ≤ 0.0756, v -w x ∈ [-0.362, 0.068], vw x ∈ [0.777894, 1.11178]}. iv. I 7 mapps to I 11 = {w y ≤ 0.134, v -w x ∈ [-0.394, 0.0782], vw x ∈ [0.595, 1]}. (c) Step 3: Splitting and merging I 8 , I 9 , I 10 , I 11 , we have i. For Case (II): I 12 = {w y ≤ 0.134, v -w x ∈ [-0.394, 0.078], vw x ∈ [0.595, 0.777]}. ii. I 13 = {w y ≤ 0.214, v -w x ∈ [-0.394, 0.078], vw x ∈ [0.777, 1]}. iii. I 14 = {w y ≤ 0.214, v -w x ∈ [-0.362, 0.068], vw x ∈ [1, 1.11178]}. iv. I 15 = {w y ≤ 0.214, v -w x ∈ [-0. Now we are to compute the ranges starting from the interval where I = {w y ≤ 0.37, v -w x ∈ [-0.017, 0.17], vw x ∈ [ 1 4π , 1 2π ]}. First, we denote it as 1. I 52 = {w y ≤ 0.37, v -w x ∈ [-0.017, 0.17], vw x ∈ [ 1 4π , 1 2π ]. Then, following the above method with splitting and merging intervals, we have  I s = T ≥t {(v (T ) , w (T ) x , w (T ) y )|(v (t) t , w (t) x , w (t) y ) ∈ I g }. Then each element (v, w x , w y ) ∈ I s has the following properties: 1. w y = 0.

2.. vw

x ∈ [0.181, 1.5].

3.. If vw

x ≤ 1, then v -w x ∈ [-0.735, 0.23]. If vw x > 1, then v -w x ∈ [-0.474, 0.148]. The first property is obvious. The third can be proven as follows: for each element (v, w x , w y ) ∈ I g , it has v (t+1) -w (t+1) x = (v -w x ) (1 + K(vw x -1)) , where the ratio 1 + K(vw x -1) ∈ [1, 1 + 1.1(1.5 -1)] when vw x ∈ [1, 1.5]. Furthermore, in the proven 2-D case, we have shown that "if vw x > 1 with some mild conditions, then v (t+2) -w (t+2) x v-wx ∈ (-1, 1)". Actually it can be tighter as v (t+2) -w (t+2) x v-wx ∈ (-0.2, 1) because here K ≤ 1.1 while the original bound is for K ≤ 1.5. The condition of bounded |v -w x | can also be verified, the purpose of which is to keep v, w x always positive. Then the bound [-0.2, 1] will tell v -w x ∈ [-0.474, 0.148] on vw x ≥ 1, because 0.148 0.474 > 0.2, 0.474 0.148 > 0.2. For the second property, the left bound can be verified as min x∈[1,1.5] x(1 + 1.1(1 -x)) 2 + 1.1(1 -x) • 0.474 2 = x(1 + 1.1(1 -x)) 2 + 1.1(1 -x) • 0.474 2 x=1.5 ≥ 0.181. The right bound can be verified as max x∈[0,1] x(1 + 1.1(1 -x)) 2 + 1.1(1 -x) * 0.735 2 < 1.5. After proving these three properties, we would like to bound how far I f is away from I s . More precisely, the distance is measured by w y . We are going to show w y decays exponentially. Remind the update rules in (80, 81) . Denote γ = 1 π (arctan(α) -wxwy w 2 ) again and δ = K w 2 y v (-v 2 + vwy π w 2 ), then it is ∆v = Kw x (-vw x + 1) -Kw x γ + δ, ∆w x = Kv(-vw x + 1) -Kvγ, δ ∈ [-Kvw 2 y , 0]. Note that both γ and δ are very small, so we are to show their effects separately, which is enough to be a good approximation. Consider an iteration where v (t) w (t) x > 1 and the corresponding γ (t) . Let's denote v (t+1) , w x as the next parameters with no corruption from γ (t) . Similarly, we denote v(t+1) , ŵx (t+1) are corrupted with γ (t) . From the 2-D analysis, we know v (t+2) -w (t+2) x v (t) -w (t) x = (1 + K(v (t) w (t) x -1))(1 + K(v (t+1) w (t+1) x -1)) < 1. We would like to show, with a small γ (t) and ignoring δ, v(t+2) -ŵx (t+2) v (t) -w (t) x = (1 + K(v (t) w (t) x -1 + γ (t) ))(1 + K(v (t+1) ŵx (t+1) -1 + γ (t+1) )) 1, where γ (t+1) is in time (t + 1) accordingly. The difference of LHS of the above two expressions turns out to be (118) -(117) = Kγ (t) (1 + K(v (t+1) w (t+1) x -1)) + (1 + K(v (t) w (t) x -1))K(v (t+1) ŵx (t+1) -v (t+1) w (t+1) x + γ (t+1) ) + O(γ 2 ) = Kγ (t) (1 + K(v (t+1) w (t+1) x -1)) + K(1 + K(v (t) w (t) x -1))(-K(v (t) ) 2 γ (t) -K(w (t) x ) 2 γ (t) + γ (t+1) ) + O(γ 2 ) ≤ Kγ (t) 1 + (1 + K(v (t) w (t) x -1)) -K(v (t) ) 2 -K(w (t) x ) 2 + γ (t+1) γ (t) + O(γ 2 ) ≤ Kγ (t) 1 + (1 + K(v (t) w (t) x -1)) -2Kv (t) w (t) x + γ (t+1) γ (t) + O(γ 2 ). Since ∆wx wx = K v wx (-vw x + 1 -γ), we have w (t+1) x w (t) x = 1 + K v (t) w (t) x (-v (t) w (t) x + 1 -γ (t) ) < 1. Also we have γ (t+1) γ (t) = arctan( w (t+1) y w (t+1) x ) - w (t+1) x w (t+1) y w (t+1) 2 arctan( w (t) y w (t) x ) - w (t) x w (t) y w (t) 2 . ( ) Since w (t+1) y ≤ w (t) y and arctan(mx) -mx 1+m 2 x 2 arctan(x) -x 1+x 2 ≤ m 3 , for any m > 0, x > 0, we have γ (t+1) γ (t) ≤ 1 1 + K v (t) w (t) x (-v (t) w (t) x + 1 -γ (t) ) 3 . (123) For general vw x ∈ (1, 1.5], (123) holds as γ (t+1) γ (t) 1 (1 + 1.1 √ 1+0.074 2 +0.074 √ 1+0.074 2 -0.074 (-1.5 + 1)) 3 ≤ 22. Since 1 + K(v (t) w (t) x -1) ≤ 1 + 1.1 * 0.5 = 1.55, it is fair to say  γ (t) ≤ arctan(x) -x 1+x 2 π ≤ 2.6 × 10 -4 . ( ) As a result, (118) -(117) 0.0084. ( ) Note that this small value is very easy to cover in (117), requiring 1 - v (t+2) -w (t+2) x v (t) -w (t) x ≥ 0.0084, except when vw x is pretty close to 1. When vw x -→ 1, from the analysis of 2-D case, (derived from the case of x t+1 y t+1 ≥ x 2 s ) 1 - v (t+2) -w (t+2) x v (t) -w (t) x ≥ (2K -2)(v (t) w (t) x -1). For ( 118) -(117), denote a function p(x) as p(x) = 1 + (1 + Kx)   -2K(x + 1) + 1 1 + K v wx (-x) 3    , where x = v (t) w (t) x -1 in (119, 123) . It is obvious that p(0) = 1 + (-2K + 1) < 0. When x is small, it turns out p(x) = -2K + 2 + K -2K -1 + 3 v (t) w (t) x x + O(x 2 ) (132) As a result, (118) -(117) < 0 when vw x -1 = o(K -1). What if vw x -1 = Ω(K -1)? Actually, we can get a better bound by a more care analysis, as (118) -(117) Kγ (t) ≤ 1 + (1 + K(v (t) w (t) x -1)) -K(v (t) ) 2 -K(w (t) x ) 2 + γ (t+1) γ (t) + K v (t) w (t) x (1 + K(1 -v (t) w (t) x )) 2 -1 , where the last term is due to v (t+1) w (t+1) x ≤ v (t) w (t) x (1 + K(1 -v (t) w (t) x )) 2 . Hence, with this bound, by expanding the last term, (132) becomes p(x) = -2K + 2 + K -2K -1 + 3 v (t) w (t) x x + K(1 -2K)x + O(x 2 ) (134) = -2K + 2 + K -4K + 3 v (t) w (t) x x + O(x 2 ), which is definitely negative because v (t) w (t) x ≤ √ 1 + 0.074 2 + 0.074 √ 1 + 0.074 2 -0.074 < 1.16 < 4 3 . Meanwhile, we are to prove the δ in (114) will not make Ĩs make v -w x < -0.474 starting from v -w x ≥ -0.462. First, in the region of {vw x ∈ [1, 1.5], v -w x ≤ 0.148}, we have Kvw 2 y ≤ 1.1 ) 2 ≤ 0.0144 * (1 + 0.37 2 ) = 0.0164. Since |v (t+2) -w (t+2) | < |v (t) -w (t) | if there is no δ, we shall see that there is no need to discuss the case of v -w x ≥ -0.462 + 0.0164 = -0.4456 because it still holds v (t+1) -w (t+1) x > -0.462. When v (t) -w (t) x ∈ [-0.462, -0.4456], we shall see that in (59), after adding the term of δ in v, v (t+2) -w (t+2) x v (t) -w (t) x ≤ 1 -(1 + K(vw x -1)) • Kw x δ, which means the absolute value of v -w x decays at least by a margin depending on δ. After multiplying the current difference v (t) -w x on both side, it gives (v (t+2) -w (t+2) x ) -(v (t) -w (t) x ) ≥ v (t) w (t) x w (t) x δ. Note that here v (t+2) -w (t+2) x does not include δ (t) and δ (t+1) . As stated above, we have δ (t+1) δ (t) ≤ 0.37 2 ≤ 0.16 due to the decay of w y . So it is safe to say δ (t) + δ (t+1) ≥ 1.16δ (t) . Combining with the above inequality, it gives (v (t+2) -w (t+2) x ) -(v (t) -w (t) x ) + δ (t) + δ (t+1) ≥ (v (t) w (t) x w (t) x + 1.16)δ (t) , where gives that the sum of ( 140) is bounded by 0.6 1 -0.16 δ (t) ≥ 0.6 1 -0.16 v (t) • (-0.0144) ≥ -0.0103. Since -0.474 -(-0.462) < -0.0103, we shall see that the term of δ cannot drive v -w x < -0.472. Note that (140) shall include a factor (< 1) in front of δ (t) , but we have ignored it to show a more aggressive bound. Therefore, we are able to say an Interval Îs generated by I f also has the following properties: for each element (v, w x , w y ) ∈ Îs , 1. vw x ∈ [0.181, 1.5].

2.. If vw

x ≤ 1, then v -w x ∈ [-0.735, 0.23]. If vw x > 1, then v -w x ∈ [-0.472, 0.148]. Then the decreasing ratio of ∆w y /w y is bounded by ∆w y w y = K v w x -vw x + 1 π w x w y w 2 (143) ∈ -1.1( 1.5 + 0.074 2 + 0.074) 2 , -0.030K = [-1.87, -0.030K]. (145) Hence, w y decays with a linear ratio of 0.97 (or 1 -0.030K) at most for Cases (I, II) in stage 2. For Case (III), in the first step of stage 2, it already has w y ≤ 0.078 and v -w x ∈ [-0.017, 0.17]. So surely it will also converge to I s . Here we present the time analysis for Case (III) of both stages. The number of iterations in the first stage is apparently similar to that of case (I, II), as T 1 ≤ log 2.56 2.7ψ β 2 , where ψ < 1 4π is the value of vw x in the first step of stage 2. In stage 2, since our target is to find how many steps are necessary to get vw x ≥ 0.181, so it is v (t+1) w (t+1) x ≥ v (t) w (t) x   1 -0.181 + 1 - arctan(2 - √ 3) -2- √ 3 1+(2- √ 3) 2 π -1.1w 2 y   (147) ≥ 3.28v (t) w (t) x . where obviously it still holds 

J PROOF OF MATRIX FACTORIZATION

Consider a two-layer matrix factorization problem, parameterized by learnable weights X ∈ R m×p , Y ∈ R p×q , and the target matrix is C ∈ R m×q . The loss L is defined as L(X, Y) = 1 2 XY -C 2 F . Obviously {X, Y : XY = C} forms a minimum manifold. Focusing on this manifold, our targets are: 1) to prove our condition for stable oscillation on 1D functions holds at the minimum of L for any setting of dimensions, and 2) to prove a convergence result with initialization close to a minimum beyond the edge of stability for the symmetric case Y = X . J.1 ASYMMETRIC CASE: 1D FUNCTION AT THE MINIMA Before looking into the theorem, we would like to clarify the definition of the loss Hessian. Inherently, we squeeze X, Y into a vector θ = vec(X, Y) ∈ R mp+pq , which vectorizes the concatnation. As a result, we are able to represent the loss Hessian w.r.t. θ as a matrix in R (mp+pq)×(mp+pq) . Meanwhile, the support of the loss landscape is in R mp+pq . In the following theorem, we are to show the leading eigenvector ∆ vec(∆X, ∆Y) ∈ R mp+pq of the loss Hessian. Since the cross section of the loss landscape and ∆ forms a 1D function f ∆ , we would also show the stable-oscillation condition on 1D function holds at the minima of f ∆ . Theorem 12. For a matrix factorization problem, assume XY = C. Consider SVD of both matrices as X = min{m,p} i=1 σ x,i u x,i v x,i and Y = min{p,q} i=1 σ y,i u y,i v y,i , where both groups of σ •,i 's are in descending order and both top singular values σ x,1 and σ y,1 are unique. Also assume v x,1 u y,1 = 0. Then the leading eigenvector of the loss Hessian is ∆ = vec(C 1 u x,1 u y,1 , C 2 v x,1 v y,1 ) with C 1 = σy,1 √ σ 2 x,1 +σ 2 y,1 , C 2 = σx,1 √ σ 2 x,1 +σ 2 y,1 . Denote f ∆ as the 1D function at the cross section of the loss landscape and the line following the direction of ∆ passing vec(∆X, ∆Y). Then, at the minima of f ∆ , it satisfies 3[f (3) ∆ ] 2 -f (2) ∆ f (4) ∆ > 0. Proof. To obtain the direction of the leading Hessian eigenvector at parameters (X, Y), consider a small deviation of the parameters as (X + ∆X, Y + ∆Y). With XY = C, evaluate the loss function as L(X + ∆X, Y + ∆Y) = 1 2 ∆XY + X∆Y + ∆X∆Y 2 F . Expand these terms and split them by orders of ∆X, ∆Y as follows: Θ( ∆X 2 + ∆Y 2 ) : 1 2 ∆XY + X∆Y 2 F , Θ( ∆X 3 + ∆Y 3 ) : ∆XY + X∆Y, ∆X∆Y , Θ( ∆X 4 + ∆Y 4 ) : 1 2 ∆X∆Y 2 F . From the second-order terms, the leading eigenvector of ∇ 2 L is the solution of vec(∆X, ∆Y) = arg max ∆X 2 F + ∆Y 2 F =1 ∆XY + X∆Y 2 F . Since both the top singular values of X, Y are unique, the solution shall have both ∆X, ∆Y of rank 1. Actually the solution is (here for simplicity we eliminate the sign of both) ∆X = σ y,1 σ 2 x,1 + σ 2 y,1 u x,1 u y,1 , ∆Y = σ x,1 σ 2 x,1 + σ 2 y,1 v x,1 v y,1 . Equipped with the top eigenvector of Hessian, vec(∆X, ∆Y), we consider the 1-D function f ∆ generated by the cross-section of the loss landscape and the eigenvector, passing the minima (X, Y). Define the function as f ∆ (µ) = L(X + µ∆X, Y + µ∆Y), µ ∈ R. Then, around µ = 0, we have f ∆ (µ) = 1 2 ∆XY + X∆Y 2 F • µ 2 + ∆XY + X∆Y, ∆X∆Y • µ 3 + 1 2 ∆X∆Y 2 F • µ 4 . Therefore, the several order derivatives of f ∆ (µ) at µ = 0 can be obtained from Taylor expansion as f (2) ∆ (0) = ∆XY + X∆Y 2 F , (3) ∆ (0) = 6 ∆XY + X∆Y, ∆X∆Y , (4) ∆ (0) = 12 ∆X∆Y 2 F . Then we compute the condition of stable oscillation of 1-D function as 3[f (3) ∆ ] 2 -f (2) ∆ f (4) ∆ (0) = 108 ∆XY + X∆Y, ∆X∆Y 2 -12 ∆XY + X∆Y 2 F ∆X∆Y 2 F (162) = 96 ∆XY + X∆Y 2 F ∆X∆Y 2 F > 0, because all of ∆XY, X∆Y, ∆X∆Y are parallel to u x,1 v y,1 and v x,1 u y,1 = 0.

J.2 SYMMETRIC CASE: CONVERGENCE ANALYSIS AROUND THE MINIMA

In this section, we focus on the symmetric case of matrix factorization where Y = X . Accordingly, we rescale the loss function as L(X, Y) = 1 4 XX -C 2 F . Denote the target as C = X 0 X 0 , and assume we are around the minima X 1 = X 0 + ∆X 1 with small ∆X 1 . Consider SVD as X 0 = min{m,p} i=1 σ i u i v i σ 1 u 1 v 1 + X 0 . Then the EoS-learning-rate threshold at X = X 0 is η = 1 σ 2 1 . Therefore, we are to show the convergence initializing from X 1 = X 0 + ∆X 1 with learning rate η = 1 σ 2 1 + β with β > 0. Theorem 13. Consider the above symmetric matrix factorization with learning rate η = 1 σ 2 1 + β. Assume 0 < βσ 2 1 < √ 4.5 -1 ≈ 1. 121 and ησ 2 2 < 1. The initialization is around the minimum, as X 1 = X 0 + ∆X 1 , with the deviation satisfying u 1 ∆X 1 v 1 = 0 and ∆X 1 ≤ bounded by a small value. Then GD would converge to a period-2 orbit γ η by a small margin in O( ), as X t → γ η + ∆X, ∆X = O( ), γ η = (X 0 + δ 1 σ 1 u 1 v 1 , X 0 + δ 2 σ 1 u 1 v 1 ), where δ 1 ∈ (0, 1), δ 2 ∈ (-1, 0) are the two solutions of 1 + βσ 2 1 = 1 (δ + 1) 2 1 (δ+1) 2 -3 4 + 1 2 . ( ) Proof. The update rule of gradient descent gives X t+1 = X t -η X t X t -X 0 X 0 X t . Denoting ∆X t = X t -X 0 , the update rule is equivalent to ∆X t+1 = ∆X t -η ∆X t X 0 + X 0 ∆X t + ∆X t ∆X t (X 0 + ∆X t ) Consider a decomposition of ∆X t = t ũt v 1 + ∆X t where ∆X t v 1 = 0 and ũt = 1. We also control the sign of ũt by claiming ũt , u 1 > 0. Then, the update rule is again equivalent to ∆X t+1 = t ũt v 1 + ∆X t -η t ũt v 1 + ∆X t X 0 + X 0 t ũt v 1 + ∆X t + (169) t ũt v 1 + ∆X t t ũt v 1 + ∆X t X 0 + t ũt v 1 + ∆X t . After expanding X 0 = σ 1 u 1 v 1 + X 0 and projecting ∆X t+1 onto v 1 , we have ∆X t+1 v 1 = t ũt -η σ 2 1 t ũt + σ 1 2 t ũt u 1 ũ + 3 t ũt + σ 2 1 t u 1 ũ t u 1 + σ 1 2 t u 1 + σ 1 2 t ũt ũ t u 1 (171) +σ 1 ∆X t ∆X t u 1 + σ 1 X 0 ∆X t u 1 + t ∆X t X 0 ũt + t ∆X t ∆X t ũt + t X 0 ∆X t ũt . Phase I: ũt gets close to u 1 sharply from a random direction For now, let's assume ∆X t stays small (which we will show later), then ignoring high-order small values gives t+1 ũt+1 = t ũt -η σ 2 1 t ũt + σ 2 1 t u 1 ũ t u 1 + σ 1 X 0 ∆X t u 1 + O( 2 ), ( ) t+1 ũt+1 , u 1 = (1 -2ησ 2 1 ) t ũt , u 1 + O( 2 ) = (-1 -βσ 2 1 ) t ũt , u 1 + O( 2 ), m t+1 t+1 ũt+1 -t+1 ũt+1 , u 1 u 1 = (1 -ησ 2 1 )( t ũt -t ũt , u 1 u 1 ) -ησ 1 X 0 ∆X t u 1 + O( 2 ), where ( 174) is the projection of ∆X t+1 onto u 1 v 1 , and ( 175) is the orthogonal-to-u 1 component of ∆X t+1 v 1 . Meanwhile, we have the following iteration of ∆X t following the update rule, ∆X t+1 = ∆X t -η σ 1 t ũt u 1 ∆X t + ∆X t X 0 X 0 + ∆X t X 0 ∆X t + σ 1 t u 1 ũ t X 0 (176) +σ 1 t u 1 ũ t ∆X t + X 0 ∆X t X 0 + X 0 ∆X t ∆X t + 2 t ũt ũ t X 0 (177) + 2 t ũt ũ t ∆X t + ∆X t ∆X t X 0 + ∆X t ∆X t ∆X t . Ignoring higher-order small values, it is ∆X t+1 = ∆X t -η σ 1 t u 1 ũ t X 0 + ∆X t X 0 X 0 + X 0 ∆X t X 0 + O( 2 ). Now we are to verify two facts: 1. The orthogonal-to-u 1 component of ∆X t v 1 , denoted as m t , stays small. Then combining the exponential growth of parallel-to-u 1 component in (174) gives ũt , u 1 → 1 quickly. 2. ∆X t stays small. First is the bound of m t . From (175), we have m t+1 = (1 -ησ 2 1 )m t -ησ 1 X 0 ∆X t u 1 + O( 2 ). Combining with (179) and noticing u 1 X 0 = 0, it holds m t+1 ≈ (1 -ησ 2 1 )m t -ησ 1 X 0 ∆X t u 1 (180) σ 1 X 0 ∆X t+1 u 1 ≈ σ 1 X 0 ∆X t u 1 -ησ 1 σ 1 t X 0 X 0 ũt + X 0 X 0 X 0 ∆X t u 1 (181) = I -η X 0 X 0 σ 1 X 0 ∆X t u 1 -ησ 2 1 t X 0 X 0 ũt . Furthermore, we have t X 0 ũt = X 0 m t due to X 0 u 1 = 0, so (182) can be rewritten as σ 1 X 0 ∆X t+1 u 1 ≈ I -η X 0 X 0 σ 1 X 0 ∆X t u 1 -ησ 2 1 X 0 X 0 m t . Combining (180, 183) gives a form of m t+1 as a combination of all previous terms as m t+1 ≈ (1 -ησ 2 1 )m t + t i=1 η 2 σ 2 1 I -η X 0 X 0 i-1 X 0 X 0 m t-i . Since our goal is to verify that m t is bounded, pick any eigenvector v p of X 0 X 0 with associated eigenvalue λ p . Then we have m t+1 , v p = (1 -ησ 2 1 ) m t , v p + t i=1 η 2 σ 2 1 v p I -η X 0 X 0 i-1 X 0 X 0 m t-i = (1 -ησ 2 1 ) m t , v p + η 2 σ 2 1 t i=1 (1 -ηλ p ) i-1 λ p m t-i , v p . Obviously m t+1 , v p shall converge to a series with exponential growth. Assume the ratio as m t+1 , v p / m t , v p = r.The above equation is equivalent to r = -βσ 2 1 + η 2 σ 2 1 t i=1 (1 -ηλ p ) i-1 λ p r -i = -βσ 2 1 + η 2 σ 2 1 λ p /r 1 -(1 -ηλ p )/r (187) = -βσ 2 1 + η 2 σ 2 1 λ p r -1 + ηλ p . The solutions for this equation are r = 1 or r = 1 -ηλ p -ησ 2 1 , both of which are in [-1, 1] once λ p ≤ (1 -βσ 2 1 )/η. Hence, it is safe to say m t+1 , v p is bounded as non-increasing. (1 -ηλ p ) i-1 λ p m t-i , v p , where RHS is a constant due to the proven non-increasing | m t , v p |. But this constant is proportional to λ p because m 2 , v p + βσ 2 1 m 1 , v p = ησ 2 1 λ p m 0 , v p ∝ λ p and all further iterations follow a similar factor. Therefore, we have | m 1 , v p | ∝ λ p . Now let's show ∆X t stays small. Consider u 1 ∆X t , (179) gives u 1 ∆X t+1 = u 1 ∆X t I -η X 0 X 0 -ησ 1 m t X 0 (190) = t i=1 -ησ 1 m t-i+1 X 0 I -η X 0 X 0 i-1 , so the norm is bounded as, for some σ of singular value of X 0 , u 1 ∆X t+1 ≤ ησ 1 | 0 | σ 1 -(1 -ησ 2 ) Cσ 2 = | 0 |σ 1 Cσ, where Cσ 2 are from the previous discussion of | m 1 , v p | ∝ λ p , which follows σ = λ p . Hence, it is fair to say u 1 ∆X t stays small. Meanwhile, the residual component of ∆X t that is orthogonal to u 1 on the left, denoted as ∆X t,⊥ , iterates following ∆X t+1,⊥ = ∆X t,⊥ -η ∆X t,⊥ X 0 X 0 + X 0 ∆X t,⊥ X 0 (193) ∆X t+1,⊥ X 0 + X 0 ∆X t+1,⊥ = ∆X t,⊥ X 0 + X 0 ∆X t,⊥ -η ∆X t,⊥ X 0 X 0 X 0 (194) + X 0 ∆X t,⊥ X 0 X 0 + X 0 X 0 ∆X t,⊥ X 0 + X 0 X 0 X 0 ∆X t,⊥ = ∆X t,⊥ X 0 + X 0 ∆X t,⊥ 0.5I -η X 0 X 0 (196) + 0.5I -η X 0 X 0 ∆X t,⊥ X 0 + X 0 ∆X t,⊥ . As a result, due to X 0 X 0 ≤ σ 2 2 < 1/η, the following norm is recursively bounded as ∆X t+1,⊥ X 0 + X 0 ∆X t+1,⊥ ≤ ∆X t,⊥ X 0 + X 0 ∆X t,⊥ . Since ∆X t+1,⊥ is a polynomial of X 0 and its transpose, the above bound tells that ∆X t+1,⊥ does not grow, which means ∆X t+1,⊥ ≤ ∆X t,⊥ . Combining (192, 199) , we can conclude that ∆X t stays small. After proving that both m t and ∆X t stay small, from (174), the only term growing fast is t ũt , u 1 exponentially, which means the projection of , ũt onto u 1 is also growing sharply. Phase II: after ũ1 u 1 approximately In the first phase, t ũt , u 1 grows exponentially with all other components in ∆X t keeping small. The consequences of such a growth are 1. ũ1 gets close to u 1 in direction, with the cosine similarity between them growing like cos(arctan(exp(t))), where t is a time variable starting from some constant. 2. while ũt is a unit vector, the growth of t ũt , u 1 makes t grow sharply as well. In the phase I, the proof is strongly dependent of the assumption that t is small. But in the phase II, t is not small any more. We would like to assume that the second consequence comes later than the first one, which is feasible once we make the initialization smaller. Then, we would like to verify the dynamics of t ũt , u 1 when t is relatively large. After keeping higher-order terms of t in (172) and considering u 1 ũt ≈ 1, we have t+1 ũt , u 1 = t ũt , u 1 -η σ Theorem 14 (Restatement of Theorem 4). Consider the above quasi-symmetric matrix factorization with learning rate η = 1 σ 2 1 + β. Assume 0 < βσ 2 1 < √ 4.5 -1 ≈ 1.121. Consider a minimum (Y 0 = αX 0 , Z 0 = 1 /αX 0 ), α > 0. The initialization is around the minimum, as Y 1 = Y 0 + ∆Y 1 , Z 1 = Z 0 + ∆Z 1 , with the deviations satisfying u 1 ∆Y 1 v 1 = 0, u 1 ∆Z 1 v 1 = 0 and ∆Y 1 , ∆Z 1 ≤ . The second largest singular value of X 0 needs to satisfy max η σ 2 1 α 2 1 + α 4 σ 2 2 σ 2 1 , ησ 2 1 α 2 1 + σ 2 2 α 4 σ 2 1 ≤ 2. ( ) Then GD would converge to a period-2 orbit γ η approximately with error in O( ), formally written as (Y t , Z t ) → γ η + (∆Y, ∆Z), ∆Y , ∆Z = O( ), γ η = Y 0 + (ρ i -α) σ 1 u 1 v 1 , Z 0 + (ρ i -1 /α) σ 1 u 1 v 1 , (i = 1, 2) ( ) where ρ 1 ∈ (1, 2), ρ 2 ∈ (0, 1) are the two solutions of solving ρ in 1 + βσ 2 1 = 1 ρ 2 1 ρ 2 -3 4 + 1 2 . ( ) Proof. The update rule of gradient descent gives (for t ≥ 1) Y t+1 = Y t -η Y t Z t -X 0 X 0 Z t , Y t+1 = Z t -η Z t Y t -X 0 X 0 Y t . Denoting ∆Y t = Y t -Y 0 , ∆Z t = Z t -Z 0 , for t ≥ 1, the update rule is equivalent to ( ∆Y t+1 = ∆Y t - ) Phase I: ũy,t and ũz,t get close to u 1 sharply from random directions Decompose the three matrices at the around-initialization minimum X 0 = σ 1 u 1 v 1 + X 0 , Y 0 = ασ 1 u 1 v 1 + Y 0 , Z 0 = 1 α σ 1 u 1 v 1 + Z 0 . Obviously they satisfy X 0 = 1 α Y 0 = α Z 0 . For now, let's Similarly, we need to the bound the norm of the following term N t+1 -( ∆Y t+1,⊥ X 0 + X 0 ∆Y t+1,⊥ ) + α 2 ( ∆Z t+1,⊥ X 0 + X 0 ∆Z t+1,⊥ ) (257) = -∆Y t,⊥ X 0 -X 0 ∆Y t,⊥ + α 2 ∆Z t,⊥ X 0 + α 2 X 0 ∆Z t,⊥ + ηα 2 ∆Z t,⊥ X 0 X 0 X 0 + ηα 2 X 0 X 0 X 0 ∆Z t,⊥ + η X 0 ∆Y t,⊥ X 0 X 0 (259) + η X 0 X 0 ∆Y t,⊥ X 0 -η ∆Y t,⊥ X 0 X 0 X 0 -η X 0 X 0 X 0 ∆Y t,⊥ -ηα 2 X 0 ∆Z t,⊥ X 0 X 0 -ηα 2 X 0 X 0 ∆Z t,⊥ X 0 (261) = α 2 ∆Z t+1,⊥ X 0 I -η X 0 X 0 + -η X 0 X 0 α 2 ∆Z t+1,⊥ X 0 (262) + α 2 X 0 ∆Z t+1,⊥ -η X 0 X 0 + I -η X 0 X 0 α 2 X 0 ∆Z t+1,⊥ -∆Y t+1,⊥ X 0 I -η X 0 X 0 --η X 0 X 0 ∆Y t+1,⊥ X 0 (264) -X 0 ∆Y t+1,⊥ -η X 0 X 0 + I -η X 0 X 0 X 0 ∆Y t+1,⊥ . Regarding the above equation, there are several key observations: (a) N t is symmetric. (b) For any eigenvector v of X 0 X 0 , we have v N t+1 v = v N t v. (c) For any distinct eigenvectors v p , v q of X 0 X 0 , we have v p N t+1 v q + v q N t+1 v p = v p N t v q + v q N t v p . Combining these three observations, we have v N t+1 v = v N t v for any v with v decomposed as a linear combination of the eigenvectors of X 0 X 0 . Therefore, it is fair to say N t+1 = N t . So combining M t+1 ≤ M t and N t+1 = N t tells that both ∆Y t+1,⊥ X 0 + X 0 ∆Y t+1,⊥ , ∆Z t+1,⊥ X 0 + X 0 ∆Z t+1,⊥ stay small, which tells ∆Y t,⊥ , ∆Z t,⊥ also stay small. Therefore, we can conclude that both ∆Y t , ∆Z t stay small. After proving that all of m y,t , m z,t , ∆Y t , ∆Z t stay small, from (174), the only terms growing fast are t ũy,t , u 1 , t ũz,t , u 1 exponentially, which means the projections of ũy,t , ũz,t onto u 1 is also growing sharply. Phase II: after ũy , ũz u 1 approximately Following the same spirit of everything in the symmetric case, we re-introduce higher-order terms, as i is bounded for any t ≥ 1, we only need to show the below matrix C with C spectral < 1, with C defined as ∆Y t+1 v 1 y, C =     1 -a 1 v 1 u 1 0 -a 1 v 2 u 1 0 0 1 -a 2 v 2 u 2 0 -a 2 v 1 u 2 -a 2 v 1 u 2 0 1 -a 2 v 2 u 2 0 0 -a 1 v 2 u 1 0 1 -a 1 v 1 u 1     , where v 1 u 1 = 1 1 + σp σ1 α 2 2 1 + σp σ1 α 2 2 , v 2 u 1 = σp σ1 2 1 + σp σ1 α 2 2 1 + σp σ1α 2 2 , v 2 u 2 = 1 1 + σp σ1α 2 2 1 + σp σ1α 2 2 , v 1 u 2 = σp σ1 2 1 + σp σ1 α 2 2 1 + σp σ1α 2 2 . ( ) To obtain C 2 < 1, we only need to compute C i,: < 1 for i ∈ [4]. Taking i = 1 as an example, we have C 1,: 2 = (1 -a 1 v 1 u 1 ) 2 + (a 2 v 2 u 1 ) 2 = 1 -η σ 2 1 α 2 2 +     η σ 2 p α 2 1 + σp σ1 α 2 2 1 + σp σ1α 2 2     2 . If α < 1, the above RHS < (1 -η σ 2 1 α 2 ) 2 + (η σ 2 p α 2 ) 2 < 1 where the second inequality is due to σ 2 p ≤ σ 2 2 < σ 2 1 . If α ≥ 1, the condition of a 1 ≤ 2 (from B ≤ 2) gives σ 2 p ≤ 2/η-σ 2 1 /α 2 α 2 , which is further σ 4 p < 2/η-σ 2 1 /α 2 α 2 σ 2 1 due to σ 2 p < σ 2 1 . As a result, it holds ησ 4 1 -2α 2 σ 2 1 + ησ 4 p α 4 < 0, which means (1 -ησ 2 1 /α 2 ) 2 + (ησ 2 p ) 2 < 1. Noting the above RHS≤ (1 -ησ 2 1 /α 2 ) 2 + (ησ 2 p /α 2 • α 2 ) 2 = (1 -ησ 2 1 /α 2 ) 2 + (ησ 2 p ) 2 when α ≥ 1, it finishes the proof. L ILLUSTRATION OF PERIOD-2 AND PERIOD-4 ORBITS In the proof of local convergence to the period-2 orbit in (??), we give a bound of learning rate as √ 5 -1 ≈ 1.236. Local convergence is guaranteed if the learning rate is smaller than it. Conversely, if the learning rate is larger than it, although the period-2 orbit still exists, GD starting from a point infinitesimally close to the orbit still escapes from it. This is when GD converges to a higher-order orbit. Figure 9 precisely shows the effectiveness of such a bound where GD converges to the period-2 orbit when η = 1.235 < √ 5 -1 and a period-4 orbit when η = 1.237 > √ 5 -1.

M DISCUSSIONS

First, we provide a general roadmap of our theoretical results in Section M.1, as illustrated in Figure 10 . Then, in Section M.2 we discuss three implications from our current low-dimensional settings to more complicated models for future understanding of EoS in pratical NNs, where low-dimension theorems are enhanced with high-dim experiments.

M.1 CONNECTIONS BETWEEN THEORETICAL RESULTS

In this section, we discuss the connections between our presented theoretical results, as illustrated in Figure 10 . Theorem 1 and Lemma 1 present (local) intrinsic geometric properties for a 1-D function to allow stable oscillations. Such properties provide us the 1-D function f (x) = (µ -x 2 ) 2 and, furthermore,



One can replace the uniform curvature bound by sup θ;f (θ)≤f (θ (0) ) λ(θ). in particular, it contains the orbit {X0U ; U ∈ O(p)}



Figure 1: Running GD around the local minima of f (x) = 1 4 (x 2 -1) 2 (left) and f (x) = 2 sin(x) (right) with learning rate η = 1.01 > 2 f (x) = 1. Stars denote the start points. It turns out both functions allow stable oscillation around the local minima.

Figure2: Running GD in the teacher-student setting with learning rate η = 2.2 = 1.1d, trained on 1000 points uniformly sampled from sphere S 1 of x = 1. The teacher neuron is w = [1, 0] and the student neuron is initialized as w (0) = [0, 0.1] with v (0) = 0.1.

Figure 3: Symmetric and Quasi-symmetric Matrix factorization: running GD around flat (α = 1) and sharp (α = 0.8) minima. In both cases, their leading singular values converge to the same period-2 orbit (about 6.1 and 5.3). (Left: Training loss. Middle: Largest singular value of symmetric case. Right: Largest singular values of quasi-symmetric case.)

Figure4: Running GD on f (x, y) = 1 2 (xy -1) 2 with learning rate η = 1.05 (top) and η = 1.25 (bottom). When η = 1.05, it converges to a period-2 orbit. When η = 1.25, it converges to a period-4 orbit. In both cases, |x -y| decays sharply.

Figure 5: Result of 2-layer 16-neuron teacher-student experiment.

Figure 6: Result of 3-layer ReLU MLPs on MNIST. Both (c) and (d) are for learning rate as 0.5.

Figure 7: Result of 4-layer ReLU MLPs on MNIST.

Figure 8: Result of 5-layer ReLU MLPs on MNIST.

Step 1: I 1 mapps to I 3 = {w y ≤ 0.416, v -w x ∈ [-0.162, 0.068], vw x ∈ [0.55, 1.12131]}. (b) Step 2: Splitting I 3 , we have i. I 4 = {w y ≤ 0.416, v -w x ∈ [-0.162, 0.068], vw x ∈ [0.55, 0.8]}. ii. I 5 = {w y ≤ 0.416, v -w x ∈ [-0.162, 0.068], vw x ∈ [0.8, 0.9]}. iii. I 6 = {w y ≤ 0.416, v -w x ∈ [-0.162, 0.068], vw x ∈ [0.9, 1.0]}. iv. I 7 = {w y ≤ 0.416, v -w x ∈ [-0.162, 0.068], vw x ∈ [1.0, 1.12131]}. Then, we have i. I 4 mapps to I 8 = {w y ≤ 0.214, v -w x ∈ [-0.309, 0.0545], vw x ∈ [0.942, 1.25786]}. ii. I 5 mapps to I 9 = {w y ≤ 0.0966, v -w x ∈ [-0.335, 0.0613], vw x ∈ [0.880, 1.19649]}. iii. I 6 mapps to

Starting from I 30 , (a) Step 1: I 30 mapps to I 31 = {w y ≤ 0.44, v -w x ∈ [-0.124, 0.037], vw x ∈ [0.422, 0.767]} (b) Step 2: Splitting I 31 , we have i. I = {w y ≤ 0.44, v -w x ∈ [-0.124, 0.037], vw x ∈ [0.422, 0.5]}. ii. I = {w y ≤ 0.44, v -w x ∈ [-0.124, 0.037], vw x ∈ [0.5, 0.6]}. iii. I = {w y ≤ 0.44, v -w x ∈ [-0.124, 0.037], vw x ∈ [0.6, 0.767]}. Then, we have i. I mapps to I = {w y ≤ 0.301, v -w x ∈ [-0.218, 0.0185], vw x ∈ [0.901, 1.20971]}. ii. I mapps to I = {w y ≤ 0.262, v -w x ∈ [-0.245, 0.023], vw x ∈ [0.96322, 1.25093]}. iii. I mapps to I = {w y ≤ 0.213, v -w x ∈ [-0.288, 0.029], vw x ∈ [0.947, 1.25345]}. (c) Step 3: Splitting and merging I 35 , I 36 , I 37 , we have i.I = {w y ≤ 0.301, v -w x ∈ [-0.288, 0.029], vw x ∈ [0.901, 1]}. ii. I = {w y ≤ 0.301, v -w x ∈ [-0.288, 0.029], vw x ∈ [1, 1.1]}. iii. I = {w y ≤ 0.301, v -w x ∈ [-0.288, 0.029], vw x ∈ [1.1, 1.25093]}. iv. I = {w y ≤ 0.262, v -w x ∈ [-0.245, 0.029], vw x ∈ [1.25093, 1.25345]}. Then, we have i. I mapps to I = {w y ≤ 0.0404, v -w x ∈ [-0.392, 0.029], vw x ∈ [0.888, 1.11696]}. ii. I mapps to I = {w y ≤ 0.0740, v -w x ∈ [-0.428, 0.033], vw x ∈ [0.741, 1]}. iii. I mapps to I = {w y ≤ 0.125, v -w x ∈ [-0.482, 0.038], vw x ∈ [0.497, 0.891]}. iv. I mapps to I = {w y ≤ 0.109, v -w x ∈ [-0.400, 0.038], vw x ∈ [0.534, 0.702]}. (d) Step 4: Splitting and merging I 42 , I 43 , I 44 , I 45 , we have i. I = {w y ≤ 0.125, v -w x ∈ [-0.482, 0.038], vw x ∈ [0.497, 0.891]}. ii. I = {w y ≤ 0.074, v -w x ∈ [-0.428, 0.033], vw x ∈ [0.891, 1]}. iii. I = {w y ≤ 0.041, v -w x ∈ [-0.40, 0.029], vw x ∈ [1, 1.11696]}. Then, we have i. I 46 mapps to I 49 = {w y ≤ 0.0424, v -w x ∈ [-0.442, 0.034], vw x ∈ [1.07853, 1.34708]}. ii. I 47 mapps to I 50 = {w y ≤ 0.0110, v -w x ∈ [-0.435, 0.033], vw x ∈ [0.993, 1.13943]}. iii. I 48 mapps to I 51 = {w y ≤ 0.0109, v -w x ∈ [-0.454, 0.033], vw x ∈ [0.497, 0.891]}.

In fact, | m t+1 , v p | is bounded by | m 0 , v p | because | m 1 , v p | < | m 0 , v p | due to the scaling factor βσ 21 < 1. Therefore, after picking any eigenvector v p of X 0 X 0 , we can concludem t ≤ m 0 ≤ | 0 |. A further result is that | m 1 , v p | ∝ λ p . Notice that m t+1 , v p + βσ 2 1 m t , v p = η 2 σ 2 1 t i=1

309, 0.061], vw x ∈ [1.11178, 1.25786]}. Then, we have i. I 12 mapps to I 16 = {w y ≤ 0.0372, v -w x ∈ [-0.317, 0.061], vw x ∈ [1.14493, 1.31246]}. ii. I 13 mapps to I 17 = {w y ≤ 0.0432, v -w x ∈ [-0.448, 0.078], vw x ∈ [0.943633, 1.24393]}. iii. I 14 mapps to I 18 = {w y ≤ 0.0662, v -w x ∈ [-0.462, 0.077], vw x ∈ [0.77846, 1]}. iv. I 15 mapps to I 20 = {w y ≤ 0.0998, v -w x ∈ [-0.456, 0.0785], vw x ∈ [0.550, 0.878]}. 2. Starting from I 2 , (a) Step 1: I 2 mapps to I 21 = {w y ≤ 0.332, v -w x ∈ [-0.205, 0.114], vw x ∈ [0.864, 1.25894]} (b) Step 2: Splitting I 21 , we have i. I = {w y ≤ 0.332, v -w x ∈ [-0.205, 0.114], vw x ∈ [0.864, 1]}. ii. I = {w y ≤ 0.332, v -w x ∈ [-0.205, 0.114], vw x ∈ [1, 1.125894]}. Then, we have i. I mapps to I = {w y ≤ 0.081, v -w x ∈ [-0.336, 0.114], vw x ∈ [0.858, 1.14813]}. ii. I mapps to I = {w y ≤ 0.184, v -w x ∈ [-0.409, 0.148], vw x ∈ [0.463, 1]}. (c) Step 3: Splitting and merging I 24 , I 25 , we have i. I = {w y ≤ 0.184, v -w x ∈ [-0.409, 0.148], vw x ∈ [0.463, 1]}. ii. I = {w y ≤ 0.081, v -w x ∈ [-0.336, 0.114], vw x ∈ [1, 1.14813]}. Then, we have i. I mapps to I = {w y ≤ 0.083, v -w x ∈ [-0.452, 0.148], vw x ∈ [0.952783, 1.31778]}. ii. I mapps to I = {w y ≤ 0.034, v -w x ∈ [-0.399, 0.133], vw x ∈ [0.777, 1]}.

1. Starting from I 52 , (a) Step 1: I 52 mapps to I 53 = {w y ≤ 0.37, v -w x ∈ [-0.079, 0.0271], vw x ∈ [0.222, 0.616]}. (b) Step 2: I 53 mapps to I 54 = {w y ≤ 0.343, v -w x ∈ [-0.171, 0.017], vw x ∈ [0.621, 1.24894]}. (c) Step 3: Splitting I 54 , we have i.I 55 = {w y ≤ 0.343, v -w x ∈ [-0.171, 0.017], vw x ∈ [0.621, 1}. ii. I 56 = {w y ≤ 0.343, v -w x ∈ [-0.171, 0.017], vw x ∈ [1, 1.24894]}.Then, we have i. I 55 mapps toI 57 = {w y ≤ 0.150, v -w x ∈ [-0.305, 0.017], vw x ∈ [0.840, 1.25908]}. ii. I 56 mapps to I 58 = {w y ≤ 0.137, v -w x ∈ [-0.367, 0.022], vw x ∈ [0.472, 1]}. (d) Step 4: Splitting and merging I 57 , I 58 , we have i. ii. I 59 = {w y ≤ 0.150, v -w x ∈ [-0.367, 0.022], vw x ∈ [0.472, 1}. iii. iv. I 60 = {w y ≤ 0.150, v -w x ∈ [-0.305, 0.017], vw x ∈ [1, 1.25908}. Then, we have i. I 59 mapps to I 61 = {w y ≤ 0.0705, v -w x ∈ [-0.393, 0.022], vw x ∈ [0.971, 1.304]}. ii. I 60 mapps toI 62 = {w y ≤ 0.0613, v -w x ∈ [-0.421, 0.0219], vw x ∈ [0.583, 1]}.From I 16-20 , I 28 , I 29 , I 49-51 , I 61 , I 62 , we can see that it has fallen into an intervalI f = {w y < 0.1, v -w x ∈ [-0.462, 0.148], vw x ∈ [0.497, 1.34078]}.Something special here is that w y has been much smaller than w x . More broadly, let's define an interval I s generated by I g = {w y = 0, v -w x ∈ [-0.464, 0.148], vw x ∈ [1, 1.5]}. Here "generated" means

η ∆Y t Z 0 + Y 0 ∆Z t + ∆Y t ∆Z t (Z 0 + ∆Z t ) , (220)∆Z t+1 = ∆Z t -η ∆Z t Y 0 + Z 0 ∆Y t + ∆Z t ∆Y t (Y 0 + ∆Y t ) . (221)Consider the decompositions ∆Y t = y,t ũy,t v 1 + ∆Y t , ∆Z t = z,t ũz,t v 1 + ∆Z t where we assume ∆Y t v 1 = 0, ∆Z t v 1 = 0, ũy,t = 1, ũz,t = 1. We also control the sign of ũy,t , ũz,t by claiming ũy,t , u 1 > 0, ũz,t , u 1 > 0. Then, the update rule is again equivalent to∆Y t+1 = y,t ũy,t v 1 + ∆Y t -η y,t ũy,t v 1 + ∆Y t Z 0 + Y 0 z,t ũz,t v 1 + ∆Z t + ,t ũy,t v 1 + ∆Y t z,t ũz,t v 1 + ∆Z t (Z 0 + z,t ũz,t v 1 + ∆Z t ), ∆Z t+1 = z,t ũz,t v 1 + ∆Z t -η z,t ũz,t v 1 + ∆Z t Y 0 + Z 0 y,t ũy,t v 1 + ∆Y t + ,t ũy,t v 1 + ∆Y t z,t ũz,t v 1 + ∆Z t (Y 0 + y,t ũy,t v 1 + ∆Y t ).

t+1 ũy,t = y,t ũy,t -η y,t ũy,t ,t+1 = z,t+1 ũz,t+1 , u 1 = z,t ũz,t , u 1 -η σ 2 1 α 2 z,t + 2σ 1 α y,t z,t + σ 2 1 y,t + Since A t is bounded ∀ t if and only if k

annex

We are to show that u 1 ∆X t X 0 ũt is small, so that t+1 ũt , u 1 is a function approximately of only η, t , σ 1 .After re-introducing higher-order terms around (178), we have(203) Then, it holds(206) So we have to ensure that m t keeps small. To obtain this, we start to re-write the update of m t by m t = t ũtt ũt , u 1 u 1 , which is(208) Compared with (180), the above equation has two differences, the first is the coefficient of m t-1 being 1 -η(σ 2 + 2 t ) instead of 1 -ησ 2 , and the second is additional two terms in (208). Let's first discuss the second difference. Actually these two terms somehow act as the coefficient as well, so if X 0 1 then it is safe to neglect them and only consider 1 -ησ 2 . To show this, we again look into the stability of u ∆X t+1 and ∆X t+1,⊥ . We havewhere the first line simply differs from (191) with the coefficient as η(σ 1 + t ) instead of ησ 1 and the second is the same as (193) . Therefore, following the same arguments in the phase I, if both of m t-1 , ∆X t = O( ) are small, we have the following step they are still in O( ).Therefore, we have the dynamics of t as) which corresponds to the update rule of gradient descent on a 1D function fwith learning rate η, it converges to the two positive solutions, , in (39)with learning rate η, it converges to the solutions (one in (-σ 1 , 0), one in (0, σ 2 )) ofAlong with the above argument of stability ∆X t = O( ), it concludes that the trajectory converge to the above solution with deviation upper-bounded by O( ).assume ∆Y t , ∆Z t stay small, then ignoring high-order small values gives y,t+1 ũy,t+1 = ∆Y t+1 v 1 = y,t ũy,t -η y,t ũy,tthen we haveFrom the above two, we would like to find a lower bound of P y,t , P z,t . Note that P z,t+1 -α 2 P y,t+1 = P z,t -α 2 P y,t k, (230)| is growing exponentially with the ratio of 2ησ 21 -1 at least. Meanwhile, with P z,t -α 2 P y,t fixed along time, it holds |P z,t | is growing exponentially with the same ratio as well. Now let's see how other things stay small when P y,t and P z,t are the only two terms growing exponentially. If that holds, we can conclude that ũy,t and ũz,t get close to u 1 sharply from random directions. Similar to the discussion of the symmetric case, we have two remaining components as follows (which are wished to be bounded) m y,t+1 y,t+1 ũy,t+1y,t+1 ũy,t+1 , uTake any eigenvector v p of X 0 X 0 with associated eigenvalue σ 2 p with σ p > 0. Then the above system can be written asThen the above system can be re-written as matrixSo we have a rank-2 matrix B with all elements as non-negative. Our target is to show A n is bounded with n → ∞. Hence, we require the spectral norm B ≤ 2 by Lemma 8, which givesTherefore, this gives an upper bound for σ 2 , asHence, with the above discussion, we have shown m y,t , m z,t , u 1 ∆Y t , u 1 ∆Z t stay small in this phase. Meanwhile, the residual components of ∆Y t , ∆Z t that are orthogonal to u 1 on the left, denoted as ∆Y t,⊥ , ∆Z t,⊥ , iterate followingThen we have-Then, by reorganizing terms, M t+1 can be well bounded as M t+1 ≤ M t .Then we haveNote that the shared factor σ1 α y,t + y,t z,t + σ 1 α z,t = (σ 1 α + y,t ) σ1 α + z,t -σ 2 1 . Hence, the above system is equivalent toFurthermore, this is equivalent to f (y, z) = 1 2 (yz -1) 2 with learning rate η = 1 + βσ 2 1 . Therefore, from Theorem 5, we know y, z will converge to the same values, which make the problem with the same solution of 1D functions. Fortunately, it follows the solution for (39). In summary, after straightforward computation, it converges towhere ρ 1 ∈ (1, 2), ρ 2 ∈ (0, 1) are the two solutions of solving ρ in 242), if we have B ≤ 2 and with σ 2 < σ 1 , then A t is bounded entry-wise, for any t ≥ 1.

Proof. Denote u

, where z 1 , z 2 , z 3 , z 4 are positive normalization terms to ensure u i = 1, v i = 1 for i = 1, 2. Then A can be re-written as. Obviously, for any t ≥ 1, the t-th power of A, can be represented asand the update rule of k 4 (x 2 -1) 2 with learning rate=1.05, 1.235 and 1.237. The first two smaller learning rates drive to period-2 orbits while the last one goes to an period-4 orbit. The significant bound between period-2 and period-4 is predicted by our proof in (??), as √ 5 -1 ≈ 1.236.Local Geometry (Thm. 1)High-order LG (Lem. 1) 1-D case (Thm. 2)LG for MF (Thm. 6) LG stands for Local Geometry. MF stands for Matrix Factorization.we generalize the local property to a global convergence result in Theorem 2. Then we are to generalize the 1-D analysis to cases of i) multi-parameter, ii) nonlinear and iii) high-dimension.(a) Multi-parameter. Compared with 1-D f (x) = (µ -x 2 ) 2 , the 2-D function f (x, y) = (µ -xy) 2 can be viewed as the simplest setting of two-layer models. We prove that the 2-D case converges to the region of x = y in Theorem 5 in Appendix A.1, which means it shares the same convergence as the 1-D model. Also, x = y means its sharpness is the flattest.(b) Nonlinear. We extend the 2-D model to a two-layer single-neuron ReLU model in Section 5.Although the student neuron can be initialized far from the direction of the teacher neuron, we prove the student neuron converges to the correct direction (as w y → 0) in Theorem 3.Then the problem degenerates to the above 2-D analysis, which means it shares the same convergence with the 2-D, where (v, w x ) corresponds to (x, y) in 2-D.(c) High-dimension. We extend the 2-D model to quasi-symmetric matrix factorization in Section 6. Although the parameters are initialized near a sharp minima, GD still walks towards the flattest minima, as shown in Theorem 4. At convergence, the top singular values of Y, Z are the same, following the 2-D analysis. So the singular values are in the same period-2 orbit as the 1-D case.Meanwhile, from Theorem 1 and Lemma 1, we prove a condition for base models g in regression tasks to allow stable oscillation in Lemma 2. Furthermore, we provide a composition rule of two base models to find a more complicated model that allows stable oscillation in Prop 1.

M.2 IMPLICATIONS FROM LOW-DIMENSION TO HIGH-DIMENSION

We would like to emphasize that, although our current simple settings are a little far from practical NNs, it still helps understand the ability of GD at large LRs to discover flat minima in three steps as follows. We include more experiments in Appendix B.2 to present the following hopes for complicated networks:(a) By Theorem 1, especially its second condition, we wish to discover an intrinsic geometric property around local minima of more complicated models. The key is to investigate the 1-D function at the cross-section of the leading eigenvector and the loss landscape. (c) The final implication is the implicit bias of EoS after such oscillation. It turns out GD is driven to flatter minima from sharper minima. In the 1-D case, obviously there is nothing about implicit bias since the only thing GD is doing is to approximate the target value. However, an implicit bias from the oscillation appears starting from the 2-D case.N Theoretical 1: in the 2-D case in Theorem 5, we prove the two learnable parameters 

N CONCLUSIONS

In this work, we investigate gradient descent with a large step size that crosses the threshold of local stability. In the low dimensional setting, we provide conditions on high-order derivatives that allow stable oscillation around local minima. For a two-layer single-neuron ReLU network, we prove its convergence to align with the teacher neuron under population loss. For matrix factorization, we prove that the necessary 1-D condition holds around any minima. Furthermore, we conduct an analysis of GD in symmetric matrix factorization, which converges to a period-2 orbit aligned with the 1-D convergence. Moreover, we generalize the analysis to quasi-symmetric cases where GD walks towards the flattest minimiser although initialized near sharp ones. A further discussion is provided in Appendix M.While these are encouraging results that contribute to the growing understanding of gradient descent beyond the Edge of Stability, our analysis suffers from important limitations that require further work.An important item for future work is therefore to extend it to general dimensions with nonlinearity, which will enable the analysis of empirical landscapes as well as multiple neurons. Next, the understanding of the implicit bias of GD in the large-learning rate regime won't be complete without integrating the noise, either in the classic SGD sense or in the labels, as done in Damian et al. (2021) ; Li et al. (2021) .

