HOW DOES SHARPNESS-AWARE MINIMIZATION MINIMIZE SHARPNESS?

Abstract

Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient variant; moreover, a third notion of sharpness was used for proving generalization guarantees. The subtle differences in these notions of sharpness can indeed lead to significantly different empirical results. This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when full-batch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes the third notion of sharpness mentioned above, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the alignment between the gradient and the top eigenvector of Hessian when SAM is applied. Lemma 3.1. For any x at which Φ is defined and differentiable, we have that ∂Φ(x)∇L(x) = 0. Recent empirical studies have shown that there are essentially no barriers in loss landscape between different minimizers, that is, the set of minimizers are path-connected (Draxler et al., 2018; Garipov et al., 2018) . Motivated by this empirical discovery, we make the assumption below following Fehrman et al. ( 2020); Li et al. (2021); Arora et al. (2022), which is theoretically justified by Cooper (2018) under a generic setting. Assumption 3.2. Assume loss L : R D → R is C 4 , and there exists a C 2 submanifold Γ of R D that is a (D -M )-dimensional for some integer 1 ≤ M ≤ D, where for all x ∈ Γ, x is a local minimizer of L and rank(∇ 2 L(x)) = M .

1. INTRODUCTION

Modern deep nets are often overparametrized and have the capacity to fit even randomly labeled data (Zhang et al., 2016) . Thus, a small training loss does not necessarily imply good generalization. Yet, standard gradient-based training algorithms such as SGD are able to find generalizable models. Recent empirical and theoretical studies suggest that generalization is well-correlated with the sharpness of the loss landscape at the learned parameter (Keskar et al. Despite its empirical success, the underlying working of SAM remains elusive because of the various intriguing approximations made in its derivation and analysis. There are three different notions of sharpness involved -SAM intends to optimize the first notion, the sharpness along the worst direction, but actually implements a computationally efficient notion, the sharpness along the direction of the gradient. But in the analysis of generalization, a third notion of sharpness is actually used to prove generalization guarantees, which admits the first notion as an upper bound. The subtle difference between the three notions can lead to very different biases (see Figure 1 for demonstration). More concretely, let L be the training loss, x be the parameter and ρ be the perturbation radius, a hyperparameter requiring tuning. The first notion corresponds to the following optimization problem (1) , where we call R Max ρ (x) = L Max ρ (x) -L(x) the worst-direction sharpness at x. SAM intends to minimize the original training loss plus the worst-direction sharpness at x.  However, even evaluating L Max ρ (x) is computationally expensive, not to mention optimization. Thus Foret et al. (2021); Zheng et al. (2021) have introduced a second notion of sharpness, which approximates the worstcase direction in (1) by the direction of gradient, as defined below in (2) . We call R Asc ρ (x) = L Asc ρ (x) -L(x) the ascent-direction sharpness at x. min ) min x Tr(∇ 2 L(x)) (Thm G.5) Table 1 : Definitions and biases of different notions of sharpness-aware loss. The corresponding sharpness is defined as the difference between sharpness-aware loss and the original loss. Here λ1 denotes the largest eigenvalue and λmin denotes the smallest non-zero eigenvalue. For further acceleration, Foret et al. (2021) ; Zheng et al. (2021) omit the gradient through other occurrence of x and approximate the gradient of ascent-direction sharpness by gradient taken after one-step ascent, i.e., ∇L Asc ρ (x) ≈ ∇L (x + ρ∇L(x)/ ∥∇L(x)∥ 2 ) and derive the update rule of SAM, where η is the learning rate. Sharpness-Aware Minimization (SAM): x(t + 1) = x(t) -η∇L (x + ρ∇L(x)/ ∥∇L(x)∥ 2 ) . Intriguingly, the generalization bound of SAM upperbounds the generalization error by the third notion of sharpness, called average-direction sharpness, R Avg ρ (x) and defined formally below. R Avg ρ (x) = L Avg ρ (x) -L(x), where L Avg ρ (x) = E g∼N (0,I) L (x + ρg/∥g∥ 2 ) . (4) The worst-case sharpness is an upper bound of the average case sharpness and thus it is a looser bound for generalization error. In other words, according to the generalization theory in Foret et al. (2021) ; Wu et al. (2020) in fact motivates us to directly minimize the average case sharpness (as opposed to the worst-case sharpness that SAM intends to optimize). In this paper, we analyze the biases introduced by penalizing these various notions of sharpness as well as the bias of SAM (Equation 3). Our analysis for SAM is performed for small perturbation radius ρ and learning rate η under the setting where the minimizers of loss form a manifold following the setup of Fehrman et al. (2020); Li et al. (2021) . In particular, we make the following theoretical contributions. 1. We prove that full-batch SAM indeed minimizes worst-direction sharpness. (Theorem 4.5) 2. Surprisingly, when batch size is 1, SAM minimizes average-direction sharpness. (Theorem 5.4) 3. We provide a characterization (Theorems 4.2 and 5. 3) of what a few sharpness regularizers bias towards among the minimizers (including all the three notions of the sharpness in Table 1 ), when the perturbation radius ρ goes to zero. Surprisingly, both heuristic approximations made for SAM lead to inaccurate conclusions: (1) Minimizing worst-direction sharpness and ascent-direction sharpness induce different biases among minimizers, and (2) SAM doesn't minimize ascent-direction sharpness. The key mechanism behind this bias of SAM is the alignment between gradient and the top eigenspace of Hessian of the original loss in the latter phase of training-the angle between them decreases gradually to the level of O(ρ). It turns out that the worst-direction sharpness starts to decrease once such alignment is established (see Section 4.3). Interestingly, such an alignment is not implied by the minimization problem (2) , but rather, it is an implicit property of the specific update rule of SAM. Interestingly, such an alignment property holds for SAM with full batch and SAM with batch size one, but does not necessarily hold for the mini-batch case.

2. RELATED WORKS

Sharpness and Generalization. The study on the connection between sharpness and generalization can be traced back to Hochreiter et al. (1997) of SAM. Zhuang et al. (2022) proposes a variant of SAM, which improves generalization by simultaneously optimizing the surrogate gap and the sharpness-aware loss. Zhao et al. (2022) propose to improve generalization by penalizing gradient norm. Their proposed algorithm can be viewed as a generalization of SAM. Andriushchenko et al. (2022) study a variant of SAM where the step size of ascent step is ρ instead of ρ ∥∇L(x)∥ 2 . They show that for a simple model this variant of SAM has a stronger regularization effect when batch size is 1 compared to the full-batch case and argue that this might be the explanation that SAM generalizes better with small batch sizes. More related works are discussed in Appendix A. .

3. NOTATIONS AND ASSUMPTIONS

For any natural number k, we say a function is C k if it is k-times continuously differentiable and is C k if its kth order derivatives are locally lipschitz. We say a subset of R D is compact if each of its open covers has a finite subcover. It is well known that a subset of R D is compact if and only if it is closed and bounded. For any positive definite symmetric matrix A ∈ R D×D , define {λ i (A), v i (A)} i∈ [D] as all its eigenvalues and eigenvectors satisfying λ 1 (A) ≥ λ 2 (A)... ≥ λ D (A) and ∥v i (A)∥ 2 = 1. For any mapping F , we define ∂F (x) as the Jacobian where [∂F (x)] ij = ∂ j F i (x). Thus the directional derivative of F along the vector u at x can be written as ∂F (x)u. We further define the second order directional derivative of F along the vectors u and v at x, ∂ 2 F (x)[u, v], ∂(∂F • u)(x)v, that is, the directional derivative of ∂F • u along the vector v at x. Given a C 1 submanifold (Definition C.1) Γ of R D and a point x ∈ Γ, define P x,Γ as the projection operator onto the manifold of the normal space of Γ at x and P ⊥ x,Γ = I D -P x,Γ . We fix our initialization as x init and our loss function as L : R D → R. Given the loss function, its gradient flow is denoted by mapping ϕ : R D × [0, ∞) → R D . Here, ϕ(x, τ ) denotes the iterate at time τ of a gradient flow starting at x and is defined as the unique solution of ϕ(x, τ ) = x -τ 0 ∇L(ϕ(x, t))dt, ∀x ∈ R D . We further define the limiting map Φ as Φ(x) = lim τ →∞ ϕ(x, τ ), that is, Φ(x) denotes the convergent point of the gradient flow starting from x. When L(x) is small, Φ(x) and x are near. Hence in our analysis, we regularly use Φ(x(t)) as a surrogate to analyze the dynamics of x(t). Lemma 3.1 is an important property of Φ from Li et al. (2021) (Lemma C.2), which is repeatedly used in our analysis. The proof is shown in Appendix F. Though our analysis for the full-batch setting is performed under the general and abstract setting, Assumption 3.2, our analysis for the stochastic setting uses a more concrete one, Setting 5.1, where we can prove that Assumption 3.2 holds. (see Theorem 5.2) Definition 3.3 (Attraction Set). Let U be the attraction set of Γ under gradient flow, that is, a neighborhood of Γ containing all points starting from which gradient flow w.r.t. loss L converges to some point in Γ, or mathematically, U ≜ {x ∈ R D |Φ(x) exists and Φ(x) ∈ Γ}. It can be shown that for a minimum loss manifold, the rank of Hessian plus the dimension of the manifold is at most the environmental dimension D, and thus our assumption about Hessian rank essentially says the rank is maximal. Assumption 

4. EXPLICIT AND IMPLICIT BIAS IN THE FULL-BATCH SETTING

In this section, we present our main results in the full-batch setting. Section 4.1 provides characterization of explicit bias of worst-direction, ascent-dircetion, and average-direction sharpness. In particular, we show that ascent-direction sharpness and worst-direction sharpness have different explicit biases. However, it turns out the explicit bias of ascent-direction sharpness is not the effective bias of SAM (that approximately optimizes the ascent-direction sharpness), because the particular implementation of SAM imposes additional, different biases, which is the main focus of Section 4.2. We provide our main theorem in the full-batch setting, that SAM implicitly minimizes the worst-direction sharpness, via characterizing its limiting dynamics as learning rate ρ and η goes to 0 with a Riemmanian gradient flow with respect to the top eigenvalue of the Hessian of the loss on the manifold of local minimizers. In Section 4.3, we sketch the proof of the implicit bias of SAM and identify a key property behind the implicit bias, which we call the implicit alignment between the gradient and the top eigenvector of the Hessian.

4.1. WORST-AND ASCENT-DIRECTION SHARPNESS HAVE DIFFERENT EXPLICIT BIASES

In this subsection, we show that the explicit biases of three notions of sharpness are all different under Assumption 3.2. We first recap the heuristic derivation of ascent-direction sharpness R Asc ρ . The intuition of approximating R Max ρ by R Asc ρ comes from the following Taylor expansions (Foret et al., 2021; Wu et al., 2020) . Consider any compact set, for sufficiently small ρ, the following holds uniformly for all x in the compact set: R Max ρ (x) = sup ∥v∥ 2 ≤1 L(x + ρv) -L(x) = sup ∥v∥ 2 ≤1 ρv ⊤ ∇L(x) + ρ 2 2 v ⊤ ∇ 2 L(x)v + O(ρ 3 ) , R Asc ρ (x) = L x + ρ ∇L(x) ∥∇L(x)∥ 2 -L(x) =ρ ∥∇L(x)∥ 2 + ρ 2 2 ∇L(x) ⊤ ∇ 2 L(x)∇L(x) ∥∇L(x)∥ 2 2 +O(ρ 3 ) . Here, the preference among the local or global minima is what we are mainly concerned with. Since sup ∥v∥ 2 ≤1 v ⊤ ∇L(x) = ∥∇L(x)∥ 2 when ∥∇L(x)∥ 2 > 0, the leading terms in Equations 5 and 6 are both the first order term, ρ ∥∇L(x)∥ 2 , and are the same. However, it is erroneous to think that the first order term decides the explicit bias, as the first order term ∥∇L(x)∥ 2 vanishes at the local minimizers of the loss L and thus the second order term becomes the leading term. Any global minimizer x of the original loss L is an O(ρfoot_1 )-approximate minimizer of the sharpness-aware loss because ∇L(x) = 0. Therefore, the sharpness-aware loss needs to be of order ρ 2 so that we can guarantee the second-order terms in Equation 5and/or Equation 6 to be non-trivially small. Our main result in this subsection (Theorem 4.2) gives an explicit characterization for this phenomenon. The corresponding explicit biases for each type of sharpness is given below in Definition 4.1. As we will see later, they can be derived from a general notion of limiting regularizer  (Definition 4.3). Definition 4.1. For x ∈ R D , we define S Max (x) = λ 1 (∇ 2 L(x))/2, S Asc (x) = λ M (∇ 2 L(x))/2 and S Avg (x) = Tr(∇ 2 L(x))/(2D U ′ ∩ Γ ⊆ U ′ ∩ Γ. For any type ∈ {Max, Asc, Avg} and any optimality gap ∆ > 0, there is a function ϵ : R + → R + with lim ρ→0 ϵ(ρ) = 0, such that for all sufficiently small ρ > 0 and all u ∈ U ′ satisfying that L(u) + R type ρ (u) -inf x∈U ′ L(x) + R type ρ (x) ≤ ∆ρ 2 , 1 it holds L(u) -inf x∈U ′ L(x) ≤ (∆ + ϵ(ρ))ρ 2 and that S type (u) -inf x∈U ′ ∩Γ S type (x) ∈ [-ϵ(ρ), ∆ + ϵ(ρ)]. Theorem 4.2 suggests a sharp phase transition of the property of the solution of min x L(x) + R ρ (x) when the optimization error drops from ω(ρfoot_2 ) to O(ρ 2 ). When the optimization error is larger than ω(ρ 2 ), no regularization effect happens and any minimizer satisfies the requirement. When the error becomes O(ρ 2 ), there is a non-trivial restriction on the coefficients in the second-order term. Next we give a heuristic derivation for the above defined S type . First, for worst-and average-direction sharpness, the calculations are fairly straightforward and well-known in literature ), and we sketch them here. In the limit of perturbation radius ρ → 0, we know that the minimizer of the sharpness-aware loss will also converges to Γ, the manifold of minimizers of the original loss L. Thus to decide to which x ∈ Γ the minimizers will converge to as ρ → 0, it suffices to take Taylor expansion of L Asc ρ or L Avg ρ at each x ∈ Γ and compare the second-order coefficients, e.g., we have that R Avg ρ (x) = ρ 2 2D Tr(∇ 2 L(x)) + O(ρ 3 ) and R Max ρ (x) = ρ 2 2 λ 1 (∇ 2 L(x)) + O(ρ 3 ) by Equation 5. However, the analysis for ascent-direction sharpness is more tricky because R Asc ρ (x) = ∞ for any x ∈ Γ and thus is not continuous around such x. Thus we have to aggregate information from neighborhood to capture the explicit bias of R ρ around manifold Γ. This motivates the following definition of limiting regularizer which allows us to compare the regularization strength of R ρ around each point on manifold Γ as ρ → 0. Definition 4.3 (Limiting Regularizer). We define the limiting regularizer of {R ρ } as the function 2 S : Γ → R, S(x) = lim ρ→0 lim r→0 inf ∥x ′ -x∥ 2 ≤r R ρ (x ′ )/ρ 2 . To minimize R Asc ρ around x, we can pick x ′ → x satisfying that ∥∇L(x ′ )∥ 2 → 0 yet strictly being nonzero. By Equation 6, we have R Asc ρ (x ′ ) ≈ ρ 2 2 •∇L(x ′ ) ⊤ ∇ 2 L(x)∇L(x ′ ) ∥∇L(x ′ )∥ 2 2 . Here the crucial step of the proof is that because of Assumption 3.2, ∇L(x)/ ∥∇L(x)∥ 2 must almost lie in the column span of ∇ 2 L(x), which implies that inf x ′ ∇L(x ′ ) ⊤ ∇ 2 L(x)∇L(x ′ )/∥∇L(x ′ )∥ 2 2 ρ→0 → λ M (∇ 2 L(x)) , where rank(∇ 2 L(x)) = M by Assumption 3.2. The above alignment property between the gradient and the column space of Hessian can be checked directly for any non-negative quadratic function. The maximal Hessian rank assumption in Assumption 3.2 ensures that this property extends to general losses. We defer the proof of Theorem 4.2 into Appendix G.1, where we develop a sufficient condition where the notion of limiting regularizer characterizes the explicit bias of R ρ as ρ → 0.

4.2. SAM PROVABLY DECREASES WORST-DIRECTION SHARPNESS

Though ascent-direction sharpness has different explicit bias from worst-direction sharpness, in this subsection we will show that surprisingly, SAM (Equation 3), a heuristic method designed to minimize ascentdirection sharpness, provably decreases worst-direction sharpness. The main result here is an exact characterization of the trajectory of SAM (Equation 3) via the following ordinary differential equation (ODE) (Equation 7), when learning rate η and perturbation radius ρ are small and the initialization x(0) = x init is in U , the attraction set of manifold Γ. X(τ ) = X(0) - 1 2 τ s=0 P ⊥ X(s),Γ ∇λ 1 (∇ 2 L(X(s)))ds, X(0) = Φ(x init ). We assume ODE (Equation 7) has a solution till time T 3 , that is, Equation 7 holds for all t ≤ T 3 . We call the solution of Equation 7 the limiting flow of SAM, which is exactly the Riemannian Gradient Flow on the manifold Γ with respect to the loss λ 1 (∇ 2 L(•)). In other words, the ODE (Equation 7) is essentially a projected gradient descent algorithm with loss λ 1 (∇ 2 L(•)) on the constraint set Γ and an infinitesimal learning rate. Note λ 1 (∇ 2 L(x)) may not be differentiable at x if λ 1 (∇ 2 L(x)) = λ 2 (∇ 2 L(x) ), thus to ensure Equation 7 is well-defined, we assume there is a positive eigengap for L on Γ. 3Assumption 4.4. For all x ∈ Γ, there exists a positive eigengap, i.e., λ 1 (∇ 2 L(x)) > λ 2 (∇ 2 L(x)). Theorem 4.5 is the main result of this section, which is a direct combination of Theorems I.1 and I.3. The proof is deferred to Appendix I.3. Theorem 4.5 (Main). Let {x(t)} be the iterates of full-batch SAM (Equation 3) with x(0) = x init ∈ U . Under Assumptions 3.2 and 4.4, for all η, ρ such that η ln(1/ρ) and ρ/η are sufficiently small, the dynamics of SAM can be characterized in the following two phases: • Phase I: (Theorem I.1) Full-batch SAM (Equation 3) follows Gradient Flow with respect to L until entering an O(ηρ) neighborhood of the manifold Γ in O(ln(1/ρ)/η) steps; • Phase II: (Theorem I.3) Under a mild non-degeneracy assumption (Assumption I.2) on the initial point of phase II, full-batch SAM (Equation 3) tracks the solution X of Equation 7, the Riemannian Gradient Flow with respect to the loss λ 1 (∇ 2 L(•)) in an O(ηρ) neighborhood of manifold Γ. Quantitatively, the approximation error between the iterates x and the corresponding limiting flow X is O(η ln(1/ρ)), that is, ∥x ⌈T 3 /(ηρ 2 )⌉ -X(T 3 )∥ 2 = O(η ln(1/ρ)) . Moreover, the angle between ∇L x(⌈ T3 ηρ 2 ⌉ and the top eigenspace of ∇ 2 L(x(⌈ T3 ηρ 2 ⌉)) is O(ρ). Theorem 4.5 shows that SAM decreases the largest eigenvalue of Hessian of loss locally around the manifold of local minimizers. Phase I uses standard approximation analysis as in Hairer et al. (2008) . In Phase II, as T 3 is arbitrary, the approximation and alignment properties hold simultaneously for all X(t) along the trajectory, provided that η ln(1/ρ) and ρ/η are sufficiently small. The subtlety here is that the threshold of being "sufficiently small" on η ln(1/ρ) and ρ/η actually depends on T 3 , which decreases when T 3 → 0 or → ∞. We defer the proof of Theorem 4.5 to Appendix I. As a corollary of Theorem 4.5, we can also show that the largest eigenvalue of the limiting flow closely tracks the worst-direction sharpness. Corollary 4.6. In the setting of Theorem 4.5, the difference between the worst-direction sharpness of the iterates and the corresponding scaled largest eigenvalues along the limiting flow is at most O(ηρ 2 ln(1/ρ)). That is, R Max ρ (x(⌈T 3 /ηρ 2 ⌉)) -ρ 2 λ 1 (∇ 2 L(X(T 3 ))/2 = O(ηρ 2 ln(1/ρ)) . Since η ln(1/ρ) is assumed to be sufficiently small, the error O(η ln(1/ρ) • ρ 2 ) is only o(ρ 2 ), meaning that penalizing the top eigenvalue on the manifold does lead to non-trivial reduction of worst-direction sharpness, in the sense of Section 4.1. Hence we can show that full-batch SAM (Equation 3) provably minimizes worst-direction sharpness around the manifold if we additionally assume the limiting flow converges to a minimizer of the top eigenvalue of Hessian in the following Corollary 4.7. Corollary 4.7. Under Assumptions 3.2 and 4.4, define U ′ as in Theorem 4.2 and suppose X(∞) = lim t→∞ X(t) exists and is a minimizer of λ 1 (∇ 2 L(x)) in U ′ ∩ Γ. Then for all ϵ > 0, there exists T ϵ > 0, such that for all ρ, η such that η ln(1/ρ) and ρ/η are sufficiently small, we have that L Max ρ (x(⌈T ϵ /(ηρ 2 )⌉)) ≤ ϵρ 2 + inf x∈U ′ L Max ρ (x) . We defer the proof of Corollaries 4. , where the high-level idea is to use Φ(x(t)) as a proxy for x(t) and study the dynamics of Φ(x(t)) via Taylor expansion. Following the analysis in Arora et al. (2022) we can show Equation 9 using Taylor expansion, starting from which we will discuss the key innovation in this paper regarding implicit Hessian-gradient alignment. We defer its intuitive derivation into Appendix I.5. Φ(x(t + 1))-Φ(x(t))= - ηρ 2 2 ∂Φ(x(t))∂ 2 (∇L)(x(t)) ∇L(x(t)) ∥∇L(x(t))∥ 2 , ∇L(x(t)) ∥∇L(x(t))∥ 2 +O(η 2 ρ 2 + ηρ 3 ) . (9) Now, to understand how Φ(x(t)) moves over time, we need to understand what the direction of the RHS of Equation 9 corresponds to-we will prove that it corresponds to the Riemannian gradient of the loss function ∇λ 1 (∇ 2 L(x)) at x = Φ(x(t)). To achieve this, the key is to understand the direction ∇L(x(t)) ∥∇L(x(t))∥ 2 . It turns out that we will prove ∇L(x(t)) ∥∇L(x(t))∥ 2 is close to the top eigenvector of the Hessian up to sign flip, that is ∥ ∇L(x(t)) ∥∇L(x(t))∥ 2 -s • v 1 (∇ 2 L(x))∥ 2 ≤ O(ρ) for some s ∈ {-1, 1}. We call this phenomenon Hessian-gradient alignment and will discuss it in more detail at the end of this subsection. Using this property, we can proceed with the derivation (detailed in Appendix I.5): Φ(x(t + 1)) -Φ(x(t)) = - ηρ 2 2 ∂Φ(Φ(x(t)))∇λ 1 (∇ 2 L(Φ(x(t)))) + O(η 2 ρ 2 + ηρ 3 ), Implicit Hessian-gradient Alignment. It remains to explain why the gradient implicitly aligns to the top eigenvector of the Hessian, which is the key component of the analysis in Phase II. The proof strategy here is to first show alignment for a quadratic loss function, and then generalize its proof to general loss functions satisfying Assumption 3.2. Below we first give the formal statement of the implicit alignment on quadratic loss, Theorem 4.8 and defer the result for general case (Lemma I.19) to appendix. Note this alignment property is an implicit property of the SAM algorithm as it is not explicitly enforced by the objective that SAM is intended to minimize, L Asc ρ . Indeed optimizing L Asc ρ would rather explicitly align gradient to the smallest non-zero eigenvector (See proofs of Theorem G.5)! Theorem 4.8. Suppose A is a positive definite symmetric matrix with unique top eigenvalue. Consider running full-batch SAM (Equation 3) on loss L(x) := 1 2 x T Ax as in Equation 11 below. x(t + 1) = x(t) -ηA x(t) + ρAx(t)/∥Ax(t)∥ 2 . (11) Then, for almost every x(0), we have x(t) converges in direction to v 1 (A) up to a sign flip and lim t→∞ ∥x(t)∥ 2 = ηρλ1(A) 2-ηλ1(A) with ηλ 1 (A) < 1. The proof of Theorem 4.8 relies on a two-phase analysis of the behavior of Equation 11, where we first show that x(t) enters an invariant set from any initialization and in the second phase, we construct a potential function to show alignment. The proof is deferred to Appendix H. Below we briefly discuss why the case with general loss is closely related to the quadratic loss case. We claim that, in the general loss function case, the analog of Equation 11 is the update rule for the gradient: ∇L(x(t + 1))=∇L(x(t))-η∇ 2 L(x(t)) ∇L(x(t)) + ρ∇ 2 L(x(t)) ∇L(x(t)) ∥∇L(x(t)))∥ 2 + O(ηρ 2 ) . We first note that indeed in the quadratic case where ∇L(x) = Ax and ∇ 2 L(x) = A, Equation 12 is equivalent to Equation 11 because they only differ by a multiplicative factor A on both sides. We derive its intuitive derivation into Appendix I.5.

5. EXPLICIT AND IMPLICIT BIASES IN THE STOCHASTIC SETTING

In practice, people usually use SAM in the stochastic mini-batch setting, and the test accuracy improves as the batch size decreases (Foret et al., 2021) . Towards explaining this phenomenon, Foret et al. (2021) argue intuitively that stochastic SAM minimizes stochastic worst-direction sharpness. Given our results in Section 4, it is natural to ask if we can justify the above intuition by showing the Hessian-gradient alignment in the stochastic setting. Unfortunately, such alignment is not possible in the most general setting. Yet when the batch size is 1, we can prove rigorously in Section 5.2 that stochastic SAM minimizes stochastic worst-direction sharpness, which is the expectation of the worst-direction sharpness of loss over each data (defined in Section 5.1), which is the main result in this section. We stress that the stochastic worst-direction sharpness has a different explicit bias to the worst-direction sharpness, which full-batch SAM implicitly penalizes. When perturbation radius ρ → 0, the former corresponds to Tr(∇ 2 L(•)), the same as averagedirection sharpness, and the latter corresponds to λ 1 (∇ 2 L(•)). Below we start by introducing our setting for SAM with batch size 1, or 1-SAM. We still need Assumption 3.2 in this section. We first analyze the explicit bias of the stochastic ascent-and worst-direction sharpness in Section 5.1 via the tools developed in Section 4.1. It turns out they are all proportional to the trace of hessian as ρ → 0. In Section 5.2, we show that 1-SAM penalizes the trace of Hessian. Below we formally state our setting for stochastic loss of batch size one (Setting 5.1). Setting 5.1. Let the total number of data be M . Let f k (x) be the model output on the k-th data where f k is a C 4 -smooth function and y k be the k-th label, for k = 1, . . . , M . We define the loss on the k-th data as L k (x) = ℓ(f k (x), y k ) and the total loss L = M k=1 L k /M , where function ℓ(y ′ , y) is C 4 -smooth in y ′ . We also assume for any y ∈ R, it holds that arg min y ′ ∈R ℓ(y ′ , y) = y and that ∂ 2 ℓ(y ′ ,y) (∂y ′ ) 2 | y ′ =y > 0. Finally, we denote the set of global minimizers of L with full-rank Jacobian by Γ and assume that it is non-empty, that is, Γ ≜ x ∈ R D | f k (x) = y k , ∀k ∈ [M ] and {∇f k (x)} M k=1 are linearly independent ̸ = ∅. We remark that given training data (i.e., {f k } M k=1 ), Γ defined above is just equal to the set of global minimizers, x ∈ R D | f k (x) = y k , ∀k ∈ [M ] , except for a zero measure set of labels (y k ) M k=1 when f k are C ∞ smooth, by Sard's Theorem. Thus Cooper (2018) argued that the global minimizers form a differentiable manifold generically if we allow perturbation on the labels. In this work we do not make such an assumption for labels. Instead, we consider the subset of the global minimizers with full-rank Jacobian, Γ. A standard application of implicit function theorem implies that Γ defined in Setting 5.1 is indeed a manifold. (See Theorem 5.2, whose proof is deferred into Appendix E.1) Theorem 5.2. Loss L, set Γ and integer M defined in Setting 5.1 satisfy Assumption 3.2.

1-SAM:

We use 1-SAM as a shorthand for SAM on a stochastic loss with batch size 1 as below Equation 13, where k t is sampled i.i.d from uniform distribution on [M ].

1-SAM :

x(t + 1) = x(t) -η∇L kt x + ρ∇L kt (x)/ ∥∇L kt (x)∥ 2 . (13)  [R Max k,ρ ], E k [R Asc k,ρ ] and E k [R Avg k,ρ ]. Unlike the full-batch setting, these three sharpness notions have the same explicit biases, or more precisely, they have the same limiting regularizers (up to some scaling factor). Theorem 5.3. The limiting regularizers of three notions of stochastic sharpness, denoted by S Max , S Asc , S Avg , satisfy that S Max (x) = S Asc (x) = D • S Avg (x) = Tr(∇ 2 L(x))/2. Furthermore, define U ′ in the same way as in Theorem 4.2 . For any type ∈ {Max, Asc, Avg}, it holds that if for some u ∈ U ′ , L(u) + E k [R type k,ρ (u)] ≤ inf x∈U ′ L(x) + E k [R type k,ρ (x)] + ϵρ 2 , 4 then we have that L(u) -inf x∈U ′ L(x) ≤ ϵρ 2 + o(ρ 2 ) and that S type (u) -inf x∈U ′ ∩Γ S type (x) ≤ ϵ + o(1). We defer the proof of Theorem 5.3 to Appendix G.4. Unlike in the full-batch setting where the implicit regularizer of ascent-direction sharpness and worst-direction sharpness have different explicit bias, here they are the same because there is no difference between the maximum and minimum of its non-zero eigenvalue for rank-1 Hessian of each individual loss L k , and that the average of limiting regularizers is equal to the limiting regularizer of the average regularizer by definition.

5.2. STOCHASTIC SAM MINIMIZES AVERAGE-DIRECTION SHARPNESS

This subsection aims to show that the implicit bias of 1-SAM (Equation 13) is minimizing the averagedirection sharpness for small perturbation radius ρ and learning rate η, which has the same implicit bias as all three notions of stochastic sharpness do (Theorem 5.3). As an analog of the analysis in Section 4.3, which shows full-batch SAM minimizes worst-direction sharpness, analysis in this section conceptually shows that 1-SAM minimizes the stochastic worst-direction sharpness. Mathematically, we prove that the trajectory of 1-SAM tracks the following Riemannian gradient flow (Equation 14) with respect to their limiting regularize Tr(∇ 2 L(•)) on the manifold for sufficiently small η and ρ and thus penalizes stochastic worst-direction sharpness (of batch size 1). We assume the ODE (Equation 14) has a solution till time T 3 . X(τ ) = X(0) - 1 2 τ s=0 P ⊥ X(s),Γ ∇Tr(∇ 2 L(X(s)))ds, X(0) = Φ(x init ). ( ) Theorem 5.4. Let {x(t)} be the iterates of 1-SAM (Equation 13) and x(0) = x init ∈ U , then under Setting 5.1, for almost every x init , for all η and ρ such that (η + ρ) ln(1/ηρ) is sufficiently small, with probability at least 1 -O(ρ) over the randomness of the algorithm, the dynamics of 1-SAM (Equation 13) can be split into two phases: • Phase I (Theorem J.1): 1-SAM follows Gradient Flow with respect to L until entering an Õ(ηρ) neighborhood of the manifold Γ in O(ln(1/ρη)/η) steps; • Phase II (Theorem J.2): 1-SAM tracks the solution of Equation 14, X, the Riemannian gradient flow with respect to Tr(∇ 2 L(•)) in an Õ(ηρ) neighborhood of manifold Γ. Quantitatively, the approximation error between the iterates x and the corresponding limiting flow X is Õ(η 1/2 + ρ), that is, ∥x(⌈T 3 /(ηρ 2 )⌉) -X(T 3 )∥ 2 = Õ(η 1/2 + ρ). The high-level intuition for the Phase II result of Theorem 5.4 is that Hessian-gradient alignment holds true for every stochastic loss L k along the trajectory of 1-SAM and therefore by Taylor expansion (the same argument in Section 4.3), at each step Φ(x(t)) moves towards the negative (Riemannian) gradient of λ 1 (∇ 2 L kt ) where k t is the index of randomly sampled data, or the limiting regularizer of the worst-direction sharpness of L kt . Averaging over a long time, the moving direction becomes the negative (Riemmanian) gradient of E kt [λ 1 (∇ 2 L kt )], which is the limiting regularizer of stochastic worst-direction sharpness and equals to Tr(∇ 2 L) by Theorem 5.3. The reason that Hessian-gradient alignment holds under Setting 5.1 is that the Hessian of each stochastic loss 15) , is exactly rank-1, which enforces the gradient ∇L k (x) ≈ ∇ 2 L k (Φ(x))(x -Φ(x)) to (almost) lie in the top (which is also the unique) eigenspace of ∇ 2 L k (Φ(x)). Lemma 5.5 formally states this property. Lemma 5.5. Under Setting 5.1, for any p ∈ Γ and k ∈ [M ], it holds that ∇f k (p) ̸ = 0 and that there is an open set V containing p, satisfying that L k at minimizers p ∈ Γ, ∇ 2 L k (p) = ∂ 2 ℓ(y ′ ,y k ) (∂y ′ ) 2 | y ′ =f k (p) ∇f k (p)(∇f k (p)) ⊤ (Lemma J. ∀x ∈ V, ∇L k (x) ̸ = 0 =⇒ ∃s ∈ {-1, 1}, ∇L k (x) ∥∇L k (x) ∥ = s ∇f k (p) ∥∇f k (p)∥ 2 + O(∥x -p∥ 2 ). Corollaries 5.6 and 5.7 below are stochastic counterparts of Corollaries 4.6 and 4.7, saying that the trace of Hessian are close to the stochastic worst-direction sharpness along the limiting flow (14) , and therefore when the limiting flow converges to a local minimizer of trace of Hessian, 1-SAM (Equation 13) minimizes the average-direction sharpness. We defer the proofs of Corollaries 5.6 and 5.7 to Appendix J.4. Corollary 5.6. Under the condition of Theorem 5.4, we have that with probability 1 -O( √ η + √ ρ), the difference between the stochastic worst-direction sharpness of the iterates and the corresponding scaled trace of Hessian along the limiting flow is at most O (η 1/4 + ρ 1/4 )ρ 2 , that is, E k [R Max k,ρ (x(⌈T 3 /(ηρ 2 )⌉))] -ρ 2 Tr(∇ 2 L(X(T 3 )))/2 = O (η 1/4 + ρ 1/4 )ρ 2 . Corollary 5.7. Define U ′ as in Theorem 4.2, suppose X(∞) = lim t→∞ X(t) exists and is a minimizer of Tr(∇ 2 L(x))) in U ′ ∩ Γ. Then for all ϵ > 0, there exists a constant T ϵ > 0, such that for all ρ, η such that (η + ρ) ln(1/ηρ) are sufficiently small, we have that with probability 1 -O( √ η + √ ρ), E k [L Max k,ρ (x(⌈T ϵ /(ηρ 2 )⌉))] ≤ ϵρ 2 + inf x∈U ′ E k [L Max k,ρ (x)] .

6. CONCLUSION

In this work, we have performed a rigorous mathematical analysis of the explicit bias of various notions of sharpness when used as regularizers and the implicit bias of the SAM algorithm. In particular, we show the explicit biases of worst-, ascent-and average-direction sharpness around the manifold of minimizers are minimizing the largest eigenvalue, the smallest nonzero eigenvalue, and the trace of Hessian of the loss function. We show that in the full-batch setting, SAM provably decreases the largest eigenvalue of Hessian, while in the stochastic setting when batch size is 1, SAM provably decreases the trace of Hessian. The most interesting future work is to generalize the current analysis for stochastic SAM to arbitrary batch size. This is challenging because, without the alignment property which holds automatically with batch size 1, such an analysis essentially requires understanding the stationary distribution of the gradient direction along the SAM trajectory. 2019)) is that we focus on a much longer training regime, i.e., T = Θ(η -1 ρ -2 ) steps where the previous continuous-time approximation results no longer holds throughout the entire training. As a result, their continuous approximation is only equivalent to the Phase I dynamics in our Theorems 4.5 and 5.4 and cannot capture the dynamics of SAM in Phase II, when the sharpness-reduction implicit bias happens. The latter requires a more fine-grained analysis to capture the effects of higher-order terms in η and ρ in SAM Equation 3.

B EXPERIMENTAL DETAILS FOR FIGURE 1

In Figure 1 , we choose F 1 (x) = x 2 1 + 6x 2 2 + 8 and F 2 (x) = 4(1 -x 1 ) 2 + (1 -x 2 ) 2 + 1. The loss L has a zero loss manifold {x = 0} and the eigenvalues of its Hessian on the manifold are F 1 (x) and F 2 (x) with F 1 (x) ≥ 8 > 6 ≥ F 2 (x) on [0, 1] 2 . The loss L has a zero loss manifold {x 3 = x 4 = 0} of codimension M = 2 and the two non-zero eigenvalues of ∇ 2 L of any point x on the manifold are λ 1 (∇ 2 L(x)) = F 1 (x 1 , x 2 ) and λ 2 (∇ 2 L(x)) = F 2 (x 1 , x 2 ). As our theory predicts, 1. Full-batch SAM (Equation 3) finds the minimizer with the smallest top eigenvalue F 1 (x), which is x 1 = 0, x 2 = 0, x 3 = 0, x 4 = 0; 2. GD on ascent-direction loss L Asc ρ (2) finds the minimizer with the smallest bottom eigenvalue, F 2 (x), which is x 1 = 1, x 2 = 1, x 3 = 0, x 4 = 0; 3. Stochastic SAM (Equation 13) (with L 0 (x, y) = F 1 (x)y 2 0 , L 1 (x, y) = F 2 (x)y 2 1 ) finds the minimizer with smallest trace of Hessian, which is x 1 = 4/5, x 2 = 1/7, x 3 = 0, x 4 = 0.

C ADDITIONAL PRELIMINARY

In this section, we introduce some additional notations and clarification before the proof. We will first give the detailed definition of differentiable submanifold. Definition C.1 (Differentiable Submanifold of R D ). We call a subset Γ ⊂ R D a C k submanifold of R D if and only if for every x ∈ Γ, there exists a open neighborhood U of x and an invertible C k map ψ : U → R D , such that ψ(Γ ∩ U ) = (R n × {0}) ∩ ψ(U ). Necessity of Manifold Assumption. The connectivity of the set of local minimizers implied by the manifold assumption above allows us to take limits of perturbation radius ρ → 0 while still yield interesting and insightful implicit bias results in the end-to-end analysis. So far almost all analysis of implicit bias for general model parameterizations relies on Taylor expansion, e.g. (2020) . Thus it's crucial to consider small perturbation size ρ. On the contrary, if the set of global minimizers are a set of discrete points, then with small perturbation radius ρ, implicit bias of optimizers is not sufficient to drive the iterate from global minimum to the other one. Implicit versus Explicit Bias. If an algorithm or optimizer has a bias towards certain type of global/local minima of the loss over other minima of the loss, and this bias is not encoded in the loss function, then we call such bias an implicit bias. On the other hand, a bias emerges as solely a consequence of successfully minimizing certain regularized loss regardless of the optimizers (as long as the optimzers minimize the loss), we say such bias is an explicit bias of the regularized loss (or the regularizer). As a concrete example, we will prove that full-batch SAM (Equation 3) prefers local minima with certain sharpness property. The bias stems from the particular update rule of full-batch SAM (Equation 3), and not all optimizers for the intended target loss function L Asc ρ (Equation 2) has this bias. Therefore, it's considered as an implicit bias. As an example for explicit bias, all optimizers minimizing a loss combined with ℓ 2 regularization will prefer model with smaller parameter norm and this is considered as an explicit bias of ℓ 2 regularization. Usage of O(•) Notation: Our analysis assumes small η and ρ while treating all other problem-dependent parameters as constants, such as the dimension of parameter space and the maximum possible value of derivatives (of different orders) of loss function L and the limit map Φ. In O(•), Ω(•), o(•), ω(•), Θ(•) , we hide all the dependency related to the problem, e.g., the (unique) initialization x init , the manifold Γ, compact set U ′ in Theorem 4.2, and the continuous time T 3 in Theorems 4.5 and 5.4, and only keep the dependency on ρ and η. For example, O(f (ρ)) is a placeholder for some function g(ρ) such that there exists problem-dependent constant C > 0, ∀ρ > 0, |g(ρ)| ≤ C|f (ρ)|. In informal equations such as Equation 31 in the proof sketch section, we are a bit more sloppy and hide dependency on x(t) in O(•) notation as well. But these will be formally dealt with in the proofs. Ill-definedness of SAM with Zero Gradient. The update rule of SAM (Equations 3 and 13) is ill-defined when the gradient is zero. However, our analysis in Appendix D shows that when the stationary point of loss L, {x | ∇L(x) = 0}, is a zero-measure set, for any perturbation radius ρ, except for countably many learning rates, full-batch SAM is well-defined for almost all initialization and all steps (Theorem D.1). A similar result is shown for stochastic SAM if the stationary points of each stochastic loss form a zero-measure set (Theorem D.2). Thus SAM is generically well-defined. For the sake of rigorousness, when SAM encountering zero gradients, we modify the algorithm via replacing the ill-defined normalized gradient by an arbitrary vector with unit norm and our analysis for implicit bias of SAM still holds.

D WELL-DEFINEDNESS OF SAM

In this section, we discuss the well-definedness of SAM. When ∇L(x) = 0, SAM (Equation 3) is not welldefined, because the normalized gradient ∇L(x) ∥∇L(x)∥ 2 is not well-defined. The main result of this section are Theorems D.1 and D.2, which say that (stochastic) SAM starting from random initialization only has zero probability to reach points that SAM is undefined (i.e., points with zero gradient), for all except countably many learning rates. These results follow from Theorem D.3, which is a more general theorem also applicable to other discrete update rules as well, like SGD. Note results in this section does not rely on the manifold assumption, i.e., Assumption 3.2. We end this section with a concrete example where SAM is undefined with constant probability, suggesting that the exclusion of countably many learning rates are necessary in Theorems D.1 and D.2. Theorem D.1. Consider any C 2 loss L with zero-measure stationary set {x | ∇L(x) = 0}. For every ρ > 0, except countably many learning rates, for almost all initialization and all t, the iterate of full-batch SAM (Equation 3) x(t) has non-zero gradient and is thus well-defined. Theorem D.2. Consider any C 2 losses {L k } M k=1 with zero-measure stationary set {x | ∇L k (x) = 0} for each k ∈ [M ]. For every ρ > 0, except countably many learning rates η, for almost all initialization and all t, with probability one of the randomness of the algorithm, the iterate of stochastic SAM (Equation 13) x(t) has non-zero gradient and is thus well-defined. 5Before present our main theorem (Theorem D.3), we need to introduce some notations first. For a map F mapping from R D \ Z → R D , we define that F η : R D \ Z → R D as F η (x) ≜ x -ηF (x) for any η ∈ R + . Given a sequence of functions {F n } ∞ n=1 , we define F n η (x) ≜ x -ηF n (x), for any x ∈ R D . We further define that F n η (x) ≜ F n η (F n-1 η (x)) for any n ≥ 1 and that F 0 η (x) = x. Theorem D.3. Let Z be a closed subset of R D with zero Lebesgue measure and µ be any probability measure on R D that is absolutely continuous to the Lesbegue measure. For any sequence of C 1 functions F n : R D \ Z → R D , n ∈ N + , the following claim holds for all except countably many η ∈ R + : µ {x ∈ R D | ∃n ∈ N, F n η (x) ∈ Z and ∀0 ≤ i ≤ n -1, F i η (x) / ∈ Z} = 0. In other words, for almost all η (except countably many positive numbers), iteration x(t + 1) = x(t) - ηF (x(t)) = F t η (x(0)) will not enter Z almost surely, provided that x(0) is sampled from µ. Theorem D.1 and Theorem D.2 follows immediately from Theorem D.3.  Proof of Theorem D.1. Let F (x) = ∇L(x + ρ ∇L(x) ∥∇L(x)∥ 2 ) and Z = {x ∈ R D | ∇L(x) = 0}. We can easily check F is C 1 on R D \ Z and by assumption Z is a zero-measure set. Applying Theorem D.3 with F n ≡ F for all n ∈ N + , we get the desired results. Proof of Theorem D.2. Let G k (x) = ∇L(x + ρ ∇L k (x) ∥∇L k (x)∥ 2 ) and Z = ∪ M k=1 {x ∈ R D | ∇L k (x) = 0}. We can easily check F k is C 1 on R D \ Z ∈ R + , {x ∈ R d \ Z | det(∂F η (x)) = 0} is a zero-measure set under Lebesgue measure. Lemma D.5. Let Z be a closed subset of R D with zero Lebesgue measure and H : R D \ Z → R D be a continuously differentiable function. If {x ∈ R d \ Z | det(∂H(x)) = 0} is a zero-measure set, then for any zero-measure set Z ′ , H -1 (Z ′ ) is a zero-measure set. Proof of Theorem D.3. It suffices to prove that for every N ∈ N + , at most for countably many η: µ {x ∈ R D | F N η (x) ∈ Z and ∀0 ≤ i ≤ N -1, F i η (x) / ∈ Z} = 0. ( ) The desired results is immediately implied by the above claim because the countable union of countable set is still countable, and countable union of zero-measure set is still zero measure. To prove Equation 15, we first introduce some notations. For any η > 0, 0 ≤ n ≤ N -1, and x ∈ R D , we define F -(n+1) η (x) ≜ (F N -n η ) -1 (F -n η (x)), where F 0 η (x) = x. We extend the definition to set in a natural way, namely F -n η (S) ≜ ∪ x∈S F -n η (x) for any S ⊆ R D . Under this notation, we have that F -N η (Z) = µ {x ∈ R D | F N η (x) ∈ Z and ∀0 ≤ i ≤ N -1, F i η (x) / ∈ Z} We will prove by induction. We claim that for each 0 ≤ n ≤ N except for countably many η ∈ R + , F -n η (Z) has zero Lebesgue measure. The base case n = 0 is by trivial as Z is assumed to be zero-measure. Suppose this holds for n. By Lemma D.4, except countably many η ∈ R + , {x ∈ R d \F -n η (Z) | det(∂F N -n-1 η (x)) = 0} is a zero-measure set. Next by Lemma D.5 if for some η ∈ R + , {x ∈ R d \F -n η (Z) | det(∂F N -n-1 η (x)) = 0} is a zero-measure set, then F -n-1 η (Z) = (F N -n-1 η ) -1 (F -n η (Z) ) is a zero-measure set. Then by induction, we know that except countably many η ∈ R + , for all integer 0 ≤ n ≤ N , F -n η (Z) is zero-measure. Since µ is absolutely continuous to Lebesgue measure, µ(F -N η (Z)) = 0. We end this section with the proofs of Lemmas D.4 and D.5. Proof of Lemma D.4. We use λ i (x) to denote that the real part of the ith eigenvalue of the matrix ∂F (x) in the descending order. Since ∂F (x) is continuous in x, λ i (x) is continuous in x as well, for any i ∈ [D], and thus {x ∈ R D \ Z | λ i (x) = 1/η} is a measurable set. Note that for a fixed i ∈ [D], for each positive integer n, let I n be the set of η where µ({x ∈ R D \ Z | λ i (x) = 1/η}) > 1/n, then |I n | ≤ n, because |I n | n ≤ η∈In µ(({x ∈ R D \ Z | λ i (x) = 1/η}) ≤ µ(({x ∈ R D \ Z | 1/λ i (x) ∈ I n }) ≤ 1. Therefore, there are at most countably many η ∈ R + , such that µ({x ∈ R D \ Z | λ i (x) = 1/η}) > 0. Further note that det(∂F η (x)) = 0 ⇐⇒ ∃i ∈ [D], λ i (x) = 1/η, we know that there are at most countably many η ∈ R + , such that µ({x ∈ R D \ Z | det(∂F η (x)) = 0}) = 0. This completes the proof. Proof of Lemma D.5. Denote {x ∈ R D \ Z | det(∂H(x)) = 0} by Z ′′ , since det(∂H(x)) is continuous in x as F is C 1 , Z ′′ is relatively closed in R D \ Z. Since Z ′ is a closed set, R D \ (Z ′ ∪ Z ′′ ) is open. Thus for all x ∈ R D \ (Z ′ ∪ Z ′′ ) with det(∂H(x)) ̸ = 0, there exists a open neighborhood of x, U , where for all x ′ ∈ U , det(∂H(x ′ )) ̸ = 0, since thus det(∂H(x) ) is continuous. This further implies H is invertible on U and its inverse (H| U ) -1 is differentiable on F (U ). Therefore, (H| U ) -1 maps any zero-measure set to a zeromeasure set. In particular, (H| U ) -1 (Z ′ ∩ H(U )) is zero measure, so is (H) -1 (Z ′ ) ∩ U ⊂ (H| U ) -1 (Z ′ ∩ H(U )). Now for every x ∈ R D \ Z we take an open neighborhood U x ⊆ R D \ (Z ′ ∪ Z ′′ ). Since R D is a separable metric space, the open cover of R D , {U x } x∈R D \(Z ′ ∪Z ′′ ) has a countable subcover, {U x } x∈I , where I is a countable set of R D \ (Z ′ ∪ Z ′′ ). Therefore we have that H -1 (Z ′ ) \ (Z ′ ∪ Z ′′ ) = H -1 (Z ′ ) ∩ (R D \ (Z ′ ∪ Z ′′ )) = ∪ x∈I H -1 (Z ′ ) ∩ U x is a zero-measure set. Thus H -1 (Z ′ ) is also zero-measure since Z ′ , Z ′′ are both zero-measure. This completes the proof. We end this section with an example where SAM is undefined with constant probability. Theorem D.6. For any η, ρ > 0, there is a C 2 loss function L : R → R satisfying that (1) L has a unique stationary point and (2) the set of initialization that makes SAM with learning rate η and perturbation radius ρ to reach the unique stationary point has positive Lebesgue measure. Proof of Theorem D.6. We first consider the case with ρ = η = 1 with L(x) =    x 2 /2 + x + 1/2, for x ∈ (-∞, -2); x 4 /64 + x 2 /8, for x ∈ [-2, 2]; x 2 /2 -x + 1/2, for x ∈ (2, ∞). (16) We first check L is indeed C 1 : L(2) = L(-2) = 1/2, L ′ (2) = -L ′ (-2) = 1 and L ′′ (2) = L ′′ (-2) = 1. Now we claim that for all |x(0)| > 2, x(1) = 0, which is a stationary point. Note that L is even and monotone increasing on [0, ∞), we have ∇L(x)/|∇L(x)| = sign(x). Thus for |x(t)| > 1, it holds that |x(t) + sign(x(t)| > 2 and therefore x(t + 1) =x(t) -ηL ′ (x(t) + ρ∇L(x)/|∇L(x)|) =x(t) -L ′ (x(t) + sign(x(t))) =x(t) -(x(t) + sign(x(t)) -sign(x(t) + sign(x(t)))) =x(t) -x(t) = 0. ( ) Now we turn to the case with arbitrary positive η, ρ. It suffices to consider L η,ρ (x) ≜ ρ η L( x ρ ). We can use the calculation for ρ = η = 1 to verify for any |x| > 2ρ, L ′ η,ρ (x + ρ sign(L ′ η,ρ (x))) = L ′ η,ρ (x + ρ sign(x)) = 1 η L(x/ρ + sign(x)) = x η , namely x -L ′ η,ρ (x + ρ sign(L ′ η,ρ (x))) = 0. This completes the proof. A common (but wrong) intuition here is that, for a continuously differentiable update rule, as long as the points where the update rule is ill-defined (here it means the points with zero gradient) has zero measure, then almost surely for all initialization, gradient-based optimization algorithms like SAM will not reach exactly at any stationary point. However the above example negate this intuition. The issue here is that though a differentiable map (like SAM x → x -η∇L(x + ρ ∇L(x) ∥∇L(x)∥ 2 )) always maps the zero-measure set to zero-measure set, the preimage of zero-measure set is not necessarily zero-measure, as the map x → x-η∇L(x+ρ ∇L(x) ∥∇L(x)∥ 2) is not necessarily invertible. The update rule of SAM is not invertible at 0 is exactly the reason of why preimage of 0 has a positive measure.

E PROOF SETUPS

In this section we provide details of our proof setups, including notations and assumptions/settings. We first introduce some additional notations that will be used in the proofs. For any subset S ∈ R D , we define dist(x, S) ≜ inf y∈S ∥x -y∥ 2 . For any d > 0 and any subset S ∈ R D , we define S d ≜ {x ∈ R D | dist(x, S) ≤ d}. Our convention is to use K to denote a compact set and U to denote an open set. Below we restate our main assumption in the full-batch case and related notations in Section 3. Throughout the analysis, we fix our initialization as x init , our loss function as L : R D → R. Assumption 3.2. Assume loss L : R D → R is C 4 , and there exists a C 2 submanifold Γ of R D that is a (D -M )-dimensional for some integer 1 ≤ M ≤ D, where for all x ∈ Γ, x is a local minimizer of L and rank(∇ 2 L(x)) = M . Notations for Full-Batch Setting: Given any point x ∈ Γ, define P x,Γ as the projection operator onto the manifold of the normal space of Γ at x and P ⊥ x,Γ = I D -P x,Γ . Given the loss function L, its gradient flow is denoted by mapping ϕ : R D × [0, ∞) → R D . Here, ϕ(x, τ ) denotes the iterate at time τ of a gradient flow starting at x and is defined as the unique solution of ϕ(x, τ ) = x -τ 0 ∇L(ϕ(x, t))dt, ∀x ∈ R D . We further define the limiting map of ϕ(x, •) as Φ(x) = lim τ →∞ ϕ(x, τ ), that is, Φ(x) denotes the convergent point of the gradient flow starting from x. For convenience, we define λ i (x), v i (x) as λ i (∇ 2 L(Φ(x))), v i (∇ 2 L(Φ(x))) whenever the latter is well defined. When x(t) and Γ is clear from context, we also use λ i (t) := λ i (x(t)), v i (t) := v i (x(t)), P ⊥ t,Γ := P ⊥ Φ(x(t)),Γ , P t,Γ := P Φ(x(t)),Γ . Definition 3.3 (Attraction Set). Let U be the attraction set of Γ under gradient flow, that is, a neighborhood of Γ containing all points starting from which gradient flow w.r.t. loss L converges to some point in Γ, or mathematically, U ≜ {x ∈ R D |Φ(x) exists and Φ(x) ∈ Γ}. Below we restate the setting for stochastic loss of batch size one in Section 5. Setting 5.1. Let the total number of data be M . Let f k (x) be the model output on the k-th data where f k is a C 4 -smooth function and y k be the k-th label, for k = 1, . . . , M . We define the loss on the k-th data as L k (x) = ℓ(f k (x), y k ) and the total loss L = M k=1 L k /M , where function ℓ(y ′ , y) is C 4 -smooth in y ′ . We also assume for any y ∈ R, it holds that arg min y ′ ∈R ℓ(y ′ , y) = y and that ∂ 2 ℓ(y ′ ,y) (∂y ′ ) 2 | y ′ =y > 0. Finally, we denote the set of global minimizers of L with full-rank Jacobian by Γ and assume that it is non-empty, that is, Γ ≜ x ∈ R D | f k (x) = y k , ∀k ∈ [M ] and {∇f k (x)} M k=1 are linearly independent ̸ = ∅. Theorem 5.2. Loss L, set Γ and integer M defined in Setting 5.1 satisfy Assumption 3.2. In our analysis, we prove our main theorems in the stochastic setting under a more general condition than Setting 5.1, which is Condition E.1 (on top of Assumption 3.2). The only usage of Setting 5.1 in the proof is Theorems 5.2 and E.2. Condition E.1. Total loss L = 1 M M k=1 L k . For each k ∈ [M ], L k is C 4 , and there exists a (D -1)- dimensional C 2 -submanifold of R D , Γ k , where for all x ∈ Γ k , x is a global minimizer of L k , L k (x) = 0 and rank(∇ 2 L k (x)) = 1. Moreover, Γ = ∩ M k=1 Γ k for Γ defined in Assumption 3.2. Theorem E.2. Setting 5.1 implies Condition E.1.

Notations for Stochastic Setting

: Since L k is rank-1 on Γ k for each k ∈ [M ], we can write it as L k (x) = Λ k (x)w k (x)w ⊤ k (x) for any x ∈ Γ, where w k is a continuous function on Γ with pointwise unit norm. Given the loss function L k , its gradient flow is denoted by mapping ϕ k : R D × [0, ∞) → R D . Here, ϕ k (x, τ ) denotes the iterate at time τ of a gradient flow starting at x and is defined as the unique solution of  ϕ k (x, τ ) = x - τ 0 ∇L k (ϕ k (x, t))dt, ∀x ∈ R D . We further define the limiting map Φ k as Φ k (x) = lim τ →∞ ϕ k (x, τ ), that is, Φ k (x) Definition E.3. A function L is µ-PL in a set U iff ∀x ∈ U , ∥∇L(x)∥ 2 2 ≥ 2µ(L(x) -inf x∈U L(x)). Definition E.4. The spectral 2-norm of a k-order tensor X i1,...,i k ∈ R d1×...×d k is defined as ∥X∥ 2 = max xi∈R d i ,∥xi∥2=1 X[x 1 , ..., x k ]. Lemma E.5 (Arora et al. (2022) Lemma B.2). Given any compact set K ⊆ Γ, there exist r(K), µ(K), ∆(K) ∈ R + such that 1. K r(K) ∩ Γ is compact. 2. K r(K) ⊂ U ∩ (∩ k∈[M ] U k ). 3. L is µ(K)-PL on K r(K) . 4. inf x∈K r(K) (λ 1 (∇ 2 L(x)) -λ 2 (∇ 2 L(x))) ≥ ∆(K) > 0. 5. inf x∈K r(K) λ M (∇ 2 L(x)) ≥ µ(K) > 0. 6. inf x∈K r(K) λ 1 (∇ 2 L k (x)) ≥ µ(K) > 0. Given compact set K ⊂ Γ, we further define ζ(K) = sup x∈K r(K) ∥∇ 2 L(x)∥ 2 , ν(K) = sup x∈K r(K) ∥∇ 3 L(x)∥ 2 , Υ(K) = sup x∈K r(K) ∥∇ 4 L(x)∥ 2 , ξ(K) = sup x∈K r(K) ∥∇ 2 Φ(x)∥ 2 , χ(K) = sup x,y∈K r(K) ∥∇ 2 Φ(x) -∇ 2 Φ(y)∥ 2 ∥x -y∥ 2 . Similarly, we use notations like ). Given any compact subset K ⊂ Γ, let r(K) be defined in Lemma E.5, there exist 0 < h(K) < r(K) such that 1. sup ζ k (K), ν k (K), Υ k (K), ξ k (K), χ k (K) x∈K h(K) L(x) -inf x∈K h(K) L(x) ≤ µ(K)ρ 2 (K) 8 . 2. ∀x ∈ K h(K) , Φ(x) ∈ K r(K)/2 . 3. ∀x ∈ K h(K) , ∥x -Φ(x)∥ 2 ≤ 8µ(K) 2 ζ(K)ν(K) . 4. The whole segment xΦ(x) lies in K r(K) , so does xΦ k (x), for any k ∈ [D]. The proof of the lemmas above can be found in Arora et al. (2022) . Readers should note that although Arora et al. (2022) only prove these lemmas when K is a special compact set (the trajectory of an ODE), all the proof does not use any property of K other than it is a compact subset of Γ, and thus our Lemmas E.5 and E.6 hold for general compact subsets of Γ. In the rest part of the appendix, for convenience we will drop the dependency on K in various constants when there is no ambiguity.  → R M as [F (x)] k = f k (x), ∀k ∈ [M ]. Let T x ≜ span({∇f k (x)} M k=1 ) and T ⊥ x be the orthogonal complement of T x in R D . Now we apply implicit function theorem on F at each x ∈ Γ. Without loss of generality (e.g. by rotating the coordinate system), we can assume that x = 0, T x = R D-M × {0}, and that T ⊥ x = {0} × R M . Implicit function theorem ensures that there are two open sets 0 ∈ U ⊂ R D-M and 0 ∈ V ⊂ R M and an invertible C 4 map g : U → V such that F -1 (Y ) ∩ (U × V ) = {(u, g(u)) | u ∈ U }, where Y ≜ [y 1 , . . . , y M ] ∈ R M . Moreover, {∇f k (x)} M k=1 is linearly independent for every x ′ ∈ U × V . Thus by definition of Γ, it holds that Γ ∩ (U × V ) = F -1 (y) ∩ (U × V ) = {(u, g(u)) | u ∈ U }. Now for x = (u, v) ∈ U × V , we define ψ : U × V → R D by ψ(u, v) ≜ (u, v -g(u)). We can check that ψ is C 4 and ψ(Γ ∩ (U × V )) = {(u, v -g(u)) | v = g(u), u ∈ U )} = {(u, 0) | u ∈ U )} = U × {0} = (R D-M × {0}) ∩ ψ(U ). This proves that Γ is a C 4 submanifold of R D of dimension D -M . (c.f. Definition C.1) Since arg min y ′ ∈R ℓ(y ′ , y) = y for any y ∈ R, it is clear that ∀x ∈ Γ, x is a global minimizer of L. Finally we check the rank of Hessian of loss L. Note that for any x ∈ Γ, ∇ 2 L k (x) = ∂ 2 ℓ(y ′ ,y k ) (∂y ′ ) 2 | y ′ =y k ∇f k (x)(∇f k (x)) ⊤ and that ∂ 2 ℓ(y ′ ,y k ) (∂y ′ ) 2 | y ′ =y k > 0, rank(∇ 2 L(x)) = rank(∂F (x)) = M . This completes the proof. Proof of Theorem E.2. 1. L = 1 M M k=1 L k by definition. 2. ∀k ∈ [M ], L k (x) = ℓ(f k (x), y k ) is C 4 as ℓ and f k are both C 4 . 3. For any x ∈ Γ, by Lemma 5.5, we have ∇f k (x) ̸ = 0. Then there exists an open neighborhood V k such that Γ ⊂ V k and ∇f k (x) ̸ = 0 for any x ∈ V k , k ∈ [M ]. Then applying implicit function theorem as in the proof of Theorem 5.2, for any k ∈ M there exists a (D -1)-dimensional C 4 -manifold Γ ′ k ⊂ V k , such that for any x ′ ∈ V , f k (x ′ ) = y k if and only if x ′ ∈ Γ ′ k . As for any x ∈ Γ ⊂ V k , f k (x ′ ) = y k , we can infer that Γ ⊂ Γ ′ k . Then Γ ⊂ ∪ M k=1 Γ k . 4. For any x ∈ Γ k , we have f k (x) = y k , which implies L k (x) = 0. Also as x ∈ V ,∇f k (x) ̸ = 0. By Lemma J.15, we have rank(∇ 2 L(x)) = 1.

F PROPERTIES OF LIMITING MAP OF GRADIENT FLOW, Φ

In our analysis, the property of Φ will be heavily used. In this section, we will recap some related lemmas from Arora et al. ( 2022), and then introduce some new lemmas for the stochastic setting with batch size one. Lemma F.1 (Arora et al. (2022) Lemma B.6). Given any compact set K ⊂ Γ, for any x ∈ K h , ∥x -Φ(x)∥ 2 ≤ ∞ 0 ∥ dϕ(x, t) dt ∥ 2 ≤ 2(L(x) -L(Φ(x))) µ ≤ ∥∇L(x)∥ 2 µ . Lemma F.2. Given any compact set K ⊂ Γ, for any x ∈ K h , ∥∇L(x)∥ 2 ≤ ζ∥x -Φ(x)∥ 2 ≤ ζ 2(L(x) -L(Φ(x))) µ . Proof of Lemma F. Proof of Lemma 3.1. Since Φ is defined the limit map of gradient flow, it holds that for any t ≥ 0, Φ(ϕ(x, t)) = Φ(x). Differentiating both sides at t = 0, we have ∂Φ(ϕ(x, 0)) ∂ϕ(x,t) ∂t = 0. The proof is completed by noting that ∂ϕ(x,t) ∂t = -∇L(ϕ(x, t)) by definition of ϕ.

∂Φ(x)∇L

(x) = 0, x ∈ U ; ∂Φ (x) ∇ 2 L (x) ∇L (x) = -∂ 2 Φ (x) [∇L (x) , ∇L (x)] , x ∈ U ; ∂Φ (x) ∂ 2 (∇L)(x)[v 1 , v 1 ] = P ⊥ x,Γ ∇(λ 1 (∇ 2 (L(x)))), x ∈ Γ. Lemma F.4 (Arora et al. (2022) Lemmas B.8 and B.9). Given any compact set K ⊂ Γ, for any x ∈ K h , ∥P ⊥ Φ(x),Γ (x -Φ(x))∥ 2 ≤ ζν 4µ 2 ∥x -Φ(x)∥ 2 2 ; ∥∇L (x) -∇ 2 L (Φ(x)) (x -Φ(x))∥ 2 ≤ ν 2 ∥x -Φ(x)∥ 2 2 ; ∥∇L (x) ∥ 2 ∥∇ 2 L (Φ(x)) (x -Φ(x))∥ 2 -1 ≤ 2ν µ ∥x -Φ(x)∥ 2 ; ∇L (x) ∥∇L (x) ∥ = ∇ 2 L (Φ(x)) (x -Φ(x)) ∥∇ 2 L (Φ(x)) (x -Φ(x))∥ 2 + O( ν µ ∥x -Φ(x)∥ 2 ). Lemma F.6. Given any compact set K ⊂ Γ, for any x ∈ K h , ∥∂Φ(x)∇L k (x)∥ 2 ≤ (ν k + ζ k ξ)∥x -Φ(x)∥ 2 2 ∥∂Φ(x)∇ 2 L k (x) ∇L k (x) ∥∇L k (x) ∥ ∥ 2 ≤ (ν k + ζ k ξ)∥x -Φ(x)∥ 2 Proof of Lemma F.6. By Lemma E.6 and Taylor Expansion, ∥∂Φ(x)∇L k (x)∥ 2 ≤ ∥∂Φ(x)∇ 2 L k (Φ(x))(x -Φ(x))∥ 2 + ν k ∥x -Φ(x)∥ 2 2 ≤ ∥∂Φ(Φ(x))∇ 2 L k (Φ(x))(x -Φ(x))∥ 2 + ν k ∥x -Φ(x)∥ 2 2 + ζ k ξ∥x -Φ(x)∥ 2 2 = ∥P ⊥ x,Γ ∂Φ(Φ(x))∇ 2 L k (Φ(x))(x -Φ(x))∥ 2 + ν k ∥x -Φ(x)∥ 2 2 + ζ k ξ∥x -Φ(x)∥ 2 2 = (ν k + ζ k ξ)∥x -Φ(x)∥ 2 2 , this proves the first claim. Again by Lemma E.5 and Taylor Expansion, ∥∂Φ(x)∇ 2 L k (x) ∇L k (x) ∥∇L k (x) ∥ ∥ 2 ≤ ∥∂Φ(x)∇ 2 L k (Φ(x)) ∇L k (x) ∥∇L k (x) ∥ ∥ 2 + ν k ∥x -Φ(x)∥ 2 ≤ ∥∂Φ(Φ(x))∇ 2 L k (Φ(x)) ∇L k (x) ∥∇L k (x) ∥ ∥ 2 + (ν k + ζ k ξ)∥x -Φ(x)∥ 2 = (ν k + ζ k ξ)∥x -Φ(x)∥ 2 , this proves the second claim. Lemma F.7. Suppose x ∈ K h and y = x -η∇L x + ρ ∇L(x) ∥∇L(x)∥ , ∥y - x∥ 2 ≤ η∥∇L (x) ∥ 2 + ηζρ ∥Φ(x) -Φ(y)∥ 2 ≤ ξηρ∥∇L (x) ∥ 2 + νηρ 2 + ξη 2 ∥∇L (x) ∥ 2 2 + ξζ 2 η 2 ρ 2 ≤ ζξηρ∥x -Φ(x)∥ 2 + ζ 2 ξη 2 ∥x -Φ(x)∥ 2 2 + νηρ 2 + ξζ 2 η 2 ρ 2 Proof of Lemma F.7. For sufficient small ρ, x + ρ ∇L(x) ∥∇L(x)∥ ∈ K r . By Taylor Expansion, ∥y -x∥ 2 = η∥∇L x + ρ ∇L (x) ∥∇L (x) ∥ ∥ 2 ≤ η∥∇L (x) ∥ 2 + ηζρ This further implies that for sufficiently small η and ρ, xy ∈ K r . Again by Taylor Expansion, ∥∂Φ(x)(y -x)∥ 2 ≤ η∥∂Φ(x)∇L (x) + ρ∂Φ(x)∇ 2 L(x) ∇L (x) ∥∇L (x) ∥ ∥ 2 + ηρ 2 ν/2 . By Lemma F.3, ∂Φ(x)∇L (x) = 0 and ∂Φ (x) ∇ 2 L (x) ∇L (x) = -∂ 2 Φ (x) [∇L (x) , ∇L (x)]. Hence, ∥∂Φ(x)(y -x)∥ 2 ≤ ηρ∥∇L (x) ∥ 2 ∥∂ 2 Φ(x) ∇L (x) ∥∇L (x) ∥ , ∇L (x) ∥∇L (x) ∥ ∥ 2 + ηρ 2 ν/2 ≤ ξηρ∥∇L (x) ∥ 2 + ηρ 2 ν/2 . As xy ∈ K r , by Taylor Expansion, ∥Φ(y) -Φ(x)∥ 2 ≤ ∥∂Φ(x)(y -x)∥ 2 + ξ∥y -x∥ 2 2 /2 Putting together we have ∥Φ(x) -Φ(y)∥ 2 ≤ ξηρ∥∇L (x) ∥ 2 + ηρ 2 ν + ξη 2 ∥∇L (x) ∥ 2 2 + ξζ 2 η 2 ρ 2 . Finally, by Lemma F.2, we have ∥Φ(x) -Φ(y)∥ 2 ≤ ξηρ∥∇L (x) ∥ 2 + νηρ 2 + ξη 2 ∥∇L (x) ∥ 2 2 + ξζ 2 η 2 ρ 2 ≤ ζξηρ∥x -Φ(x)∥ 2 + ζ 2 ξη 2 ∥x -Φ(x)∥ 2 2 + νηρ 2 + ξζ 2 η 2 ρ 2 . This completes the proof. Lemma F.8. Suppose x ∈ K h and y = x -η∇L k x + ρ ∇L k (x) ∥∇L k (x)∥ , ∥y -x∥ 2 ≤ η∥∇L k (x) ∥ 2 + ηζρ , ∥Φ(x) -Φ(y)∥ 2 ≤ O(η∥∇L(x)∥ 2 2 + ηρ∥∇L(x)∥ 2 + ηρ 2 ) . Proof of Lemma F.8. For sufficient small ρ, x + ρ ∇L k (x) ∥∇L k (x)∥ ∈ K r . By Taylor Expansion, ∥y -x∥ 2 = η∥∇L k x + ρ ∇L k (x) ∥∇L k (x) ∥ ∥ 2 ≤ η∥∇L k (x) ∥ 2 + ηζρ . This further implies that for sufficiently small η and ρ, xy ∈ K r . Again by Taylor Expansion, ∥∂Φ(x)(y -x)∥ 2 ≤ η∥∂Φ(x)∇L k (x) + ρ∂Φ(x)∇ 2 L k (x) ∇L k (x) ∥∇L k (x) ∥ ∥ 2 + ηρ 2 ν/2 . We further have by Lemma F.1, ∥∂Φ(x)∇L k (x) ∥ ≤∥∂Φ(Φ(x))∇L k (x) ∥ + ξ∥∇L k (x) ∥ 2 ∥x -Φ(x)∥ ≤∥∂Φ(Φ(x))∇ 2 L k (Φ(x))(x -Φ(x))∥ + ν∥x -Φ(x)∥ 2 2 + ζξ∥x -Φ(x)∥ 2 2 ≤ ν µ ∥∇L(x)∥ 2 2 + ζξ µ 2 ∥∇L(x)∥ 2 2 . Similarly, ∥ρ∂Φ(x)∇ 2 L k (x) ∇L k (x) ∥∇L k (x) ∥ ∥ 2 ≤∥ρ∂Φ(Φ(x))∇ 2 L k (x) ∇L k (x) ∥∇L k (x) ∥ ∥ 2 + ρζξ∥x -Φ(x)∥ 2 ≤∥ρ∂Φ(Φ(x))∇ 2 L k (Φ(x)) ∇L k (x) ∥∇L k (x) ∥ ∥ 2 + ρζξ∥x -Φ(x)∥ 2 + ρν∥x -Φ(x)∥ 2 ≤ρ ζξ µ 2 ∥∇L(x)∥ 2 + ρ ν µ ∥∇L(x)∥ 2 . This completes the proof.

G ANALYSIS FOR EXPLICIT BIAS

Throughout this section, we assume that Assumption 3.2 holds.

G.1 A GENERAL THEOREM FOR EXPLICIT BIAS IN THE LIMIT CASE

In this subsection we provide the proof details for section 4.1, which shows that the explicit biases of three notions of sharpness are all different, using our new mathematical tool, Theorem G.6. Notation for Regularizers. Let R ρ : R D → R ∪ {∞} be a family of regularizers parameterized by ρ. If R ρ is not well-defined at some x, then we let R ρ (x) = ∞. This convention will be useful when analyzing ascent-direction sharpness R Asc ρ = L Asc ρ -L which is not defined when ∇L(x) = 0. This convention will not change the minimizers of the regularized loss. Intuitively, a regularizer should always be non-negative, but however, when far away from manifold, there are regularizers R ρ (x) of our interest that can actually be negative, e.g., R Avg ρ (x) ≈ ρ 2 2D Tr(∇ 2 L(x)). Therefore we make the following assumption to allow the regularizer to be mildly negative. Condition G.1. Suppose for any bounded closed set B ⊂ U , there exists C > 0, such that for sufficiently small ρ, ∀x ∈ B, R ρ (x) ≥ -Cρ 2 . Definition 4.3 (Limiting Regularizer). We define the limiting regularizer of {R ρ } as the functionfoot_7  S : Γ → R, S(x) = lim ρ→0 lim r→0 inf ∥x ′ -x∥ 2 ≤r R ρ (x ′ )/ρ 2 . The high-level intuition is that we want to use the notion of limiting regularizer to capture the explicit bias of R ρ among the manifold of minimizers Γ as ρ → 0, which is decided by the second order term in the Taylor expansion, e.g., Equation 5 and Equation 6. In other words, the hope is that whenever the regularized loss is optimized, the final solution should be in a neighborhood of minimizer x with smallest value of limiting regularizer S(x). However, such hope cannot be true without further assumptions, which motivates the following definition of good limiting regularizer. Definition G.2 (Good Limiting Regularizer). We say the limiting regularizer S of {R ρ } is good around some x * ∈ Γ, if S is non-negative and continuous at x * and that there is an open set V x * containing x * , such that for any C > 0, inf x ′ :∥x ′ -x∥ 2 ≤Cρ R ρ (x ′ )/ρ 2 converges uniformly to S(x) in for all x ∈ Γ ∩ V x * as ρ → 0. In other words, a good limiting regularizer satisfy that for any C, ϵ > 0, there is some ρ x * > 0, ∀x ∈ Γ ∩ V x * and ρ ≤ ρ x * , S(x) - inf ∥x ′ -x∥ 2 ≤C•ρ R ρ (x ′ )/ρ 2 < ϵ. We say the limiting regularizer S is good on Γ, if S is good around every point x ∈ Γ. In such case we also say R ρ admits S as a good limiting regularizer on Γ. The intuition of the concept of a good limiting regularizer is that, the value of the regularizer should not drop too fast when moving away from a minimizer x in its O(ρ) neighborhood. If so, the minimizer of the regularized loss may be Ω(ρ) away from any minimizer to reduce the regularizer at the cost of increasing the original loss, which makes the limiting regularizer unable to capture the explicit bias of the regularizer. (See Appendix G.2 for a counter example) We emphasize that the conditions of good limiting regularizer is natural and covers a large family of regularizers, including worst-, ascent-and average-direction sharpness. See Theorems G.3 to G.5 below. Theorem G.3. Worst-direction sharpness R Max Next we present the main mathematical tool to analyze the explicit bias of regularizers admitting good limiting regularizers, Theorem G.6. Theorem G.6. Let U ′ be any bounded open set such that its closure U ′ ⊆ U and U ′ ∩ Γ = U ′ ∩ Γ. Then for any family of parametrized regularizers {R ρ } admitting a good limiting regularizer S(x) on Γ and satisfying Condition G.1, for sufficiently small ρ, it holds that inf x∈U ′ L(x) + R ρ (x) -inf x∈U ′ L(x) -ρ 2 inf x∈U ′ ∩Γ S(x) ≤ o(ρ 2 ). Moreover, for sufficiently small ρ, it holds uniformly for all u ∈ U ′ that L(u) + R ρ (u) ≤ inf x∈U ′ (L(x) + R ρ (x)) + O(ρ 2 ) =⇒ R ρ (u)/ρ 2 -inf x∈U ′ ∩Γ S(x) ≥ -o(1). Theorem G.6 says that minimizing the regularized loss L(u) + R ρ (u) is not very different from minimizing the original loss L(u) and the regularizer R ρ (u) respectively. To see this, we define the following optimality gaps A(u) ≜ L(u) + R ρ (u) -inf x∈U ′ (L(x) + R ρ (x)) ≥ 0 B(u) ≜ L(u) -inf x∈U ′ L(x) ≥ 0 C(u) ≜ R ρ (u)/ρ 2 -inf x∈U ′ ∩Γ S(x), and Theorem G.6 implies that A(u) -B(u) -ρ 2 C(u) = o(ρ 2 ). Moreover, A(u), B(u) are non-negative by definition, and C(u) ≥ -o(1) are almost non-negative, whenever A(u) is O(ρ 2 )-approximately optimized. For the applications we are interested in in this paper, the good limiting regularizer S can be continuously extended to the entire space R D . In such a case, the third optimality gap has an approximate alternative form which doesn't involve R ρ , namely S(u) -inf x∈U ′ ∩Γ S(x). Corollary G.7 shows minimizing regularized loss L(u) + R ρ (u) is equivalent to minimizing the limiting regularizer, S(u) around the manifold of local minimizer, Γ. Corollary G.7. Under the setting of Theorem G.6, let S be an continuous extension of S to R d . For any optimality gap ∆ > 0, there is a function ϵ : R + → R + with lim ρ→0 ϵ(ρ) = 0, such that for all sufficiently small ρ > 0 and all u ∈ U ′ satisfying that L(u) + R ρ (u) -inf x∈U ′ L(x) + R ρ (x) ≤ ∆ρ 2 , it holds that L(u) -inf x∈U ′ L(x) ≤ (∆ + ϵ(ρ))ρ 2 and that S(u) -inf x∈U ′ ∩Γ S(x) ∈ [-ϵ(ρ), ∆ + ϵ(ρ)].

G.2 BAD LIMITING REGULARIZERS MAY NOT CAPTURE EXPLICIT BIAS

In this subsection, we provide an example where a bad limiting regularizer cannot capture the explicit bias of regularizer when ρ → 0, to justify the necessity of Definition G.2. Here a bad limiting regularizer is a limiting regularizer which is not good. Consider choosing R ρ (x) = L(x + ρe) -L(x) with ∥e∥ = 1 as a fixed unit vector. We will show minimizing the regularized loss L(x) + R ρ (x) does not imply minimizing the limiting regularizer of R ρ (x) on the manifold. By Definition 4.3 and the continuity of R ρ , the limiting regularizer S of R ρ is ∀x ∈ Γ, S(x) = lim ρ→0 lim r→0 inf ∥x ′ -x∥2≤r R ρ (x ′ )/ρ 2 = lim ρ→0 R ρ (x)/ρ 2 = ∇ 2 L(x)[e, e] ≥ 0. However, for any x ∈ Γ, we can choose x ′ = x -ρe, then L(x ′ ) + R ρ (x ′ ) = L(x ′ + ρe) = L(x) = 0. Therefore, no matter how small ρ is, minimizing L(x) + R ρ (x) can return a solution which is ρ-close to any point point of Γ. In other words, the explicit bias of minimizing L(x) + R ρ (x) is trivial and thus is not equivalent to minimizing the limiting regularizer S on the manifold Γ. The reason behind the inefficacy of the limiting regularizer S in explaining the explicit bias of R ρ is that S(x) is not a good limiting regularizer for any x ∈ Γ satisfying S(x) > 0. To be more concrete, choose C = 1 and ϵ = S(x)/2 in Definition G.2. For any x ∈ Γ and sufficiently small ρ > 0, considering x ′ = x -ρe 1 , by Taylor Expansion, R ρ (x ′ ) = L(x ′ + ρe) -L(x ′ ) = ρ⟨∇L(x ′ ), e⟩ + ρ 2 ∇ 2 L(x ′ )[e, e] + o(ρ 2 ) = ρ⟨∇ 2 L(x)(x ′ -x), e⟩ + ρ 2 ∇ 2 L(x ′ )[e, e] + o(ρ 2 ) = -ρ 2 ∇ 2 L(x)[e, e] + ρ 2 ∇ 2 L(x ′ )[e, e] + o(ρ 2 ) = ρ 2 e T (∇ 2 L(x ′ ) -∇ 2 L(x))e + o(ρ 2 ) = o(ρ 2 ) This implies inf ∥x ′ -x∥2≤Cρ R ρ (x ′ ) ≤ R ρ (x 1 ) = o(ρ 2 ). Hence, S(x) - inf ∥x ′ -x∥2≤Cρ R ρ (x ′ )/ρ 2 ≥ S(x) -o(1) > S(x)/2 = ϵ.

G.3 PROOF OF THEOREM G.6

This subsection aims to prove Theorem G.6. We start with a few lemmas that will be used later. Lemma G.8. Γ = U ∩ Γ. Proof of Lemma G.8. For any point x ∈ U ∩ Γ, there exists {x k } ∞ k=1 ∈ Γ such that lim k→∞ x k = x. Since x ∈ U and Φ is continuous in U , it holds that Φ is continuous at x, thus lim k→∞ Φ(x k ) = Φ(x) ∈ Γ. However Φ(x k ) = x k because x k ∈ Γ, ∀k. Thus we know x = Φ(x) ∈ Γ. Hence U ∩ Γ ⊂ Γ. The other side is clear because Γ ⊂ U and Γ ⊂ Γ. Lemma G.9. Let U ′ be any bounded open set such that its closure U ′ ⊆ U . If U ′ ∩ Γ ⊆ U ′ ∩ Γ, then U ′ ∩ Γ = U ′ ∩ Γ. Proof of Lemma G.9. By Lemma G.8, it holds that U ′ ∩ Γ = U ′ ∩ U ∩ Γ = U ′ ∩ Γ. Note that U ′ ∩ Γ ⊆ U ′ , U ′ ∩ Γ ⊆ Γ, we have that U ′ ∩ Γ ⊆ U ′ ∩ Γ = U ′ ∩ Γ, which completes the proof. Lemma G.10. Let U ′ be any bounded open set such that its closure U ′ ⊆ U and U ′ ∩ Γ ⊆ U ′ ∩ Γ. Then for all h 2 > 0,∃ρ 0 > 0 if x ∈ U ′ , dist(x, Γ) ≤ ρ 0 ⇒ dist(x, U ′ ∩ Γ) ≤ h 2 . Proof of Lemma G.10. We will prove by contradiction. Suppose there exists h 2 > 0 and {x k } ∞ k=1 ∈ U ′ , such that lim k→∞ dist(x k , Γ) = 0 but ∀k > 0, dist(x k , U ′ ∩ Γ) ≥ h 2 . Since U ′ is bounded, U ′ is compact and thus {x k } ∞ k=1 has at least one accumulate point x * in U ′ ⊆ U . Since U is the attraction set of Γ under gradient flow, we know that Φ(x * ) ∈ Γ. Now we claim x * ∈ Γ. This is because lim k→∞ dist(x k , Γ) = 0 and thus there exists a sequence of points on Γ, {y k } ∞ k=1 , where lim k→∞ ∥x k -y k ∥ = 0. Thus we have that x * = lim k→∞ y k = lim k→∞ Φ(y k ) = Φ(x * ), where the last step we used that x * ∈ U and Φ is continuous on U . By the definition of U , x * ∈ U ⇐⇒ Φ(x * ) ∈ Γ, thus x * ∈ Γ. Then we would have x * ∈ U ′ ∩ Γ, which is contradictory to dist(x k , U ′ ∩ Γ) ≥ dist(x k , U ′ ∩ Γ) ≥ h 2 , ∀k > 0. This completes the proof. Lemma G.11. Let U ′ be any bounded open set such that its closure U ′ ⊆ U and U ′ ∩ Γ ⊆ U ′ ∩ Γ. Then for all h 2 > 0,∃ρ 1 > 0 if x ∈ U ′ , L(x) ≤ inf x∈U ′ L(x) + ρ 1 ⇒ dist(x, U ′ ∩ Γ) ≤ h 2 . Proof of Lemma G.11. We will prove by contradiction. If there exists a list of ρ 1 , ..., ρ k , ..., such that ρ k → 0 and there exists x k ∈ U ′ , such that L(x k ) ≤ inf x∈U ′ L(x) + ρ k and dist(x k , U ′ ∩ Γ) ≥ h 2 . Since U ′ is bounded, U ′ is compact and thus {x k } ∞ k=1 has at least one accumulate point x * in U ′ ⊆ U . Since L is continuous in U , L(x * ) = lim k→∞ L(x k ) = inf x∈U ′ L(x) . Thus x * is a local minimizer of L and thus has zero gradient, which further implies that x * = Φ(x * ). Thus x * ∈ U ′ ∩ Γ, which is contradictory to dist(x k , U ′ ∩ Γ) ≥ dist(x k , U ′ ∩ Γ) ≥ h 2 , ∀k > 0. This completes the proof. Lemma G.12. Let U ′ be any bounded open set such that its closure U ′ ⊆ U and U ′ ∩ Γ = U ′ ∩ Γ. Suppose regularizers {R ρ } admits a limiting regularizer S on Γ, then inf x∈U ′ (L(x) + R ρ (x)) ≤ ρ 2 inf x∈U ′ ∩Γ S(x) + inf x∈U ′ L(x) + o(ρ 2 ). Proof of Lemma G.12. First choose sufficiently small ρ, such that ρ < h(U ′ ∩ Γ). Choose an approximate minimizer of S(x), x 0 ∈ U ′ ∩ Γ, such that S(x 0 ) ≤ inf x∈U ′ ∩Γ S(x) + ρ 2 . Then by the definition of limiting regularizers (Definition 4.3) and the assumption that U ′ is open, there exists x 1 ∈ U ′ satisfying that ∥x 1 -x 0 ∥ 2 ≤ r ρ < ρ 2 and R ρ (x 1 )/ρ 2 -S(x 0 ) ≤ ρ 2 . Thus, R ρ (x 1 ) ≤ ρ 2 S(x 0 ) + ρ 4 . As ∥x 1 -x 0 ∥ 2 ≤ ρ 2 < h and x 0 ∈ U ′ ∩ Γ. This further leads to x 0 x 1 ∈ U ′ ∩ Γ h . By Taylor expansion on L at x 0 , we would have L(x 1 ) ≤ L(x 0 ) + O(∥x 0 -x 1 ∥ 2 2 ) = inf x∈U ′ ∩Γ L(x) + O(ρ 4 ). Thus it holds that inf x∈U ′ (L(x) + R ρ (x)) ≤ L(x 1 ) + R ρ (x 1 ) ≤ ρ 2 inf x∈U ′ ∩Γ S(x) + inf x∈U ′ L(x) + O(ρ 4 ). This completes the proof. Lemma G.13. Let U ′ be any bounded open set such that its closure U ′ ⊆ U and U ′ ∩ Γ = U ′ ∩ Γ. Suppose regularizers {R ρ } admits a good limiting regularizer S on Γ, then for all u ∈ U ′ , ∥u -Φ(u)∥ 2 = O(ρ) =⇒ R ρ (u) ≥ ρ 2 inf x∈U ′ ∩Γ S(x) -o(ρ 2 ) . Proof of Lemma G.13. Define r = r(K), h = h(K) as the constant in Lemma E.5 with K = U ′ ∩ Γ. Note K is compact and by Lemma G.9, K = U ′ ∩ Γ ⊂ Γ. By Lemma E.5, we have K r ∩ Γ is a compact set, so is K h ∩ Γ. Since S is a good limiting regularizer for {R ρ }, by Definition G.2, for any x * ∈ K h ∩ Γ, there exists open neighborhood of x * , V x * such that for any C, ϵ 1 > 0, there is a ρ x * such that ∀x ∈ V x * and ρ ≤ ρ x * , S(x) - inf ∥x ′ -x∥ 2 ≤C•ρ R ρ (x ′ )/ρ 2 < ϵ 1 . Note that K h ∩ Γ is compact, there exists a finite subset of K h ∩ Γ, {x k } k , such that K h ∩ Γ ⊂ ∪ k V x k . Hence for any C, ϵ 1 > 0, there is some ρ K = min k ρ x k > 0, it holds that, ∀x ∈ K h ∩ Γ and ρ ≤ ρ K , S(x) - inf ∥x ′ -x∥ 2 ≤C•ρ R ρ (x ′ )/ρ 2 < ϵ 1 . We can rewrite Equation 18 as for any C > 0, sup x∈K h ∩Γ S(x) - inf ∥x ′ -x∥ 2 ≤C•ρ R ρ (x ′ )/ρ 2 = o(1), as ρ → 0. ( ) As u ∈ U ′ ⊆ U , we have that Φ(u) ∈ Γ. If ∥u -Φ(u)∥ 2 = O(ρ), then dist(u, Γ) ≤ O(ρ). By Lemma G.10, we have that dist(u, K) = o(1). This further implies dist(Φ(u), K) ≤ dist(u, K) + dist(Φ(u), u) = o(1). Hence we have that Φ(u) ∈ K h ∩ Γ for sufficiently small ρ. Thus we can pick x = Φ(u) in Equation 19 and C sufficiently large, which yields that ρ 2 S(Φ(u)) ≤ inf ∥u ′ -Φ(u)∥ 2 ≤O(ρ) R ρ (u ′ ) + o(ρ 2 ) ≤ R ρ (u) + o(ρ 2 ), where the last step is because ∥u -Φ(u)∥ 2 = O(ρ). On the other hand, we have that S(Φ(u)) ≥ inf x∈U ′ ∩Γ S(x) -o(1) . ( ) as S is continuous on Γ and dist(U ′ ∩ Γ, Φ(u)) = o(1). Combining Equations 20 and 21, we have R ρ (u) ≥ ρ 2 inf x∈U ′ ∩Γ S(x) -o(ρ 2 ). Proof of Theorem G.6. We will first lower bound L(x) + R ρ (x) for x ∈ U ′ . Suppose C U ′ is the constant in Condition G.1. Define C 1 = 2 C U ′ +inf x∈U ′ ∩Γ S(x)+1 µ . We discuss by cases. For sufficiently small ρ,  1. If x ̸ ∈ K h , L(x) ≥ µ∥x -Φ(x)∥ 2 2 2 ≥ (C U ′ + inf x∈U ′ ∩Γ S(x) + 1)ρ 2 . This implies L(x) + R ρ (x) ≥ (inf x∈U ′ ∩Γ S(x) + 1)ρ 2 + inf x∈U ′ L(x). 3. If ∥x -Φ(x)∥ 2 ≤ C 1 ρ, by Lemma G.13, R ρ (x) ≥ ρ 2 inf x∈U ′ ∩Γ S(x) -o(ρ 2 ), hence L(x) + R ρ (x) + o(ρ 2 ) ≥ inf x∈U ′ ∩Γ S(x)ρ 2 + inf x∈U ′ L(x) . Concluding the three cases, we have inf x∈U ′ (L(x) + R ρ (x)) ≥ inf x∈U ′ ∩Γ L(x) + inf x∈U ′ ∩Γ S(x)ρ 2 -o(ρ 2 ) . By Lemma G.12, we have that inf x∈U ′ (L(x) + R ρ (x)) ≤ ρ 2 inf x∈U ′ ∩Γ S(x) + inf x∈U ′ ∩Γ L(x) + o(ρ 2 ) . Combining the above two inequalities, we prove the main statement of Theorem G.6. Furthermore, if L(u) + R ρ (u) ≤ inf x∈U ′ (L(x) + R ρ (x)) + O(ρ 2 ), then by the main statement and Condition G.1, we have that L(u) -inf x∈U ′ L(x) ≤ inf x∈U ′ (L(x) + R ρ (x)) -R ρ (u) -inf x∈U ′ L(x) + O(ρ 2 ) ≤ρ 2 inf x∈U ′ ∩Γ S(x) + Cρ 2 + O(ρ 2 ) = O(ρ 2 ) . Then by Lemma G.11, we have u ∈ (U ′ ∩Γ) h for sufficiently small ρ. By Lemma F.1, we have ∥u-Φ(u)∥ 2 = O(ρ). By Lemma G.13, we have R ρ (u) ≥ ρ 2 inf x∈U ′ ∩Γ S(x) -o(ρ 2 ). G.4 PROOFS OF COROLLARY G.7 Proof of Corollary G.7. Since L(u) + R ρ (u) -inf x∈U ′ L(x) + R ρ (x) ≤ ∆ρ 2 = O(ρ 2 ), by Theorem G.6, we have that L(u) -inf x∈U ′ L(x) ≤ (∆ + o(1))ρ 2 , and R ρ (x) -inf x∈U ′ ∩Γ S(x) ∈ [-o(1), ∆ + o(1)]. Thus it suffices to show R ρ (x) -S(x) = o(ρ 2 ). Since L(u) -inf x∈U ′ L(x) ≤ (∆ + ϵ(ρ))ρ 2 = o(1), by Lemma G.11, we know dist(x, U ′ ∩ Γ) = o(1). Thus by Lemma F.1, ∥x -Φ(x)∥ = o(1), which implies that ρ 2 S(Φ(x))-o(ρ 2 ) ≤ R ρ (u). Since S is an continuous extension, S(x)-S(Φ(x)) = S(x)-S(Φ(x)) = O(∥x -Φ(x)∥ 2 ) = o(1). Thus we conclude that S(x) ≤ S(Φ(x)) ≤ inf x∈U ′ ∩Γ S(x) + ∆ + o(1). On the other hand, S(x) ≥ S(Φ(x)) -o(1) ≥ inf x∈U ′ ∩Γ S(x) -o(1) , where the last step we use the fact that dist(x, U ′ ∩ Γ) = o(1). This completes the proof.

G.5 LIMITING REGULARIZERS FOR DIFFERENT NOTIONS OF SHARPNESS

Proof of Theorem G.3. 1. We will first verify Condition G.1. For fixed compact set B ⊂ U , as ∥∇ 3 L(x)∥ 2 is continuous, there exists constant ν, such that ∀x ∈ B 1 , ∥∇ 3 L(x)∥ 2 ≤ ν. Then by Taylor Expansion, R Max ρ (x) = max ∥v∥ 2 ≤1 L(x + ρv) -L(x) ≥ max ∥v∥ 2 ≤1 ρ⟨∇L(x), v⟩ + ρ 2 v T ∇ 2 L(x)v/2 -νρ 3 /6 ≥ -νρ 3 /6 .

2.. Now we verify S

Max (x) = λ 1 (∇ 2 L(•))/2 is the limiting regularizer of R Max ρ . Let x be any point in Γ, by continuity of R Max ρ , lim ρ→0 lim r→0 inf ∥x ′ -x∥2≤r R Max ρ (x ′ ) ρ 2 = lim ρ→0 R Max ρ (x) ρ 2 = λ 1 (∇ 2 L(x))/2 . 3. Finally we verify definition of good limiting regularizer, by Assumption 3.2, S Max (x) = λ 1 (x)/2 is nonnegative and continuous on Γ. For any x * ∈ Γ, choose a sufficiently small open convex set V containing x * such that ∀x ∈ V 1 , ∥∇ 3 L(x)∥ 2 ≤ ν. For any x ∈ V ∩ Γ, for any x ′ satisfying that ∥x ′ -x∥ 2 ≤ Cρ, by Theorem K.3, R Max ρ (x ′ ) = max ∥v∥ 2 ≤1 L(x ′ + ρv) -L(x ′ ) ≥ max ∥v∥ 2 ≤1 ρ⟨∇L(x ′ ), v⟩ + ρ 2 v T ∇ 2 L(x ′ )v/2 -νρ 3 /6 ≥ ρ 2 λ 1 (∇ 2 L(x ′ ))/2 -νρ 3 /6 ≥ ρ 2 λ 1 (∇ 2 L(x))/2 -O(ρ 3 ) .

This implies inf

∥x ′ -x∥2≤Cρ R Max ρ (x ′ ) ≥ ρ 2 λ 1 (∇ 2 L(x))/2 -O(ρ 3 ). On the other hand, for any x ∈ V ∩ Γ, R Max ρ (x) = max ∥v∥ 2 ≤1 L(x + ρv) -L(x) ≤ max ∥v∥ 2 ≤1 ρ⟨∇L(x), v⟩ + ρ 2 v T ∇ 2 L(x)v/2 + νρ 3 = max ∥v∥ 2 ≤1 ρ 2 v T ∇ 2 L(x)v/2 + νρ 3 = ρ 2 λ 1 (∇ 2 L(x ′ ))/2 + O(ρ 3 ) . ∥x ′ -x∥2≤Cρ R Max ρ (x ′ ) ≤ ρ 2 λ 1 (∇ 2 L(x))/2 + O(ρ 3 ). Thus, we conclude that inf ∥x ′ -x∥2≤Cρ R Max ρ (x ′ )/ρ 2 -λ 1 (∇ 2 L(x))/2 = O(ρ), ∀x ∈ V ∩ Γ, indicating S Max is a good limiting regularizer of R Max ρ on Γ. This completes the proof. Proof of Theorem G.4. 1. We will first prove Condition G.1 holds. For any fixed compact set B ⊂ U , as λ 1 (∇ 2 L) and ∥∇ 3 L∥is continuous, there exists constant C, such that ∀x ∈ B 2 , λ 1 (∇ 2 L) > -ζ and ∥∇ 3 L(x)∥ < ν. Then by Taylor Expansion, R Asc ρ (x) = L(x + ρ ∇L (x) ∥∇L (x) ∥ ) -L(x) ≥ ρ∥∇L (x) ∥ 2 + ρ 2 ( ∇L (x) ∥∇L (x) ∥ ) T ∇ 2 L(x) ∇L (x) ∥∇L (x) ∥ /2 -νρ 3 /6 ≥ -(ζ + ν/6)ρ 2 . 2. Now we verify S Asc (x) = Tr(∇ 2 L(•))/2 is the limiting regularizer of R Asc ρ . Let x be any point in Γ. Let K = {x} and choose h = h(K) as in Lemma E.5. For any x ′ ∈ K h ∩ U ′ , R Asc ρ (x ′ ) = L(x ′ + ρ ∇L (x ′ ) ∥∇L (x ′ ) ∥ ) -L(x ′ ) ≥ ρ∥∇L (x ′ ) ∥ 2 + ρ 2 ( ∇L (x ′ ) ∥∇L (x ′ ) ∥ ) T ∇ 2 L(x ′ ) ∇L (x ′ ) ∥∇L (x ′ ) ∥ /2 -νρ 3 /6 ≥ ρ 2 ( ∇L (x ′ ) ∥∇L (x ′ ) ∥ ) T ∇ 2 L(Φ(x ′ )) ∇L (x ′ ) ∥∇L (x ′ ) ∥ /2 -νρ 3 /6 . By Lemma F.4, we have ∇L(x ′ ) ∥∇L(x ′ )∥ = ∇ 2 L(Φ(x ′ ))(x ′ -Φ(x ′ )) ∥∇ 2 L(Φ(x ′ ))(x ′ -Φ(x ′ ))∥2 + O( ν µ ∥x ′ -Φ(x ′ )∥ 2 ). Hence R Asc ρ (x ′ ) ≥ ρ 2 λ M (∇ 2 L(Φ(x ′ )))/2 -ζρ 2 O(∥x ′ -Φ(x ′ )∥ 2 ) -νρ 3 /6 . This implies lim ρ→0 lim r→0 inf ∥x ′ -x∥2≤r R Asc ρ (x ′ ) ρ 2 ≥ λ M (∇ 2 L(Φ(x ′ ) ))/2. We now show the above inequality is in fact equality. If we choose x ′′ r = x + rv M , then by Taylor Expansion, ∇L(x ′′ r ) = ∇L(x) + ∇ 2 L(x)(x ′′ r -x) + O(∥x ′′ r -x∥ 2 ) = rv M + O(r 2 ) This implies lim r→0 ∇L(x ′′ r ) ∥∇L(x ′′ r )∥ = v M . We also have lim r→0 ∇ 2 L(x ′′ r ) = ∇ 2 L(x) and lim r→0 ∇ L (x ′′ r ) = 0. Putting together, lim r→0 R Asc ρ (x ′′ r ) = lim r→0 L(x ′′ r + ρ ∇L (x ′′ r ) ∥∇L (x ′′ r ) ∥ ) -L(x ′′ r ) = lim r→0 ρ∥∇L (x ′′ r ) ∥ 2 + ρ 2 ( ∇L (x ′′ r ) ∥∇L (x ′′ r ) ∥ ) T ∇ 2 L(x ′′ r ) ∇L (x ′′ r ) ∥∇L (x ′′ r ) ∥ /2 + O(νρ 3 ) = ρ 2 λ M (∇ 2 L(x))/2 + O(ρ 3 ). This implies lim ρ→0 lim r→0 inf ∥x ′ -x∥2≤r R Asc ρ (x ′ ) ρ 2 ≤ lim ρ→0 lim r→0 R Asc ρ (x ′′ r ) ρ 2 = λ M (∇ 2 L(x))/2. Hence the limiting regularizer S is exactly λ M (∇ 2 L(•))/2. 3. Finally we verify definition of good limiting regularizer, by Assumption 3.2, S Max (x) = λ M (x)/2 is nonnegative and continuous on Γ. For any x * ∈ Γ, choose a sufficiently small open convex set V containing x * such that ∀x ∈ V 1 , ∥∇ 3 L(x)∥ 2 ≤ ν. For any x ∈ V ∩ Γ, for any x ′ satisfying that ∥x ′ -x∥ 2 ≤ Cρ, R Asc ρ (x ′ ) = L(x ′ + ρ ∇L (x ′ ) ∥∇L (x ′ ) ∥ ) -L(x ′ ) ≥ ρ∥∇L (x ′ ) ∥ 2 + ρ 2 ( ∇L (x ′ ) ∥∇L (x ′ ) ∥ ) T ∇ 2 L(x ′ ) ∇L (x ′ ) ∥∇L (x ′ ) ∥ /2 -νρ 3 /6 ≥ ρ 2 ( ∇L (x ′ ) ∥∇L (x ′ ) ∥ ) T ∇ 2 L(Φ(x ′ )) ∇L (x ′ ) ∥∇L (x ′ ) ∥ /2 -νρ 3 /6 . By Lemma F.4, we have ∇L(x ′ ) ∥∇L(x ′ )∥ = ∇ 2 L(Φ(x ′ ))(x ′ -Φ(x ′ )) ∥∇ 2 L(Φ(x ′ ))(x ′ -Φ(x ′ ))∥2 + O( ν µ ∥x ′ -Φ(x ′ )∥ 2 ). This implies inf ∥x ′ -x∥2≤Cρ R Asc ρ (x ′ ) ≥ ρ 2 λ M (∇ 2 L(x))/2 -O(ρ 3 ). On the other hand, simillar to the proof in the second part, we have inf ∥x ′ -x∥2≤Cρ R Asc ρ (x ′ ) ≤ ρ 2 λ M (∇ 2 L(x))/2 + O(ρ 3 ). Thus, we conclude that inf ∥x ′ -x∥2≤Cρ R Max ρ (x ′ )/ρ 2 -λ 1 (∇ 2 L(x))/2 = O(ρ), ∀x ∈ V ∩ Γ, indicating S Max is a good limiting regularizer of R Max ρ on Γ. This completes the proof. Proof of Theorem G.5. 1. We will first verify Condition G.1. For fixed compact set B ⊂ U , as ∥∇ 3 L(x)∥ 2 is continuous, there exists constant ν, such that ∀x ∈ B 1 , ∥∇ 3 L(x)∥ 2 ≤ ν. Then by Taylor Expansion, R Avg ρ (x) = E g∼N (0,I) L(x + ρ g ∥g∥ ) -L(x) ≥ E g∼N (0,I) ρ⟨∇L(x), g ∥g∥ ⟩ + ρ 2 ( g ∥g∥ ) T ∇ 2 L(x) g 2∥g∥ -νρ 3 /6 ≥ -νρ 3 /6 .

2.. Now we verify S

Max (x) = Tr(∇ 2 L(•))/2D is the limiting regularizer of R Avg ρ . Let x be any point in Γ, by continuity of R Avg ρ , lim ρ→0 lim r→0 inf ∥x ′ -x∥2≤r R Avg ρ (x ′ ) ρ 2 = lim ρ→0 R Avg ρ (x) ρ 2 = Tr(∇ 2 L(x))/2D . 3. Finally we verify definition of good limiting regularizer, by Assumption 3.2, S Avg (x) = Tr(x)/2D is nonnegative and continuous on Γ. For any x * ∈ Γ, choose a sufficiently small open convex set V containing x * such that ∀x ∈ V 1 , ∥∇ 3 L(x)∥ 2 ≤ ν. For any x ∈ V ∩ Γ, for any x ′ satisfying that ∥x ′ -x∥ 2 ≤ Cρ, by Theorem K.3, R Avg ρ (x ′ ) = E g∼N (0,I) L(x ′ + ρ g ∥g∥ ) -L(x ′ ) ≥ E g∼N (0,I) ρ⟨∇L(x ′ ), g ∥g∥ ⟩ + ρ 2 g ∥g∥ T ∇ 2 L(x ′ ) g ∥2g∥ -νρ 3 /6 ≥ ρ 2 Tr(∇ 2 L(x ′ ))/2D -νρ 3 /6 ≥ ρ 2 Tr(∇ 2 L(x))/2D -O(ρ 3 ) .

This implies inf

∥x ′ -x∥2≤Cρ R Avg ρ (x ′ ) ≥ ρ 2 Tr(∇ 2 L(x))/2D -O(ρ 3 ). On the other hand, for any x ∈ V ∩ Γ, R Avg ρ (x) = E g∼N (0,I) L(x + ρ g ∥g∥ ) -L(x) ≤ E g∼N (0,I) ρ⟨∇L(x), g ∥g∥ ⟩ + ρ 2 g ∥g∥ T ∇ 2 L(x) g 2∥g∥ + νρ 3 = E g∼N (0,I) ρ 2 g ∥g∥ T ∇ 2 L(x) g 2∥g∥ + νρ 3 = ρ 2 Tr(∇ 2 L(x ′ ))/2D + O(ρ 3 ) . ∥x ′ -x∥2≤Cρ R Avg ρ (x ′ ) ≤ ρ 2 Tr(∇ 2 L(x))/2D + O(ρ 3 ). Thus, we conclude that inf (11) Then, for almost every x(0), we have x(t) converges in direction to v 1 (A) up to a sign flip and lim t→∞ ∥x(t)∥ 2 = ηρλ1(A) 2-ηλ1(A) with ηλ 1 (A) < 1. ∥x ′ -x∥2≤Cρ R Avg ρ (x ′ )/ρ 2 -Tr(∇ 2 L(x))/2D = O(ρ), ∀x ∈ V ∩ Γ, Proof of Theorem 4.8. We first rewrite the iterate as x(t + 1) = x(t) -ηAx(t) -ηρ A 2 x(t) ∥Ax(t)∥ 2 . Define x(t) ≜ ∇L(x(t)) ρ = Ax(t) ρ , and we have x(t + 1) = x(t) -ηAx(t) -η A 2 x(t) ∥x(t)∥ 2 . ( ) We suppose A ∈ R D×D and use λ i , v i to denote λ i (A), v i (A). Further, we define that P (j:D) ≜ D i=j v i (A)v i (A) T , I j ≜ {x | ∥P (j:D) x∥ 2 ≤ ηλ 2 j } , xi (t) ≜ ⟨x(t), v i ⟩ , S ≜ {t | ∥x(t)∥ 2 ≤ ηλ 2 1 2 -ηλ 1 , t > T 1 } . By Lemma H.1, I j is an invariant set for update rule Equation 22. Our proof consists of two steps. ( 2-ηλ1 , which implies our final results.

H.1 ENTERING INVARIANT SET

In this subsection, we will prove the following three lemmas. 1. Lemma H.1 shows I j is an invariant set for update rule (Equation 22). 2. Lemma H.2 shows that under the update rule (Equation 22), all iterates not in I j will shrink exponentially in ℓ 2 norm. 3. Lemma H.3 combines Lemmas H.1 and H.2 to show that for sufficiently large t, x(t) ∈ ∩ j I j . Lemma H.1. For t ≥ 0, if ηλ 1 (A) < 1 and x(t) ∈ I j , then x(t + 1) ∈ I j . Proof of Lemma H.1. By (Equation 22), we have that P (j:D) x(t + 1) = (I -P (j:D) ηA -η P (j:D) A 2 ∥x(t)∥ 2 )P (j:D) x(t) . Hence we have that ∥P (j:D) x(t + 1)∥ 2 = ∥(I -P (j:D) ηA -η P (j:D) A 2 ∥x(t)∥ 2 )P (j:D) x(t)∥ 2 ≤ ∥I -P (j:D) ηA -η P (j:D) A 2 ∥x(t)∥ 2 ∥ 2 ∥P (j:D) x(t)∥ 2 . Because x(t) ∈ I j , ∥x(t)∥ 2 ≤ ηλ 2 j 1-ηλj . This implies, I(1 -ηλ j -η λ 2 j ∥P (j:D) x(t)∥ 2 ) ≺ I(1 -ηλ j -η λ 2 j ∥x(t)∥ 2 ) ≺ I -P (j:D) ηA -η P (j:D) A 2 ∥x(t)∥ 2 ≺ I . Hence, ∥I -P (j:D) ηA -η P (j:D) A 2 ∥2 x(t)∥ ∥ 2 ≤ max(1, ηλ j + η λ 2 j ∥P (j:D) x(t)∥2 -1) . It holds that ∥P (j:D) x(t + 1)∥ 2 ≤ max(∥P (j:D) x(t)∥ 2 , ηλ 2 j -(1 -ηλ j )∥P (j:D) x(t)∥ 2 ) ≤ ηλ 2 j , where the last equality is because 1 -ηλ j ≥ 0. This above inequality is exactly the definition of x(t + 1) ∈ I j and thus is proof is completed. Lemma H.2. For t ≥ 0, if ηλ 1 (A) < 1 and x(t) ̸ ∈ I j , then ∥P (j:D) x(t + 1)∥ 2 ≤ max 1 -ηλ D -η λ 2 D ∥x(t)∥ 2 , ηλ j ∥P (j:D) x(t)∥ 2 (23) ≤ max (1 -ηλ D , ηλ j ) ∥P (j:D) x(t)∥ 2 . Proof of Lemma H.2. Note that ∥P (j:D) x(t + 1)∥ 2 = ∥(I -P (j:D) ηA -η P (j:D) A 2 ∥x(t)∥ 2 )P (j:D) x(t)∥ 2 ≤ ∥P (j:D) -P (j:D) ηA -η P (j:D) A 2 ∥x(t)∥ 2 ∥ 2 ∥P (j:D) x(t)∥ 2 . As x(t) ̸ ∈ I j , We have ∥x(t)∥ 2 ≥ ∥P (j:D) x(t)∥ 2 > ηλ 2 j , hence η P (j:D) A 2 ∥x(t)∥2 ≺ η P (j:D) A 2 ηλ 2 j ≺ P (j:D) . This implies that -ηλ j P (j:D) ≺ -P (j:D) ηA ≺ P (j:D) -P (j:D) ηA -η P (j:D) A 2 ∥x(t)∥ 2 , and P (j:D) -P (j:D) ηA -η P (j:D) A 2 ∥x(t)∥ 2 ≺ P (j:D) (1 -ηλ D ) -η λ 2 D ∥x(t)∥ 2 . Hence we have that ∥P (j:D) x(t + 1)∥ 2 ≤ max 1 -ηλ D -η λ 2 D ∥x(t)∥ 2 , ηλ j ∥P (j:D) x(t)∥ 2 . ≤ max (1 -ηλ D , ηλ j ) ∥P (j:D) x(t)∥ 2 This completes the proof. Lemma H.3. Choosing T 1 = max j -log max(1-ηλ D ,ηλj ) max( ∥x(0)∥2 ηλ 2 j , 1) , then ∀t ≥ T 1 , D > j ≥ 1, x(t) ∈ I j Proof of Lemma H.3. We will prove by contradiction. Suppose ∃j ∈ [D] and T > T 1 , such that x(T ) ̸ ∈ I j . By Lemma H.1, it holds that ∀t < T, x(t) ̸ ∈ I j . Then by Lemma H.2, ∥P (j:D) x(T )∥ 2 ≤ max (1 -ηλ D , ηλ j ) T ∥P (j:D) x(0)∥ 2 ≤ ηλ 2 j , which leads to a contradiction.

H.2 ALIGNMENT TO TOP EIGENVECTOR

In this subsection, we prove the following lemmas towards showing that x(t) converges in direction to v 1 (A) up to a proper sign flip. 1. Corollary H.4 show that for almost every learning rate η and initialization x init , x1 (t) ̸ = 0, for every t ≥ 0. This condition is important because if x1 (t) = 0 at some step t, then for any t ′ ≥ t, x1 (t ′ ) will also be 0 and thus alignment is impossible. We will first prove that ∀t, x1 (t) ̸ = 0 happens for almost every learning rate η and initialization x init (Corollary H.4), using a much more general result (Theorem D.3). Corollary H.4. Except for countably many η ∈ R + , for almost all initialization x init = x(0), it holds that for all natural number t, x1 (t) ̸ = 0.

Proof of Corollary

H.4. Let F n (x) ≡ F (x) ≜ A(x + ρ Ax ∥Ax∥ 2 ), ∀n ∈ N + , x ∈ R D and Z = {x ∈ R D | ⟨x, v 1 ⟩ = 0}. We can easily check F is C 1 on R D \ Z and Z is a zero-measure set. Applying Theorem D.3, we have the following corollary. Lemma H.5. For t ≥ 0, if ∥x(t)∥ 2 > ηλ 2 1 2-ηλ1 , x(t) ∈ ∩I j , then ∥x(t + 1)∥ 2 ≤ max( ηλ 2 1 2 -ηλ 1 -η λ 4 D 2λ 2 1 , ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 ) Proof of Lemma H.5. Note that x(t + 1) = (I -ηA -η A 2 ∥x(t)∥ 2 )x(t) = 1 ∥x(t)∥ 2 D j=1 (1 -ηλ j )∥x(t)∥ 2 -ηλ 2 j xj (t)v j Consider the following two cases. 1 If for any i, such that (1 -ηλ 1 )∥x(t)∥ 2 -ηλ 2 1 ≥ (1 -ηλ i )∥x(t)∥ 2 -ηλ 2 i , then we have ∥x(t + 1)∥ 2 ≤ (1 -ηλ 1 )∥x(t)∥ 2 -ηλ 2 1 = ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 . 2 If there exists i, such that (1 -ηλ 1 )∥x(t)∥ 2 -ηλ 2 1 < (1 -ηλ i )∥x(t)∥ 2 -ηλ 2 i , then suppose WLOG, i is the smallest among such index. As ηλ 2 i -(1 -ηλ i )∥x(t)∥ 2 < ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 = (1 -ηλ 1 )∥x(t)∥ 2 -ηλ 2 1 We have -ηλ 2 i + (1 -ηλ i )∥x(t)∥ 2 > ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 . Equivalently, ∥x(t)∥ 2 > ηλ 2 1 + ηλ 2 i 2 -ηλ 1 -ηλ i Combining with x(t) ∈ I 1 ⇒ ∥x(t)∥ 2 ≤ ηλ 2 1 , we have η < λ1-λi λ 2 1 . Now consider the following vertors, v (1) (t) ≜ (ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 )x(t) , v (2) (t) ≜ ((2 -ηλ 1 -ηλ i )∥x(t)∥ 2 -ηλ 2 i -ηλ 2 1 )P (i:D) x(t) , v (2+j) (t) ≜ ((ηλ i+j-1 -ηλ i+j )∥x(t)∥ 2 -ηλ 2 i+j + ηλ 2 i+j-1 )P (i+j:D) x(t), 1 ≤ j ≤ D -i . Then we have ∥x(t + 1)∥ 2 =∥ 1 ∥x(t)∥ 2 D j=1 (1 -ηλ j )∥x(t)∥ 2 -ηλ 2 j xj (t)v j ∥ 2 ≤∥ 1 ∥x(t)∥ 2 i-1 j=1 ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 xj (t)v j ∥+ ∥ 1 ∥x(t)∥ 2 D j=i (1 -ηλ j )∥x(t)∥ 2 -ηλ 2 j xj (t)v j ∥ 2 ≤ 1 ∥x(t)∥ 2 D+1-i j=1 ∥v (j) ∥ 2 By assumption, we have x(t) ∈ ∩I j , hence we have ∥v (1)  (t)∥ 2 = (ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 )∥x(t)∥ 2 , ∥v (2) (t)∥ 2 ≤ η((2 -ηλ 1 -ηλ i )∥x(t)∥ 2 -ηλ 2 i -ηλ 2 1 )λ 2 i , ∥v (2+j) (t)∥ 2 ≤ η((ηλ i+j-1 -ηλ i+j )∥x(t)∥ 2 -ηλ 2 i+j + ηλ 2 i+j-1 )λ 2 i+j , 1 ≤ j ≤ D -i . Using AM-GM inequality, we have λ i+j-1 λ 2 i+j ≤ λ 3 i+j-1 + 2λ 3 i+j 3 , λ 2 i+j-1 λ 2 i+j ≤ λ 4 i+j-1 + λ 4 i+j 2 . Hence ∥v (2+j) (t)∥ 2 ≤ η((ηλ i+j-1 -ηλ i+j )∥x(t)∥ 2 -ηλ 2 i+j + ηλ 2 i+j-1 )λ 2 i+j ≤ η 2 ∥x(t)∥ 2 λ 3 i+j-1 -λ 3 i+j 3 + η 2 λ 4 i+j-1 -λ 4 i+j 2 , 1 ≤ j ≤ D -i D-i j=1 ∥v (2+j) (t)∥ 2 ≤ η 2 ∥x(t)∥ 2 λ 3 i -λ 3 D 3 + η 2 λ 4 i -λ 4 D 2 . Putting together, ∥x(t + 1)∥ 2 ≤ 1 ∥x(t)∥ 2 D+1-i j=1 ∥v (i) ∥ 2 ≤ηλ 2 1 + ηλ 2 i (2 -ηλ 1 -ηλ i ) + η 2 λ 3 i -λ 3 D 3 -(1 -ηλ 1 )∥x(t)∥ 2 -η 2 λ 2 i (λ 2 i + λ 2 1 ) 1 ∥x(t)∥ 2 + η 2 λ 4 i -λ 4 D 2 1 ∥x(t)∥ 2 ≤ηλ 2 1 + ηλ 2 i (2 -ηλ 1 - 2 3 ηλ i ) -(1 -ηλ 1 )∥x(t)∥ 2 -η 2 λ 2 i ( 1 2 λ 2 i + λ 2 1 ) 1 ∥x(t)∥ 2 -η 2 λ 4 D 2∥x(t)∥ 2 ≤ηλ 2 1 + ηλ 2 i (2 -ηλ 1 - 2 3 ηλ i ) -(1 -ηλ 1 )∥x(t)∥ 2 -η 2 λ 2 i ( 1 2 λ 2 i + λ 2 1 ) 1 ∥x(t)∥ 2 -η λ 4 D 2λ 1 . We further discuss three cases 1. If ηλ i 1 2 λ 2 i +λ 2 1 1-ηλ1 < ηλ 2 1 +ηλ 2 i 2-ηλ1-ηλi , we have ∥x(t)∥ 2 > ηλ 2 1 +ηλ 2 i 2-ηλ1-ηλi > ηλ i 1 2 λ 2 i +λ 2 1 1-ηλ1 ,then ∥x(t + 1)∥ 2 ≤ηλ 2 1 + ηλ 2 i (2 -ηλ 1 - 2 3 ηλ i ) -(1 -ηλ 1 )∥x(t)∥ 2 -η 2 λ 2 i ( 1 2 λ 2 i + λ 2 1 ) 1 ∥x(t)∥ 2 -η λ 4 D 2λ ≤ηλ 2 1 + ηλ 2 i (2 -ηλ 1 - 2 3 ηλ i ) -(1 -ηλ 1 ) ηλ 2 1 + ηλ 2 i 2 -ηλ 1 -ηλ i -η 2 λ 2 i ( 1 2 λ 2 i + λ 2 1 ) 2 -ηλ 1 -ηλ i ηλ 2 1 + ηλ 2 i -η λ 4 D 2λ 2 1 ≤ ηλ 2 1 2 -ηλ 1 -η λ 4 D 2λ 2 1 . The second line is because (1 -ηλ 1 )∥x(t)∥ 2 + η 2 λ 2 i ( 1 2 λ 2 i + λ 2 1 ) 1 ∥x(t)∥2 monotonously increase w.r.t ∥x(t)∥ 2 when ∥x(t)∥ 2 > ηλ i 1 2 λ 2 i +λ 2 1 1-ηλ1 . The last line is due to Lemma K.9.

2.. If ηλ

2 1 ≥ ηλ i 1 2 λ 2 i +λ 2 1 1-ηλ1 ≥ ηλ 2 1 +ηλ 2 i 2-ηλ1-ηλi , then ∥x(t + 1)∥ 2 ≤ηλ 2 1 + ηλ 2 i (2 -ηλ 1 - 2 3 ηλ i ) -(1 -ηλ 1 )∥x(t)∥ 2 -η 2 λ 2 i ( 1 2 λ 2 i + λ 2 1 ) 1 ∥x(t)∥ 2 -η λ 4 D 2λ ≤ηλ 2 1 + ηλ 2 i (2 -ηλ 1 - 2 3 ηλ i ) -2ηλ i (λ 2 1 + 1 2 λ 2 i )(1 -ηλ 1 ) -η λ 4 D 2λ 2 1 ≤ ηλ 2 1 2 -ηλ 1 -η λ 4 D 2λ 2 1 . The second line is because of AM-GM inequality. The last line is due to Lemma K.11. 3. If ηλ 2 1 < ηλ i 1 2 λ 2 i +λ 2 1 1-ηλ1 , we have ∥x(t)∥ 2 < ηλ 2 1 < ηλ i 1 2 λ 2 i +λ 2 1 1-ηλ1 , then ∥x(t + 1)∥ 2 ≤ηλ 2 1 + ηλ 2 i (2 -ηλ 1 - 2 3 ηλ i ) -(1 -ηλ 1 )∥x(t)∥ 2 -η 2 λ 2 i ( 1 2 λ 2 i + λ 2 1 ) 1 ∥x(t)∥ 2 -η λ 4 D 2λ ≤ηλ 2 1 + ηλ 2 i (2 -ηλ 1 - 2 3 ηλ i ) -(1 -ηλ 1 )ηλ 2 1 -ηλ 2 i ( 1 2 λ 2 i + λ 2 1 ) 1 λ 2 1 -η λ 4 D 2λ 2 1 ≤ ηλ 2 1 2 -ηλ 1 -η λ 4 D 2λ 2 1 . The second line is because  (1 -ηλ 1 )∥x(t)∥ 2 + η 2 λ 2 i ( 1 2 λ 2 i + λ 2 1 ) (t + 1)| = |1 -ηλ 1 -η λ 2 1 ∥x(t)∥2 ||x 1 (t)| and that η λ 2 1 ∥x(t)∥2 > 2 -ηλ 2 1 . It follows that 1 -ηλ 1 -η λ 2 1 ∥x(t)∥2 < -1. Hence we have that |x 1 (t + 1)| > |x 1 (t)|. Lemma H.7. For any t ≥ 0, if ∥x(t)∥ 2 ≤ ηλ 2 1 2-ηλ1 , x(t) ∈ ∩I j , it holds that ∥x(t + 1)∥ 2 ≤ ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 . Proof of Lemma H.7. Note that ∥I -ηA -η A 2 ∥x(t)∥ 2 ∥ 2 ≤ max 1≤j≤D {|1 -ηλ j -η λ 2 j ∥x(t)∥ |} = η λ 2 1 ∥x(t)∥ -(1 -ηλ j ) . The proof is completed by noting that ∥x(t + 1)∥ ≤ ∥I -ηA -η A 2 ∥x(t)∥2 ∥ 2 ∥x(t)∥ 2 . Lemma H.8. For any t ≥ 0, if ∥x(t)∥ 2 ≤ ηλ 2 1 1-ηλ1 , it holds that ∥x(t + 1)∥ 2 ≤(ηλ 2 1 -(1 + ηλ 1 )∥x(t)∥ 2 )× |x 1 (t)| 2 ∥x(t)∥ 2 + max j∈[2:M ] |(1 -ηλ j )∥x(t)∥ 2 -ηλ 2 j | ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 2 1 - |x 1 (t)| 2 ∥x(t)∥ 2 . Proof of Lemma H.8. We will discuss the movement along v 1 and orthogonal to v 1 . First, ∥P (2:D) x(t + 1)∥ 2 = ∥(I -P (2:D) ηA -η P (2:D) A 2 ∥x(t)∥ 2 )P (2:D) x(t)∥ 2 ≤ ∥P (2:D) -P (2:D) ηA -η P (2:D) A 2 ∥x(t)∥ 2 ∥ 2 ∥P (2:D) x(t)∥ 2 ≤ max j∈[2:M ] {|1 -ηλ j - ηλ 2 j ∥x(t)∥ 2 |}∥P (2:D) x(t)∥ 2 . Second, |x 1 (t + 1)| = ( ηλ 2 1 ∥x(t)∥2 -1 + ηλ 1 )|x 1 (t)|. Hence we have that ∥x(t + 1)∥ 2 ≤(ηλ 2 1 -(1 + ηλ 1 )∥x(t)∥ 2 )× |x 1 (t)| 2 ∥x(t)∥ 2 + max j∈[2:M ] |(1 -ηλ j )∥x(t)∥ 2 -ηλ 2 j | ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 } 2 1 - |x 1 (t)| 2 ∥x(t)∥ 2 . Lemma H.9. For t, t ′ ∈ S, 0 ≤ t ≤ t ′ , then |x 1 (t)| ≤ |x 1 (t ′ )|. Proof of Lemma H.9. For t ∈ S, by Lemma H.5, t + 1 ∈ S or t + 1 ̸ ∈ S, t + 2 ∈ S. We will discuss by case. 1. If t + 1 ∈ S, we can use Lemma H.6 to show |x 1 (t)| ≤ |x 1 (t + 1)|. 2. If t + 1 ̸ ∈ S, t + 2 ∈ S, then |x 1 (t + 2)| = (ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 )(ηλ 2 1 -(1 -ηλ 1 )∥x(t + 1)∥ 2 ) ∥x(t)∥ 2 ∥x(t + 1)∥ 2 |x 1 (t)| . As (ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 )(ηλ 2 1 -(1 -ηλ 1 )∥x(t + 1)∥ 2 ) ≥ ∥x(t)∥ 2 ∥x(t + 1)∥ 2 ⇐⇒ η 2 λ 4 1 -ηλ 2 1 (1 -ηλ 1 )(∥x(t)∥ 2 + ∥x(t + 1)∥ 2 ) ≥ (2ηλ 1 -η 2 λ 2 1 )∥x(t)∥ 2 ∥x(t + 1)∥ 2 ⇐⇒ η 2 λ 4 1 -ηλ 2 1 (1 -ηλ 1 )∥x(t)∥ 2 ≥ (2ηλ 1 -η 2 λ 2 1 )∥x(t)∥ 2 + ηλ 2 1 (1 -ηλ 1 ) ∥x(t + 1)∥ 2 , combining with Lemma H.7, we only need to prove, η 2 λ 4 1 -ηλ 2 1 (1 -ηλ 1 )∥x(t)∥ 2 ≥ (2ηλ 1 -η 2 λ 2 1 )∥x(t)∥ 2 + ηλ 2 1 (1 -ηλ 1 ) ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 . Through some calculation, this is equivalent to Formally ∀ϵ > 0, there exists ((2 -ηλ 1 )∥x(t)∥ 2 -ηλ 2 1 )((1 -ηλ 1 )∥x(t)∥ 2 -ηλ T ϵ > 0 such that ∀t, t ′ ∈ S, t ′ > t > T ϵ , ∥x1(t ′ )∥2 ∥x1(t)∥2 < 1 + ϵ. Then by Lemma H.5, ∀t ∈ S, t + 1 ∈ S or t + 2 ∈ S, we will discuss by case. For t ≥ T ϵ , 1. If t + 1 ∈ S, then 1 + ϵ ≥ ∥x 1 (t + 1)∥ 2 ∥x 1 (t)∥ 2 = ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 ∥x(t)∥ 2 . 2. If t + 1 ̸ ∈ S and t + 2 ∈ S, then 1 + ϵ ≥ ∥x 1 (t + 2)∥ 2 ∥x 1 (t)∥ 2 = (ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 )(ηλ 2 1 -(1 -ηλ 1 )∥x(t + 1)∥ 2 ) ∥x(t)∥ 2 ∥x(t + 1)∥ 2 ≥ (ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 ) ηλ 2 1 -(1 -ηλ 1 ) ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 ∥x(t)∥ 2 (ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 ) = ηλ 2 1 -(1 -ηλ 1 ) ηλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 ∥x(t)∥ 2 . Here in the last inequality, we apply Lemma H.7. Concluding, ∥x(t)∥ 2 ≥ min ηλ 2 1 2-ηλ 2 1 +ϵ , η 2 λ 3 1 (2-λ1η)λ1η+ϵ , ∀t > T ϵ , t ∈ S. As ∀t ̸ ∈ S, t > T ϵ , we have ∥x(t)∥ 2 ≥ ηλ 2 1 2-ηλ 2 1 . Hence we have ∀t > T ϵ , ∥x(t)∥ 2 ≥ min ηλ 2 1 2-ηλ 2 1 +ϵ , η 2 λ 3 1 (2-λ1η)λ1η+ϵ . Further by Lemma H.7, ∀t > T ϵ + 1, ∥x(t)∥ 2 ≤ ηλ 2 1 -(1 -ηλ 1 ) min ηλ 2 1 2-ηλ 2 1 +ϵ , η 2 λ 3 1 (2-λ1η)λ1η+ϵ . Combining both bound, we have lim t→∞ ∥x(t)∥ 2 = ηλ 2 1 2-ηλ1 . Lemma H.11. ∥x 1 (t)∥ 2 converges to ηλ 2 1 2-ηλ1 , when t → ∞. Proof of Lemma H.11. Notice that ∥P (2:D) x(t + 1)∥ 2 ≤ max |1 -ηλ 2 -η λ 2 2 ∥x(t)∥ 2 |, |1 -ηλ D -η λ 2 D ∥x(t)∥ 2 | ∥P (2:D) x(t)∥ 2 . When ∥x(t)∥ 2 > ηλ 2 2 2-ηλ2-δ , -1 + δ ≤ 1 -ηλ 2 -η λ 2 2 ∥x(t)∥ 2 ≤ 1 -ηλ D -η λ 2 D ∥x(t)∥ 2 ≤ 1 -ηλ D ∥P (2:D) x(t + 1)∥ 2 ≤ max(1 -ηλ D , 1 -δ)∥P (2:D) x(t)∥ 2 Hence for sufficiently large t, ∥P (2:D) x(t)∥ 2 shrinks exponentially, showing that lim t→∞ ∥x 1 (t)∥ 2 = ηλ 2 1 2-ηλ1 . I ANALYSIS FOR FULL-BATCH SAM ON GENERAL LOSS (PROOF OF THEOREM 4.5) The goal of this section is to prove the following theorem. Theorem 4.5 (Main). Let {x(t)} be the iterates of full-batch SAM (Equation 3) with x(0) = x init ∈ U . Under Assumptions 3.2 and 4.4, for all η, ρ such that η ln(1/ρ) and ρ/η are sufficiently small, the dynamics of SAM can be characterized in the following two phases: • Phase I: (Theorem I.1) Full-batch SAM (Equation 3) follows Gradient Flow with respect to L until entering an O(ηρ) neighborhood of the manifold Γ in O(ln(1/ρ)/η) steps; • Phase II: (Theorem I.3) Under a mild non-degeneracy assumption (Assumption I.2) on the initial point of phase II, full-batch SAM (Equation 3) tracks the solution X of Equation 7, the Riemannian Gradient Flow with respect to the loss λ 1 (∇ 2 L(•)) in an O(ηρ) neighborhood of manifold Γ. Quantitatively, the approximation error between the iterates x and the corresponding limiting flow X is O(η ln(1/ρ)), that is, ∥x ⌈T 3 /(ηρ 2 )⌉ -X(T 3 )∥ 2 = O(η ln(1/ρ)) . Moreover, the angle between ∇L x(⌈ T3 ηρ 2 ⌉ and the top eigenspace of ∇ 2 L(x(⌈ T3 ηρ 2 ⌉)) is O(ρ). Readers may refer to Appendix E for notation. To prove the theorem, we will separate the dynamic of SAM on general loss L to two phases. Define R j (x) = M i=j λ 2 i (x)⟨v i (x), x -Φ(x)⟩ 2 -ηρλ 2 j (x), ∀j ∈ [M ], x ∈ U, which is the length projection of x -Φ(x) on button-k non-zero eigenspace of ∇ 2 L(Φ(x)). We will provide a fine-grained convergence bound on R j (x). Theorem I.1 (Phase I). Let {x(t)} be the iterates defined by SAM ( Equation 3) and x(t) = x init ∈ U , then under Assumption 3.2 there exists a positive number T 1 independent of η and ρ, such that for any T ′ 1 > T 1 , it holds for all η, ρ such that (η + ρ) ln(1/ηρ) is sufficiently small, we have max T1 ln(1/ηρ)≤ηt≤T ′ 1 ln(1/ηρ) max j∈[M ] max{R j (x(t)), 0} = O(ηρ 2 ) max T1 ln(1/ηρ)≤ηt≤T ′ 1 ln(1/ηρ) ∥Φ(x(t)) -Φ(x init )∥ ≤ O((η + ρ) ln(1/ηρ)) Theorem I.1 implies SAM will converge to an O(ηρ) neighbor of Γ. Notice in the time frame defined by Theorem I.1, x(t) effectively operates at a local regime around Φ(⌈T 1 ln(1/ηρ)/η⌉), this allows us to approximate L with the quadratic Taylor expansion of L at Φ(⌈T 1 ln(1/ηρ)/η⌉) and prove the following theorem Theorem I.3. Towards proving Theorem I.3, we need to make one assumption about the trajectory of SAM, Assumption I.2. Assumption I.2. There exists step t, satisfying that T 1 ln(1/ηρ)/η ≤ t ≤ O(ln(1/ηρ/η)), |⟨x(t) - Φ(x(t)), v 1 (x(t))⟩| ≥ Ω(ρ 2 ) and that ∥x(t) -Φ(x(t))∥ 2 ≤ λ 1 (t)ηρ -Ω(ρ 2 ) , where T 1 is the constant defined in Theorem I.1. We remark that the above assumption is very mild as we only need the above two conditions in Assumption I.2 to hold for some step in Θ(1/η) steps after Phase I ends, and since then our analysis for Phase II shows that these two conditions will hold until Phase II ends. Theorem I.3 (Phase II). Let {x(t)} be the iterates defined by SAM (Equation 3) under Assumptions 3.2 and 4.4, for all η, ρ such that η ln(1/ρ) and ρ/η is sufficiently small, further assuming that (1) max j∈[M ] max{R j (x(0)), 0} = O(ηρ 2 ), (2) ∥Φ(x(0)) -Φ(x init )∥ = O((η + ρ) ln(1/ηρ)), (3) |⟨x(0) -Φ(x(0)), v 1 (x(0))⟩| ≥ Ω(ρ 2 ) and ( 4) ∥x(0) -Φ(x(t))∥ 2 ≤ λ 1 (0)ηρ -Ω(ρ 2 ), the iterates x(t) tracks the solution X of Equation 7. Quantitatively for t = ⌈T 3 /ηρ 2 ⌉, we have that ∥Φ(x(t)) -X(ηρ 2 t)∥ = O(η ln(1/ρ)) . Moreover, the angle between ∇L(x(t)) and the top eigenspace of ∇ 2 L(Φ(x(t))) is at most O(ρ). Quantita- tively, |⟨x(t) -Φ(x(t)), v 1 (x(t))⟩| = Θ(ηρ) . max j∈[2:M ] |⟨x(t) -Φ(x(t)), v j (x(t))⟩| = O(ηρ 2 ) . In this section we will define K as {X(t) | 0 ≤ t ≤ T 3 } where X is the solution of Equation 7. To simplify our proof, we assume WLOG L(x) = 0 for x ∈ Γ.

I.1 PHASE I (PROOF OF THEOREM I.1)

Proof of Theorem I.1. The proof consists of three major parts. 1. Tracking Gradient Flow. Lemma I.4 shows the existence of step  t GF = O(1/η) such that x(t GF ) is in a subset of K h and Φ(x(t GF )) is O(η + ρ) close to Φ(x init ). 2. Decreasing Loss. Lemma I.6 shows the existence of step t DEC = O(ln(1/ρ)/η) such that x(t DEC ) is in O(ρ) neighbor of Γ and Φ(x(t DEC )) is O((η + ρ) ln(1/ρ)) close to Φ(x init ). (x(t GF )) is O(η + ρ) is close to Φ(x init ). Quantitatively, L(x(t GF )) ≤ µh 2 32 ∥x(t GF ) -Φ(x(t GF ))∥ ≤ h/4 , ∥Φ(x(t GF )) -Φ(x init )∥ = O(η + ρ) . Proof of Lemma I.4. Choose C = 1 4 µ ζ . Since Φ(x init ) = lim T →∞ ϕ(x init , T ), there exists T > 0, such that ∥ϕ(x init , T ) -Φ(x init )∥ 2 ≤ Ch/2 . Note that x(t + 1) = x(t) -η∇L(x(t) + ρ ∇L (x(t)) ∥∇L (x(t)) ∥ ) = x(t) -η∇L(x(t)) + O(ηρ) . By Corollary L.3, let b(x) = -∇L(x), p = η and ϵ = O(ρ), we have that the iterates x(t) tracks gradient flow ϕ(x init , T ) in O(1/η) steps. Quantitatively for t GF = ⌈ T η ⌉, we have that ∥x(t GF ) -ϕ(x init , T )∥ 2 = O(ϵ + p) = O(η + ρ) . This implies x(t GF ) ∈ K h , hence by Taylor Expansion on Φ, ∥Φ(x(t GF )) -Φ(x init )∥ 2 = ∥Φ(x(t GF )) -Φ(ϕ(x init , T ))∥ 2 ≤ O(∥x(t GF ) -ϕ(x init , T )∥ 2 ) ≤ O(η + ρ) . This implies ∥x(t GF ) -Φ(x(t GF ))∥ 2 ≤∥x(t GF ) -ϕ(x init , T 0 )∥ 2 + ∥ϕ(x init , T 0 ) -Φ(x init )∥ 2 + ∥Φ(x init ) -Φ(x(t GF ))∥ 2 ≤Ch/2 + O(η + ρ) ≤ Ch ≤ h/4 . By Taylor Expansion, we conclude that L(x(t GF )) ≤ ζ∥x(t GF ) -Φ(x(t GF ))∥  ) -Φ(x(t))∥ ≤ O(η 2 ) . Proof of Lemma I.5. As x(t) ∈ K h and L is µ-PL in K h , we have L(x(t)) ≥ 0. As x(t) ∈ K h , by Lemma F.7 and Taylor Expansion, we have ∥x(t)x(t + 1)∥ = O(η). hence for sufficiently small η, x(t)x(t + 1) ⊂ K r . Using similar argument, the segment from x(t) to x(t) + ρ ∇L(x(t)) ∥∇L(x(t))∥ is in K r . Then by Taylor Expansion on L, L(x(t + 1)) = L(x(t) -η∇L x(t) + ρ ∇L (x(t)) ∥∇L (x(t)) ∥ ) ≤ L(x(t)) -η ∇L (x(t)) , ∇L x(t) + ρ ∇L (x(t)) ∥∇L (x(t)) ∥ + ζη 2 ∥∇L x(t) + ρ ∇L(x(t)) ∥∇L(x(t))∥ ∥ 2 2 . ( ) By Taylor Expansion on ∇L, we have that ∥∇L x(t) + ρ ∇L (x(t)) ∥∇L (x(t)) ∥ -∇L (x(t)) ∥ ≤ ζρ . After plugging in Equation 25, we have that L(x(t + 1)) ≤ L(x(t)) -η∥∇L (x(t)) ∥ 2 + ηζρ∥∇L (x(t)) ∥ + ζη 2 ∥∇L (x(t)) ∥ 2 + ζ 3 η 2 ρ 2 . As ∥∇L(x(t))∥ ≥ 4ζρ, we have that the following term is bounded. ζη 2 ∥∇L (x(t)) ∥ 2 ≤ 1 2 η∥∇L (x(t)) ∥ 2 , ηζρ∥∇L (x(t)) ∥ ≤ 1 4 η∥∇L (x(t)) ∥ 2 , ζ 3 η 2 ρ ≤ ζ 2 ηρ 2 ≤ 1 16 η∥∇L (x(t)) ∥ 2 . After plugging in Equation 26, by Lemma F.2, L(x(t + 1)) ≤ L(x(t)) - 1 16 η∥∇L (x(t)) ∥ 2 ≤ L(x(t))(1 -ηµ/8) . As x(t) ∈ K h , by Taylor Expansion, we have ∥∇L (x(t)) ∥ ≤ ζh . Hence by Lemma F.7 and Taylor Expansion, ∥Φ(x(t + 1)) -Φ(x(t))∥ ≤ ξηρ∥∇L (x) ∥ 2 + νηρ 2 + ξη 2 ∥∇L (x) ∥ 2 2 + ξζ 2 η 2 ρ 2 ≤ O(η 2 ), which completes the proof. Lemma I.6. Under condition of Theorem I.1, assuming there exists t GF such that L(x(t GF )) ≤ µh 2 32 and x(t GF ) ∈ K h/4 , then there exists t DEC = t GF + O(ln(1/ρ)/η), such that x(t DEC ) is in O(ρ) neighbor of Γ, quantitatively, we have that ∥∇L(x(t DEC ))∥ 2 ≤ 4ζρ . Moreover the movement of the projection of Φ(x(•)) on the manifold is bounded, ∥Φ(x(t GF )) -Φ(x(t DEC ))∥ 2 = O(η ln(1/ρ)) . Proof of Lemma I.6. Choose t DEC as the minimal t ≥ t GF such that ∥∇L(x(t DEC ))∥ 2 ≤ 4ζρ. Define C = ⌈ln 1-ηµ 8 (64ρ 2 /h 2 )⌉ = O(ln(1/ρ)/η). We will first perform an induction on t ≤ min{t DEC , t GF + C} = t GF + O(ln(1/ρ)/η) to show that L(x(t)) ≤ (1 -ηµ/8) t-tGF L(x(t GF )) ∥Φ(x(t)) -Φ(x(t GF ))∥ = O(η 2 (t -t GF )) For t = t GF , the result holds trivially. Suppose the induction hypothesis holds for t. Then by F.1 and Taylor Expansion, ∥Φ(x(t)) -x(t)∥ ≤ 2L(x(t GF )) µ ≤ h/4 . Then we have that dist(K, x(t)) ≤dist(K, x(t GF )) + ∥x(t GF ) -Φ(x(t GF ))∥ 2 + ∥Φ(x(t GF )) -Φ(x(t))∥ + ∥Φ(x(t)) -x(t)∥ ≤3h/4 + O(η 2 (t -t GF )) = 3h/4 + O(η ln(1/ρ)) ≤ h . That is x(t) ∈ K h . Then as t ≤ t DEC , ∥∇L(x(t))∥ 2 ≥ 4ζρ. Then by Lemma I.5, we have that L(x(t + 1)) ≤ (1 -ηµ/8)L(x(t)) ≤ (1 -ηµ/8) t+1-tGF L(x(t GF )) , ∥Φ(x(t + 1)) -Φ(x(t GF ))∥ ≤ ∥Φ(x(t + 1)) -Φ(x(t))∥ + ∥Φ(x(t)) -Φ(x(t GF ))∥ ≤ O(η 2 (t -t GF )) , which completes the induction. Now if t DEC ≥ t GF + C = t GF + Ω(ln(1/ρ)/η), As the result of the induction, we have that L(x(t GF + C)) ≤ (1 - ηµ 8 ) C L(x(t GF )) ≤ 64ρ 2 h 2 L(x(t GF )) ≤ 8ρ 2 µ . By Lemma F.2, we have that ∥∇L(x(t GF +C))∥ 2 ≤ ζ 2L(x(tGF+C)) µ = 4ζρ, which leads to a contradiction. Hence we have that t DEC ≤ t GF + C = t GF + O(ln(1/ρ)/η). By induction, we have that ∥Φ(x(t DEC )) -Φ(x(t GF ))∥ = O(η 2 (t DEC -t GF )) = O(η ln(1/ρ)) . This completes the proof.

I.1.3 ENTERING INVARIANT SET

We first introduce some notations that is required for the proof in this and following subsection. Define x = x -Φ(x) , A(x) = ∇ 2 L (Φ(x)) , x = A(x)x , xj = ⟨x, v j (x)⟩ , P (j:D) (x) = M i=j v i (x)v T i (x) . Proof of Lemma I.9. By ∥x(t) -Φ(x(t))∥ = O(ρ), x(t) ∈ K h/2 , and Lemma F.7, we have that ∥x(t + 1)x(t)∥ = O(ηρ) and hence x(t + 1) ∈ K 3h/4 . This also implies ∥x(t + 1) -Φ(x(t + 1))∥ 2 = O(ρ). Similarly we have x(t + 2) ∈ K 3h/4 . For k ∈ {1, 2}, by Taylor Expansion, x(t + k + 1) =x(t + k) -η∇L(x(t + k)) -ηρ∇ 2 L(x(t + k)) ∇L (x(t + k)) ∥∇L (x(t + k)) ∥ + O(ηρ 2 ) =x(t + k) -η∇ 2 L(Φ(x(t + k)))(x(t + k) -Φ(x(t + k))) + O(ηρ 2 ) -ηρ∇ 2 L(Φ(x(t + k))) ∇L (x(t + k)) ∥∇L (x(t + k)) ∥ + O(ηρ 2 ) =x(t + k) -η∇ 2 L(Φ(x(t + k)))(x(t + k) -Φ(x(t + k))) -ηρ∇ 2 L(Φ(x(t + k))) ∇L (x(t + k)) ∥∇L (x(t + k)) ∥ + O(ηρ 2 ). Now by Lemmas I.7 and I.8, ∥Φ(x(t + k)) -Φ(x(t))∥ 2 = O(ηρ 2 ), x(t + k + 1) =x(t + k) -η∇ 2 L(Φ(x))(x(t + k) -Φ(x(t))) -ηρ∇ 2 L(Φ(x)) ∇L (x(t + k)) ∥∇L (x(t + k)) ∥ + O(ηρ 2 ). Now we first prove the first claim, we have for k = 0, ∥x(t + k) -x ′ (t + k)∥ 2 = 0, by Lemma F.4 and eq. 27, x(t + 1) =x(t) -η∇ 2 L(Φ(x))(x(t) -Φ(x)) -ηρ ∇ 2 L(Φ(x))(x(t) -Φ(x)) ∥∇ 2 L(Φ(x))(x(t) -Φ(x))∥ 2 + O(ηρ 2 ) =x ′ (t + 1) + O(ηρ 2 ). The second claim is slightly more complex. By the first claim and Lemma F.4, we have that ∇L (x(t + 1)) ∥∇L (x(t + 1)) ∥ = ∇ 2 L(Φ(x(t + 1)))(x(t + 1) -Φ(x(t + 1))) ∥∇ 2 L(Φ(x(t + 1)))(x(t + 1) -Φ(x(t + 1)))∥ 2 +O(∥x(t + 1) -Φ(x(t + 1))∥ 2 ). We first show ∥∇ 2 L(Φ(x(t + 1)))(x(t + 1) -Φ(x(t + 1)))∥ 2 is of order ∥x(t + 1) -Φ(x(t + 1))∥ 2 = Ω(ρ 2 ) to show that the normalized gradient term is stable with respect to small perturbation, ∥∇ 2 L(Φ(x(t + 1)))(x(t + 1) -Φ(x(t + 1)))∥ 2 ≥∥P Φ(x(t+1)),Γ ∇ 2 L(Φ(x(t + 1)))(x(t + 1) -Φ(x(t + 1)))∥ 2 ≥∥∇ 2 L(Φ(x(t + 1)))P Φ(x(t+1)),Γ (x(t + 1) -Φ(x(t + 1)))∥ 2 ≥µ∥P Φ(x(t+1)),Γ (x(t + 1) -Φ(x(t + 1)))∥ 2 ≥µ(∥(x(t + 1) -Φ(x(t + 1)))∥ 2 -∥P ⊥ Φ(x(t+1)),Γ (x(t + 1) -Φ(x(t + 1)))∥ 2 ) ≥µ(∥(x(t + 1) -Φ(x(t + 1)))∥ 2 - νζ 4µ 2 ∥x(t + 1) -Φ(x(t + 1))∥ 2 2 ) ≥ µ 2 ∥(x(t + 1) -Φ(x(t + 1)))∥ 2 = Ω(ηρ). Based on Lemma F.7, we have Φ(x(t + 1)) -Φ(x(t)) = O(ηρ 2 ). We further have by the first claim and Lemma I.8, ∇ 2 L(Φ(x(t + 1)))(x(t + 1) -Φ(x(t + 1))) -∇ 2 L(Φ(x))(x ′ (t + 1) -Φ(x(t))) =∇ 2 L(Φ(x))(x(t + 1) -Φ(x(t + 1))) -∇ 2 L(Φ(x))(x ′ (t + 1) -Φ(x(t)) + O(∥x(t + 1) -Φ(x(t + 1))∥ 2 ∥Φ(x(t + 1)) -Φ(x)∥ 2 ) =∇ 2 L(Φ(x))(x(t + 1) -Φ(x(t + 1))) -∇ 2 L(Φ(x))(x ′ (t + 1) -Φ(x(t))) + O(ηρ 3 ) =∇ 2 L(Φ(x))(x(t + 1) -x ′ (t + 1)) + ∇ 2 L(Φ(x))(Φ(x(t + 1)) -Φ(x(t))) + O(ηρ 3 ) =O(ηρ 2 ) This implies ∇ 2 L(Φ(x(t + 1)))(x(t + 1) -Φ(x(t + 1))) ∥∇ 2 L(Φ(x(t + 1)))(x(t + 1) -Φ(x(t + 1)))∥ 2 = ∇ 2 L(Φ(x))(x ′ (t + 1) -Φ(x)) ∥∇ 2 L(Φ(x))(x ′ (t + 1) -Φ(x))∥ 2 + O(ρ) Combining with Equation 28, we have ∇L (x(t + 1)) ∥∇L (x(t + 1)) ∥ = ∇ 2 L(Φ(x))(x ′ (t + 1) -Φ(x)) ∥∇ 2 L(Φ(x))(x ′ (t + 1) -Φ(x))∥ 2 + O(ρ) By the above approximation and Equation 27, x(t + 2) = x ′ (t + 2) + O(ηρ 2 ) . Lemma I.10. Assuming t satisfy that x(t) ∈ K 3h/4 and ∥x(t)∥ 2 = O(ρ), then we have that ∥x(t + 1) -x(t) + ηA(t)x(t) + ηρA 2 (t) x(t) ∥x(t)∥ ∥ 2 = O(ηρ 2 ) . Proof of Lemma I.10. By Lemma I.9, we know ∥x(t + 1) -x(t) + ηx(t) + ηρA(t) x(t) ∥x(t)∥ ∥ ≤ O(ηρ 2 ) . This implies ∥A(t)(x(t + 1) -Φ(x(t))) -x(t) + ηA(t)x(t) + ηρA 2 (t) x(t) ∥x(t)∥ ∥ =∥A(t)(x(t + 1) -x(t) + ηx(t) + ηρA(t) x(t) ∥x(t)∥ )∥ ≤ζ∥x(t + 1) -x(t) + ηx(t) + ηρA(t) x(t) ∥x(t)∥ ∥ = O(ηρ 2 ) . We also have x(t + 1) -A(t)(x(t + 1) -Φ(x(t))) =(A(t + 1) -A(t))(x(t + 1) -Φ(x(t + 1))) -A(t)(Φ(x(t)) -Φ(x(t + 1))) =O(ηρ 2 ) . Plugging in Equation 29, we have that ∥x(t + 1) -x(t) + ηA(t)x(t) + ηρA 2 (t) x(t) ∥x(t)∥ ∥ 2 = O(ηρ 2 ) . Lemma I.11. Under condition of Theorem I.1, assuming there exists t DEC such that x(t DEC ) ∈ K h/2 and ∥∇L(x(t DEC ))∥ ≤ 4ζρ, then there exists t DEC2 = t DEC + O(ln(1/η)/η), such that x(t DEC2 ) is in I 1 ∩ K 3h/4 . Furthermore, for any t satisfying t DEC2 ≤ t ≤ t DEC2 + Θ(ln(1/η)/η), we have that x(t) ∈ I 1 ∩ K 3h/4 and ∥Φ(x(t)) -Φ(x(t DEC ))∥ = O(ρ 2 ln(1/η)). Proof of Lemma I.11. For simplicity, denote C = ⌈ln 1-ηµ ηµ 3 4ζ 2 ⌉ + Θ(ln(1/ρ)/η) = O(ln(1/η)/η). Here the quantity Θ(ln(1/ρ)/η) is the same quantity in the statement of the lemma. We will prove the induction hypothesis for t DEC ≤ t ≤ t DEC + 2C,            ∥x(t -1)∥ ≥ ηρλ 2 1 (t), t > t DEC ⇒ ∥x(t)∥ ≤ (1 -ηµ)∥x(t -1)∥, ∥x(t -1)∥ ≤ ηρλ 2 1 (t -1), t > t DEC ⇒ ∥x(t)∥ ≤ ηρλ 2 1 (t) + O(ηρ 2 ), ∥Φ(x(t)) -Φ(x(t DEC ))∥ ≤ O(ηρ 2 (t -t DEC )), x(t) ∈ K 3h/4 . The induction hypothesis holds trivially for t = t DEC . Assume the induction hypothesis holds for t ′ ≤ t. By Lemmas F.1 and I.7, ∥x(t DEC )∥ 2 ≤ ζ∥x(t DEC ) - Φ(x(t DEC ))∥ ≤ ζ µ ∥∇L(x(t DEC ))∥ ≤ 4ζ 2 µ ρ. Combining with the induction hypothesis, we have ∥x(t)∥ ≤ 4ζ 2 µ ρ. By x(t) ∈ K 3h/4 and Lemma I.8, we have that ∥Φ(x(t + 1)) -Φ(x(t))∥ ≤ O(ηρ 2 ) . Hence we have that ∥Φ(x(t + 1)) -Φ(x(t DEC ))∥ ≤ ∥Φ(x(t + 1)) -Φ(x(t))∥ + ∥Φ(x(t)) -Φ(x(t DEC ))∥ ≤ O(ηρ 2 (t + 1 -t DEC )). This proves the third statement of the induction hypothesis. By ∥x(t)∥ = O(ρ) and Lemma I.10, we have that ∥x(t + 1) -x(t) + ηA(t)x(t) + ηρA 2 (t) x(t) ∥x(t)∥ ∥ 2 = O(ηρ 2 ) . Analogous to the proof of Lemmas H.1 and H.2, we have 1. If ∥x(t)∥ > ηρλ 2 1 (t), we would have ∥x(t) -ηA(t)x(t) -ηρA 2 (t) x(t) ∥x(t)∥ ∥ ≤∥x(t)∥∥I -ηA(t) -ηρA 2 (t) 1 ∥x(t)∥ ∥ ≤∥x(t)∥ max{ηλ 1 , 1 -ηλ D -ηρλ 2 D 1 ∥x(t)∥ } ≤ max{(1 -ηλ D )∥x(t)∥ -ηρλ 2 D , ηλ 1 ∥x(t)∥} ≤ max{(1 -ηµ)∥x(t)∥ -ηρµ 2 , ηζ∥x(t)∥} Hence we have ∥x(t + 1)∥ ≤ max{(1 -ηµ)∥x(t)∥ -ηρµ 2 , ηζ∥x(t)∥} + O(ηρ 2 ) ≤ (1 -ηµ)∥x(t)∥. 2. If ∥x(t)∥ 2 ≤ ηρλ 2 1 (t) , then by Lemma H.1, we have that ∥x(t) -ηA(t)x(t) -ηρA 2 (t) x(t) ∥x(t)∥ ∥ 2 ≤ ηρλ 2 1 (t) . Hence by Lemma K.1 ∥x(t + 1)∥ ≤ ηρλ 2 1 (t) + O(ηρ 2 ) ≤ ηρλ 2 1 (t + 1) + O(ηρ 2 ) . Concluding the two cases, we have shown the first and second claim of the induction hypothesis holds. Hence we can show that ∥x(t + 1)∥ ≤ 4ζ 2 µ ρ. Then by Lemma I.7, we have that ∥x(t + 1) -Φ(x(t + 1))∥ ≤ 8ζ 2 µ 2 ρ. As t ≤ t DEC + 2C = t DEC + O(ln(1/η)/η), by Equation 30, ∥Φ(x(t + 1)) -Φ(x(t DEC ))∥ ≤ O(-ρ 2 ln η) . This implies dist(x(t + 1), K) ≤dist(x(t DEC ), K) + ∥x(t DEC ) -Φ(x(t DEC ))∥ + ∥Φ(x(t DEC )) -Φ(x(t + 1))∥ + ∥x(t + 1) -Φ(x(t + 1))∥ =h/2 + O(ρ 2 ln(1/η)) + O(ρ) ≤ 3h/4. This proves the fourth claim of the inductive hypothesis. The induction is complete. Now define t DEC2 the minimal t ≥ t DEC , such that ∥x(t)∥ ≤ ηρλ 2 1 (t). If t DEC2 > t DEC + C, then by the induction, Lemmas F.1 and I.7, ∥x(t DEC + C)∥ ≤ (1 -ηµ) C ∥x(t DEC )∥ ≤ ηµ 3 4ζ 2 ∥x(t DEC )∥ ≤ ηµ 3 4ζ 2 ζ∥x(t DEC ) -Φ(x(t DEC )∥ ≤ ηµ 2 4ζ ∥∇L(t DEC )∥ ≤ µ 2 ηρ ≤ λ 2 1 (t DEC + C)ηρ . This is a contradiction. Hence we have t DEC2 ≤ t DEC + C. By the induction hypothesis x(t DEC2 ) ∈ I 1 ∩ K 3h/4 . Furthermore by induction, for any t satisfying t DEC2 ≤ t ≤ t DEC + 2C, we have that ∥x(t)∥ ≤ ηρλ 2 1 (t) + O(ηρ 2 ) . By the induction hypothesis x(t) ∈ I 1 ∩ K 3h/4 and ∥Φ(x(t)) -Φ(x(t DEC ))∥ = O(ρ 2 ln(1/ρ)). Lemma I.12. Under condition of Theorem I.1, assuming t satisfy that x(t) ∈ I 1 ∩ K 3h/4 , then we have that R k (x(t)) ≥ 0 ⇒ R k (x(t + 1)) + λ 2 k (t + 1)ηρ ≤ (1 -ηµ)(R k (x(t)) + λ 2 k (t)ηρ), R k (x(t)) ≤ 0 ⇒ R k (x(t + 1)) ≤ O(ηρ 2 ). Proof of Lemma I.12. As x(t ) ∈ I 1 , ∥x(t)∥ 2 ≤ ζηρ + O(ηρ 2 ). As ∥x(t)∥ 2 = O(ρ), we have x(t)x(t + 1) ⊂ K h and Φ(x(t))Φ(x(t + 1)) ⊂ K r . We will begin with a quantization technique separating [M ] into disjoint continuous subset S 1 , ..., S p such that ∀i ̸ = j, min k∈Si,l∈Sj |λ k (t) -λ l (t)| ≥ ρ . By Lemmas I.8 and K.1, we have that for any n ∈ [M ], |λ k (t) -λ k (t + 1)| = O(∥∇ 2 L(Φ(x(t))) -∇ 2 L(Φ(x(t + 1)))∥) = O(∥Φ(x(t)) -Φ(x(t + 1))∥) = O(ηρ 2 ).

This implies min

k∈Si,l∈Sj |λ k (t + 1) -λ l (t + 1)| ≥ ρ -O(ηρ 2 ) ≥ 0.99ρ . Define P (t) S (i) ≜ k∈Si v n (t)v n (t) T . By Theorem K.3, for any k, ∥P (t) S k -P (t+1) S k ∥ ≤ O( ∥∇ 2 L(Φ(x(t))) -∇ 2 L(Φ(x(t + 1)))∥ ρ ) = O(ηρ) . By Lemma I.10, we have that ∥x(t + 1) -x(t) + ηA(t)x(t) + ηρA 2 (t) x(t) ∥x(t)∥ ∥ 2 = O(ηρ 2 ) . We will write x ′ (t + 1) as shorthand of x(t) -ηA(t)x(t) -ηρA 2 (t) x(t) ∥x(t)∥ . Now we discuss by cases, 1. If p i=j ∥P (t) S (i) x(t)∥ 2 > max k∈Sj λ 2 k (t)ηρ > µ 2 ηρ, by Lemma H.3, p i=j ∥P (t) S (i) x(t + 1)∥ 2 ≤ p i=j ∥P (t) S (i) x ′ (t + 1)∥ 2 + O(ηρ 2 ) ≤ max{ 1 -ηλ D (t + 1) ∥ p i=j P S (i) x(t)∥ -ηρλ D (t + 1) 2 ∥ p i=j P S (i) x(t)∥ ∥x(t)∥ , η max k∈Sj λ k (t + 1)∥ p i=j P S (i) x(t)∥} + O(ηρ 2 ) ≤ max{ 1 -ηµ ∥ p i=j P S (i) x(t)∥ -ηρ µ 3 2ζ , ηζ∥ p i=j P S (i) x(t)∥} + O(ηρ 2 ) . This further implies  p i=j ∥P S (i) x(t + 1)∥ 2 ≤ p i=j ∥P S (i) x(t + 1)∥ 2 + O(ηρ∥x(t + 1)∥) ≤ max{ 1 -ηµ ∥ p i=j P S (i) x(t)∥ -ηρ µ 3 2ζ , ηζ∥ p i=j P S (i) x(t)∥} + O(ηρ 2 ) ≤ 1 -ηµ ∥ p i=j P S (i) x(t)∥ . 2. If p i=j ∥P S (i) x(t)∥ 2 ≤ max k∈Sj λ 2 k (t) S (i) x ′ (t + 1)∥ 2 ≤ ηρ max k∈Sj λ 2 k (t) . Hence we have that p i=j ∥P S (i) x(t + 1)∥ 2 ≤ p i=j ∥P S (i) x ′ (t + 1)∥ 2 + O(ηρ 2 ) ≤ max k∈Sj λ 2 k (t)ηρ + O(ηρ 2 ) ≤ max k∈Sj λ 2 k (t + 1)ηρ + O(ηρ 2 ) . This further implies p i=j ∥P S (i) x(t + 1)∥ 2 ≤ p i=j ∥P S (i) x(t + 1)∥ 2 + O(ηρ∥x(t + 1)∥) ≤ max k∈Sj λ 2 k (t)ηρ + O(ηρ 2 ) ≤ max k∈Sj λ 2 k (t + 1)ηρ + O(ηρ 2 ) . Finally taking into quantization error, as all the eigenvalue in the same group at most differ Dρ, for any i ∈ S j , we have that -λ 2 i (t + 1) + max k∈Sj λ 2 k (t + 1) ≤ 2Dζρ + D 2 ρ 2 . Hence the previous discussion concludes as 1. If R k (x(t)) ≥ 0 R k (x(t + 1)) + λ 2 k (t + 1)ηρ ≤ (1 -ηµ)(R k (x(t)) + λ 2 k (t).ηρ) 2. If R k (x(t)) < 0 R k (x(t + 1)) ≤ O(ηρ 2 ). Lemma I.13. Under condition of Theorem I.1, assuming there exists t DEC2 such that for any t satisfying t DEC2 ≤ t ≤ t DEC2 + Θ(ln(1/η)/η), we have that x(t) ∈ I 1 ∩ K 3h/4 . Then there exists t INV = t DEC2 + O(ln(1/η)/η)) such that for any t satisfying t INV ≤ t ≤ t INV + Θ(ln(1/η)/η), we have that x(t) ∈ (∩ k∈[M ] I k ) ∩ K 7h/8 . ∥Φ(x(t)) -Φ(x(t DEC2 ))∥ = O(ρ 2 ln(1/η)) . Proof of Lemma I.13. The proof is almost identical with Lemma I.11 replacing the first two iterative hypothesis to Lemma I.12 and is omitted here.

I.2 PHASE II (PROOF OF THEOREM I.3)

Proof of Theorem I.3. Let t ALIGN = O(ln(1/ρ)/η) be the quantity defined in Lemma I.19. We will inductively prove the following induction hypothesis P(t) holds for t ALIGN ≤ t ≤ T 3 /ηρ 2 + 1, x(t) ∈ K h/2 , t ALIGN ≤ τ ≤ t |⟨x(τ ) -Φ(x(τ )), v 1 (x(τ ))⟩| = Θ(ηρ), t ALIGN ≤ τ ≤ t max j∈[2:M ] |⟨x(τ ) -Φ(x(τ )), v j (x(τ ))⟩| = O(ηρ 2 ), t ALIGN ≤ τ ≤ t ∥Φ(x(τ )) -X(ηρ 2 τ )∥ = O(η ln(1/ρ)), t ALIGN ≤ τ ≤ t P(t ALIGN ) holds due to Lemma I.19. Now suppose P(t) holds, then x(t + 1) ∈ K h . By Lemma I.19 again, |⟨x(t + 1) -Φ(x(t + 1)), v 1 (x(t + 1))⟩| = Θ(ηρ) and max j∈[2:M ] |⟨x(t + 1) -Φ(x(t + 1)), v j (x(t + 1))⟩| = O(ηρ 2 ) holds. Now by Lemma I.20, ∥Φ(x(τ + 1)) -Φ(x(τ )) + ηρ 2 P ⊥ Φ(x(τ )),Γ ∇λ 1 (t)/2∥ = O(ηρ 3 + η 2 ρ 2 ) , t ALIGN ≤ τ ≤ t. By Corollary L.3, let b(x) = -∂Φ(x)∇λ 1 (∇ 2 L(x))/2, p = ηρ 2 and ϵ = O(η + ρ), it holds that ∥Φ(x(τ )) -X(ηρ 2 τ )∥ =O(∥Φ(x(t ALIGN )) -Φ(x init )∥ + T 3 ηρ 2 + (ρ + η)T 3 ) =O(η ln(1/ρ)), t ALIGN ≤ τ ≤ t + 1 This implies ∥x(t + 1) -X(ηρ 2 (t + 1))∥ 2 ≤ ∥x(t + 1) -Φ(x(t + 1))∥ 2 + ∥Φ(x(t + 1)) -X(ηρ 2 (t + 1))∥ 2 = Õ(η ln(1/ρ)) < h/2. Hence x(t + 1) ∈ K h/2 . Combining with P(t) holds, we have that P(t + 1) holds. The induction is complete. Now P(⌈T 3 /ηρ 2 ⌉) is equivalent to our theorem.

I.2.1 ALIGNMENT TO TOP EIGENVECTOR

We will continue to use the notations introduced in Appendix I.1.3. We further define S = {t|∥x(t)∥ ≤ ηλ 2 1 2 -ηλ 1 ρ + O(ηρ 2 )} , T = {t|∥x(t)∥ ≤ 1 2 ηλ 2 1 2 -ηλ 1 + ηλ 2 2 2 -ηλ 2 ρ} , U = {t|Ω(ρ 2 ) ≤ ∥x 1 (t)∥ ≤ 1 2 ηλ 2 1 2 -ηλ 1 + ηλ 2 2 2 -ηλ 2 ρ} . Here the constant in O depends on the constant in I j and will be made clear in Lemma I.16. For s ∈ S, define next(s) as the smallest integer greater than s in S. Lemma I.14. Under the condition of Theorem I.3, there exist constants C 1 , C 2 < 1 independent of η and ρ, if ∥x 1 (t)∥ 2 ≤ 1 2 ηλ1(t) 2 2-ηλ1(t) + ηλ2(t) 2 2-ηλ2(t) ρ and x(t) ∈ (∩ j∈[M ] I j ) ∩ K 7h/8 , then ∥x(t)∥ 2 ≥ C 1 ηλ 2 1 2 -ηλ 1 ρ ⇒ ∥x(t + 1)∥ 2 ≤ C 2 ηλ 2 1 2 -ηλ 1 ρ Proof of Lemma I.14. By Lemma I.10, if we write x ′ (t+1) as shorthand of x(t)-ηA(t)x(t)-ηρA 2 (t) x(t) ∥x(t)∥ , then ∥x(t + 1) -x ′ (t + 1)∥ = O(ηρ 2 ). Define I quad j as {x|R j (x) ≤ 0}. Then we can find a surrogate x sur (t) such that x sur (t) ∈ (∩ j∈[M ] I quad j )∩K h and ∥x sur (t) -x(t)∥ 2 = O(ηρ 2 ). We will write x ′ sur (t + 1) as shorthand of x sur (t) -ηA(t)x sur (t) -ηρA 2 (t) xsur(t) ∥xsur(t)∥ . Let h(t) ≜ (2 -t) 1 2t (ζ -∆) 2 ζ 2 + 1 + (1 - 1 2t (ζ -∆) 2 ζ 2 + 1 ) max{ ζ 2 -µ 2 ζ 2 , (ζ -∆) 2 ζ 2 }. As h(1) < 1, we can choose C 1 < 1, such that h(C 1 ) < 1. We can further choose C 2 = max{(h(C 1 ) + 1)/2, 1 -µ 2 3ζ 2 } < 1. We will discuss by cases 1. If ∥x(t)∥ 2 ≥ ηλ 4 1 λ 2 1 (1 -ηλ D ) + (λ 2 1 -λ 2 D )(1 -ηλ 1 ) ρ Then ∥x(t)∥ 2 ηλ 2 1 2-ηλ1 ρ = λ 2 1 (2 -ηλ 1 ) λ 2 1 (1 -ηλ D ) + (λ 2 1 -λ 2 D )(1 -ηλ 1 ) = λ 2 1 (2 -ηλ 1 ) λ 2 1 (2 -ηλ 1 -ηλ D ) -λ 2 D (1 -ηλ 1 ) ≥ 1 1 - λ 2 D λ 2 1 1-ηλ1 2-ηλ1 ≥ 1 + λ 2 D λ 2 1 1 -ηλ 1 2 -ηλ 1 ≥ 1 + µ 2 3ζ 2 . In such case we have ∥ x(t) ∥x(t)∥ - x sur (t) ∥x sur (t)∥ ∥ = O(ρ) . Then we have ∥x ′ sur (t + 1) -x ′ (t + 1)∥ = O(ηρ 2 ). By Lemma H.5, we have that ∥x(t + 1)∥ 2 ≤ ∥x(t + 1) -x ′ (t + 1)∥ 2 + ∥x(t + 1) -x ′ sur (t + 1)∥ + ∥x ′ sur (t + 1)∥ ≤ max( ηλ 2 1 2 -ηλ 1 ρ -ηρ λ 4 D 2λ 2 1 , ηρλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 ) + O(ηρ 2 ) ≤ max(1 - λ 4 D (2 -ηλ 1 ) 2λ 4 1 , (2 -ηλ 1 ) -(1 -ηλ 1 )(1 + µ 2 3ζ 2 )) ηλ 2 1 2 -ηλ 1 ρ ≤ (1 - µ 2 3ζ 2 ) ηλ 2 1 2 -ηλ 1 ρ ≤ C 2 ηλ 2 1 2 -ηλ 1 ρ . 2. If ∥x(t)∥ 2 ≤ ηλ 4 1 λ 2 1 (1 -ηλ D ) + (λ 2 1 -λ 2 D )(1 -ηλ 1 ) ρ ≤ ηλ 2 1 1 -ηλ 1 ρ. Then we have | -ηρλ 2 D + (1 -ηλ D )∥x(t)∥ 2 | ηρλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 ≤ λ 2 1 -λ 2 D λ 2 1 . |ηρλ 2 2 -(1 -ηλ 2 )∥x(t)∥ 2 | ηρλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 ≤ λ 2 2 λ 2 1 . By Lemma H.8, ∥x ′ (t + 1)∥ 2 ≤(ηρλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 ) ∥x 2 1 (t)∥ 2 ∥x(t)∥ 2 2 + (1 - ∥x 2 1 (t)∥ 2 ∥x(t)∥ 2 2 ) max{ λ 2 1 -λ 2 D λ 2 1 , λ 2 2 λ 2 1 } ≤(ηρλ 2 1 -(1 -ηλ 1 )∥x(t)∥ 2 ) ∥x 2 1 (t)∥ 2 ∥x(t)∥ 2 2 + (1 - ∥x 2 1 (t)∥ 2 ∥x(t)∥ 2 2 ) max{ ζ 2 -µ 2 ζ 2 , (ζ -∆) 2 ζ 2 } . As ∥x 1 (t)∥ 2 ≤ 1 2 ηλ 2 1 2 -ηλ 1 + ηλ 2 2 2 -ηλ 2 ρ . For ∥x(t)∥ 2 ≥ ηλ 2 1 2-ηλ1 ρC 1 , ∥x 1 (t)∥ 2 ∥x(t)∥ 2 ≤ 1 2 λ 2 2 (2 -ηλ 1 ) λ 2 1 (2 -ηλ 2 ) + 1 /C 1 ≤ 1 2 λ 2 2 λ 2 1 + 1 /C 1 ≤ 1 2C 1 (ζ -∆) 2 ζ 2 + 1 . After plugging in, we have that ∥x(t + 1)∥ 2 ≤ ∥x ′ (t + 1)∥ 2 + O(ηρ 2 ) ≤ h(C 1 ) ηλ 2 1 2 -ηλ 1 ρ + O(ηρ 2 ) ≤ C 2 ηλ 2 1 2 -ηλ 1 ρ. This concludes the proof. Lemma I.15. Under the condition of Theorem I.3, for any t ≥ 0 satisfying that (1) x(t) ∈ (∩ j∈[M ] I j ) ∩ K h , t ̸ ∈ S, it holds that t + 1 ∈ S. Moreover, if |x 1 (t)| ≥ Ω(ρ 2 ) and ∥x(t)∥ 2 ≤ ηρλ 2 1 -Ω(ρ 2 ), then it holds that ∥x 1 (t + 1)∥ ≥ Ω(ρ 2 ). Proof of Lemma I.15. As t ̸ ∈ S, it holds that ∥x(t)∥ ≥ ηλ 2 1 2 -ηλ 1 ρ + Θ(ηρ 2 ). By Lemma I.10, if we write x ′ (t + 1) as shorthand of x(t) -ηA(t + 1)x(t) -ηρA 2 (t) x(t) ∥x(t)∥ , then ∥x(t + 1) -x ′ (t + 1)∥ = O(ηρ 2 ). Define I quad j as {x|R j (x) ≤ 0}. Then we can find a surrogate x sur (t) such that x sur (t) ∈ (∩ j∈[M ] I quad j ) ∩ K h , and ∥x sur (t) -x(t)∥ 2 = O(ηρ 2 ). We will write x ′ sur (t + 1) as shorthand of x sur (t) -ηA(t)x sur (t) -ηρA 2 (t) xsur(t) ∥xsur(t)∥ . As ∥x(t)∥ = Ω(ηρ), we have ∥ x sur (t) ∥x sur (t)∥ - x(t) ∥x(t)∥ ∥ 2 = O(ρ) . Hence we have that ∥x(t + 1) - x ′ sur (t + 1)∥ = ∥x(t + 1) -x ′ (t + 1)∥ + ∥x ′ (t + 1) -x ′ sur (t + 1)∥ = O(ηρ 2 ) Notice we have ∥x sur (t)∥ 2 ≥ ηλ 2 1 2-ηλ1 ρ for properly chosen function in the definition S, hence, by Lemma H.5 ∥x ′ sur (t + 1)∥ 2 ≤ ηλ 2 1 2 -ηλ 1 ρ. This further implies t + 1 ∈ S. We also have |⟨x ′ sur (t + 1), v 1 ⟩| = |⟨x sur (t), v 1 ⟩ -ηλ 1 ⟨x sur (t), v 1 ⟩ -ηρλ 2 1 ⟨x sur (t), v 1 ⟩ ∥x sur (t)∥ | = |⟨x sur (t), v 1 ⟩|(ηλ 1 + ηρλ 2 1 ∥x sur (t)∥ -1) We will discuss by cases. Let C satisfies that C = 1 2 (λ 4 2 + λ 4 1 ). 1. If ∥x sur (t)∥ ≤ Cηρ, then as we have λ 2 1 C ≥ √ 2ζ 2 √ ζ 2 +(ζ-∆) 2 ) . |⟨x ′ sur (t + 1), v 1 ⟩| ≥ |⟨x sur (t), v 1 ⟩|( λ 2 1 C -1) ≥ Ω(ρ 2 ). 2. If ∥x sur (t)∥ ≥ Cηρ, then as x(t) ∈ I 2 , we have that |⟨x sur (t), v 1 ⟩| ≥ Ω(ηρ). Then as ∥x sur (t)∥ ≤ ∥x(t)∥ 2 + O(ηρ 2 ) ≤ λ 2 1 ηρ -Ω(ρ 2 ), we have that |⟨x ′ sur (t + 1), v 1 ⟩| ≥ |⟨x sur (t), v 1 ⟩|( λ 2 1 ηρ λ 2 1 ηρ -Ω(ρ 2 ) -1) ≥ Ω(ρ 2 ). By previous approximation results, we have that ∥x 1 (t + 1)∥ ≥ Ω(ρ 2 ). Lemma I.16. Under the condition of Theorem I. Lemma I.17. Under the condition of Theorem I.3, there exists constant C > 0 independent of η and ρ, assuming that (1) x(t) ∈ (∩ j∈[M ] I j ) ∩ K 7h/8 , (2) t ∈ S, (3) Ω(ρ 2 ) ≤ ∥x 1 (t)∥, then ∥x 1 (next(t))∥ ≥ ∥x 1 (t)∥ -O(ηρ 2 ) . Proof of Lemma I.17. This is by standard approximation as in previous proof and Lemma H.9. Lemma I.18. Under the condition of Theorem I.3, there exists constant C > 0 independent of η and ρ, assuming that (1) 2-ηλ2 , else the result holds already. x(t) ∈ (∩ j∈[M ] I j ) ∩ K 7h/8 , (2) t ∈ S (3) Ω(ρ 2 ) ≤ ∥x 1 (t)∥ ≤ 1 2 ηλ 2 1 2-ηλ1 + ηλ 2 2 2-ηλ2 ρ, then ∥x 1 (next(t))∥ ≥ min{(1 + Cη)∥x 1 (t)∥, 1 ηλ 2 1 2 -ηλ 1 + ηλ 2 2 2 - By assumption, we have ∥x 1 (t)∥ ≥ Ω(ρ 2 ). Using Lemma I.10, ∥x(t + 1) -x(t) + ηAx(t) + ηρA 2 x(t) ∥x(t)∥ ∥ ≤ O(ηρ 2 ) .

Denote

x ′ (t + 1) = x(t) + ηAx(t) + ηρA 2 x(t) ∥x(t)∥ , as the one step update of SAM on the quadratic approximation of the general loss. Now using Lemma I.14 and the induction hypothesis, we have for some C 1 and C 2 smaller than 1, ∥x(t )∥ ≥ C 1 ηλ 2 1 2-ηλ1 ρ ⇒ ∥x ′ (t + 1)∥ ≤ C 2 ηλ 2 1 2-ηλ1 ρ. We will discuss by cases, 1 If ∥x(t)∥ ≤ C 1 ηλ 2 1 2-ηλ1 ρ If next(t) = t + 1, then ∥x ′ 1 (t + 1)∥ ∥x 1 (t)∥ = ηρλ 2 1 -(1 -ηλ 1 )∥x(t)∥ ∥x(t)∥ ≥ (2 -C 1 ) -ηλ 1 + C 1 ηλ 1 C 1 ≥ 1 C 1 As we have x1 (t) = Ω(ρ 2 ), we have x ′ 1 (t + 1) = Ω(ρ 2 ), then as ∥x 1 (t + 1) -x ′ 1 (t + 1)∥ = O(ηρ 2 ), this implies ∥x 1 (t + 1)∥ ≥ ∥x ′ 1 (t + 1)∥ -O(ηρ 2 ) ≥ 1 C 1 ∥x 1 (t + 1)∥ -O(ηρ 2 ) ≥ 1 2 ( 1 C 1 + 1)∥x 1 (t)∥. If next(t) = t + 2, define x ′ (t + 2) = x ′ (t + 1) -ηAx ′ (t + 1) -ηρA 2 x ′ (t+1) ∥x ′ (t+1)∥ , as ∥x 1 (t + 1)∥ = Ω(ηρ), by Lemma I.9, we have ∥x ′ (t + 2) -x(t + 2)∥ = O(ηρ 2 ). ∥x ′ (t + 2)∥ ∥x 1 (t)∥ = (ηρλ 2 1 -(1 -ηλ 1 )∥x(t)∥)(ηρλ 2 1 -(1 -ηλ 1 )∥x ′ (t + 1)∥) ∥x(t)∥∥x ′ (t + 1)∥ ≥ (ηρλ 2 1 -(1 -ηλ 1 )∥x(t)∥) ηρλ 2 1 -(1 -ηλ 1 ) ηρλ 2 1 -(1 -ηλ 1 )∥x(t)∥ ∥x(t)∥ (ηρλ 2 1 -(1 -ηλ 1 )∥x(t)∥) = ηρλ 2 1 -(1 -ηλ 1 ) ηρλ 2 1 -(1 -ηλ 1 )∥x(t)∥ ∥x(t)∥ ≥ (1 -ηλ 1 ) 2 + ηλ 1 C 1 (2 -ηλ 1 ) ≥ 1 + 4Cη. Combining with |x 1 (t)| ≥ Ω(ρ 2 ), we have that ∥x 1 (next(t))∥ ≥ (1 + Cη)∥x 1 (t)∥ 2 Case 2 ∥x(t)∥ > C 1 ηλ 2 1 2-ηλ1 ρ, then ∥x(t + 1)∥ ≤ C 2 ηλ 2 1 2-ηλ1 ρ, next(t) = t + 1 By Lemma I.17, ∥x 1 (t + 1)∥ ≥ (1 -Cη)∥x 1 (t)∥. As ∥x(next(t))∥ ≤ C 2 ηλ 2 1 2-ηλ1 , similar to the first case, ∥x 1 (next(next(t)))∥ ≥ (1 + 4Cη)∥x 1 (next(t))∥ ≥ (1 + Cη)∥x 1 (t)∥. In conclusion, if ∥x 1 (t)∥ ≤ 1 Lemma I.19. Under the condition of Theorem I.3, there exists constant T 2 > 0 independent of η and ρ, we would have that when t = t ALIGN = ⌈T 2 ln(1/ρ)/η⌉, |⟨x(t) -Φ(x(t)), v 1 (x(t))⟩| = Θ(ηρ) , max j∈[2:M ] |⟨x(t) -Φ(x(t)), v j (x(t))⟩| = O(ηρ 2 ) . Further if x(t ′ ) ∈ K h holds for t ′ = 0, 1, ..., t LOCAL , then for t satisfying t ALIGN ≤ t ≤ t LOCAL |⟨x(t) -Φ(x(t)), v 1 (x(t))⟩| = Θ(ηρ) , max j∈[2:M ] |⟨x(t) -Φ(x(t)), v j (x(t))⟩| = O(ηρ 2 ) . Proof of Lemma I.19. Let C be the constant defined in Lemma I.18. By Lemma I.15, we can suppose WLOG ∥x 1 (0)∥ ≥ ρ 2 and 0 ∈ S. Define C 1 ≜ ⌈log 1+Cη ( ηλ 2 1 2 -ηλ 1 /ρ)⌉ C 2 ≜ C 1 + ⌈ln max{1-µ 2 2ζ 2 ,1-∆ 2 4ζ 2 } ρ 2 ζ 2 ⌉ = O(log(1/ρ)/η). We will choose t ALIGNMID as the minimal t ∈ S, such that ∥x 1 (t)∥ ≥ 1 2 ηλ 2 1 2-ηλ1 + ηλ 2 2 2-ηλ2 ρ. Then by induction and Lemmas I.17 and I.18, we easily have that for t ≤ min{C 2 + 1, t ALIGNMID } and t ∈ S, we have that x(t) ∈ K 7h/8 ∩ (∩ j I j ) , ∥x 1 (t)∥ ≥ min{(1 + Cη) t/4 ∥x 1 (0)∥, 1 2 ηλ 2 1 2 -ηλ 1 + ηλ 2 2 2 -ηλ 2 ρ} or ∥x 1 (next(t))∥ ≥ min{(1 + Cη) t/4 ∥x 1 (0)∥, 1 ηλ 2 1 2 -ηλ 1 + ηλ 2 2 2 -ηλ 2 ρ} . The detailed induction is analogous to previous inductive argument and is omitted. If t ALIGNMID ≥ C 1 , then we have for the minimal t ≥ C 1 and t ∈ S ∥x 1 (t)∥ ≥ ηλ 2 1 2 -ηλ 1 ρ . This is a contradiction and we have that t ALIGNMID ≤ C 1 . By Lemma I.17, ∥x 1 (next(t))∥ ≥ ∥x 1 (t)∥ -O(ηρ 2 ) for ∥x 1 (t)∥ ≥ 1 2 ηλ 2 1 2-ηλ1 + ηλ 2 2 2-ηλ2 ρ and t ∈ S and then by Lemma I.18, ∥x(t)∥ ≥ ∥x 1 (t))∥ ≥ 1 4 ηλ 2 1 2 -ηλ 1 + 3 ηλ 2 2 2 -ηλ 2 ρ. for C 2 ≥ t ≥ t ALIGNMID . We will then show that for t ≥ t ALIGNMID + C 1 iteration, ∥P (2:D) x(t + 1)∥ ≤ O(ηρ 2 ). For C 1 ≥ t ≥ t ALIGNMID , 1 -ηλ 2 -ηρ λ 2 2 ∥x(t)∥ ≤ 1 -ηλ D -ηρ λ 2 D ∥x(t)∥ ≤ 1 - λ 2 D 2λ 2 1 ≤ 1 - µ 2 ζ 2 . Notice that, 1 -ηλ 2 -ηρ λ 2 2 ∥x(t)∥ ≥ 1 -ηλ 2 -ηρ λ 2 2 ∥x(t)∥ ≥1 -ηλ 2 - 4λ 2 2 λ 2 1 + 3λ 2 2 (2 -ηλ 2 ) ≥ -1 + 2(λ 2 1 -λ 2 2 ) λ 2 1 + 3λ 2 2 ≥ -1 + ∆ 2 2ζ 2 Hence, ∥P (2:D) (t)x ′ (t + 1)∥ 2 ≤ max{1 - µ 2 ζ 2 , 1 - ∆ 2 2ζ 2 }∥P (2:D) (t)x(t)∥ 2 Now by Lemma K.1 and Theorem K.3, ∥P (2:D) (t) -P (2:D) (t + 1)∥ ≤ O(ηρ 2 ) ∥v 1 (t) -v 1 (t + 1)∥ ≤ O(ηρ 2 ) ∥λ 1 (t) -λ 1 (t + 1)∥ ≤ O(ηρ 2 ) By Lemma I.10, we have that ∥x ′ (t + 1) -x(t + 1)∥ = O(ηρ 2 ). Combining the above, it holds that ∥P (2:D) (t + 1)x(t + 1)∥ ≤ max{1 - µ 2 2ζ 2 , 1 - ∆ 2 4ζ 2 }∥P (2:D) (t)x(t)∥ + O(ηρ 2 ) Hence when t = t ALIGN = t ALIGNMID + C 2 , ∥x(t)∥ ≥ ∥x 1 (t)∥ ≥ Ω(ηρ) , ∥P (2:D) (t)x(t)∥ ≤ O(ηρ 2 ) . By x(t) ∈ I 1 , we easily have ∥x 1 (t)∥ = O(ηρ). Hence we conclude that ∥x 1 (t)∥ = Θ(ηρ) , ∥P (2:D) (t)x(t)∥ = O(ηρ 2 ) . The second claim is just another induction similar to previous steps and is omitted as well.

I.2.2 TRACKING RIEMANNIAN GRADIENT FLOW

We are now ready to show that Φ(x(t)) will track the solution of Equation 7. The main principal of this proof has been introduced in Section 4.3. Lemma I.20. Under the condition of Theorem I.3, for any t satisfying that x(t) ∈ K h , ∥x 1 (t)∥ = Θ(ηρ), ∥P (2:D) (t)x(t)∥ = O(ηρ 2 ) , it holds that ∥Φ(x(t + 1)) -Φ(x(t)) + ηρ 2 P ⊥ Φ(x(t)),Γ ∇λ 1 (t)/2∥ ≤ O(ηρ 3 + η 2 ρ 2 ) . Proof of Lemma I.20. To begin with, we can approximate Φ(x(t + 1)) -Φ(x(t)) by its first order Taylor Expansion, by Lemma F.7, ∥Φ(x(t + 1)) -Φ(x(t)) -∂Φ(x(t))(x(t + 1) -x(t))∥ = O(∥x(t + 1) -x(t)∥ 2 ) = O(η 2 ρ 2 ) . Then by plugging in the update rule and another Taylor Expansion, ∥∂Φ(x(t))(x(t + 1) -x(t))-ηρ∂Φ(x(t))∇ 2 L (x) ∇L (x) ∥∇L (x) ∥ -ηρ 2 ∂Φ(x(t))∂∇ 2 L (x)[ ∇L (x) ∥∇L (x) ∥ , ∇L (x) ∥∇L (x) ∥ ]/2∥ 2 = O(ηρ 3 ). Using Lemma F.3, we have ∥ηρ∂Φ(x(t))∇ 2 L (x) ∇L (x) ∥∇L (x) ∥ ∥ = ηρ∥∇L (x) ∥∥∂ 2 Φ(x(t)) ∇L (x) ∥∇L (x) ∥ , ∇L (x) ∥∇L (x) ∥ ∥ = O(ηρ∥∇L (x) ∥) . Putting together, we have that ∥Φ(x(t + 1)) -Φ(x(t)) -ηρ 2 ∂Φ(x(t))∂∇ 2 L (Φ(x(t)))[ ∇L (x) ∥∇L (x) ∥ , ∇L (x) ∥∇L (x) ∥ ]/2∥ ≤O(η 2 ρ 2 + ηρ 3 ) + O(ηρ∥∇L (x) ∥) . As we have ∥x(t)∥ = Θ(ηρ), hence by Lemmas F.2 and I.7, ∥Φ(x(t + 1)) -Φ(x(t)) -ηρ 2 ∂Φ(x(t))∂∇ 2 L (Φ(x(t)))[ ∇L (x(t)) ∥∇L (x(t)) ∥ , ∇L (x(t)) ∥∇L (x(t)) ∥ ]/2∥ ≤O(ηρ 3 + η 2 ρ 2 ) Finally, we have that ∥ηρ 2 ∂Φ(x(t))∂∇ 2 L (Φ(x(t)))[ ∇L (x(t)) ∥∇L (x(t)) ∥ , ∇L (x(t)) ∥∇L (x(t)) ∥ ]/2 -ηρ 2 ∂Φ(x(t))∂∇ 2 L (Φ(x(t)))[v 1 (t), v 1 (t)]/2∥ ≤ O(ηρ 3 ) as the angle between ∇L(x) ∥∇L(x)∥ and v 1 (t) is O(ρ). By Lemma F.3, it holds that ∂Φ(x(t))∂∇ 2 L (Φ(x(t)))[v 1 (t), v 1 (t)] = P ⊥ X,Γ ∇(λ 1 (t)) Putting together we have that, ∥Φ(x(t + 1)) -Φ(x(t)) + ηρ 2 P ⊥ X,Γ ∇λ 1 (t)/2∥ ≤ O(ηρ 3 + η 2 ρ 2 ). It completes the proof.

I.3 PROOF OF THEOREM 4.5

Proof of Theorem 4.5. By Theorem I.1, there exists constant T 1 independent of η, ρ, such that for any T ′ 1 > T 1 independent of η, ρ, it holds that max T1 ln(1/ηρ)≤ηt≤T ′ 1 ln(1/ηρ) max j∈[M ] R j (x(t)) = O(ηρ 2 ). max T1 ln(1/ηρ)≤ηt≤T ′ 1 ln(1/ηρ) ∥Φ(x(t)) -Φ(x init )∥ = O((η + ρ) ln(1/ηρ)). By Assumption I.2, there exists step T 1 ln(1/ηρ) ≤ ηt PHASE ≤ T ′ 1 ln(1/ηρ), such that max j∈[M ] R j (x(t PHASE )) = O(ηρ 2 ), ∥Φ(x(t PHASE )) -Φ(x init )∥ = O((η + ρ) ln(1/ηρ)), |⟨x(t PHASE ) -Φ(x(t PHASE )), v 1 (x(t PHASE ))⟩| ≥ Ω(ρ 2 ). ∥x(t PHASE )∥ 2 ≤ λ 1 (t PHASE )ηρ -Ω(ρ 2 ). Hence by Theorem I.3, if we consider a translated process with x ′ (t) = x(t + t PHASE ), we would have for any T 3 such that the solution X of Equation 7 is well defined, we have that for t = ⌈ T3 ηρ 2 ⌉ ∥Φ(x ′ (t)) -X(ηρ 2 t)∥ 2 = O(η ln(1/ρ)) . This implies for t satisfying X(ηρ 2 (t -t PHASE )) is well-defined, ∥Φ(x(t)) -X(ηρ 2 (t -t PHASE ))∥ 2 = O(η ln(1/ρ)). Finally, as ∥X(ηρ 2 (t -t PHASE )) -X(ηρ 2 t)∥ 2 = O(ηρ 2 t PHASE ) = O(ρ ln(1/ηρ)) = O(η ln(1/ρ)). We have that ∥Φ(x(t)) -X(ηρ 2 t)∥ 2 = O(η ln(1/ρ)). The alignment result is a direct consequence of Theorem I.3. I.4 PROOFS OF COROLLARIES 4.6 AND 4.7 Proof of Corollary 4.6. We will do a Taylor expansion on L Max ρ . By Theorem I.1 and I.3, we have ∥x(⌈T 3 /ηρ 2 ⌉)) -X(T 3 )∥ = Õ(η + ρ) and ∥x(⌈T 3 /ηρ 2 ⌉)) -Φ(x(⌈T 3 /ηρ 2 ⌉)))∥ 2 = O(ηρ). For convenience, we denote x(⌈T 3 /ηρ 2 ⌉) by x. R Max ρ (x) = max ∥v∥2≤1 ρv T ∇L(x) + ρ 2 v T ∇ 2 L(x)v/2 + O(ρ 3 ) Since max ∥v∥ 2 ≤1 ∥v T ∇L(x)∥ 2 = O(∥x -Φ(x)∥ 2 ) = O(ηρ), it holds that R Max ρ (x) = ρ 2 max ∥v∥2≤1 v T ∇ 2 L(x)v/2 + O(η 2 ρ 2 + ρ 3 ) = ρ 2 λ 1 (∇ 2 L(x)) + O(η 2 ρ 2 + ρ 3 ) = ρ 2 λ 1 (∇ 2 L(X(T 3 ))) + Õ(ηρ 2 ), which completes the proof. Proof of Corollary 4.7. We choose T such that X(T ϵ ) is sufficiently close to X(∞), such that λ 1 (X(T ϵ )) ≤ λ 1 (X(∞)) + ϵ/2. By Corollary 4.6 (let T 3 = T ϵ ), we have that for all ρ, η such that η ln(1/ρ) and ρ/η are sufficiently small, ∥R Max ρ (x(⌈T ϵ /(ηρ 2 )⌉)) -ρ 2 λ 1 (X(T ϵ ))/2∥ ≤ õ(1). This further implies ∥R Max ρ (x(⌈T ϵ /(ηρ 2 )⌉))-ρ 2 λ 1 (X(∞))/2∥ ≤ ϵρ 2 +o(1). We also have L(x(⌈T ϵ /(ηρ 2 )⌉))-inf x∈U ′ L(x) = o(1). Then we can leverage Theorem G.6 and Theorem G.3 to get the desired bound.

I.5 DERIVATIONS FOR SECTION 4.3

We will first show our derivation of Equation 9. In Phase II, x(t) is O(ηρ)-close to the manifold Γ and therefore it can be shown that ∥x(t) -Φ(x(t))∥ 2 = O(ηρ) holds for every step in Phase II. This also implies that ∥x(t + 1) -x(t)∥ 2 = O(ηρ) (See Lemma F.7). Using Taylor expansion around x(t), we have that Φ(x(t + 1)) -Φ(x(t)) =∂Φ(x(t))(x(t + 1) -x(t)) + O(∥x(t + 1) -x(t)∥ 2 2 ) = -η∂Φ(x(t))∇L x(t) + ρ ∇L(x(t)) ∥∇L(x(t))∥ 2 + O(η 2 ρ 2 ) . For any x ∈ R D , applying Taylor expansion on ∇L x + ρ ∇L(x) ∥∇L(x)∥ 2 around x, we have that ∇L x + ρ ∇L(x) ∥∇L(x)∥ 2 =∇L(x) + ρ∇ 2 L(x) ∇L(x) ∥∇L(x)∥ 2 + ρ 2 2 ∂ 2 (∇L)(x) ∇L(x) ∥∇L(x)∥ 2 , ∇L(x) ∥∇L(x)∥ 2 + O(ρ 3 ). Using Equation 32 with x = x(t), plugging in Equation 31 and then rearranging, we have that Φ(x(t + 1)) -Φ(x(t)) + ηρ 2 2 ∂Φ(x(t))∂ 2 (∇L)(x(t)) ∇L(x(t)) ∥∇L(x(t))∥ 2 , ∇L(x(t)) ∥∇L(x(t))∥ 2 = -η∂Φ(x(t))∇L(x(t)) -ηρ∂Φ(x(t))∇ 2 L(x(t)) ∇L(x(t)) ∥∇L(x(t))∥ 2 + O(η 2 ρ 2 + ηρ 3 ) . By Lemma 3.1, we have that ∂Φ(x(t))∇L(x(t)) = 0. Furthermore, by Lemma F.5, we have that ∂Φ(Φ(x(t)))∇ 2 L(Φ(x(t))) = 0. This implies that ∂Φ(x(t))∇ 2 L(x(t)) = ∂Φ(Φ(x(t)))∇ 2 L(Φ(x(t))) + O(∥x(t) -Φ(x(t))∥ 2 ) = O(ηρ) . Thus we conclude that Φ(x(t + 1)) -Φ(x(t)) = - ηρ 2 2 ∂Φ(x(t))∂ 2 (∇L)(x(t)) ∇L(x(t)) ∥∇L(x(t))∥ 2 , ∇L(x(t)) ∥∇L(x(t))∥ 2 +O(η 2 ρ 2 + ηρ 3 ) . We will then show our derivation of Equation 10 Φ(x(t + 1)) -Φ(x(t)) = - ηρ 2 2 ∂Φ(x(t))∂ 2 (∇L)(x(t)) ∇L(x(t)) ∥∇L(x(t))∥ 2 , ∇L(x(t)) ∥∇L(x(t))∥ 2 + O(η 2 ρ 2 + ηρ 3 ) = - ηρ 2 2 ∂Φ(x(t))∂ 2 (∇L)(x(t)) v 1 (∇ 2 L(x(t))), v 1 (∇ 2 L(x(t))) + O(η 2 ρ 2 + ηρ 3 ) = - ηρ 2 2 ∂Φ(x(t))∇λ 1 (∇ 2 L(x(t))) + O(η 2 ρ 2 + ηρ 3 ) = - ηρ 2 2 ∂Φ(Φ(x(t)))∇λ 1 (∇ 2 L(Φ(x(t)))) + O(η 2 ρ 2 + ηρ 3 ), where the second to last step we use the property of the derivative of eigenvalue (Lemma K.7) and the last step is due to Taylor expansion of ∂Φ(•)∇λ 1 (∇ 2 L(•)) at Φ(x(t)) and the fact that ∥Φ(x(t)) -x(t)∥ = O(ηρ). We will finally show our derivation of Equation 12. The update of the gradient (Equation 12) can be viewed as an O(ηρ 2 )-perturbed version of the update of the iterate in the quadratic case. Note O(ηρ 2 ) is a higher order term comparing to the other two terms, which are on the order of Θ(η 2 ρ) and Θ(ηρ) respectively. By controlling the error terms, the mechanism and analysis of the implicit alignment between Hessian and gradient still apply to the general case. We can also show that once this alignment happens, it will be kept until the end of our analysis, which is Θ(η -1 ρ -2 ) steps. There exists T > 0, such that ∥ϕ(x init , T ) -Φ(x init )∥ 2 ≤ Ch/2 . Consider x(t + 1) = x(t) -η∇L k (x(t) + ρ ∇L k (x(t)) ∥∇L k (x(t)) ∥ ) = x(t) -η∇L k (x(t)) + O(ηρ) . By Theorem L.1, let b(x) = -∇L(x),p = η and ϵ = O(ρ), for sufficiently small η and ρ, the iterates x(t) tracks gradient flow ϕ(x init , T ) in O(1/η) steps in expectation, Quantitatively, with probability 1 -ρ 2 , for t GF = ⌈ T0 η ⌉, we have that ∥x(t GF ) -ϕ(x init , T 0 )∥ 2 = Õ( √ p + ϵ) ≤ Õ(η 1/2 + ρ) . This implies x(t GF ) ∈ K h , hence by Taylor Expansion on Φ, ∥Φ(x(t GF )) -Φ(x init )∥ 2 = ∥Φ(x(t GF )) -Φ(ϕ(x init , T ))∥ 2 ≤ O(∥x(t GF ) -ϕ(x init , T )∥ 2 ) ≤ Õ(η 1/2 + ρ) . This implies ∥x(t GF ) -Φ(x(t GF ))∥ 2 ≤∥x(t GF ) -ϕ(x init , T 0 )∥ 2 + ∥ϕ(x init , T 0 ) -Φ(x init )∥ 2 + ∥Φ(x init ) -Φ(x(t GF ))∥ 2 ≤Ch/2 + Õ(η 1/2 + ρ) ≤ Ch ≤ h/4 . By Taylor Expansion,  L(x(t GF )) ≤ ζ∥x(t GF ) -Φ(x(t GF ))∥ (τ ) ∈ K h , ∀t 0 ≤ τ ≤ t. Moreover, we have that ∥Φ(x(t)) -Φ(x(t 0 ))∥ = O((η + ρ) ln(1/ηρ)). Proof of Lemma J.4. We will prove by induction. For τ = t 0 , the result holds trivially. Suppose the result holds for t -1, then for any τ satisfying t 0 ≤ τ ≤ t -1, by Lemmas F.1 and F.8, ∥Φ(x(τ + 1)) -Φ(x(τ ))∥ ≤ ξηρ∥∇L (x(τ )) ∥ 2 + νηρ 2 + ξη 2 ∥∇L (x(τ )) ∥ 2 2 + ξζ 2 η 2 ρ 2 = O(η 2 + ηρ) . Also by Lemma F.1, ∥x(t) -Φ(x(t))∥ 2 ≤ h/2 √ 2, this implies, dist(K, x(t)) ≤dist(K, x(t 0 )) + ∥x(t 0 ) -Φ(x(t 0 ))∥ 2 + ∥Φ(x(t 0 )) -Φ(x(t))∥ + ∥Φ(x(t)) -x(t)∥ ≤0.99h + O(η 2 (t -t GF )) = 0.99h + O(η ln(1/ηρ)) ≤ h . Lemma J.5. Under condition of Theorem J.1, if x(τ ) ∈ K h , then we have that E[L(x(τ + 1))|x(τ )] ≤ L(x(τ )) - ηµ 2 L(x(τ )) . Moreover it holds that, E[ln L(x(τ + 1))|x(τ )] ≤ ln E[L(x(τ + 1))|x(τ )] ≤ ln L(x(τ )) - ηµ 2 . Proof of Lemma J.5. By Lemma F.8 and Taylor Expansion, E[L(x(τ + 1))|x(τ )] =E L x(τ ) -η∇L k [x(τ ) + ρ ∇L k (x(τ )) ∥∇L k (x(τ )) ∥ ] |x(τ ) ≤E L(x(τ )) -η ∇L (x(τ )) , ∇L k x(τ ) + ρ ∇L k (x(τ )) ∥∇L k (x(τ )) ∥ + E ζη 2 2 ∥∇L k [x(τ ) + ρ ∇L k (x(τ )) ∥∇L k (x(τ )) ∥ ]∥ 2 2 ≤L(x(τ )) -η∥∇L (x(τ )) ∥ 2 2 + ηρζ∥∇L (x(τ )) ∥ 2 + ζη 2 E[∥∇L k (x(τ ))∥ 2 2 ] + ζ 3 η 2 ρ 2 ≤L(x(τ )) - η 2 ∥∇L (x(τ )) ∥ 2 2 ≤L(x(τ )) - ηµ 2 L(x(τ )) . Lemma J.6. Under condition of Theorem J.1, assuming x(t 0 ) ∈ K h/4 and L(x(t 0 )) ≤ µh 2 32 , then with probability 1 -O(ρ), for any t satisfying t 0 ≤ t ≤ t 0 + O(ln(1/ηρ)/η), it holds that x(t) ∈ K h . Moreover, we have that ∥Φ(x(t)) -Φ(x(t 0 ))∥ = O((η + ρ) ln(1/ηρ)). Proof of Lemma J.  ) ∈ K h , ∀t 0 ≤ τ ≤ t -1) Consider each term, and applying uniform bound again, P (L(x(t)) ≥ µh 2 16 and x(τ ) ∈ K h , ∀t 0 ≤ τ ≤ t -1) ≤ t τ =t0 P (L(x(t)) ≥ µh 2 16 and L(x(τ )) ≤ µh 2 32 and ∀t -1 ≥ τ ′ ≥ τ + 1, µh 2 16 > L(x(τ ′ )) > µh 2 32 and ∀t -1 ≥ τ ′′ ≥ τ, x(τ ′′ ) ∈ K h ) . Then if we consider each term, we have that it is bounded by P (L(x(t)) ≥ µh 2 16 and ∀t -1 ≥ τ ′ ≥ τ + 1, L(x(τ ′ )) > µh 2 32 and ∀t -1 ≥ τ ′′ ≥ τ, x(τ ′′ ) ∈ K h | L(x(τ )) ≤ µh 2 ) . Define a coupled process L(τ + 1) = ln L(x(τ + 1)) and L(τ ′ ) = ln L(x(τ ′ )), if L(τ ′ -1) = ln L(x(τ ′ -1)) ≥ ln( µh 2 32 ), L(τ ′ -1) -ηµ/2, if otherwise. Then clearly P (L(x(t)) ≥ µh 2 16 and ∀t ≥ τ ′ ≥ τ + 1, L(x(τ ′ )) > µh 2 32 and ∀t ≥ τ ′′ ≥ τ, x(τ ′′ ) ∈ K h | L(x(τ )) ≤ µh 2 32 ) ≤ P ( L(t) ≥ ln( µh 2 )) . Consider a fixed τ ′ satisfying τ + 1 ≤ τ ′ ≤ t. By Lemma J.5, we have that L(x(τ ′ + 1)) -L(x(τ ′ )) ≤ -ηµ/2. Hence L(t) + ηµt/2 is a super martingale. Further it holds that if L(x(τ ′ -1)) ≥ ( µh 2 32 ), then L(x(τ ′ -1)) -L(x(τ ′ )) = O(∥x(τ ′ -1) -x(τ ′ )∥) = O(η) . Using the smoothness at log(x) at µh 2 32 which is a positive constant, ∥ L(τ ′ + 1) -L(τ ′ )∥ ≤ O(η) ≤ Cη . Here C is a constant independent of η. This implies L(x(τ + 1)) ≤ µh 2 16 √ Now by Azuma-Hoeffding bound (Lemma K.4), we have that P ( L(t) -L(τ + 1) + (t -τ -1)ηµ/2 > a) ≤ 2 exp(- a 2 8(t -τ -1)(C + µ) 2 η 2 ). With a = ln( µh 2 16 L(τ +1) ) + (t -τ -1)ηµ/2 ≥ (ln 2 + (t -τ -1)ηµ)/2, we have that P ( L(t) > ln( µh 2 )) ≤ 2 exp(- (ln 2 + (t -τ -1)ηµ) 2 32(C + µ) 2 η 2 ) ≤ 2 exp(- ln 2(t -τ -1)µ 8(C + µ) 2 η ) Hence we have P(∃t 0 ≤ t ≤ t 0 + O(ln(1/ηρ)/η), L(x(t)) ≥ µh 2 16 ) ≤O(2 exp(- ln 2(t -τ -1)µ 8(C + µ) 2 η ) ln 2 (1/ηρ)/η 2 ) ≤ ρ. Hence with probability 1 -ρ, L(x(t)) ≤ µh 2 16 , ∀t 0 ≤ t ≤ t 0 + O(ln(1/ηρ)/η), combining with Lemma J.4, we have completed our proof. Lemma J.7. Under condition of Theorem J.1, assuming there exists t GF such that L(x(t GF )) ≤ µh 2 32 and x(t GF ) ∈ K h/4 , then with probability 1-O(ρ), there exists t DEC = t GF +O(ln(1/ρ)/η), such that x(t DEC ) is in O(ρ) neighbor of Γ, quantitatively, we have that ∥∇L(x(t DEC ))∥ 2 ≤ 4ζρ . Moreover the movement of the projection of Φ(x(•)) on the manifold is bounded, ∥Φ(x(t GF )) -Φ(x(t DEC ))∥ 2 = O((η + ρ) ln(1/ρ)) .

2.. By Taylor expansion at

p, ∇L k (x) = Λ k (x)w k (p)w k (p) ⊤ (x -p) + O(ν∥x -p∥ 2 2 ). That being said, when |w ⊤ k (x -p)| ≥ ∥x -p∥ 3/2 2 , we have ∥∇L k (x) -Λ k w k w ⊤ k (x -p)∥ 2 ≤ O(∥x -p∥ 2 2 ) . ∥∇L k (x) ∥ ≥ ∥Λ k w k w ⊤ k (x -p)∥ 2 -O(∥x -p∥ 2 2 ) ≥ Ω(∥x -p∥ 3/2 2 ). Concluding, ∥ ∇L k (x) ∥∇L k (x) ∥ - Λ k w k w ⊤ k (x -p) ∥Λ k w k w ⊤ k (x -p)∥ ∥ 2 ≤ O(∥x -p∥ 1/ 2 ) Hence we have ∇L k (x) ∥∇L k (x) ∥ = sign(w ⊤ k (x -p))w k + O(∥x -p∥ 1/2 2 ) . Comparing ( 35) and (36), we have s = sign(w k (p) ⊤ (x -p)) when |w ⊤ k (x -p)| ≥ ∥x -p∥ 3/2 2 . Lemma J.9. Under condition of Theorem J.1, for any constant C > 0 independent of η, ρ, there exists constant C 1 > C 2 > 0 independent of η, ρ, if x(t) ∈ K h and C 1 ηρ ≤ ∥x(t) -Φ(x(t))∥ ≤ Cρ, then we have that E k [∥x(t + 1) -Φ(x(t + 1))∥ 2 | x(t)] ≤ ∥x(t) -Φ(x(t))∥ 2 -C 2 ηρ . Proof of Lemma J.9. By Lemma F.2, ∥x(t) -Φ(x(t))∥ = O(ρ). Hence we have that by Taylor Expansion, x(t + 1) = x(t) -η∇L k x(t) + ρ ∇L k (x(t)) ∥∇L k (x(t)) ∥ = x(t) -η∇L k (x(t)) -ηρ∇ 2 L k (x(t)) ∇L k (x(t)) ∥∇L k (x(t)) ∥ + O(ηρ 2 ) = x(t) -η∇L k (x(t)) -ηρΛ k w k w ⊤ k ∇L k (x(t)) ∥∇L k (x(t)) ∥ + O(ηρ 2 ) . Here Λ k , w k indicates Λ k (Φ(x(t))), w k (Φ(x(t))). Notice that given ∥x(t) -Φ(x(t))∥ = O(ρ), by Lemma F.8, we have that ∥Φ(x(t + 1)) -Φ(x(t))∥ 2 = O(ηρ 2 ), ∥x(t + 1) -x(t)∥ 2 = O(ηρ). This implies x(t + 1) ∈ K r .

Further by Taylor Expansion

, ∇L k (x(t)) = Λ k w k w ⊤ k (x(t) -Φ(x(t))) + O(ρ 2 ) . By Lemma J.8, we have for some s k (t) ∈ {-1, 1}. w ⊤ k ∇L k (x(t)) ∥∇L k (x(t)) ∥ = s k (t)w k + O(∥x(t) -Φ(x(t))∥ 2 ) . We also have s k (t) ̸ = sign(w ⊤ k (x(t) -Φ(x(t)))) ⇒ ∥w ⊤ k (x(t) -Φ(x(t)))∥ 2 ≤ ∥x(t) -Φ(x(t))∥ 3/2 2 . (37) Concluding, x(t + 1) -Φ(x(t + 1)) =(x(t) -Φ(x(t))) -ηΛ k w k w ⊤ k (x(t) -Φ(x(t))) -ηρΛ k s k (t)w k w ⊤ k w k + O(ηρ 2 ). After we take square and expectation, E[∥x(t + 1) -Φ(x(t + 1))∥ 2 2 | x(t)] ≤∥x(t) -Φ(x(t))∥ 2 2 + 2η 2 M M k=1 Λ 2 k |w ⊤ k (x(t) -Φ(x(t)))| 2 + 2η 2 ρ 2 M M k=1 Λ 2 k -2 η M M k=1 Λ k |w ⊤ k (x(t) -Φ(x(t)))| 2 -2 ηρ M M k=1 Λ k s k (t)w ⊤ k (x(t) -Φ(x(t))) + O(ηρ 2 ∥x(t) -Φ(x(t))∥ + η 2 ρ 3 ) . Lemma J.12. Under condition of Theorem J. Proof of Lemma J.12. We have that x(t) ∈ K h (Lemma J.6) and ∥x(t) -Φ(x(t))∥ ≤ Cρ for some constant C (Lemma J.11) for any t satisfying that t DEC ≤ t ≤ t DEC + O(ln(1/ηρ)/η) with probability 1 -O(ρ) and we will suppose this holds for the following deduction. The second statement then follows directly from Lemma F.8. Let C 1 , C 2 be the constant in Lemma J.9 corresponding to C, For simplicity of writing, define T 1 ≜ ⌈ C ln( C C 1 ηρ 2 ) C2η ⌉ = O(ln(1/ηρ)/η). Define indicator function as A(t) = 1[∥x(t) -Φ(x(t))∥ ≥ C 1 ηρ, ∀t ≥ τ ≥ t GF ] . By Lemma J.9, we have that, E[∥x(t + 1) -Φ(x(t + 1))∥A(t + 1)] ≤ E[∥x(t + 1) -Φ(x(t + 1))∥A(t)] ≤ E[∥x(t) -Φ(x(t))∥A(t)] -C 2 ηρE[A(t)] ≤ E[∥x(t) -Φ(x(t))∥A(t)](1 - C 2 η C ) We can then conclude that with T 2 = T 1 + t DEC , using Lemma F.2, C 1 ηρEA(T 2 + 1) ≤ E[∥x(T 2 + 1) -Φ(x(T 2 + 1))∥ 2 A(T 2 + 1)] ≤ (1 - C 2 η C ) T1 ∥x(t DEC ) -Φ(x(t DEC ))∥ ≤ C 1 ηρ 3 . This implies A(T 2 + 1) = 0 with probability 1 -O(ρ), which indicates the existence of t DEC2 .

J.2 PHASE II (PROOF OF THEOREM J.2)

Proof of Theorem J.2. We will inductively prove the following induction hypothesis P(t) holds with probability 1 -O(η 3 ρ 3 t) for t ≤ T 3 /ηρ 2 + 1, x(τ ) ∈ K h/2 , τ ≤ t ∥x(τ ) -Φ(x(τ ))∥ 2 ≤ 2∥x(0) -Φ(x(0))∥ 2 = O(ηρ), τ ≤ t ∥Φ(x(τ )) -X(ηρ τ )∥ = Õ(η 1/2 + ρ), τ ≤ t P(0) holds trivially. Now suppose P(t) holds, then x(t + 1) ∈ K h . By Lemma J.13, we have that with probability 1 -O(η 3 ρ 3 ), ∥x(t + 1) -Φ(x(t + 1))∥ ≤ 2∥x(0) -Φ(x(0))∥ 2 = O(ηρ).

Now we have

2∥x(0) -Φ(x(0))∥ 2 = O(ηρ), τ ≤ t + 1. x(τ ) ∈ K h , τ ≤ t + 1 By Lemma J.14, it holds that ∥Φ(x(τ + 1)) -Φ(x(τ )) + ηρ 2 P ⊥ Φ(x(τ )),Γ ∇λ 1 ∇ 2 L kτ Φ(x(τ )) /2∥ ≤ Õ(ηρ 3 + η 2 ρ 2 ) . As E kt P ⊥ Φ(x(t)),Γ ∇λ 1 ∇ 2 L kt Φ(x(t)) = P ⊥ Φ(x(t)),Γ ∇Tr(∇ 2 L(Φ(x(t)))). By Theorem L.1, let b(x) = -∂Φ(x)∇Tr(∇ 2 L(x)), b k (x) = -∂Φ(x)Tr(∇ 2 L kt (x)), p = ηρ 2 and ϵ = O(η + ρ), it holds that, with probability 1 -O(η 3 ρ 3 ), ∥Φ(x(τ )) -X(ηρ 2 τ )∥ =O(∥Φ(x(0)) -Φ(x init )∥ + T 3 ηρ 2 + ηρ 2 T 3 log(2eT 3 /(η 2 ρ 4 )) + (ρ + η)T 3 ) = Õ(η 1/2 + ρ), τ ≤ t + 1 This implies ∥x(t + 1) -X(ηρ 2 (t + 1))∥ 2 ≤ ∥x(t + 1) -Φ(x(t + 1))∥ 2 + ∥Φ(x(t + 1)) -X(ηρ 2 (t + 1))∥ 2 = Õ(η 1/2 + ρ) < h/2. Hence x(t + 1) ∈ K h/2 . Combining with P(t) holds with probability 1 -O(η 3 ρ 3 t), we have that P(t + 1) holds with probability 1 -O(η 3 ρ 3 (t + 1)). The induction is complete. Now P(⌈T 3 /ηρ 2 ⌉) is equivalent to our theorem.

J.2.1 CONVERGENCE NEAR MANIFOLD

Lemma J.13. Under condition of Theorem J.2, assuming x(t) ∈ K h , ∀t 0 ≤ t ≤ t 0 + O(1/ηρ 2 ) and ∥x(t 0 ) -Φ(x(t 0 ))∥ ≤ f (η, ρ) for some fixed function f and f (η, ρ) ∈ Ω(ηρ ln 2 (1/ηρ)) ∩ O(ρ), then with probability 1 -O(η 3 ρ 3 ), for any t satisfying t 0 ≤ t ≤ t 0 + O(1/ηρ 2 ), it holds that ∥x(t) -Φ(x(t))∥ ≤ 2f (η, ρ). Proof of Lemma J.13. The proof is almost identical to Lemma J.11 and is omitted. J.2.2 TRACKING RIEMANNIAN GRADIENT FLOW Lemma J.14. Under the condition of Theorem J.2, for any t satisfying that x(t) ∈ K h and ∥x(t) -Φ(x(t))∥ = O(ηρ ln 2 (1/ηρ)).

It holds that

∥Φ(x(t + 1)) -Φ(x(t)) + ηρ 2 P ⊥ Φ(x(t)),Γ ∇λ 1 ∇ 2 L kt Φ(x(t)) /2∥ ≤ Õ(ηρ 3 + η 2 ρ 2 ) . Proof of Lemma J.14. We will abbreviate k t by k in this proof. By Taylor Expansion, Hence we have Φ(x(t + 1)) -Φ(x(t)) = -ηρ 2 P ⊥ Φ(x(t)),Γ ∇λ 1 ∇ 2 L kt Φ(x(t)) /2 + Õ(η 2 ρ 2 + ηρ 3 ) This completes the proof. J.3 PROOF OF THEOREM 5.4 Proof of Theorem 5.4. By Theorem J.1, there exists constant T 1 independent of η, ρ, such that there exists t PHASE ≤ T 1 ln(1/ηρ)/η, with probability 1 -O(ρ), it holds that ∥x(t PHASE ) -Φ(x(t PHASE ))∥ 2 = O(ηρ). ∥Φ(x(t PHASE )) -Φ(x init )∥ = Õ(η 1/2 + ρ) x Hence by Theorem J.2, if we consider a translated process with x ′ (t) = x(t + t PHASE ), we would have for any T 3 such that the solution X of Equation 14 is well defined, we have that for t = ⌈ T3 ηρ 2 ⌉ ∥Φ(x ′ (t)) -X(ηρ 2 t)∥ 2 = O(η ln(1/ρ)) . This implies for t satisfying X(ηρ 2 (t -t PHASE )) is well-defined, ∥Φ(x(t)) -X(ηρ 2 (t -t PHASE ))∥ 2 = Õ(η 1/2 + ρ). Finally, as ∥X(ηρ 2 (t -t PHASE )) -X(ηρ 2 t)∥ 2 = O(ηρ 2 t PHASE ) = O(ρ ln(1/ηρ)) = Õ(ρ). We have that ∥Φ(x(t)) -X(ηρ 2 t)∥ 2 = Õ(η 1/2 + ρ). We also have ∥x(t) -Φ(x(t))∥ 2 = O(ηρ). by Theorem J.2. J.4 PROOFS OF COROLLARIES 5.6 AND 5.7 Proof of Corollary 5.6. We will do Taylor expansion on E k [L Max k,ρ ](x). By Theorem J.1 and J.2, we have ∥x(⌈T 3 /ηρ 2 ⌉) -X(T 3 )∥ 2 = Õ(η 1/2 + ρ) and ∥Φ(x(⌈T 3 /ηρ 2 ⌉)) -x(⌈T 3 /ηρ 2 ⌉)∥ 2 = Õ(η 1/2 + ρ). For convenience, we denote x(⌈T 3 /ηρ 2 ⌉) by x. Then for T t = t n=0 a n , it holds that x(t + 1) ≤ Ce LTt . Lemma K.7 (Magnus (1985) ). Let A : R D → R D×D be any C 1 symmetric matrix function and x * ∈ R D satisfying λ 1 (A(x * )) > λ 2 (A(x * )) and v 1 be the top eigenvector of A(x * ). It holds that ∇λ 1 (A(x)) | x=x * = ∇(v ⊤ 1 A(x)v 1 )| x=x * . We then present some of the technical lemmas we required to prove Lemma H.5. Lemma K.8. If 0 < c < b-a b 2 , a a 2 +2b 2 2(1-cb) ≥ a 2 +b 2 2-ca-cb , then a >  So √ 1 -cb + 1 √ 1 -cb ≥ 1 + b 2 a 2 As c < b-a b 2 , we have 1 > 1 -cb > a b . So a b + b a ≥ 1 + b 2 a 2 The above inequality implies a ≥ 1 2 b. As c < b-a b 2 ,cb ≤ 1 2 . Lemma K.9. When 0 < a < b, 0 < c < b-a b 2 , we have cb 2 + ca 2 (2 -cb - 2 3 ca) -(1 -cb) c(a 2 + b 2 ) 2 -ca -cb -ca 2 ( 1 2 a 2 + b 2 ) 2 -ca -cb (a 2 + b 2 ) ≤ cb 2 2 -cb Proof of Lemma K.9. Equivalently, we are going to prove (1 -cb)b 2 1 2 -ca -cb - 1 2 -cb + a 2 1 -cb 2 -ca -cb + a 2 ( 1 2 a 2 + b 2 ) 2 -ca -cb (a 2 + b 2 ) ≥ a 2 (2 -cb - 2 3 ca) Further simplifying, we only need to prove (1 -cb)cab 2 (2 -cb)(2 -ca -cb) + a 2 1 -cb 2 -ca -cb ≥ 1 3 ca 3 + a 4 2(a 2 + b 2 ) (2 -ca -cb) We have the following auxiliary inequalities, Using the above auxiliary inequalities we have (1 -cb)cab 2 (2 -cb)(2 -ca -cb) + a 2 1 -cb 2 -ca -cb ≥ 1 3 ca 3 + a 4 2(a 2 + b 2 ) (2 -ca -cb) ⇐ ca 2 b (2 -cb)(2 -ca -cb) + 1 - 1 2 (2 -ca -cb) a 2 (1 -cb) 2 -ca -cb ≥ 1 3 ca 3 ⇐ ca 2 b (2 -cb)(2 -ca -cb) + ca 2 (a + b)(1 -cb) 2(2 -ca -cb) ≥ 1 3 ca 3 ⇐ ca 2 b (2 -cb)(2 -ca -cb) + ca 2 b(1 -cb) 2(2 -ca -cb) ≥ 1 3 ca 2 b ⇐ 1 (2 -cb) 2 + 1 -cb 2(2 -cb) ≥ 1 ⇐3(1 -cb)(2 -cb) + 6 ≥ 2(2 -cb) 2 ⇐(cb) 2 -cb + 4 ≥ 0 Lemma K.10. When 0 < a < b, 0 < c < b-a b 2 , a a 2 +2b 2 2(1-cb) ≥ a 2 +b 2 2-ca-cb , we have cb 2 + ca 2 (2 -cb - 2 3 ca) -(1 -cb)cb 2 -ca 2 ( 1 2 a 2 + b 2 ) 1 b 2 ≤ cb 2 2 -cb Proof of Lemma K.10. Equivalently, we are going to prove, cb 3 + a 2 (2 -cb - 2 3 ca) ≤ b 2 2 -cb + a 2 ( 1 2 a 2 + b 2 ) b 2 ⇐⇒ cb 3 + a 2 (1 -cb - 2 3 ca) ≤ b 2 2 -cb + a 4 2b 2 We have the auxiliary inequality Define u ≜ cb, v ≜ a b , then u + v ≤ 1. d 2 F (a) da 2 ≥ 4 -2u -4uv -3 √ 1 -u 1 1 2 + 1 v 2 ≥ 4 -2u -4u(1 -u) -3 √ 1 -u 1 1 2 + 1 (1-u) 2 ≥ 4u 2 -6u + 4 -3 √ 1 -u (1 -u) (1-u) 2 2 + 1 As (1-u) For F (a min ), we have a min  ≤ 1 c ( cb 2 2 -cb -cb 2 ) For F (a max ), we know that a max must satisfy at least of the following three equalities and we discuss three cases one by one. . These imply a 3 max + 2a max b 2 -2b 3 ≤ 0 ⇒ a max < 0.9b. This implies v ≤ 0.9. By Lemma K.8, 0.5 ≤ v. As v ∈ [0.5, 0.9], it holds that v(1 + v) + 1 1 + v ≤ 2 (1 + v 2 2 )v. This implies v 2 (2 -(1 -v) - 2 3 (1 -v)v) -2v (1 + v 2 2 )v ≤ -v 1 + v = 1 2 -cb - Finally, F (a max ) = a 2 max (2 -cb - 2 3 ca max ) -2a max (b 2 + 1 2 a 2 max )(1 -cb) = b 2 v 2 (2 -(1 -v) - 2 3 (1 -v)v) -2v (1 + v 2 2 )v ≤ b 2 ( 1 2 -cb -1) . In conclusion, it holds that, 

L OMITTED PROOFS ON CONTINUOUS APPROXIMATION

In this section we give a general approximation result (Theorem L.1) between a continuous-time flow (Equation 39) and a discrete-time (stochastic) iterates (Equation 40) in some compact subset of R D , denoted by K. This result is used multiple times in our analysis for full-batch SAM and 1-SAM. (39) and the discrete-time iterate {x(t)} t∈N which approximately satisfy x(t + 1) ≈ x(t) + pb kt (x(t)), where k t is independently sampled from uniform distribution over [M ] for each t ∈ N and x(t) is a deterministic function of k 0 , . . . , k t-1 . We use F t to denote the σ-algebra generated by k 0 , . . . , k t-1 and F * to denote the filtration (F t ) t∈N . Thus x(t) is adapted to filtration F * . Note b is undefined outside K, thus in the analysis we only consider the process stopped immediately leaving K, that is, x K (t) ≜ x(min(t, t K )), where t K ≜ {t ′ ∈ N | x(t ′ ) / ∈ K}. If x(t) is in K for all t ≥ 0, then t K = ∞. It is easy to verify that t K is a stopping time with respect to the filtration F * . For convenience, we denote X K (τ ) = X(min(τ, pt K )) as the stopped continuous counterpart of x K . Theorem L.1. Suppose there exist constants C 2 , ϵ, ϵ > 0 satisfying that ≤ ϵ, for all t. Then for any integer 0 ≤ k ≤ T /p and 0 < δ < 1, with probability at least 1 -δ, it holds that max 0≤t≤T /p x K (t) -X K (pt) ≤ H p,δ e C1T , where H p,δ ≜ ∥x(0) -X(0)∥ 2 + C 1 C 2 T p + 2C 3 pT log 2eT δp + ϵT . Proof of Theorem L.1. Summing up Equation 39 and Equation 40, for any t ≤ t K , we have that X(pt) -X(0) = pt τ =0 b(X(τ ))dτ, and that x(t) -x(0) = t-1 t ′ =0 x(t ′ + 1) -x(t ′ ) Denote ∥x(t) -X(pt)∥ 2 by E t , we have that for t ≤ t K , E t -E 0 ≤ pt τ =0 b(X(τ ))dτ -  + p t-1 t ′ =0 b(x(t ′ )) -p t-1 t ′ =0 b k t ′ (x(t ′ )) 2 (C) + p t-1 t ′ =0 b k t ′ (x(t ′ )) - t-1 t ′ =0 (x(t ′ + 1) -x(t ′ )) by Azuma-Hoeffding's inequality (vector form, Lemma K.5), it holds that for any 0 ≤ t ≤ T /p and 0 ≤ δ ≤ 1, with probability at least 1 -δ, ∥S t ∥ 2 ≤ 2C 3 p 2t log 2e δ . Applying an union bound on the above inequality over t = 0, . . . , ⌊T /p⌋ -1, we conclude that with probability at least 1 -δ, (C) ≤ 2C 3 p 2T /p log 2eT δp = 2C 3 2T p log 2eT δp . 4. We have that (D) ≤ p t-1 t ′ =0 b k t ′ (x(t ′ )) - x(t ′ + 1) -x(t ′ ) p ≤ ptϵ ≤ ϵT. Combining the above upper bounds for (A), (B), (C) and (D), we conclude that for any 0 ≤ t ≤ min(T /p, t K ), E t ≤ H p,δ + C 1 p t-1 t ′ =0 E t ′ . Applying the discrete gronwall inequality (Lemma K.6) on Equation 45, we have that x K (t) -X K (pt) ≤ H p,δ e C1T . E Therefore dist(x K (t), R D \ K) ≥ dist(X K (pt), R D \ K) -dist(X K (pt), x K (t)) > 0 for any 0 ≤ t ≤ T /p, which implies x K (t) / ∈ R D \ K, or equivalently, x K (t) ∈ K. Thus we conclude that t K ≥ ⌊T /p⌋. Corollary L.3. Suppose M = 1 and there exist constants C 2 , ϵ > 0 satisfying that 1. ∥b(x)∥ 2 ≤ C 2 for any x ∈ K; 2. b(x) -x(t+1)-x(t) p ≤ ϵ, for all x ∈ K. Then for any k ∈ N such that kp ≤ T , it holds that max 0≤t≤T /p x K (t) -X K (pt) ≤ H p e C1T , where H p ≜ ∥x(0) -X(0)∥ 2 + C 1 C 2 T p + ϵT . Therefore, similar to Corollary L.2, if min 0≤τ ≤T dist(X(τ ), R D \ K) > H p e C1T , then it holds that t K > ⌊T /p⌋ and that max 0≤t≤T /p ∥x(t) -X(pt)∥ ≤ H p,δ e C1T .



We note that R Asc ρ (x) is undefined when ∥∇L(x)∥ = 0. In such cases, we set R Asc ρ (x) = ∞. Here we implicitly assume the zeroth and first order term varnishes, which holds for all three sharpness notions. In fact we only need to assume the positive eigengap along the solution of the ODE. If Γ doesn't satisfy Assumption 4.4, we can simply perform the same analysis on its submanifold {x ∈ Γ | eigengap is positive at x}. We note that R Asc ρ (x) is undefined when ∥∇L(x)∥ 2 = 0. In such cases, we set R Asc ρ (x) = ∞. Though we call Equation 13 1-SAM, but our result here applies to any batch size where L k can be regarded as the loss for k-th possible batch and M is the number of the total number of batches. Here we implicitly assume the zeroth and first order term varnishes, which holds for all three sharpness notions. Though we believe this approximation result is folklore, we cannot find a reference under the exact setting as ours. For completeness, we provide a quick proof in this section.



, 2016; Dinh et al., 2017; Dziugaite et al., 2017; Neyshabur et al., 2017; Jiang et al., 2019). Partly motivated by these studies, Foret et al. (2021); Wu et al. (2020); Zheng et al. (2021); Norton et al. (2021) propose to penalize the sharpness of the landscape to improve the generalization. We refer this method to Sharpness-Aware Minimization (SAM) and focus on the version of Foret et al. (2021).

Max ρ (x), where L Max ρ (x) = max ∥v∥ 2 ≤1 L(x + ρv) .

x L Asc ρ (x), whereL Asc ρ (x) = L (x + ρ∇L(x)/ ∥∇L(x)∥ 2 ) .(2)Type of Sharpness-Aware Loss Notation Definition Biases (among minimizers) Worst-directionL Max ρ max ∥v∥ 2 ≤1 L(x + ρv) min x λ 1 (∇ 2 L(x)) (Thm G.3) Ascent-direction L Asc ρ L x + ρ ∇L(x) ∥∇L(x)∥ 2 min x λ min (∇ 2 L(x)) (Thm G.4) Average-direction L Avg ρ E g∼N (0,I) L(x + ρ g ∥g∥ 2

Keskar et al. (2016) observe a positive correlation between the batch size, the generalization error, and the sharpness of the loss landscape when changing the batch size. Jastrzebski et al. (2017) extend this by finding a correlation between the sharpness and the ratio between learning rate to batch size. Dinh et al. (2017) show that one can easily construct networks with good generalization but with arbitrary large sharpness by reparametrization. Dziugaite et al. (2017); Neyshabur et al. (2017); Wei et al. (2019a;b) give theoretical guarantees on the generalization error using sharpness-related measures. Jiang et al. (2019) perform a large-scale empirical study on various generalization measures and show that sharpness-based measures have the highest correlation with generalization. Background on Sharpness-Aware Minimization. Foret et al. (2021); Zheng et al. (2021) concurrently propose to minimize the loss at the perturbed from current parameter towards the worst direction to improve generalization. Wu et al. (2020) propose an almost identical method for a different purpose, robust generalization of adversarial training. Kwon et al. (2021) propose a different metric for SAM to fix the rescaling problem pointed out by Dinh et al. (2017). Liu et al. (2022) propose a more computationally efficient version

Figure 1: Visualization of the different biases of different sharpness notions on a 4D-toy example. Let F1, F2 : R 2 → R + be two positive functions satisfying that F1 > F2 on [0, 1] 2 . For x ∈ R 4 , consider loss L(x) = F1(x1, x2)x 2 3 + F2(x1, x2)x 2 4 . The loss L has a zero loss manifold {x3 = x4 = 0} of codimension M = 2 and the two non-zero eigenvalues of ∇ 2 L of any point x on the manifold are λ1(∇ 2 L(x)) = F1(x1, x2) and λ2(∇ 2 L(x)) = F2(x1, x2). We test three optimization algorithms on this 4D-toy model with small learning rates. They all quickly converge to zero loss, i.e., x3(t), x4(t) ≈ 0, and after that x1(t), x2(t) still change slowly, i.e., moving along the zero loss manifold. We visualize the loss restricted to (x3, x4) as the 3D shape at various (x1, x2)'s where x1 = x1(t), x2 = x2(t) follows the trajectories of the three algorithms. In other words, each of the 3D surface visualize the function g(x3, x4) = L(x1(t), x2(t), x3, x4). As our theory predicts, (1) Full-batch SAM (Equation 3) finds the minimizer with the smallest top eigenvalue, F1(x1, x2); (2) GD on ascent-direction loss L Asc ρ (Equation 2) finds the minimizer with the smallest bottom eigenvalue, F2(x1, x2); (3) 1-SAM (Equation 13) (with L0(x) = F1(x1, x2)x 2 3 and L1(x) = F2(x1, x2)x 2 4 ) finds the minimizer with the smallest trace of Hessian, F1(x1, x2) + F2(x1, x2). See more details in Appendix B.

6 and 4.7 to Appendix I.4. 4.3 ANALYSIS OVERVIEW FOR SHARPNESS REDUCTION IN PHASE II OF THEOREM 4.5 Now we give an overview of the analysis for the trajectory of full-batch SAM (Equation 3) in Phase II (in Theorem 4.5). The framework of the analysis is similar to Arora et al. (2022); Lyu et al. (2022); Damian et al. (2021)

Blanc et al. (2019); Damian et al. (2021); Li et al. (2021); Arora et al. (2022), so does the derivation of the SAM algorithm Foret et al. (2020); Wu et al.

and by assumption Z is a zero-measure set. Applying Theorem D.3 with F n = G kn for all n ∈ N + where k n is the nth data/batch sampled by the algorithm, we get the desired results. Now we will turn to the proof of Theorem D.3, which is based on the following two lemmas. Lemma D.4. Let Z be a closed subset of R D with zero Lebesgue measure and F : R D \ Z → R D be a continuously differentiable function. Then except countably many η

denotes the convergent point of the gradient flow starting from x. Similar to Definition 3.3, we define U k = {x ∈ R D |Φ(x) exists and Φ k (x) ∈ Γ k } be the attraction set of Γ i . We have that each U k is open and Φ k is C 2 on U k by Lemma B.15 in Arora et al. (2022).

to denote the counterpart of the above quantities defined for stochastic loss L k and its limiting map Φ k for k ∈ [M ]. Lemma E.6 (Arora et al. (2022), Lemma B.5 and B.7

PROOFS OF THEOREMS 5.2 AND E.2 Proof of Theorem 5.2. Define F : R D

Li et al. (2021), Lemma 4.3). For x ∈ Γ, ∂Φ(x) = P ⊥ x,Γ , the orthogonal projection matrix onto the tangent space of Γ at x. Since d ∂Φ(x)∇ 2 L(x) = 0.The proof of above lemmas can be found inArora et al. (2022); Li et al. (2021). In the following, we will first show the proof of Lemma 3.1

∇ 2 L(•))/2 as a good limiting regularizer on Γ and satisfies Condition G.1. Theorem G.4. Ascent-direction sharpness R Asc ρ admits λ M (∇ 2 L(•))/2 as a good limiting regularizer on Γ and satisfies Condition G.1. Theorem G.5. Average-direction sharpness R Avg ρ admits Tr(∇ 2 L(•))/(2D) as a good limiting regularizer on Γ and satisfies Condition G.1.

then by Lemma G.11, L(x) is lower bounded by a positive constant. 2. If x ∈ K h and ∥x -Φ(x)∥ 2 ≥ C 1 ρ, then by Lemma F.1,

ρ, we would have there exists C > 0∥x 1 (next(t))∥ ≥ (1 + Cη)∥x 1 (t)∥ or ∥x 1 (next(next(t)))∥ ≥ (1 + Cη)∥x 1 (t)∥.

k [R Max k,ρ ](x) = max ∥v∥≤1 E k [ρv ⊤ ∇L k (x) + ρ 2 v ⊤ ∇ 2 L k (x)v/2] + O(ρ Since max ∥v∥≤1 |v ⊤ ∇L k (x)| = O(∥x -Φ(x)∥) = Õ(η 1/2 + ρ), it holds that, E k [R Max k,ρ ](x) = ρ 2 E k [ max ∥v∥≤1 v ⊤ ∇ 2 L(x)v/2] + O (η 1/4 + ρ 1/4 )ρ E k max ∥v∥≤1 [v ⊤ ∇ 2 L(X(T 3 ))v/2] + O (η 1/4 + ρ 1/4 )ρ Tr(X(T 3 ))/2 + O (η 1/4 + ρ 1/4 )ρ 2Proof of Corollary 5.7. We chooseT ϵ such that X(T ϵ ) is sufficiently close to X(∞), such that Tr(X(T ϵ )) ≤ Tr(X(∞)) + ϵ/2. By corollary 5.6 (let T 3 = T ϵ ), we have for all ρ, η such that (η + ρ) ln(1/ηρ) is sufficiently small, ∥E k [R Max k,ρ ](x(⌈T ϵ /(ηρ 2 )⌉)) -ρ 2 Tr(X(T ))/2∥ 2 ≤ o(1). This further implies ∥E k [R Max k,ρ ](x(⌈T ϵ /(ηρ 2 )⌉)) -ρ 2 Tr(X(∞))/2∥ 2 ≤ ϵρ 2 /2 + o(1). We also have L(x(⌈T ϵ /(ηρ 2 )⌉))inf x∈U ′ L(x) = o(1). Then we can leverage Theorems G.6 and G.14 to get the desired bound.J.5 OTHER OMITTED PROOFS FOR 1-SAMWe will use ℓ ′ (y, y k ) and ℓ ′′ (y, y k ) to denote dℓ(y ′ ,y k )dy ′ | y ′ =y and d 2 ℓ(y ′ ,y k ) dy ′2 | y ′ =y .Lemma J.15. Under Setting 5.1, fix k ∈ [M ], for any p satisfying ℓ(f k (p), y k ) = 0, we have that∇ 2 L k (p) = ℓ ′′ (f k (p), y k )∇f k (p)(∇f k (p)) ⊤ .Lemma K.6 (Discrete Gronwall Inequality, Borkar (2009)). Let {x(t)} t∈N be a sequence of nonnegative real numbers, {a n } n∈N be a sequence of positive real numbers and C, L > 0 scalars such that for all n, x(t) ≤ C + L t-1 n=0 a n x(n).

cb)b > a 1 -cb 2 -ca -cb ≥ 1 a+b b + b-a (1-cb)b ≥ 1 a+b b + b-a a = ab a 2 + b 2 ≥ a 2 a 2 + b 2 1 -cb a ca -cb a 2 + b 2

≤ (b -a)(b + a) ba 2 ) -(b -a)(b + a) (b -a)(2b(a + b) -(a + b) 2 ) ≤ b 3 (b -a) 2 (b + a) ≤ b 3 Using Lemma K.8,a > b 2 ,(b -a) 2 (b + a) = (b 2 -a 2 )(b -a) ≤ b 2 (b -a) ≤ b 3



(1-u) 2 +1 (a) da 2 ≥ 4u 2 -6u + 4 -3(1 -u) = 4u 2 + 1 -3u > 0 The above inequality shows that F (a) is convex w.r.t to a for a min (c, b) ≤ a ≤ a max (c, b). Hence F (a) ≤ max (F (a min (c, b)), F (a max (c, b))).Below we use a min , a max as shorthands for a min (c, b),a max (c, b).

ca min -cb (a Hence using Lemma K.9,F (a min ) = a 2 min (2 -cb -2 3 ca min ) -(1 -cb) c(a 2 min + b 2 ) 2 -ca min -cb -ca 2 min ( ca min -cb (a 2 min + b 2 )

-camax-cb , in this case we simply redo the calculation in Part 1. 2. b 2 = a max a cb) . This implies2a max (b 2 + 1 2 a 2 max )(1 -cb) = (1 -cb)b 2 + a 2 max ( Hence using Lemma K.10, F (a max ) = a 2 max (2 -cb -2 ca max ) -(1 -cb)cb 2 -ca 2 max ( cb 2 = b -a max . Define v ≜ amax b , cb = 1 -v. Note that 1 -cb = amax b and b 2 ≥ a max b(a 2 max +2b 2 ) 2amax

(a) ≤ max (F (a min (c, b)), F (a max (c, b)))

Let b : K → R D is a C 1 -lipschitz function, that is, ∀x, x ′ ∈ K, it holds that ∥b(x) -b(x ′ )∥ 2 ≤ C 1 ∥x -x ′ ∥ 2 . Let b k be mappings from K to R D for k ∈ [M ] satisfying that b(x) = 1 M M k=1 b k (x) for all x ∈ K.We consider the continuous-time flow X : [0, T ] → K, which is the unique solution of dX(τ ) = b(X(τ ))dτ.

∥b k (x)∥ 2 ≤ C 2 , for any x ∈ K and k ∈ [M ]; 2. ∥b k (x) -b(x)∥ 2 ≤ C 3 , for any x ∈ K and k ∈ [M ]; 3. b kt (x(t)) -x(t+1)-x(t) p 2

t ′ + 1) -x(t ′ ))

proceed by bounding the four terms (A), (B), (C) and (D) in Equation43. Note that for any 0 ≤ τ ≤ τ ′ ≤ T , we have that∥X(τ ) -X(τ ′ )∥ 2 = s))∥ 2 ds ≤ (τ ′ -τ )C 2 .Thus, by C 1 -lipschitzness of b,(A) = pt τ =0 b(X(τ )) -b(X(⌊τ /p⌋p))dτ 2 ≤ pt τ =0 ∥b(X(τ )) -b(X(⌊τ /p⌋p))∥ 2 dτ ≤C 1 C 2 p 2 t ≤ C 1 C 2 pT.

k∈[M ],x∈K ∥b(x) -b k (x)∥ 2 ≤ C 3 ,

3.2 implies that U is open and Φ is C 2 on U (Arora et al., 2022, Lemma B.15).

). Theorem 4.2. Under Assumption 3.2, let U ′ be any bounded open set such that its closure U ′ ⊆ U and

Avg k,ρ to denote corresponding sharpness for L k respectively (defined as Equations 1, 2 and 4 with L replaced by L k ). We further use stochastic worst-, ascent-and average-direction sharpness to denote E k

Entering Invariant Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H.2 Alignment to Top Eigenvector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Theorem G.15. Stochastic ascent-direction sharpness E k [R Asc k,ρ ] admits Tr(∇ 2 L(•))/2 as a good limiting regularizer on Γ and satisfies Condition G.1. Proof of Theorem G.15. By Theorem E.2, Condition E.1 holds.Easily deducted from Theorem G.4 Λ k (x) is a good limiting regularizer for R asc k,ρ on Γ k as the codimension of Γ k is 1. Then asΓ ⊂ Γ k , Λ k (x) is a good limiting regularizer for R max k,ρ on Γ.Hence S(x) = k Λ k (x)/2M = Tr(∇ 2 L(x))/2 is a good limiting regularizer of E k [R Asc k,ρ ](x) on Γ.To end this section, we prove the two theorems presented in the main text. The readers will find the proof straight forward after we established the framework of good limiting regularizers.

1) Entering Invariant Set. Lemma H.2 implies that there exists constant T 1 > 0, such that ∀t > T 1 , ∥P (j:D) x(t)∥ 2 ≤ ηλ 2 Alignment to Top Eigenvector. Lemmas H.10 and H.11 show that ∥x(t)∥ 2 and |x 1 (t)| converge to

2. Lemma H.5 shows that under update rule (Equation22), t ̸ ∈ S ⇒ t + 1 ∈ S for sufficiently large t, where the definition of S is {t|∥x(t)∥ 2 ≤

, it holds that |x 1 (t + 1)| ≥ |x 1 (t)| .Proof of Lemma H.6. Nota that |x 1

Combining the two cases and using induction, we can get the desired result.Proof of Lemma H.10. By Lemma H.9, |x 1 (t)| increases monotonously for t ∈ S. By Lemma H.5, S is infinite. By Lemma H.2, for sufficiently large t, |x 1 (t)| is bounded. Combining the three facts, we know x1 (t) for t ∈ S converges.

3. Entering Invariant Set. Lemmas I.11 and I.13 shows the existence of step t

ηρ, then by Lemma H.1, we have that

3, for any t ≥ 0 satisfying that(1)  x(t) ∈ (∩ j∈[M ] I j ) ∩ K 15h/16 , (2) t ∈ S, it holds that next(t) is well defined and next(t) ≤ t + 2.Proof of Lemma I.16. Following similar argument in Lemma H.1, we have thatx(t + 1) ∈ (∩ j∈[M ] I j ) ∩ K h .If t + 1 ̸ ∈ S, then we can apply Lemma I.15 to show that t + 2 ∈ S.

1, assuming there exists t DEC such that x(t DEC ) ∈ K h/2 and ∥∇L(x(t DEC ))∥ ≤ 4ζρ, then with probability 1 -O(ρ), there exists t DEC2 = t DEC + O(ln(1/ηρ)/η), such that ∥x(t DEC2 ) -Φ(t DEC2 )∥ ≤ O(ηρ).Furthermore, for any t satisfying t DEC2 ≤ t ≤ t DEC2 + Θ(ln(1/ηρ)/η), we have that ∥Φ(x(t)) -Φ(x(t DEC ))∥ = O(ρ 2 ln(1/ηρ)).

+ 1)) -Φ(x(t)) + ηρ 2 ∂Φ(x(t))∂ 2 (∇L k )[ ∇L k (x(t)) ∥∇L k (x(t)) ∥ , ∇L k (x(t)) ∥∇L k (x(t)) ∥ ]/2∥ 2 = Õ(η 2 ρ 2 + ηρ 3 ). (∇L k )[w k , w k ] + O(∥x(t) -Φ(x(t))∥ 2 ) =P ⊥ Φ(x(t)),Γ ∇(λ 1 (∇ 2 L k (Φ(x(t))))) + O(∥x(t) -Φ(x(t))∥ 2 ).

′ =0 E t ′ .3. We claim that for any 0 < δ < 1, we have that for probability at least 1 -δ, it holds that Below we prove our claim. We denote pmin(t,t K )-1 t ′ =0 b(x(t ′ )) -p min(t,t K )-1 t ′ =0 b k t ′ (x(t ′ )) by S t ,which is a martingale with respect to filtration F * , since t K is a stopping time. Note ∥S t -S t+1 ∥ 2 ≤ max

ACKNOWLEDGEMENTS

We thank Jingzhao Zhang for helpful discussions. The authors would like to thank the support from NSF IIS 2045685.

annex

Note x ≈ ∇L(x) for x near the manifold Γ. We also use x(t), A(t) and x(t) to denote x (t), A(x(t)) and x (t).Recall the original definition of R j (x) isBased on the above notions, we can rephrase the notion R as R j (x) = ∥P (j:D) (x)x∥ -ηρλ 2 j (x) . We additionally define the approximate invariant set I j as I j = {∥P (j:D) (x)x∥ ≤ ηρλ 2 j (x) + O(ηρ 2 )} . Lemma I.7. Assuming t satisfy that x(t) ∈ K h , then we have that µ 2 ∥x(t) -Φ(x(t))∥ ≤ ∥x(t)∥ ≤ ζ∥x(t) -Φ(x(t))∥Proof of Lemma I.7. First by Lemma F.4, Φ(x(t)) ∈ K r , hence ∥x(t)∥ = ∥∇ 2 L(Φ(x(t)))(x(t) -Φ(x(t)))∥ ≤ ζ∥x(t) -Φ(x(t))∥ .Also ∥x(t)∥ = ∥∇ 2 L(Φ(x(t)))(x(t) -Φ(x(t)))∥ ≥ µ∥P ⊥ Φ(x(t)),Γ (x(t) -Φ(x(t)))∥ .By Lemma F.4 and Lemma E.6, we haveHence ∥x(t) -Φ(x(t))∥ ≤ 2 µ ∥x(t)∥.Lemma I.8. Assuming t satisfy that x(t) ∈ K h and ∥x(t)∥ 2 = O(ρ), then we have that ∥Φ(x(t + 1)) -Φ(x(t))∥ = O(ηρ 2 ) .Proof of Lemma I.8. By Lemma I.7, we have ∥x(t) -Φ(x(t))∥ = O(ρ). By Lemma F.7, we have thatLemma I.9. Assuming t satisfy x(t) ∈ K h/2 and ∥x(t) -Φ(x(t))∥ 2 = O(ρ), define x ′ as x ′ (t) = x(t) and for τ ≥ t,and further if ∥x(t + 1) -Φ(x(t + 1))∥ 2 = Ω(ηρ), thenPublished as a conference paper at ICLR 2023 Finally, we derive Equation 12 by Taylor expansion. We first apply Taylor expansion (Equation 32) on the update rule of the iterate of SAM (Equation 3):Since phase II happens in an O(ηρ)-neighborhood of manifold Γ, we have ∥x(t + 1) -x(t)∥ 2 = O(ηρ).Then by Equation 33 and Taylor expansion on ∇L(x(t + 1)) at x(t), we have thatJ ANALYSIS FOR 1-SAM (PROOF OF THEOREM 5.4)The goal of this section is to prove the following theorem.Theorem 5.4. Let {x(t)} be the iterates of 1-SAM (Equation 13) and x(0) = x init ∈ U , then under Setting 5.1, for almost every x init , for all η and ρ such that (η + ρ) ln(1/ηρ) is sufficiently small, with probability at least 1 -O(ρ) over the randomness of the algorithm, the dynamics of 1-SAM (Equation 13) can be split into two phases:• Phase I (Theorem J.1): 1-SAM follows Gradient Flow with respect to L until entering an Õ(ηρ) neighborhood of the manifold Γ in O(ln(1/ρη)/η) steps; • Phase II (Theorem J.2): 1-SAM tracks the solution of Equation 14, X, the Riemannian gradient flow with respect to Tr(∇ 2 L(•)) in an Õ(ηρ) neighborhood of manifold Γ. Quantitatively, the approximation error between the iterates x and the corresponding limiting flow X is Õ(η 1/2 + ρ), that is,As mentioned in our proof setups in Appendix E, we will prove Theorem Analogous to the full-batch setting, we will split the trajectory into two phases.Theorem J.1 (Phase I). Let {x(t)} be the iterates defined by SAM (Equation 13) and x(0) = x init ∈ U , then under Assumption 3.2 and E.1, for almost every x init , there exists a constant T 1 , it holds for sufficiently small (η + ρ) ln 1/ηρ, we have with probability 1 -O(ρ), there exists t ≤ T 1 ln(1/ηρ)/η, such that ∥x(t) -Theorem J.1 shows that SAM will converges to an Õ(ηρ) neighborhood of the manifold without getting far away from Φ(x(0)), where we can perform a local analysis on the trajectory of Φ(x(t)).

Under Assumptions 3.2 and E.1, we have

) is also differentiable and we have ( 14) is well defined for some finite time T 2 .Theorem J.2 (Phase II). Let {x(t)} be the iterates defined by SAM (Equation 13) under Assumptions 3.2 and E.1, assuming (1) ∥x(0) -Φ(x(0))∥ 2 = O(ηρ) and ( 2) ∥Φ(x init ) -Φ(x(0))∥ 2 = Õ(η 1/2 + ρ), then for almost every x(0), for any T 2 > 0 till which solution of (14) X exists, for sufficiently small (η + ρ) ln 1/(ηρ), we have with probabilityCombining Theorems E.2, J.1 and J.2, the proof of Theorem 5.4 is clear and we deferred it to Appendix J.3. Now we recall our notations for stochastic setting with batch size one. Notations for Stochastic Setting:, where w k is a continuous function on Γ k with pointwise unit norm. Given the loss function L k , its gradient flow is denoted by mapping ϕ k : R D × [0, ∞) → R D . Here, ϕ k (x, τ ) denotes the iterate at time τ of a gradient flow starting at x and is defined as the unique solution of In this section we will define K as {X(t) | t ∈ [0, T 3 ]} where X is the solution of ( 14). We will denote h(K) in Lemma E.6 by h. Using Theorem D.3, we will assume the update is always well defined.J.1 PHASE I (PROOF OF THEOREM J.1)Proof of Theorem J.1. The proof consists of two steps.1. Tracking Gradient Flow. By Lemma J.3, with probability 1 -ρ 2 , there exists step2. Decreasing Loss. By Lemma J.7, with probability 1 -O(ρ), there exists stepThen by Lemma J.12, with probability 1 -O(ρ), there exists stepConcluding, let T 1 be the constant satisfying t DEC2 ≤ T 1 ln(1/ηρ)/η, then we have forJ.1.1 TRACKING GRADIENT FLOW Lemma J.3 shows that the iterates x(t) tracks gradient flow to an O(1) neighbor of Γ. Lemma J.3. Under condition of Theorem J.1, with probability 1 -O(ρ 2 ), there existsProof of Lemma J.7. For simplicity of writing, defineBy Lemma J.6, we may assume x(t) ∈ K h for t GF ≤ t ≤ T 1 + t GF .

Define indicator function as

By Lemma J.5, we have that,We can then conclude that with T 2 = T 1 + t GF , using Lemma F.2,We haveThis implies A(T 2 + 1) = 0 with probability 1 -O(ρ), which indicates the existence of t DEC . The second claim is a direct application of Lemma J.6.Lemma J.8 (A general version of Lemma 5.5). Under Assumption 3.2 and Condition E.2 ) .Proof of Lemma J.8. We will calculate the direction of ∇L k (x) ∥∇L k (x)∥ using two different approximations and compare them to get our result.1. According to Lemma F.4,According to Lemma F.1, we have ∥x -Equation 35 is our first statement.Published as a conference paper at ICLR 2023We will then carefully examine each positive term,This implies,We will now lower bound. By Equation 37,Concluding, we have thatFinally by Jenson's Inequality,Lemma J.10. Under condition of Theorem J.1, for any constant C > 0 independent of η, ρ, there exists constant C 3 > 0 independent of η, ρ, if x(t) ∈ K h and ∥x(t) -Φ(x(t))∥ ≤ Cρ, then we have thatProof of Lemma J.10. This is a direct application of Lemma F.8.Lemma J.11. Under condition of Theorem J.1, assuming x(t 0 ) ∈ K h/2 and ∥x(t 0 ) -Φ(x(t 0 ))∥ ≤ f (η, ρ) for some fixed function f and f (η, ρ) ∈ Ω(ηρ ln 2 (1/ηρ)) ∩ O(ρ), then with probability 1 -O(ρ), for any t satisfying t 0 ≤ t ≤ t 0 + O(ln(1/ηρ)/η), it holds that ∥x(t) -Φ(x(t)∥ ≤ 2f (η, ρ). Moreover, we have thatProof of Lemma J.11. By Lemma J.6, we have that x(t) ∈ K h for any t satisfying that t 0 ≤ t ≤ t 0 + O(ln(1/ηρ)/η) and with probability 1 -O(ρ) we will suppose this hold for the following deduction.By Uniform Bound,Consider each term and apply Uniform bound again,Then if we consider each term, it is bounded byNow let C be the positive constant satisfying 2f (η, ρ) ≤ Cρ, suppose C 1 , C 2 are the constants corresponds to C in Lemma J.9 and C 3 is the constant correspond to C in Lemma J.10. By definitionDefine a coupled process ỹ(τ + 1) = y(τ + 1) andNow clearly Equation 38 is bounded by P(ỹ(t) ≥ 2f (η, ρ)).As E[ỹ(τ ′ )] ≤ ỹ(τ ′ -1) -C 2 ηρ by Lemma J.9 and ∥ỹ(τ ′ ) -ỹ(τ ′ -1)∥ ≤ C 3 ηρ by Lemma J.10. This implies ∥ỹ(τ ′ )∥ -C 2 ηρτ ′ is a super martingale. By Azuma-Hoeffding bound(Lemma K.4), we have) ≤ η 10 ρ 10 .We then haveProof of Lemma J.15. ℓ(f k (p), y k ) = 0 implies ℓ ′ (f k (p), y k ) = 0. Then by Taylor Expansion,This concludes the proof.Proof of Lemma 5.5. By Lemma J.15, asBy definition of Γ in Setting 5.1, we have for any p ∈ Γ, {∇f k (p)} n k=1 are linearly independent, which implies that ∇f k (p) ̸ = 0 for any p ∈ Γ.For any p ∈ Γ, as ∇f k (p) ∥∇f k (p)∥ is well defined and continuous at p, there exists a open ball V containing p such that for anyWe haveWe note that the alignment result in Lemma 5.5 is not directly used in our proof. Instead, we use its generalized version Lemma J.8 which holds under holds under a more general condition than Setting 5.1, namely Condition E.1. Theorem K.3. [Davis-Kahan sin(θ) theorem (Davis et al., 1970) ] Let Σ, Σ ∈ R p×p be symmetric, with eigenvalues λ 1 ≥ . . . ≥ λ p and λ1 ≥ . . . ≥ λp respectively. Fix 1 ≤ r ≤ s ≤ p, let d ≜ s-r +1 and let V = (v r , v r+1 , . . . , v s ) ∈ R p×d and V = (v r , vr+1 , . . . , vs ) ∈ R p×d have orthonormal columns satisfying Σv j = λ j v j and Σv j = λj vj for j = r, r + 1, . . . , s. Define ∆ ≜ min max{0, λ s -λs+1 }, max{0, λr-1 -λ r } , where λ0 ≜ ∞ and λp+1 ≜ -∞, we have for any unitary invariant norm ∥ • ∥ * , ∆ • ∥ sin Θ( V , V )∥ * ≤ ∥ Σ -Σ∥ * . Here Θ( V , V ) ∈ R d×d , with Θ( V , V ) j,j = arccos σ j for any j ∈ [d] and Θ( V , V ) i,j = 0 for all i ̸ = j ∈ [d]. σ 1 ≥ σ 2 ≥ • • • ≥ σ d denotes the singular values of V ⊤ V. [sin Θ] ij is defined as sin(Θ ij ). Lemma K.4 (Azuma-Hoeffding Bound). Suppose {Z n } n∈N is a super-martingale, suppose -α ≤ Z i+1 -Z i ≤ β, then for all n > 0, a > 0, we have P(Z n -Z 0 ≥ a) ≤ 2 exp(-a 2 /(2n(α + β) 2 )) Lemma K.5 (Azuma-Hoeffding Bound, Vector Form, Hayes (2003)). Suppose {Z n } n∈N is a R D -valued martingale, suppose ∥Z i+1 -Z i ∥ 2 ≤ σ, then for all n > 0, a > 0, we have P(∥Z n -Z 0 ∥ 2 ≥ σ(1 + a)) ≤ 2 exp(1 -a 2 /2n).

K TECHNICAL LEMMAS

In other words, for any 0 < δ < 1, with probability at least 1 -δ, we have that ∥Z n -Z 0 ∥ 2 ≤ σ 1 + 2n log 2e δ ≤ 2σ 2n log 2e δ .Proof of Corollary L.3. For any δ ∈ (0, 1], choosing C 3 = 0 and by Theorem L.1, we have that P max 0≤t≤T /px K (t) -X K (pt) ≤ H p e C1T ≥ 1 -δ .Since δ can be any number in (0, 1], the above probability is exactly 1.We end this section with a summary of applications of Theorem L.1 and corollary L.3 in our proofs (Table 2 ). 

