ON GRADIENT DESCENT CONVERGENCE BEYOND THE EDGE OF STABILITY Anonymous authors Paper under double-blind review

Abstract

Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a 'bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called "Edge of Stability" (EoS), where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability and oscillatory behavior. The incipient theoretical analysis of this phenomena has mainly focused in the overparametrised regime, where the effect of choosing a large learning rate may be associated to a 'Sharpness-Minimisation' implicit regularisation within the manifold of minimisers, under appropriate asymptotic limits. In contrast, in this work we directly examine the conditions for such unstable convergence, focusing on simple, yet representative, learning problems. Specifically, we characterize a local condition involving third-order derivatives that stabilizes oscillations of GD above the EoS, and leverage such property in a teacher-student setting, under population loss. Finally, focusing on Matrix Factorization, we establish a nonasymptotic 'Local Implicit Bias' of GD above the EoS, whereby quasi-symmetric initializations converge to symmetric solutions -where sharpness is minimum amongst all minimisers.

1. INTRODUCTION

Given a differentiable objective function f (θ), where θ ∈ R d is a high-dimensional parameter vector, the most basic and widely used optimization method is gradient descent (GD), defined as θ (t+1) = θ (t) -η∇ θ f (θ (t) ), where η is the learning rate. For all its widespread application across many different ML setups, a basic question remains: what are the convergence guarantees (even to a local minimiser) under typical objective functions, and how they depend on the (only) hyperaparameter η? In the modern context of large-scale ML applications, an additional key question is not only to understand whether or not GD converges to minimisers, but to which ones, since overparametrisation defines a whole manifold of global minimisers, all potentially enjoying drastically different generalisation performance. The sensible regime to start the analysis is η → 0, where GD inherits the local convergence properties of the Gradient Flow ODE via standard arguments from numerical integration. However, in the early phase of training, a large learning rate has been observed to result in better generalization (LeCun et al., 2012; Bjorck et al., 2018; Jiang et al., 2019; Jastrzebski et al., 2021) , where the extent of "large" is measured by comparing the learning rate η and the curvature of the loss landscape, measured with λ(θ) := λ max ∇ 2 θ f (θ) , the largest eigenvalue of the Hessian with respect to learnable parameters. Although one requires sup θ λ(θ) < 2/η to guarantee the convergence of GD (Bottou et al., 2018) to (local) minimisersfoot_0 , the work of (Cohen et al., 2020) noticed a remarkable phenomena in the context of neural network training: even in problems where λ(θ) is unbounded (as in NNs), for a fixed η, the curvature λ(θ (t) ) increases along the training trajectory (1), bringing λ(θ (t) ) ≥ 2/η (Cohen et al., 2020) . After that, a surprising phenomena is that λ(θ (t) ) stably hovers above 2/η and the neural network still eventually achieves a decreasing training loss -the so-called "Edge of Stability". We would like to understand and analyse the conditions of such convergence with a large learning rate under a variety models that capture such observed empirical behavior. Recently, some works have built connections between EoS and implicit bias (Arora et al., 2022; Lyu et al., 2022; Damian et al., 2021; 2022) in the context of large, overparametrised models such as neural networks. In this setting, GD is expected to converge to a manifold of minimisers, and the question is to what extent a large learning rate 'favors' solutions with small curvature. In essence, these works show that under certain structural assumptions, GD is asymptotically tracking a continuous sharpnessreduction flow, in the limit of small learning rates. Compared with these, we study non-asymptotic properties of GD beyond EoS, by focusing on certain learning problems (e.g., single-neuron ReLU networks and matrix factorization). In particular, we characterize a range of learning rates η above the EoS such that GD dynamics hover around minimisers. Moreover, in the matrix factorization setup, where minimisers form a manifold with varying local curvature, our results give a non-asymptotic analogue of the 'Sharpness-Minimisation' arguments from Arora et al. ( 2022 The straightforward starting point for the local convergence analysis is via Taylor approximations of the loss function. However, in a quadratic Taylor expansion, gradient descent diverges once λ(θ) > 2/η (Cohen et al., 2020), indicating that a higher order Taylor approximation is required. By considering a 1-D function with local minima θ * of curvature λ * = λ(θ * ), we show that it is possible to stably oscillate around the minima with η slightly above the threshold 2/λ * , provided its high order derivative satisfies mild conditions as in Theorem 1. A typical example of such functions is f (x) = 1 4 (x 2 -µ) 2 with µ > 0. Furthermore, we prove that it converges to an orbit of period 2 from a more global initialization rather than the analysis of high-order local approximation. As it turns out, the analysis of such stable one-dimensional oscillations is sufficiently intrinsic to become useful in higher-dimensional problems. First, we leverage the analysis to a two-layer single-neuron ReLU network, where the task is to learn a teacher neuron with data on a uniform high-dimensional sphere. We show a convergence result under population loss with GD beyond EoS, where the direction of the teacher neuron can be learnt and the norms of two-layer weights stably oscillate. We then focus on matrix factorization, a canonical non-convex problem whose geometry is characterized by a manifold of minimisers having different local curvature. Our techniques allow us to establish a local, non-asymptotic implicit bias of GD beyond EoS, around certain quasi-symmetric initialization, by which the large learning rate regime 'attracts' the dynamics towards symmetric minimisers -precisely those where the local curvature is minimal. A further discussion is provided in Appendix M.

2. RELATED WORK

Implicit regularization. Due to its theoretical closeness to gradient descent with a small learning rate, gradient flow is a common setting to study the training behavior of neural networks. Barrett & Dherin (2020) suggests that gradient descent is closer to gradient flow with an additional term regularizing the norm of gradients. Through analysing the numerical error of Euler's method, Elkabetz & Cohen (2021) provides theoretical guarantees of a small gap depending on the convexity along the training trajectory. Neither of them fits in the case of our interest, because it is hard to track the parametric gap when η > 1/λ. For instance, in a quadratic function, the trajectory jumps between the two sides once η > 1/λ. Damian et al. (2021) shows that SGD with label noise is implicitly subjected to a regularizer penalizing sharp minimizers but the learning rate is constraint strictly below the edge of stability threshold. Balancing effect. Du et al. (2018) proves that gradient flow automatically preserves the norms' differences between different layers of a deep homogeneous network. (Ye & Du, 2021) shows that gradient descent on matrix factorization with a constant small learning rate still enjoys the auto-balancing property. Also in matrix factorization, Wang et al. (2021) proves that gradient descent with a relatively large learning rate leads to a solution with a more balanced (perhaps not perfectly



One can replace the uniform curvature bound by sup θ;f (θ)≤f (θ (0) ) λ(θ).



); Lyu et al. (2022); Damian et al. (2022).

