GRADIENT DESCENT ON NEURAL NETWORKS TYPI-CALLY OCCURS AT THE EDGE OF STABILITY

Abstract

We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the value 2/(step size), and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability.

1. INTRODUCTION

Neural networks are almost never trained using (full-batch) gradient descent, even though gradient descent is the conceptual basis for popular optimization algorithms such as SGD. In this paper, we train neural networks using gradient descent, and find two surprises. First, while little is known about the dynamics of neural network training in general, we find that in the special case of gradient descent, there is a simple characterization that holds across a broad range of network architectures and tasks. Second, this characterization is strongly at odds with prevailing beliefs in optimization. In more detail, as we train neural networks using gradient descent with step size η, we measure the evolution of the sharpness -the maximum eigenvalue of the training loss Hessian. Empirically, the behavior of the sharpness is consistent across architectures and tasks: so long as the sharpness is less than the value 2/η, it tends to continually rise ( §3.1). We call this phenomenon progressive sharpening. The significance of the value 2/η is that gradient descent on quadratic objectives is unstable if the sharpness exceeds this threshold ( §2). Indeed, in neural network training, if the sharpness ever crosses 2/η, gradient descent quickly becomes destabilized -that is, the iterates start to oscillate with ever-increasing magnitude along the direction of greatest curvature. Yet once On three separate architectures, we run gradient descent at a range of step sizes η, and plot both the train loss (top row) and the sharpness (bottom row). For each step size η, observe that the sharpness rises to 2/η (marked by the horizontal dashed line of the appropriate color) and then hovers right at, or just above, this value. this happens, gradient descent does not diverge entirely or stall. Instead, it enters a new regime we call the Edge of Stabilityfoot_0 ( §3.2), in which (1) the sharpness hovers right at, or just above, the value 2/η; and (2) the train loss behaves non-monotonically, yet consistently decreases over long timescales. In this regime, gradient descent is constantly "trying" to increase the sharpness, but is constantly restrained from doing so. The net effect is that gradient descent continues to successfully optimize the training objective, but in such a way as to avoid further increasing the sharpness. 2 In principle, it is possible to run gradient descent at step sizes η so small that the sharpness never rises to 2/η. However, these step sizes are suboptimal from the point of view of training speed, sometimes dramatically so. In particular, for standard architectures on the standard dataset CIFAR-10, such step sizes are so small as to be completely unreasonable -at all reasonable step sizes, gradient descent eventually enters the Edge of Stability (see §4). Thus, at least for standard networks on CIFAR-10, the Edge of Stability regime should be viewed as the "rule," not the "exception." As we describe in §5, the Edge of Stability regime is inconsistent with several pieces of conventional wisdom in optimization theory: convergence analyses based on L-smoothness or monotone descent, quadratic Taylor approximations as a model for local progress, and certain heuristics for step size selection. We hope that our empirical findings will both nudge the optimization community away from widespread presumptions that appear to be untrue in the case of neural network training, and also point the way forward by identifying precise empirical phenomena suitable for further study. Certain aspects of the Edge of Stability have been observed in previous empirical studies of fullbatch gradient descent (Xing et al., 2018; Wu et al., 2018); our paper provides a unified explanation for these observations. Furthermore, Jastrzębski et al. ( 2020) proposed a simplified model for the evolution of the sharpness during stochastic gradient descent which matches our empirical observations in the special case of full-batch SGD (i.e. gradient descent). However, outside the full-batch special case, there is no evidence that their model matches experiments with any degree of quantitative precision, although their model does successfully predict the directional trend that large step sizes and/or small batch sizes steer SGD into regions of low sharpness. We discuss SGD at greater length in §6. To summarize, while the sharpness does not obey simple dynamics during SGD (as it does during GD), there are indications that the "Edge of Stability" intuition might generalize somehow to SGD, just in a way that does not center around the sharpness.

2. BACKGROUND: STABILITY OF GRADIENT DESCENT ON QUADRATICS

In this section, we review the stability properties of gradient descent on quadratic functions. Later, we will see that the stability of gradient descent on neural training objectives is partly well-modeled by the stability of gradient descent on the quadratic Taylor approximation. On a quadratic objective function f (x) = 1 2 x T Ax + b T x + c, gradient descent with step size η will diverge if 3 any eigenvalue of A exceeds the threshold 2/η. To see why, consider first the onedimensional quadratic f (x) = 1 2 axfoot_1 + bx + c, with a > 0. This function has optimum x * = -b/a. Consider running gradient descent with step size η starting from x 0 . The update rule is x t+1 = x t -η(ax t + b), which means that the error x t -x * evolves as (x t+1 -x * ) = (1 -ηa)(x t -x * ). Therefore, the error at step t is (x t -x * ) = (1 -ηa) t (x 0 -x * ), and so the iterate at step t is x t = (1 -ηa) t (x 0 -x * ) + x * . If a > 2/η, then (1 -ηa) < -1, so the sequence {x t } will oscillate around x * with ever-increasing magnitude, and diverge. Now consider the general d-dimensional case. Let (a i , q i ) be the i-th largest eigenvalue/eigenvector of A. As shown in Appendix B, when the gradient descent iterates {x t } are expressed in the special coordinate system whose axes are the eigenvectors of A, each coordinate evolves separately. In particular, the coordinate for each eigenvector q i , namely q i , x t , evolves according to the dynamics of gradient descent on a one-dimensional quadratic objective with second derivative a i .



This nomenclature was inspired by the title ofGiladi et al. (2020). In the literature, the term "sharpness" has been used to refer to a variety of quantities, often connected to generalization (e.g.Keskar et al. (2016)). In this paper, "sharpness" strictly means the maximum eigenvalue of the training loss Hessian. We do not claim that this quantity has any connection to generalization. For convex quadratics, this is "if and only if." However, if A has a negative eigenvalue, then gradient descent with any (positive) step size will diverge along the corresponding eigenvector.



Figure1: Gradient descent typically occurs at the Edge of Stability. On three separate architectures, we run gradient descent at a range of step sizes η, and plot both the train loss (top row) and the sharpness (bottom row). For each step size η, observe that the sharpness rises to 2/η (marked by the horizontal dashed line of the appropriate color) and then hovers right at, or just above, this value.

