GRADIENT DESCENT ON NEURAL NETWORKS TYPI-CALLY OCCURS AT THE EDGE OF STABILITY

Abstract

We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the value 2/(step size), and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability.

1. INTRODUCTION

Neural networks are almost never trained using (full-batch) gradient descent, even though gradient descent is the conceptual basis for popular optimization algorithms such as SGD. In this paper, we train neural networks using gradient descent, and find two surprises. First, while little is known about the dynamics of neural network training in general, we find that in the special case of gradient descent, there is a simple characterization that holds across a broad range of network architectures and tasks. Second, this characterization is strongly at odds with prevailing beliefs in optimization. In more detail, as we train neural networks using gradient descent with step size η, we measure the evolution of the sharpness -the maximum eigenvalue of the training loss Hessian. Empirically, the behavior of the sharpness is consistent across architectures and tasks: so long as the sharpness is less than the value 2/η, it tends to continually rise ( §3.1). We call this phenomenon progressive sharpening. The significance of the value 2/η is that gradient descent on quadratic objectives is unstable if the sharpness exceeds this threshold ( §2). Indeed, in neural network training, if the sharpness ever crosses 2/η, gradient descent quickly becomes destabilized -that is, the iterates start to oscillate with ever-increasing magnitude along the direction of greatest curvature. Yet once On three separate architectures, we run gradient descent at a range of step sizes η, and plot both the train loss (top row) and the sharpness (bottom row). For each step size η, observe that the sharpness rises to 2/η (marked by the horizontal dashed line of the appropriate color) and then hovers right at, or just above, this value. 1



Figure1: Gradient descent typically occurs at the Edge of Stability. On three separate architectures, we run gradient descent at a range of step sizes η, and plot both the train loss (top row) and the sharpness (bottom row). For each step size η, observe that the sharpness rises to 2/η (marked by the horizontal dashed line of the appropriate color) and then hovers right at, or just above, this value.

