UNDERSTANDING EDGE-OF-STABILITY TRAINING DYNAMICS WITH A MINIMALIST EXAMPLE

Abstract

Recently, researchers observed that gradient descent for deep neural networks operates in an "edge-of-stability" (EoS) regime: the sharpness (maximum eigenvalue of the Hessian) is often larger than stability threshold 2/η (where η is the step size). Despite this, the loss oscillates and converges in the long run, and the sharpness at the end is just slightly below 2/η. While many other well-understood nonconvex objectives such as matrix factorization or two-layer networks can also converge despite large sharpness, there is often a larger gap between sharpness of the endpoint and 2/η. In this paper, we study EoS phenomenon by constructing a simple function that has the same behavior. We give rigorous analysis for its training dynamics in a large local region and explain why the final converging point has sharpness close to 2/η. Globally we observe that the training dynamics for our example have an interesting bifurcating behavior, which was also observed in the training of neural nets. * Equal Contribution. 1 The value 2/η is called the stability threshold, because if the objective has a fixed Hessian, the gradient descent trajectory will become unstable if the largest eigenvalue of the Hessian is larger than 2/η.

1. INTRODUCTION

Many works tried to understand how simple gradient-based methods can optimize complicated neural network objectives. However, recently some empirical observations show that optimization for deep neural networks may operate in a more surprising regime. In particular, Cohen et al. (2021) observed that when running gradient descent on neural networks with a fixed step-size η, the sharpness (largest eigenvalue of the Hessian) of the training trajectory often oscillates around the stability threshold of 2/η 1 , while the loss still continues to decrease in the long run. This phenomenon is called "edge-of-stability" and has received a lot of attention (see Section 1.2 for related works). While many works try to understand why (variants of) gradient descent can still converge despite that the sharpness is larger than 2/η, empirically gradient descent for deep neural networks has even stronger properties. As shown in Fig. 1a , for a fixed initialization, if one changes the step size η, the final converging point has sharpness very close to the corresponding 2/η. We call this phenomenon "sharpness adaptivity". Another perspective on the same phenomenon is that for a wide range of initializations, for a fixed step-size η, their final converging points all have sharpness very close to 2/η. We call this phenomenon "sharpness concentration". Surprisingly, both sharpness adaptivity and sharpness concentration happen on deeper networks, while for shallower models of non-convex optimization such as matrix factorization or 2-layer neural networks, the gap between sharpness and 2/η is often much larger (see Fig. 1b ). This suggests that these phenomena are related to network depth. What is the mechanism for sharpness adaptivity and concentration, and how does that relate to the number of layers? To answer these questions, in this paper we consider a minimalist example of edge-of-stability. More specifically, we construct an objective function (4-layer scalar network with coupling entries), such that gradient descent on this objective has similar empirical behavior as deeper networks. We give a rigorous analysis for the training dynamics of this objective function in a large local region, which proves that the dynamics satisfy both sharpness adaptivity and sharpness concentration. The global training dynamics for our objective exhibit a complicated fractal behavior (which is also why our rigorous results are local), and such behavior has been observed in training of neural networks. (1x 2 y 2 ) 2 . For each model we run gradient descent from the same initialization using different learning rates. For (a) and (c), the sharpness converges very close to 2/η with loss continuing to decrease. For (b), the sharpness decreases to be significantly lower than 2/η.

1.1. OUR RESULTS

The objective function we consider is very simple: L(x, y, z, w) ≜ 1 2 (1xyzw) 2 . One can view this as a 4-layer scalar network (each layer has a single neuron). We even couple the initialization so that x = z, y = w so effectively it becomes an objective on two variables L(x, y) ≜ 1 4 (1x 2 y 2 ) 2 . For this objective function we prove its convergence and sharpness concentration properties: Theorem 1.1 (Sharpness Concentration, Informal). For any learning rate η smaller than some constant, there is a constant size region S η such that the GD trajectory with step size η from all initializations in S η converge to a global minimum with sharpness within (2/η -20 3 η, 2/η). As a direct corollary, we can also prove that it has the sharpness adaptivity property. Corollary 1.1 (Sharpness Adaptivity, Informal). There exists a constant size region S and a corresponding range of step sizes K that for all η ∈ K, the GD trajectory with step size η from any initialization in S converges to a global minimum with sharpness within (2/η -20 3 η, 2/η). The training dynamics are illustrated in Fig. 2 . To analyze the training dynamics, we reparametrize the objective function and show that the 2-step dynamics of gradient descent roughly follow a parabola trajectory. The extreme point of this parabola is the final converging point which has sharpness very close to 2/η. Intuitively, the parabola trajectory comes from a cubic term in the approximation of the training dynamics (see Section 3.1 for detailed discussions). We can also extend our result to a setting where x, y are replaced by vectors, see Theorem 3.2 in Section 3.3. In Section 4 we explain the difference between the dynamics of our degree-4 model with degree-2 models (which are more similar to matrix factorizations or 2-layer neural networks). We show that the dynamics for degree-2 models do not have the higher order terms, and their trajectories form an ellipse instead of a parabola. In Section 5 we show why it is difficult to extend Theorem 3.1 to global convergence -the training trajectory exhibits fractal behavior globally. Such behaviors can be qualitatively approximated by simple low-degree nonlinear dynamics standard in chaos theory, but are still very difficult to analyze. Finally, in Section 6 we present the similarity between our minimalist model and the GD trajectory of some over-parameterized deep neural networks trained on a real-world dataset. Toward the end of convergence, the trajectory of the deep networks mostly lies on a 2-dimensional subspace and can be well characterized by a parabola as in the scalar case.



Figure1: EoS Phenomena in NN Training. We consider three models including a 5-layer ReLU activated fully connected network, a 2-layer fully connected linear network with asymmetric initialization factor (4, 0.1) (see Appendix A.1 for explanation), and a 4-layer scalar network equivalent to min x,y 1 4 (1x 2 y 2 ) 2 . For each model we run gradient descent from the same initialization using different learning rates. For (a) and (c), the sharpness converges very close to 2/η with loss continuing to decrease. For (b), the sharpness decreases to be significantly lower than 2/η.

