A NEW CHARACTERIZATION OF THE EDGE OF STA-BILITY BASED ON A SHARPNESS MEASURE AWARE OF BATCH GRADIENT DISTRIBUTION

Abstract

For full-batch gradient descent (GD), it has been empirically shown that the sharpness, the top eigenvalue of the Hessian, increases and then hovers above 2/(learning rate), and this is called "the edge of stability" phenomenon. However, it is unclear why the sharpness is somewhat larger than 2/(learning rate) and how this can be extended to general mini-batch stochastic gradient descent (SGD). We propose a new sharpness measure (interaction-aware-sharpness) aware of the interaction between the batch gradient distribution and the loss landscape geometry. This leads to a more refined and general characterization of the edge of stability for SGD. Moreover, based on the analysis of a concentration measure of the batch gradient, we propose a more accurate scaling rule, Linear and Saturation Scaling Rule (LSSR), between batch size and learning rate.

1. INTRODUCTION

For full-batch GD, it has been empirically observed that the sharpness, the top eigenvalue of the Hessian, increases and then hovers above 2/(learning rate) (Cohen et al., 2021) as the training proceeds. This observation can provide a link between two empirical results regarding generalization, (i) using larger learning rates for GD can generalize better (Bjorck et al., 2018; Li et al., 2019b; Lewkowycz et al., 2020; Smith et al., 2020) and (ii) minima with low sharpness tend to generalize better (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017) . This observation has a significant implication in existing neural network optimization convergence analyses since it is contrary to the frequent assumption that 'the learning rate is less than 2/β (here β is an upper bound of the Hessian top eigenvalue)', which ensures the decrease in the training loss (Nesterov, 2003; Schmidt, 2014; Martens, 2014; Bottou et al., 2018) . Even though the training loss evolves non-monotonically over short timescales due to the violation of the assumption, interestingly, the loss is observed to decrease over long timescales consistently. This regime in which GD typically occurs has been referred to as 'the edge of stability (EoS)' (Cohen et al., 2021) . There remain many aspects that are not clearly explained about the EoS regime. For example, it is not clear why and to what extent the sharpness hovers above 2/(learning rate). Moreover, the inherent mechanism is not yet elucidated for the unstable optimization to occur at the EoS consistently while prevented from entirely diverging. How this phenomenon can be generalized beyond GD, especially to mini-batch SGD, is still an open question. In this paper we provide a new characterization of the EoS for SGD, which can serve as an answer to the above questions. As a tool to analyze the optimization process of SGD, we first propose a sharpness measure of neural network loss landscape aware of SGD batch gradient distribution (hence capturing the interaction between SGD and the loss landscape), which we refer to as the interaction-aware-sharpness (IAS) (Section 2). Based on this measure, we define the stable and unstable regions in the neural network parameter space. We then scrutinize both theoretically and empirically the transition process of the iterate from the stable to the unstable region (Section 4.1) and the mechanism to escape from the unstable region, i.e., how the optimization can occur at the EoS. We interpret the latter mechanism based on the non-quadraticity of the loss and the presence of asymmetric valleys in the loss landscape (He et al., 2019) (Section 4.2).

