A NEW CHARACTERIZATION OF THE EDGE OF STA-BILITY BASED ON A SHARPNESS MEASURE AWARE OF BATCH GRADIENT DISTRIBUTION

Abstract

For full-batch gradient descent (GD), it has been empirically shown that the sharpness, the top eigenvalue of the Hessian, increases and then hovers above 2/(learning rate), and this is called "the edge of stability" phenomenon. However, it is unclear why the sharpness is somewhat larger than 2/(learning rate) and how this can be extended to general mini-batch stochastic gradient descent (SGD). We propose a new sharpness measure (interaction-aware-sharpness) aware of the interaction between the batch gradient distribution and the loss landscape geometry. This leads to a more refined and general characterization of the edge of stability for SGD. Moreover, based on the analysis of a concentration measure of the batch gradient, we propose a more accurate scaling rule, Linear and Saturation Scaling Rule (LSSR), between batch size and learning rate.

1. INTRODUCTION

For full-batch GD, it has been empirically observed that the sharpness, the top eigenvalue of the Hessian, increases and then hovers above 2/(learning rate) (Cohen et al., 2021) as the training proceeds. This observation can provide a link between two empirical results regarding generalization, (i) using larger learning rates for GD can generalize better (Bjorck et al., 2018; Li et al., 2019b; Lewkowycz et al., 2020; Smith et al., 2020) and (ii) minima with low sharpness tend to generalize better (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017) . This observation has a significant implication in existing neural network optimization convergence analyses since it is contrary to the frequent assumption that 'the learning rate is less than 2/β (here β is an upper bound of the Hessian top eigenvalue)', which ensures the decrease in the training loss (Nesterov, 2003; Schmidt, 2014; Martens, 2014; Bottou et al., 2018) . Even though the training loss evolves non-monotonically over short timescales due to the violation of the assumption, interestingly, the loss is observed to decrease over long timescales consistently. This regime in which GD typically occurs has been referred to as 'the edge of stability (EoS)' (Cohen et al., 2021) . There remain many aspects that are not clearly explained about the EoS regime. For example, it is not clear why and to what extent the sharpness hovers above 2/(learning rate). Moreover, the inherent mechanism is not yet elucidated for the unstable optimization to occur at the EoS consistently while prevented from entirely diverging. How this phenomenon can be generalized beyond GD, especially to mini-batch SGD, is still an open question. In this paper we provide a new characterization of the EoS for SGD, which can serve as an answer to the above questions. As a tool to analyze the optimization process of SGD, we first propose a sharpness measure of neural network loss landscape aware of SGD batch gradient distribution (hence capturing the interaction between SGD and the loss landscape), which we refer to as the interaction-aware-sharpness (IAS) (Section 2). Based on this measure, we define the stable and unstable regions in the neural network parameter space. We then scrutinize both theoretically and empirically the transition process of the iterate from the stable to the unstable region (Section 4.1) and the mechanism to escape from the unstable region, i.e., how the optimization can occur at the EoS. We interpret the latter mechanism based on the non-quadraticity of the loss and the presence of asymmetric valleys in the loss landscape (He et al., 2019) (Section 4.2). Based on these analyses, we propose the notion of implicit interaction regularization (IIR), i.e., the IAS is implicitly bounded during SGD, as an implicit bias of SGD (Section 4.3). The value that IAS is bounded by is the ratio of a concentration measure of the batch gradient distribution of SGD to the learning rate. This is a more refined characterization of the EoS, as it shows that IAS does not hover above a certain value, but rather hovers around. More importantly, it can be naturally applied to SGD since we do not make any impractical assumptions on the batch size or learning rate. Our new characterization of the EoS leads to a novel scaling rule between batch size and learning rate, from the idea of preserving a similar level of IIR (Section 5). This scaling rule, referred to as the Linear and Saturation Scaling Rule (LSSR), recovers the well-known linear scaling rule (LSR) (Jastrzębski et al., 2017; Masters & Luschi, 2018; Zhang et al., 2019; Shallue et al., 2018; Smith et al., 2020; 2021) for small batch sizes and reduces to no scaling (due to saturation) for large batch sizes.

2. GRADIENT DISTRIBUTION AND LOSS LANDSCAPE

In this section, we review some concepts required for further discussion. See Appendix A for a quick reference for the notations. To simplify the notations, we often omit the dependence on some variables and the subscript of the expectation operation when clear from the context. For a learning task, we use a parameterized model with model parameter θ ∈ Θ ⊂ R m . Then we train the model using training data D = {x i } n i=1 and a loss function ℓ(x; θ). We denote the (total) training loss by L(θ) ≡foot_0 n n i=1 ℓ(x i ; θ) for training data D. At time step t, we update the parameter θ t using GD: θ t+1 = θ t -η∇ θ L(θ t ) with a learning rate η > 0, or using SGD: θ t+1 = θ t -ηg ξ (θ t ) with a mini-batch gradient g ξ (θ t ) ≡ 1 b x∈B t ξ ∇ θ ℓ(x; θ t ) ∈ R m for a mini-batch B t ξ ⊂ D of size b (1 ≤ b ≤ n). Here, we use the subscript ξ to denote the random batch sampling procedure. Now, we are ready to introduce some important matrices, C b , S b , and H. First, we define the covariance et al., 2017; Li et al., 2017; Wu et al., 2020) : C b (θ) ≡ Var ξ [g ξ (θ)] = E ξ (g ξ (θ) -E ξ [g ξ (θ)]) (g ξ (θ) -E ξ [g ξ (θ)]) ⊤ ∈ R m×m and the second moment S b (θ) ≡ E ξ [g ξ (θ)g ξ (θ) ⊤ ] ∈ R m×m of C b = γ n,b b (S 1 -S n ) = γ n,b b C 1 , where γ n,b = n-b n-1 for sampling without replacement 1 for sampling with replacement . We provide a self-contained proof of (1) in Appendix B.1. We note that, for sampling without replacement, many previous works approximate γ n,b ≈ 1 assuming b ≪ n (Jastrzębski et al., 2017; Hoffer et al., 2017; Smith et al., 2021) , but we consider the whole range of 1 ≤ b ≤ n (0 ≤ γ n,b ≤ 1 with γ n,1 = 1 and γ n,n = 0). Second, we define the Hessian H (θ) = ∇ 2 θ L(θ) = 1 n n i=1 ∇ 2 θ ℓ(x i ; θ) ∈ R m×m and denote the i-th largest eigenvalue and its corresponding normalized eigenvector by λ i (H) ∈ R and q i (H) ∈ R m , respectively, for i = 1, • • • , m. The operator norm ∥H∥ ≡ sup ∥u∥=1 ∥Hu∥ of H is equivalent to the top eigenvalue λ 1 . We emphasize that C b and S b represent the stochasticity of the batch gradients, and H represents the loss landscape geometry. Therefore, we can write one of our goals as follows: we aim to understand how the loss landscape geometry (H) and the gradient distribution (S b ) interact with each other during SGD training. We investigate this "interaction" in terms of matrix multiplication HS b . To be specific, we consider the trace tr(HS b ) and its normalized value tr(HS b ) tr(S b ) , and we call the latter interaction-aware sharpness: Definition 1 (Interaction-Aware Sharpness (IAS)). (2) Here, tr(HS b ) ≤ ∥H∥ tr(S b ), i.e., ∥H∥ S b ≤ ∥H∥, and the equality holds only when every g ξ is aligned in the direction of the top eigenvector q 1 of H.



These two matrices C b and S b are often called the second central and non-central moments, respectively. But to avoid confusion, we use the term "second moment" only for the non-central S b .



the mini-batch gradient g ξ (θ) over batch sampling for a batch size 1 ≤ b ≤ n. 1 The covariance C b and the second moment S b satisfy not only C b = S b -S n but also the following equation (Hoffer

∥H∥ S b ≡ tr(HS b ) tr(S b ) .

