REVISITING THE STABILITY OF STOCHASTIC GRADI-ENT DESCENT: A TIGHTNESS ANALYSIS

Abstract

The technique of algorithmic stability has been used to capture the generalization power of several learning models, especially those trained with stochastic gradient descent (SGD). This paper investigates the tightness of the algorithmic stability bounds for SGD given by Hardt et al. (2016) . We show that the analysis of Hardt et al. ( 2016) is tight for convex objective functions, but loose for non-convex objective functions. In the non-convex case we provide a tighter upper bound on the stability (and hence generalization error), and provide evidence that it is asymptotically tight up to a constant factor. However, deep neural networks trained with SGD exhibit much better stability and generalization in practice than what is suggested by these (tight) bounds, namely, linear or exponential degradation with time for SGD with constant step size. We aim towards characterizing deep learning loss functions with good generalization guarantees, despite training using SGD with constant step size. In this vein, we propose the notion of a Hessian Contractive (HC) region, which quantifies the contractivity of regions containing local minima in the neural network loss landscape. We provide empirical evidence that several loss functions exhibit HC characteristics, and provide theoretical evidence that the known tight SGD stability bounds for convex and non-convex loss functions can be circumvented by HC loss functions, thus partially explaining the generalization of deep neural networks.

1. INTRODUCTION

Stochastic gradient descent (SGD) has gained great popularity in solving machine learning optimization problems (Kingma & Ba, 2014; Johnson & Zhang, 2013) . SGD leverages the finite-sum structure of the objective function, avoids the expensive computation of exact gradients, and thus provides a feasible and efficient optimization solution in large-scale settings (Bottou, 2012) . The convergence and the optimality of SGD have been thoroughly studied (Ge et al., 2015; Rakhlin et al., 2012; Reddi et al., 2018; Zhou & Gu, 2019; Carmon et al., 2019a; b; Shamir & Zhang, 2013) . In recent years, new research questions have been raised regarding SGD's impact on a model's generalization power. The seminal work (Hardt et al., 2016) tackled the problem using the algorithmic stability of SGD, i.e., the progressive sensitivity of the trained model w.r.t. the replacement of a single (test) datum in the training set. The stability-based analysis of the generalization gap allows one to bypass classical model capacity theorems (Vapnik, 1998; Koltchinskii & Panchenko, 2000) or weight-based complexity theorems (Neyshabur et al., 2017; Bartlett et al., 2017; Arora et al., 2018) . This framework also provides theoretical insights into many phenomena observed in practice, e.g., the "train faster, generalize better" phenomenon, the power of regularization techniques such as weight decay (Krogh & Hertz, 1992) , Dropout (Srivastava et al., 2014) , and gradient clipping. Other works have applied the stability analysis to more sophisticated settings such as Stochastic Gradient Langevin Dynamics and momentum SGD (Mou et al., 2018; Chaudhari et al., 2019; Chen et al., 2018) . Despite the promises of this stability-based analysis, it remains open whether this framework can explain the strong generalization performance of deep neural networks in practice. Existing theoretical upper bounds of the stability (and thus, generalization) (Hardt et al., 2016) are ideal for strongly convex loss functions: the upper bound remains constant even as the number of training iterations increases. However, the same bound deteriorates significantly when we relax to more general and realistic settings. In particular, for convex (but not strongly convex) and non-convex loss functions, if SGD has constant step size, then the upper bound grows linearly and exponentially with the number of training iterations. This bound fails to match the superior generalization performance of deep neural networks, and leads to the following question: Question 1: Can we find a better stability upper bound for convex or non-convex loss functions? In this paper, we first address the question above and investigate the tightness of the algorithmic stability analysis for stochastic gradient methods (SGM) proposed by (Hardt et al., 2016) . R1. We show in Theorem 1 that the analysis in (Hardt et al., 2016) is tight for convex and smooth objective functions; in other words, there is a convex loss function whose stability grows linearly with the number of training iterations, with constant step size (α t = α) in SGD. R2. We show that in Theorem 2 that for linear models, the analysis in the convex case can be tightened to show that stab does not increase with t. R3. In Theorem 3 we show that the analysis in (Hardt et al., 2016) for decreasing step size (α t = O(1/t)) is loose for non-convex objective functions by providing a tighter upper bound on the stability (and hence generalization error). R4. The bound on the stability of SGD by (Hardt et al., 2016) is achieved by bounding the divergence at time t, defined as δ t := E||w t -w t ||, where w t is the model trained on data set S and w t is the model trained on a data set S that differs from S in exactly one sample. In Theorem 4 we provide evidence that our new upper bound in the non-convex case is tight, by showing a non-convex loss function whose divergence matches the upper bound for our divergence. R5. Although it is not derived formally, the techniques in (Hardt et al., 2016) can be employed to show an exponential upper bound for non-convex loss functions minimized using SGD with constant-size step. In Theorem 5, we give evidence that this abysmal upper bound is likely tight for non-convex loss functions, by exhibiting a non-convex loss function for which the divergence δ t increases exponentially. Thus the only functions whose stability provably does not increase with the number of iterations when a constant step-size during SGD is employed, are strongly convex functions. However, a) it has been empirically observed that for deep neural network loss, near the local minima, the Hessians are usually low rank (Chaudhari et al., 2017; Yao et al., 2019) , and b) neural networks trained with constant step-size SGD do generalize well in practice (Lin & Jegelka, 2018; Huang et al., 2017; Smith et al., 2017) . Combined with our lower bounds on convex and non-convex functions, we seem to hit an obstacle on the way to explaining generalization using the stability framework. Question 2: What is it that makes constant-step SGD on deep learning loss function generalize well? Realizing the limitation of the current state of stability analysis, we investigate whether a strongerthan-convex, but weaker-than-strongly-convex assumption of the loss function can be made, at least near local minima. If we can show algorithmic stability near local minima, we can still show the stability using similar argument as (Du et al., 2019; Allen-Zhu et al., 2019) . Aiming towards a characterization of loss functions exhibiting good stability, we propose a new condition for loss near local minima. This condition, called Hessian contractive, is slightly stronger than a general convex condition, but considerably weaker than strongly convex. Formally, the Hessian contractive condition stipulates that near any local minima, (1) the function is convex; and (2) a data dependent Hessian is positive definite in the gradient direction. Theoretically, we show that such a condition is sufficient to guarantee a constant stability bound for SGD (constant step size) near the local minima, while allowing the Hessian to be low rank. We also provide examples showing Hessian Contractive is a reasonable condition for several loss functions. Empirically, we verify the Hessian Contractive condition near a local minima of the loss while training deep neural networks. We sample points from a neighborhood of current iterates by adding Gaussian noise and verify the HC condition locally by Hessian product approximation. Summarizing our second set of contributions: R6. In Observation 1 we show that the family of widely used (convex) linear model loss functions will satisfy the Hessian Contractive condition. One typical example of such linear model loss is the regression loss function. These observation suggests that Hessian Contractive is a condition satisfied by (potentially many) machine learning loss functions.

R7.

In Theorem 6 we show that the Hessian Contractive condition will localize SGD iterates in a neighborhood of minima, which implies a constant stability bound for SGD near the local minima. (Hardt et al., 2016) , and * indicates results in this paper. Bounds without [H] or * are trivial. β is the smoothness parameter.

SGD Step Size

Constant α t = a/β α t = a/(βt) Constant α t Loss function Strongly Con- vex Convex Non-Convex Hessian Contrac- tive Upper Bound O(1) [H] O(aT /n) [H] O T a 1+a /n [H] O T a /n 1+a * Theorem 6 Lower Bound Ω( 1) Ω(aT /n)* Open, evidence* Ω(1) 1.1 RELATED WORKS Stability and generalization. The stability framework suggests that a stable machine learning algorithm results in models with good generalization performance (Kearns & Ron, 1999; Bousquet & Elisseeff, 2002; Elisseeff et al., 2005; Shalev-Shwartz et al., 2010; Devroye & Wagner, 1979a; b; Rogers & Wagner, 1978) . It serves as a mechanism for provable learnability when uniform convergence fails (Shalev-Shwartz et al., 2010; Nagarajan & Kolter, 2019) . The concept of uniform stability was introduced in order to derive high probability bounds on the generalization error (Bousquet & Elisseeff, 2002) . Uniform stability describes the worst case change in the loss of a model trained on an algorithm when a single data point in the dataset is replaced. In (Hardt et al., 2016) , a uniform stability analysis for iterative algorithms is proposed to analyze SGD, generalizing the one-shot version in (Bousquet & Elisseeff, 2002) . Algorithmic uniform stability is widely used in analyzing the generalization performance of SGD (Mou et al., 2018; Feldman & Vondrak, 2019; Chen et al., 2018) . The worst case leave-one-out type bounds also closely connect uniform stability with differential private learning (Feldman et al., 2018; 2020; Dwork et al., 2006; Wu et al., 2017b) , where the uniform stability can lead to provable privacy guarantee. While the upper bounds of algorithmic stability of SGD have been extensively studied, the tightness of those bounds remains open. In addition to uniform stability, an average stability of the SGD is studied in Kuzborskij & Lampert (2018) where the authors provide data-dependent upper bounds on stabilityfoot_0 . In this work, we report for the first time lower bounds on the uniform stability of SGD. Our tightness analysis suggests necessity of additional assumptions for analyzing the generalization of SGD on deep learning. Geometry of local minima. The geometry of local minima plays an important role in the generalization performance of deep neural networks (Hochreiter & Schmidhuber, 1995; Wu et al., 2017a) . The flat minima, i.e., minima whose Hessians have a large portion of zero-valued eigenvalues, are believed to attain better generalization (Keskar et al., 2016; Li et al., 2018) . In (Chaudhari et al., 2019) , the authors construct a local entropy-based objective function which converges to a solution with good generalization in a flat region, where "flatness" means that the Hessian matrix has a large portion of nearly-zero eigenvalues. However, these observations have not been supported theoretically. In this paper, we propose the Hessian contractive condition that is slight stronger than flat minima. Such condition suggests that the minima is sharp only in the gradient direction while remains flat in other directions, which unifies the geometrical interpretation of flat minima and uniform stability analysis.

2. PRELIMINARIES

In this section we introduce the notion of uniform stability and establish our notations. We first introduce the quantities Empirical and Population Risk and Generalization Gap. Given an unknown distribution D on labeled sample space Z = X × {-1, +1}, let S = {z 1 , ..., z n } denote a set of n samples z i = (x i , y i ) drawn i.i.d. from D. Let w ∈ R d be the parameter(s) of a model that tries to predict y given x, and let f be a loss function where f (w; z) denotes the loss of the model with parameter w on sample z. Let f (w; S) denote the empirical risk f (w; S) = E z∼S [f (w; z)] = 1 n n i=1 f (w; z i ) with corresponding population risk E z∼D [f (w; z)]. The generalization error of the model with parameter w is defined as the difference between the empirical and population risks: |E z∼D [f (w; z)] -E z∼S [f (w; z)]|. Next we introduce the Stochastic Gradient Descent (SGD) method. We follow the setting of (Hardt et al., 2016) , and starting with some initialization w 0 ∈ R d , consider the following SGD update step: w t+1 = w t -α t ∇ w f (w; z it ), where i t is drawn from [n] := {1, 2, • • • , n} uniformly and independently in each round. The analysis of SGD requires the following crucial properties of the loss function f (., z) at any fixed point z, viewed solely as a function of the parameter w: Definition 1 (L-Lipschitz). A function f (w) is L-Lipschitz if ∀u, v ∈ R d : |f (u)-f (v)| ≤ L u-v . Definition 2 (β-smooth). A function f (w) is β-smooth if ∀u, v ∈ R d : |∇f (u)-∇f (v)| ≤ β u-v . Definition 3 (γ-strongly convex). A function f (w) is γ strongly convex if ∀u, v ∈ R d : f (u) > f (v) + ∇f (v) [u -v] + γ 2 u -v 2 . Algorithmic Stability Next we define the key concept of algorithmic stability, which was introduced by (Bousquet & Elisseeff, 2002 ) and adopted by (Hardt et al., 2016) . Informally, an algorithm is stable if its output only varies slightly when we change a single sample in the input dataset. When this stability is uniform over all datasets differing at a single point, this leads to an upper bound on the generalization gap. More formally: Definition 4. Two sets of samples S, S are twin datasets if they differ at a single entry, i.e., S = {z 1 , ...z i , ..., z n } and S = {z 1 , ..., z i , ..., z n }. Consider a possibly randomized algorithm A that given a sample S of size n outputs a parameter A(S). Define the algorithmic stability parameter of A by: ε stab (A, n) := inf{ε : sup z∈Z,S,S E A |f (A(S); z) -f (A(S ); z)| ≤ ε}. Here E A denote expectation over the random coins of A. Also, for such an algorithm, one can define its expected generalization error as: GE(A, n) := E S,A [E z∼D [f (A(S); z)] -E z∼S [f (A(S); z)]]. Stability and generalization: It was proved in (Hardt et al., 2016)  that GE(A, n) ≤ ε stab (A, n). Furthermore, the authors observed that an L-Lipschitz condition on the loss function f enforces a uniform upper bound: sup z |f (w; z) -f (w ; z)| ≤ L w -w . This implies that for Lipschitz loss, the algorithmic stability ε stab (A, n) (and hence the generalization error GE(A, n)) can be bounded by obtaining bounds on w -w . Let w T and w T be the parameters obtained by running SGD starting on twin datasets S and S , respectively. Throughout this paper we will focus on the divergence quantity δ T := E A ||w T -w T ||. While (Hardt et al., 2016) reports upper bounds on δ T with different types of loss functions, e.g., convex and non-convex loss functions, we investigate the tightness of those bounds.

3. TIGHTNESS OF EXISTING BOUNDS

In this section we report our main results. We first consider the convex case with constant step size, where we prove 1) that the existing bounds stating stab ∝ t are tight, and 2) for linear models, the analysis can be tightened to show that stab does not increase with t. Then we move on to the non-convex case, where a) for decreasing step size we improve the existing upper bound, and give evidence that our new upper bound is tight, and b) for constant step size we give loss functions whose divergence δ t increases exponentially with t.

3.1. CONVEX CASE

In this section we analyze the stability of SGD when loss function is convex and smooth. We begin with a construction which shows that Theorem 3.8 in (Hardt et al., 2016) is tight. Our lower bound analysis will require the following quadratic function: f (w; z) = 1 2 w Aw -yx w (1) where A is a d × d matrix. In the construction of lower bounds, we carefully choose A and S so that the single data point replaced in twin data set will cause the instability of SGD. Theorem 1. Let w t , w t be the outputs of SGD on twin datasets S, S respectively. Let ∆ t = w t -w t and α t be the step size of SGD. There exists a function f which is convex, β-smooth, and L-Lipschitz on the domain of w, and twin datasets S, S such that: E ∆ T ≥ L 3n T t=1 α t , and ε stab ≥ L 3n T t=1 α t . The convex upper bound in Theorem 3.8 in (Hardt et al., 2016) states that ∆ T ≤ L T i=1 αt n , which implies that the divergence increases throughout training. The lower bound in Theorem 1 suggests the tightness of the upper bound. However, in practice, such phenomenon is not commonly observed, i.e., for a family of convex but not-strongly-convex loss functions, the generalization performance does not deteriorate as the number of training iterations increases. This motivates us to investigate a weaker condition which still can enforce an O (1) stability, without strong convexity. In the next theorem, we restrict ourselves to a family of linear model loss functions and show that the divergence will not increase unboundedly during training. We shall need the following definition of a ξ-self correlated data set. Essentially, a self-correlated dataset requires an average linear dependence of each x. Recall that the i'th sample z i = (x i , y i ). Definition 5. A set S = {z 1 , ..., z n } is ξ-self correlated if ∀j ∈ [n], 1 n n i=1 (x j x i ) 2 ≥ ξ > 0. Assuming that ∀j ∈ [n], x j ≥ r for some r > 0, definition 5 implies that S is at least r 2 n -self correlated. Thus the above condition holds for all datasets S not containing the zero-feature vector. In our next theorem, we leverage on the ξ-self correlated condition to prove a non-accumulate uniform stability bound for SGD on a loss function of Linear Models. We characterize a linear model by rewriting the loss function f (w; z) in terms of f y (w x) where f y (•) is a scalar function depending only on the inner product of the model parameter w and the input feature x. Theorem 2. Suppose a loss function f (w, z) is of the form f (w, S) = 1 n n j=1 f yj (w x j ), where f y (w x) satisfies (1) |f y (•)| ≤ L , (2) 0 < γ ≤ f y (•) ≤ β, (3) x ≤ R and (4) S, S are ξ-self correlated, twin datasets. Let w t and w t be the outputs of SGD on S and S after t steps, respectively. Let the divergence ∆ t := w t -w t and α ≤ 1 β be the step size of SGD. Then, E ∆ T ≤ 4LR ξγn . Remark 1. In (Hardt et al., 2016) , an O L 2 γn stability bound is derived on a loss function f (w, z i ) which is strongly convex, i.e., ∇ 2 f (w) γI. In practice one can incorporate a strongly convex regularizer to impose strong convexity, often resulting in improved generalization performance in practice (Shalev-Shwartz et al., 2010; Bousquet & Elisseeff, 2002) . The ξ-self correlated condition allows SGD to maintain a uniformly upper-bounded divergence guarantee for a family of widely used models for arbitrary long training without using strongly convex regularizer. The theorem suggests that if the dataset S is reasonably simple, e.g., every x i lies in a low dimensional subspace, the divergence of SGD is comparable with a strongly convex loss function. This analysis suggests an alternative condition other can strong convexity can empower SGD an O(1) stability which is data-dependent. This motivates us to go beyond linear model and seek for a generalized condition in Theorem 2. In section 4.1, we propose the Hessian Contractive condition for more general loss function driven by the observation on the linear model. Example: Linear and logistic regression. Linear regression minimizes the quadratic loss on w: f (w, S) = 1 2n xj ∈S (x j w -y j ) 2 . Note that the Hessian of an individual linear regression loss term is x j x j which is not strongly convex since it has rank 1. One cannot apply the strongly convex bound, and the bound for convex suggests stability will increase linearly. However, one can rewrite the loss function as f y (w x) where f y (•) = 1. Hence Thm. 2 can be applied to give a non-accumulative bound on SGD's stability. A similar result can be derived for the logistic regression loss.

3.2. NON-CONVEX CASE

In this section, we construct a non-convex loss function to analyze the tightness of the divergence bound in (Hardt et al., 2016) . We first focus on the case of decreasing step size. Theorem 3. There exists a function f which is non-convex and β-smooth, twin datasets S, S and constants a s.t. the following holds: if SGD is run using step size α t = a 0.99βt for 1 ≤ t < T , and w t , w t are the outputs of SGD on S and S , respectively, and ∆ t := w t -w t , the divergence of SGD after T rounds (T > n) satisfies: E ∆ T ≥ T a 3n 1+a . Comparison to the bound in Theorem 3.12 (Hardt et al., 2016) In (Hardt et al., 2016) , an assumption is made on the non-convex loss function, namely that f (w, z) < 1. We remark that our function f used in proving the lower bound above does not obey this assumption. Thus for very large T , our lower bound may exceed the upper bound in (Hardt et al., 2016) , and in general is incomparable due to the lack of this assumption. Next one observes that in the range T a 1+a ≤ n, the upper bound in (Hardt et al., 2016) , namely O(T a /n 1+a ) is larger than 1, weakening its importance, especially because of the assumption f (z, w) < 1, and the fact that when a is small, and one is interested in training faster, smaller values of T in the above range are important. Our divergence lower bound motivates an investigation into the possible tightness of the analysis leading to the upper bound in (Hardt et al., 2016) . In the following theorem we prove a tighter upper bound for this range of T : it does not assume f (z, w) < 1, and is non-trivial in the range when T a 1+a ≤ n. Theorem 4. Assume f is β-smooth and L-lipschitz. Running T (T > n) iterations of SGD on f (w; S) with step size α t = a βn , the stability of SGD satisfies: ε stab ≤ 2L 2 T a n 1+a . Dividing our bound with the one in Theorem 3.12 in Hardt et al. (2016) , we obtain the ratio Ω T a 2 1+a n a . This factor is less than 1 (and so we improve the upper bound) exactly when T a 1+a ≤ n. Note that this is potentially a large range as a is a small and positive constant. We remark that our tight bound is for permutation SGD. The bound for SGD (see Appendix Theorem 4b) using sampling without replacement has an additional log(n) factor (which we conjecture can be removed), which nevertheless is also a polynomial improvement over the known bounds.

4. DEEP LEARNING AND LIMITATIONS OF BOUNDS

While Theorem 4 improves existing upper bounds on stability by an polynomial factor, it still can not explain the generalization performance of deep learning model. In particular, the analysis relies on a decreasing step size while in training deep learning model, constant instead of a decreasing step size is a common choice. In the next result, we adopt the same construction as in Theorem 3 but with a constant learning rate to show that, unfortunately, SGD may have an exponential divergence rate for general β smooth function. Theorem 5. Let w t , w t be the outputs of SGD on twin datasets S, S , and let ∆ t := w t -w t . There exists a function f which is non-convex and β-smooth, twin sets S, S and constants a, γ such that the divergence of SGD after T rounds (T > n) using constant step size α = a 0.99β satisfies: E ∆ T ≥ 1 n 2 e aT /2 . The above theorem implies that the generalization gap may deteriorate at an exponential rate if SGD is run with a constant step size for a non-convex function. However, in Fig. 1a we observe that the distance between the weights of two deep models trained on twin datasets (in other words, the divergence δ T stabilizes after ≈ 20 epochs of training. Fig. 1a suggests that the generalization gap for deep models may remain constant after sufficiently many training iterations. However, the analysis of (Hardt et al., 2016) suggests that local minima of the loss landscape must be in strongly convex regions in order to achieve this sort of stability. Indeed, strongly convex Hessians have been proven to be contractive, leading to the stability of SGD (Ge et al., 2015) . On the other hand, it is known that the Hessians of the loss functions of common deep learning architectures have a large fraction of eigenvalues which are close to zero (Yao et al., 2019; Chaudhari et al., 2017) . This phenomenon requires further investigation to close the gap between theory and practice. Next, we formalize a condition requiring that local minima of deep learning loss landscapes lie in contractive regions, which aids in explaining why the generalization gap stabilizes. We then provide empirical evidence that this property does indeed hold.

4.1. GEOMETRY OF LOCAL MINIMA

In Theorems 1 and 5, we show that in general, divergence (and hence stability) may deteriorate at a linear or exponential rate with the number of training iterations when the local minima are not in strongly convex regions. We introduce the following property to aid in understanding the stability of the training curve of SGD. Definition 6. [Hessian Contractive] For a given set S = {z 1 , ...z n }, a local minimum w * is in a (σ, γ)-Hessian contractive region if ∀w, w with max( w -w * , w -w * ) ≤ σ and z ∈ S, ∇ w f (w, z) H w ∇ w f (w, z) ≥ γ ∇ w f (w, z) 2 (3) where H w = ∇ 2 w f (w ; S) = 1 n n j=1 ∇ 2 w f (w , z j ). When (σ, γ)-Hessian contractive holds for all σ > 0 and local minima w * , we say f (w; S) is globally γ-Hessian contractive. Hessian contractivity describes a region where the stochastic gradient will lie in the range of a positive semi-definite Hessian matrix. Such a condition prevents the iterates of SGD from escaping from a local minima in the sense that all gradient descent steps will be pulled back to the minima by the power of Hessian, which mimics the structure of a strongly convex loss function but with a weaker assumption. To show that this class is nonempty, we observe the (not strongly convex) linear models loss functions in Theorem 2 satisfy Hessian contractive condition globally. Observation 1. Suppose loss function f (w; S) conditions in Theorem 2, i.e., f (w, S) = 4) S is ξ-self correlated, we have f (w; S) is globally ξγ R 2 Hessian Contractive. Dependence on dataset The Hessian Contractive condition states that the gradient of loss function evaluated on z can be regulated by Hessian evaluated on the whole dataset, which is a generalization of ξ-self correlated set condition. In the linear model case, the power of Hessian Contractive condition depends on the the average correlation between data points which indicates the effect of data complexity in generalization. We postulate that the value of γ in Hessian Contractive condition can be one complexity measurement of the data, which is believed to affect generalization power (Kawaguchi et al., 2017) . 1 n n j=1 f yj (w x j ) satisfies (1) |f y (•)| ≤ L , (2) γ ≤ f y (•) ≤ β, (3) x ≤ R and ( Next, we leverage on the Hessian contractive condition to show that once SGD hits a stationary point around w * in a (σ, γ)-Hessian contractive region, it will be localized near the minima. The localization of SGD implies that the divergence curve will be stable eventually. Theorem 6. Given a set S and a β-smooth and L-Lipschitz loss funtion f , suppose SGD with a fixed step size α ≤ σ 2 2γL 2 reaches point w 0 which is close to a local minimum w * in a (σ, γ)-Hessian contractive region: w 0 -w * ≤ σ with large enough radius so that σ > 12L γ , then we have ∀T ≥ 1, w T -w * ≤ σ. Theorem 6 states that once the iterate w of SGD encounters a point in a (σ, γ)-Hessian contractive region with large enough radius σ, w will remain in this region. In terms of the stability of SGD, assuming SGD will bring w t into one of the local minima quickly (Ge et al., 2015; Allen-Zhu et al., 2019) , the divergence w t -w t will be uniformly bounded by the difference between two local minima w * , w * of loss functions f (w; S) and f (w ; S ) plus the σ-radius of the Hessian contractive region. The above suggests that a weaker condition than strong convexity is sufficient to explain the observation in Fig. 1a .

4.2. EMPIRICAL SUPPORT FOR HESSIAN CONTRACTIVITY

In order to explain the stabilization of the generalization gap observed when training deep learning models via SGD, we hypothesize that local minima of deep loss landscapes found by SGD lie in Hes- We plot the mean over 35 twin datasets (variance shown). The distance first increases but stabilizes after ≈ 20 train epochs. The "all" curve is the distance between the full model parameter vectors; convi is the distance between first convolutional layer weights in the i-th "block" of ResNet18. (b) Approximate normalized Hessian contractivity. For both neural networks, the "mean" contractivity value at each epoch is averaged over 100 perturbations and 5 different copies of the network trained on CIFAR10. Similarly for "mean(min)", except the min of the 100 approximations after perturbing is taken. Both quantities for both networks provide evidence that SGD converges to an increasingly contractive solution. sian contractive regions. We propose to empirically support this hypothesis by locally approximating the LHS of Eq. ( 6)-normalized by ∇ w f ( w, z) 2 -during the training of a model, and showing that this quantity steadily increases as training progresses. This indicates that SGD leads the training into regions of increasing contractivity. We now explain our approximation: For a model w * , we generate many random samples w from the unit sphere around w * . For each w, we locally estimate a stochastic version of Hessian-gradient product to get the approximation ∇ w f ( w, z) H(z)∇ w f ( w, z) ≈ 1 η [∇ w f ( w, z) -∇ w f ( w -η∇ w f ( w, z), z)] ∇ w f ( w, z) (4) where the gradients on the RHS approximation are taken with respect to a random minibatch of samples. For each perturbation w we normalize the corresponding approximation by ∇ w f ( w, z) 2 , and then take the mean and minimum over all of these quantities. For our experiments, we used the well-known CIFAR10 dataset (Krizhevsky et al., 2009) and two popular deep architectures: ResNet18, and DenseNet121 (Huang et al., 2017; Lin & Jegelka, 2018) . To match the settings in the theorems of this paper as much as possible and avoid any confounding factors, we avoided standard regularization (e.g. weight decay), data augmentation, and adaptive learning rates. We used SGD with a constant learning rate of 0.01 and momentum of 0.9. The models were trained for 200 epochs and batch size of 128. We trained five copies of each network and checkpointed the models every 8 epochs. For each checkpoint, we generated 100 random perturbations from the unit sphere around the checkpoint weights. We used η = 0.001 and batch size 128 for the approximation in Eq. ( 4). We averaged over these 100 approximations and then further took the average over the five networks. The results of these experiments are shown in Fig. 1b , and indicate a clear increase in Hessian contractivity as training progresses.

5. CONCLUSION

In this paper, we studied the stability bound of SGD with regard to different types of loss functions. We proved a better upper bound and proved the tightness of various bounds. These tightness results suggest that existing stability bounds may not suffice in answering the generalization problem of deep neural nets. We propose a new Hessian contractive condition that is slighly stronger than convex, but with O(1) stability bound. We provide empirical evidence to support the hypothesis that deep learning loss function is Hessian contractive near local minima. Lemma 2 (Lower bound on divergence). Let ∆ t be w t -w t , α t be the step size of SGD and ∆ 0 = 0. Suppose [x i -x i ]/ x i -x i is an eigenvector of A where A[x i -x i ] = λ xx [x i -x i ]. Running SGD on f (w, S), we have: E ∆ T ≥ x i -x i n T -1 t=1 T -1 τ =t+1 α t (1 -α τ λ xx ) Proof: By Lemma 1 we have E ∆ T = E (I -α T -1 A)∆ T -1 + α T -1 n [x i -x i ] = (1 -α T -1 λ xx )E ∆ T -1 + α T -1 n x i -x i = [x i -x i ] 1 n T -1 t=1 α t T -1 τ =t+1 (1 -α τ λ xx ) Theorem 1. Let w t , w t be the outputs of SGD on twin datasets S, S respectively, ∆ t be w t -w t and α t be the step size of SGD. There exists a function f which is convex and β-smooth, L-Lipschitz on domain of w t , w t and twin datasets S, S such that the divergence of the two SGD outputs satisfies: E ∆ T ≥ L 2n T t=1 α t , and ε stab ≥ L 2n T t=1 α t . Proof We set x i = v, y i = 0.5, x i = -v, y i = 0.5. The proof is constructive. Let f (w, z) = 1 2 w Aw -yx. The function is β smooth and convex by setting β = A and u Au ≥ 0 for all u. We set S \ {z i } = S \ {z i } to lie in the range of A, where A is a PSD matrix which is not full-rank. We further set x i , x i to lie in the null space of A so that Av = 0. The lower bound ∆ T follows from Lemma 2 from the fact that ∆ 0 = 0. Since w Aw = w Aw , ε stab = sup z E|f (w T , z) -f (w T , z)| ≥ E|v [w T -w T ]| ≥ 1 n T t=1 α t . The last part we show for any sequence w t generated from stochastic gradient descent step using data set S will have a constant gradient thus the Lipshitzness will follow. Let U be the range of A (so any u ∈ U satisfy u A u > 0. Pick γ so that u Au ≥ γ > 0 holds for all u ∈ U . For all u, w t+1 u ≤ (I -α t A)w t u + αz u ≤ (1 -α t γ) w t u + α. This implies w t u ≤ 1 γ . Now we bound gradient ∇f w (w t , z) = Aw t -x ≤ |v Aw|+|v yx|+|u Aw|+|u yx| ≤ 1+ β γ . This implies that the function is L-Lipschitz with L = 1 + β γ . By setting β = 2γ the proof follows. Theorem 2. Suppose loss function f (w, z) of the form f (w, S) 4) , S are ξ-self correlated. Let w t , w t be the outputs of SGD on twin datasets S, S , ∆ t := w t -w t and α ≤ 1 β is the step size of SGD. Then, = 1 n n j=1 f yj (w x j ) and f y (w x) satisfies (1) |f y (•)| ≤ L , (2) 0 < γ ≤ f y (•) ≤ β, (3) x ≤ R and ( E ∆ T ≤

4LR ξγn

Proof: For simplicity we omit the dependence of f on y j so that f yj (w x j ) = f (w, z j ). Note that the gradient of the loss function is ∇f yj (w t x j ) = f yj (w t x j )x j and the Hessian is ∇ 2 f yj (w t x j ) = f yj (w t x j )x j x j . The stochastic gradient step of f yj (w t x j ) is w t+1 = w t -α t f yj (w t x j )x j . The dynamics of the divergence can be described as: E 1:t+1 ∆ t+1 = E 1:t 1 n j =i ∆ t -α t [f yj y j (w t x j ) -f yj (w t x j )]x j 1 n (∆ t -α t [f yi (w t x i )x i -f y i (w t x i )x i ]) Note that [f yj y j (w t x j ) -f yj (w t x j )]x j can be rewritten as f yj (w θj t x j )x j x j ∆ t where w θj = (1 -θ j )w t + θ j w t , 0 < θ j < 1. Similarly we can also rewrite f yi (w t x i )x i -f y i (w t x i )x i as f yi (w t x i )x i -f y i (w t x i )x i = 1 2 {f yi (w t x i )x i -f yi (w t x i )x i } + 1 2 {f y i (w t x i )x i -f y i (w t x i )x i } + 1 2 {f yi (w t x i ) + f yi (w t x i )}x i - 1 2 {f y i (w t x i ) + f y i (w t x i )}x i = 1 2 f yi (w θi t x i )x i x i ∆ t + 1 2 f y i (w θ i t x i )x i x i ∆ t + 1 2 {f yi (w t x i ) + f yi (w t x i )}x i - 1 2 {f y i (w t x i ) + f y i (w t x i )}x i Thus equation equation 7 can be written as E 1:t+1 ∆ t+1 = E 1:t (I - α t 2n n j=1 f yj (w θi t x i )x i x i )∆ t + α t 2n {f yi (w t x i ) + f yi (w t x i )}x i - α t 2n {f y i (w t x i ) + f y i (w t x i )}x i By the ξ-self correlated assumption and letting H = 1 n xj ∈S x j x j and H = 1 n xj ∈S x j x j , we have: E ∆ t+1 ≤ E (I - α t γ 2 {H + H })∆ t + α t L n x i + α t L n x i ≤ (1 - α t ξγ 2 )E ∆ t + 2α t LR n . The first inequality follows from the fact that f (•) ≥ γ and x j x j s are all PSD. The second inequality follows from the fact that ∆ t ∈ Span{x 1 , .., x n }. Fix α t = α and the theorem follows. Our next two lemmas are used in the proof of Theorem 3. Lemma 3. Suppose x t0 ≥ 0, x t+1 = (1 + a 0.99t )x t + y t , we have x T ≥ y( T t0 ) a if a > 0 is a sufficiently small constant. Proof: In the proof we use following inequality: e a ≤ 1 + a 0.99 ≤ e a 0.99 where a > 0 is a sufficiently small constant. Lemma 4. There exists a function f which is non-convex and β-smooth, twin datasets S, S and constant a > 0 such that the following holds: if SGD is run using step size α t = a 0.99βt for 1 ≤ t < T , and w t , w t are the outputs of SGD on S and S , respectively, and ∆ t := w t -w t , then: ∀1 ≤ t 0 ≤ T, E [ ∆ T |∆ t0 = 0] ≥ 1 2n T t 0 a Proof: Consider the function f in Equation 1, and choose A to positive and negative eigenvalues. We set the minimum eigenvalue of A equal to -β and all other eigenvalues with absolute value at most β. We select twin datasets for such A as follows. We set all elements in S \ {x i } = S \ {x i } to lie in the column space of A. Also, ∀j = i, choose x j such that x j Ax j > 0, and choose any y j between 0 and 1. Let v be such that v Av = -β and v = 1. Finally, let x i = v, y i = 0.5, x i = -v, y i = 0.5. In this setting, one observes that the divergence ∆ t follows the following dynamic: ∆ t+1 =      (I -α t A)∆ t with prob. 1 -1 n (I -α t A)∆ t + αt 2 [x i -x i ] with prob 1/n.      . We first observe that ∆ t := w t -w t is of the form vθ t , where θ t > 0. This can be shown using induction. Let τ be the first time that x i , x i are picked, we have ∆ τ +1 = αt 2 [x i -x i ] = vα τ . The iterative step of ∆ t+1 and ∆ t implies that ∆ t+1 = vθ t+1 where θ t+1 = (1 + α t β)θ t with probability (1 -1 n ) and θ t+1 = (1 + α t β)θ t + α t with probability 1 n . The above construction then yields: E 1:t+1 [ ∆ t+1 |∆ t0 = 0] =E 1:t 1 - 1 n (I -α t A)∆ t + 1 n (I -α t A)∆ t + α t v = v E 1:t 1 - 1 n (1 + α t β)θ t + 1 n ((1 + α t β)θ t + α t ) = v E 1:t [(1 + α t β)θ t ] + α t n = (1 + a 0.99t )E 1:t [ ∆ t |∆ t0 = 0] + α t n v Now apply Lemma 3, with x t = E[||∆ t |||∆ t0 = 0] and y = a||v|| 0.99βn . This gives us that x T ≥ a||v|| 0.99βn (T /t 0 ) a = a 0.99βn (T /t 0 ) a , since ||v|| = 1. Finally, the claimed bound follows by setting the minimum eigenvalue β = a 0.99 . Theorem 3. There exists a function f which is non-convex and β-smooth, twin datasets S, S and constants a s.t. the following holds: if SGD is run using step size α t = a 0.99βt for 1 ≤ t < T , and w t , w t are the outputs of SGD on S and S , respectively, and ∆ t := w t -w t , the divergence of SGD after T rounds (T > n) satisfies: E ∆ T ≥ T a 3n 1+a Proof: The proof is based on Theorem 4 plus the idea of a "burn-in" period. We have: E ∆ T = E[ w t -w t |∆ n = 0]P[∆ n = 0] + E[ w t -w t |∆ n = 0]P[∆ n = 0] ≥ E[ w t -w t |∆ n = 0]P[∆ n = 0] = 1 -1 - 1 n n T a n 1+a x i -x i ≥ T a 3n 1+a x i -x i (14) Lemma 5. (Hardt et al., 2016) Assume f is β-smooth and L-lipschitz. Let w t , w t be outputs of SGD on twin datasets S, S respectively after t iterations and let ∆ t := [w t -w t ] and δ t = E ∆ t . Running SGD on f (w; S) with step size α t = a βt satisfies the following conditions: • The SGD update rule is a (1 + α t β)-expander and 2α t L-bounded. • E[ ∆ t |∆ t-1 ] ≤ (1 + α t δ) ∆ t-1 + 2αtL n . • E[ ∆ T |∆ t k = 0] ≤ T t k-1 a 2L n . Theorem 4 ( Permutation). Assume f is β-smooth and L-lipschitz. Running T (T > n) iterations of SGD on f (w; S) with step size α t = a βt , the stability of SGD satisfies: E ∆ T ≤ 2LT a n 1+a , ε stab ≤ 2L 2 T a n 1+a Proof. Let H = t represents the event that the first time the SGD pick the different entry is at time t: E ∆ T = E[ ∆ T |H ≤ n]P[H ≤ n] + E[ ∆ T |H > n]P[H > n] 0(permutation) ≤ 1 n n t=1 E[ ∆ T |H = t] ≤ * 1 n n t=1 T t a 2L n ≤ 2LT a n 2 n t=1 1 t a dt ≤ 2LT a n 1+a The inequality ( * ) derived by applying Lemma 5. Lemma 6. Let w t , w t be outputs of SGD on twin datasets S, S respectively after t iterations and let ∆ t := w t -w t . Suppose that t k = ct k-1 . Then the following conditions hold: • P[∆ t k -1 = 0|∆ t k = 0] ≤ n n+t k-1 . • P[∆ t k -1 = 0|∆ t k = 0] ≤ 1 c 1 + t k n . • E[ ∆ T |∆ t k = 0] ≤ E[ ∆ T |∆ t k-1 = 0] 1 c 1 + t k n + T t k-1 a 2L n . Proof. In the proof we will use the following inequality with r ≥ 1: n -r n ≤ (1 - 1 n ) r ≤ n n + r i): P[∆ t k-1 = 0|∆ t k = 0] = P[∆ t k -1 = 0, ∆ t k = 0] P[∆ t k = 0] = (1 -1/n) t k-1 1 -(1 -1/n) t k -t k-1 1 -(1 -1/n) t k ≤ (1 -1/n) t k-1 ≤ n n + t k-1 (17) ii): P[∆ t k -1 = 0|∆ t k = 0] = P[∆ t k = 0, ∆ t k-1 = 0] P[∆ t k = 0] = P[∆ t k-1 = 0] P[∆ t k = 0] = 1 -(1 -1/n) t k-1 1 -(1 -1/n) t k ≤ 1 -n n+t k-1 1 -n-t k n ≤ t k-1 t k (1 + t k n ) = 1 c (1 + t k n ) iii): By applying i) and ii) in the decomposition of E[∆ T |∆ t k = 0] we have E[ ∆ T |∆ t k = 0] ≤ E[ ∆ T |∆ t k-1 = 0]P[∆ t k -1 = 0|∆ t k = 0] + E[ ∆ T |∆ t k-1 = 0]P[∆ t k -1 = 0|∆ t k = 0] ≤ E[ ∆ T |∆ t k-1 = 0] t k-1 t k (1 + t k n ) + ( T t k-1 ) a 2L n + t k-1 = 1 c (1 + t k n )E[ ∆ T |∆ t k-1 = 0] + ( T t k-1 ) a 2L n + t k-1 where the last inequality uses the fact that E [ ∆ T |∆ t k = 0] ≤ T t k-1 a 2L n . Theorem4b (Uniformly Independent) Assume f is β-smooth and L-lipschitz. Running T (T > n) iterations of SGD on f (w; S) with step size α t = a βt , the stability of SGD satisfies: E ∆ T ≤ 16 log(n)L T a n 1+a ; ε stab ≤ 16 log(n)L 2 T a n 1+a Proof. We first decompose ∆ T as follows by selecting t k = n: E ∆ T = E[ ∆ T |∆ t k = 0]P[∆ t k = 0] Term 1 ≤ 2LT a n 1+a (Lemma 5) + E[ ∆ T |∆ t k = 0]P[∆ t k = 0] Term 2 ≤ 11L log(n)T a n 1+a Term 1 is easily bounded by applying Lemma 5 with α t = a tβ . To bound Term 2, plug in P[∆ t k = 0] = 1 -(1 -1/n) t k ≤ t k n and recursively apply point (iii) from Lemma 6 by setting t i+1 = ct i . We get: E[ ∆ T |∆ t k = 0]P[∆ t k = 0] ≤ 2L n t k n k-1 i=1 ( T t i ) a n n + t i k-1 τ =i+1 (1 + t τ +1 n ) t τ t τ +1 ≤ 2L n k-1 i=1 ( T t i ) a t i+1 n + t i exp( k-1 τ =i+1 t τ +1 n ) ≤ 2cL n exp c c -1 k-1 i=1 ( T t i ) a t i n + t i ≤ 2cLT a n exp c c -1 k-1 i=1 t 1-a i n ≤ 2L log(n)T a n 1+a c a log c exp c c -1 ≤ 11 log(n)LT a n 1+a In the second and third inequality we use the fact that 1 + x ≤ exp(x) and t i+1 = ct i to get k-1 τ =i+1 (1 + tτ+1 n ) ≤ exp( k-1 τ =i+1 tτ+1 n ) ≤ exp c c-1 . The last inequality is derived by picking c = 4 . Theorem 5. Let w t , w t be the outputs of SGD on twin datasets S, S , and let ∆ t := w t -w t . There exists a function f which is non-convex and β-smooth, twin sets S, S and constants a, γ such that the divergence of SGD after T rounds (T > n) using constant step size α = a 0.99γ satisfies: E ∆ T ≥ 1 n 2 e aT /2 Proof: The proof is similar to Theorem 3. Since ∆ t ∈ Span{x i -x i }, we have: E ∆ t+1 ≥ (1 - 1 n )(1 + α t β)E ∆ t + α t n x i -x i Suppose t 0 is the hitting time when ∆ t0 > 0 and ∆ t0-1 = 0 , ∆ T ≥ xi-x i 3n e a(T -t0)/2 . E ∆ T = E[ w t -w t |∆ 1 = 0]P[∆ 1 = 0] + E[ w t -w t |∆ 1 = 0]P[∆ 1 = 0] ≥ E[ w t -w t |∆ 1 = 0]P[∆ 1 = 0] = 1 n ( x i -x i n e aT /2 ) = x i -x i n 2 e aT /2 . ( ) Theorem 6. Given a set S and a β-smooth and L-Lipschitz loss funtion f , suppose SGD with a fixed step size α ≤ σ 2 2γL 2 reaches point w 0 which is close to a local minimum w * in a (σ, γ)-Hessian contractive region: w 0 -w * ≤ σ with large enough radius so that σ > 12L γ , then we have ∀T ≥ 1, w T -w * ≤ σ. Proof: Let H wt,w * ,θ = 1 n zj ∈S ∇ 2 w f (θw t + (1 -θ)w 0 , z j ) We derive that: w t+1 -w 0 2 = w t -w 0 -α∇ w f (w, z) 2 = w t -w 0 2 + α 2 ∇ w f (w s , z) 2 -2α∇ w f (w t , z) [w t -w 0 ] = w t -w 0 2 + α 2 ∇ w f (w s , z) 2 -2α∇ w f (w 0 , S) [w t -w 0 ] -2α[∇ w f (w t , z) -∇ w f (w t , S)] [w t -w 0 ] -2α[∇ w f (w t , S) -∇ w f (w 0 , S)] [w t -w 0 ] ≤ w t -w 0 2 + α 2 L 2 -2α[w t -w 0 ] H wt,w * ,θ [w t -w 0 ] + 6αLσ Due to the fact that w t -w 0 ∈ Span{∇ w f (w 0 , z), ..., ∇ w f (w t-1 , z)}, we have w t -w 0 2 + α 2 L 2 -2α[w t -w 0 ] H wt,w * ,θ [w t -w 0 ] + 6αLσ ≤(1 -2αγ) w t -w 0 2 + α 2 L 2 + 6αLσ This implies w t -w 0 ≤ σ for all t.

B EMPIRICAL SUPPORT FOR THEOREM 6

In order to empirically support Theorem 6, we ran the follow experiment: We started with a fullytrained model w * , and generated random perturbations w = w * + ν where ν ∼ N (0, σ 2 ). We then trained each w until convergence w * , and computed the normalized distance w * -w * / w * . To match Theorem 6, this distance should be no more than 2 w * -w * / w * , i.e., the trained weights w * after perturbing stay close to w * , as in the conclusion of Theorem 6. For the above experiment, we first trained both ResNet18 and Densenet121 for 200 epochs, with constant learning rate of 0.01, a momentum of 0.9, and batch size of 128. We generated 20 Gaussian perturbations w of the trained model weights w * and trained each for 50 more epochs, resulting in weights w * . We used standard deviation σ ∈ {0.01, 0.005} for the Gaussian perturbations. We then computed the normalized differences as explained in the previous paragraph. The results are shown in 



While it is an interesting open problem to get data-dependent lower bounds by lower bounding the average stability, we construct lower bounds on the worst-case stability. Thus our lower bounds are general and not data-dependent.



Figure1: (a) As in(Hardt et al., 2016), we measure the normalized Euclidean distance between the parameters of two identical deep models (ResNet18) trained on twin datasets (CIFAR10 with a random image removed). We plot the mean over 35 twin datasets (variance shown). The distance first increases but stabilizes after ≈ 20 train epochs. The "all" curve is the distance between the full model parameter vectors; convi is the distance between first convolutional layer weights in the i-th "block" of ResNet18. (b) Approximate normalized Hessian contractivity. For both neural networks, the "mean" contractivity value at each epoch is averaged over 100 perturbations and 5 different copies of the network trained on CIFAR10. Similarly for "mean(min)", except the min of the 100 approximations after perturbing is taken. Both quantities for both networks provide evidence that SGD converges to an increasingly contractive solution.

Current landscape of stability bounds. [H] indicates results in

Inspecting the table shows that the distances between the trained weights after perturbing are quite close to the distances before further training, with low variance. This indicates that the weights after training the perturbed model are well within the desired range as specified in Theorem 6.

Pertubed SGD Results

A PROOFS

Lemma 1 (Dynamics of divergence). Let f (w; x) = 1 2 w Aw -yx. Suppose [x i -x i ]/ x i -x i is an eigenvector of A where A[x i -. Let ∆ t be w t -w t , α t ≤ λ xx be the step size of SGD and ∆ 0 = 0. Suppose one runs SGD on f (w, S) and f (w, S ) where S, S are twin datasets and x i x j = 0, x i x j = 0, ∀j = i, the dynamics of ∆ t are given by:Proof: In case the different entry z i , z i is not picked, the gradient difference of f (w; z) and f (wn and a different entry with probability 1 n we have the following dynamic:

