REVISITING THE STABILITY OF STOCHASTIC GRADI-ENT DESCENT: A TIGHTNESS ANALYSIS

Abstract

The technique of algorithmic stability has been used to capture the generalization power of several learning models, especially those trained with stochastic gradient descent (SGD). This paper investigates the tightness of the algorithmic stability bounds for SGD given by Hardt et al. (2016). We show that the analysis of Hardt et al. ( 2016) is tight for convex objective functions, but loose for non-convex objective functions. In the non-convex case we provide a tighter upper bound on the stability (and hence generalization error), and provide evidence that it is asymptotically tight up to a constant factor. However, deep neural networks trained with SGD exhibit much better stability and generalization in practice than what is suggested by these (tight) bounds, namely, linear or exponential degradation with time for SGD with constant step size. We aim towards characterizing deep learning loss functions with good generalization guarantees, despite training using SGD with constant step size. In this vein, we propose the notion of a Hessian Contractive (HC) region, which quantifies the contractivity of regions containing local minima in the neural network loss landscape. We provide empirical evidence that several loss functions exhibit HC characteristics, and provide theoretical evidence that the known tight SGD stability bounds for convex and non-convex loss functions can be circumvented by HC loss functions, thus partially explaining the generalization of deep neural networks.

1. INTRODUCTION

Stochastic gradient descent (SGD) has gained great popularity in solving machine learning optimization problems (Kingma & Ba, 2014; Johnson & Zhang, 2013) . SGD leverages the finite-sum structure of the objective function, avoids the expensive computation of exact gradients, and thus provides a feasible and efficient optimization solution in large-scale settings (Bottou, 2012) . The convergence and the optimality of SGD have been thoroughly studied (Ge et al., 2015; Rakhlin et al., 2012; Reddi et al., 2018; Zhou & Gu, 2019; Carmon et al., 2019a; b; Shamir & Zhang, 2013) . In recent years, new research questions have been raised regarding SGD's impact on a model's generalization power. The seminal work (Hardt et al., 2016) tackled the problem using the algorithmic stability of SGD, i.e., the progressive sensitivity of the trained model w.r.t. the replacement of a single (test) datum in the training set. The stability-based analysis of the generalization gap allows one to bypass classical model capacity theorems (Vapnik, 1998; Koltchinskii & Panchenko, 2000) or weight-based complexity theorems (Neyshabur et al., 2017; Bartlett et al., 2017; Arora et al., 2018) . This framework also provides theoretical insights into many phenomena observed in practice, e.g., the "train faster, generalize better" phenomenon, the power of regularization techniques such as weight decay (Krogh & Hertz, 1992 ), Dropout (Srivastava et al., 2014) , and gradient clipping. Other works have applied the stability analysis to more sophisticated settings such as Stochastic Gradient Langevin Dynamics and momentum SGD (Mou et al., 2018; Chaudhari et al., 2019; Chen et al., 2018) . Despite the promises of this stability-based analysis, it remains open whether this framework can explain the strong generalization performance of deep neural networks in practice. Existing theoretical upper bounds of the stability (and thus, generalization) (Hardt et al., 2016) are ideal for strongly convex loss functions: the upper bound remains constant even as the number of training iterations increases. However, the same bound deteriorates significantly when we relax to more general and realistic settings. In particular, for convex (but not strongly convex) and non-convex loss functions, if SGD has constant step size, then the upper bound grows linearly and exponentially with the number of training iterations. This bound fails to match the superior generalization performance of deep neural networks, and leads to the following question: Question 1: Can we find a better stability upper bound for convex or non-convex loss functions? In this paper, we first address the question above and investigate the tightness of the algorithmic stability analysis for stochastic gradient methods (SGM) proposed by (Hardt et al., 2016) . R1. We show in Theorem 1 that the analysis in (Hardt et al., 2016) is tight for convex and smooth objective functions; in other words, there is a convex loss function whose stability grows linearly with the number of training iterations, with constant step size (α t = α) in SGD. R2. We show that in Theorem 2 that for linear models, the analysis in the convex case can be tightened to show that stab does not increase with t. R3. In Theorem 3 we show that the analysis in (Hardt et al., 2016) for decreasing step size (α t = O(1/t)) is loose for non-convex objective functions by providing a tighter upper bound on the stability (and hence generalization error). R4. The bound on the stability of SGD by (Hardt et al., 2016) is achieved by bounding the divergence at time t, defined as δ t := E||w t -w t ||, where w t is the model trained on data set S and w t is the model trained on a data set S that differs from S in exactly one sample. In Theorem 4 we provide evidence that our new upper bound in the non-convex case is tight, by showing a non-convex loss function whose divergence matches the upper bound for our divergence. R5. Although it is not derived formally, the techniques in (Hardt et al., 2016 ) can be employed to show an exponential upper bound for non-convex loss functions minimized using SGD with constant-size step. In Theorem 5, we give evidence that this abysmal upper bound is likely tight for non-convex loss functions, by exhibiting a non-convex loss function for which the divergence δ t increases exponentially. Thus the only functions whose stability provably does not increase with the number of iterations when a constant step-size during SGD is employed, are strongly convex functions. However, a) it has been empirically observed that for deep neural network loss, near the local minima, the Hessians are usually low rank (Chaudhari et al., 2017; Yao et al., 2019) , and b) neural networks trained with constant step-size SGD do generalize well in practice (Lin & Jegelka, 2018; Huang et al., 2017; Smith et al., 2017) . Combined with our lower bounds on convex and non-convex functions, we seem to hit an obstacle on the way to explaining generalization using the stability framework. Question 2: What is it that makes constant-step SGD on deep learning loss function generalize well? Realizing the limitation of the current state of stability analysis, we investigate whether a strongerthan-convex, but weaker-than-strongly-convex assumption of the loss function can be made, at least near local minima. If we can show algorithmic stability near local minima, we can still show the stability using similar argument as (Du et al., 2019; Allen-Zhu et al., 2019) . Aiming towards a characterization of loss functions exhibiting good stability, we propose a new condition for loss near local minima. This condition, called Hessian contractive, is slightly stronger than a general convex condition, but considerably weaker than strongly convex. Formally, the Hessian contractive condition stipulates that near any local minima, (1) the function is convex; and (2) a data dependent Hessian is positive definite in the gradient direction. Theoretically, we show that such a condition is sufficient to guarantee a constant stability bound for SGD (constant step size) near the local minima, while allowing the Hessian to be low rank. We also provide examples showing Hessian Contractive is a reasonable condition for several loss functions. Empirically, we verify the Hessian Contractive condition near a local minima of the loss while training deep neural networks. We sample points from a neighborhood of current iterates by adding Gaussian noise and verify the HC condition locally by Hessian product approximation. Summarizing our second set of contributions: R6. In Observation 1 we show that the family of widely used (convex) linear model loss functions will satisfy the Hessian Contractive condition. One typical example of such linear model loss is the regression loss function. These observation suggests that Hessian Contractive is a condition satisfied by (potentially many) machine learning loss functions.

