REVISITING THE STABILITY OF STOCHASTIC GRADI-ENT DESCENT: A TIGHTNESS ANALYSIS

Abstract

The technique of algorithmic stability has been used to capture the generalization power of several learning models, especially those trained with stochastic gradient descent (SGD). This paper investigates the tightness of the algorithmic stability bounds for SGD given by Hardt et al. (2016). We show that the analysis of Hardt et al. ( 2016) is tight for convex objective functions, but loose for non-convex objective functions. In the non-convex case we provide a tighter upper bound on the stability (and hence generalization error), and provide evidence that it is asymptotically tight up to a constant factor. However, deep neural networks trained with SGD exhibit much better stability and generalization in practice than what is suggested by these (tight) bounds, namely, linear or exponential degradation with time for SGD with constant step size. We aim towards characterizing deep learning loss functions with good generalization guarantees, despite training using SGD with constant step size. In this vein, we propose the notion of a Hessian Contractive (HC) region, which quantifies the contractivity of regions containing local minima in the neural network loss landscape. We provide empirical evidence that several loss functions exhibit HC characteristics, and provide theoretical evidence that the known tight SGD stability bounds for convex and non-convex loss functions can be circumvented by HC loss functions, thus partially explaining the generalization of deep neural networks.

1. INTRODUCTION

Stochastic gradient descent (SGD) has gained great popularity in solving machine learning optimization problems (Kingma & Ba, 2014; Johnson & Zhang, 2013) . SGD leverages the finite-sum structure of the objective function, avoids the expensive computation of exact gradients, and thus provides a feasible and efficient optimization solution in large-scale settings (Bottou, 2012) . The convergence and the optimality of SGD have been thoroughly studied (Ge et al., 2015; Rakhlin et al., 2012; Reddi et al., 2018; Zhou & Gu, 2019; Carmon et al., 2019a; b; Shamir & Zhang, 2013) . In recent years, new research questions have been raised regarding SGD's impact on a model's generalization power. The seminal work (Hardt et al., 2016) tackled the problem using the algorithmic stability of SGD, i.e., the progressive sensitivity of the trained model w.r.t. the replacement of a single (test) datum in the training set. The stability-based analysis of the generalization gap allows one to bypass classical model capacity theorems (Vapnik, 1998; Koltchinskii & Panchenko, 2000) or weight-based complexity theorems (Neyshabur et al., 2017; Bartlett et al., 2017; Arora et al., 2018) . This framework also provides theoretical insights into many phenomena observed in practice, e.g., the "train faster, generalize better" phenomenon, the power of regularization techniques such as weight decay (Krogh & Hertz, 1992 ), Dropout (Srivastava et al., 2014) , and gradient clipping. Other works have applied the stability analysis to more sophisticated settings such as Stochastic Gradient Langevin Dynamics and momentum SGD (Mou et al., 2018; Chaudhari et al., 2019; Chen et al., 2018) . Despite the promises of this stability-based analysis, it remains open whether this framework can explain the strong generalization performance of deep neural networks in practice. Existing theoretical upper bounds of the stability (and thus, generalization) (Hardt et al., 2016) are ideal for strongly convex loss functions: the upper bound remains constant even as the number of training iterations increases. However, the same bound deteriorates significantly when we relax to more general and realistic settings. In particular, for convex (but not strongly convex) and non-convex loss functions, if 1

