SHARPER GENERALIZATION BOUNDS FOR LEARNING WITH GRADIENT-DOMINATED OBJECTIVE FUNCTIONS

Abstract

Stochastic optimization has become the workhorse behind many successful machine learning applications, which motivates a lot of theoretical analysis to understand its empirical behavior. As a comparison, there is far less work to study the generalization behavior especially in a non-convex learning setting. In this paper, we study the generalization behavior of stochastic optimization by leveraging the algorithmic stability for learning with β-gradient-dominated objective functions. We develop generalization bounds of the order O(1/(nβ)) plus the convergence rate of the optimization algorithm, where n is the sample size. Our stability analysis significantly improves the existing non-convex analysis by removing the bounded gradient assumption and implying better generalization bounds. We achieve this improvement by exploiting the smoothness of loss functions instead of the Lipschitz condition in Charles & Papailiopoulos (2018). We apply our general results to various stochastic optimization algorithms, which show clearly how the variance-reduction techniques improve not only training but also generalization. Furthermore, our discussion explains how interpolation helps generalization for highly expressive models.

1. INTRODUCTION

Stochastic optimization has found tremendous applications in training highly expressive machine learning models including deep neural networks (DNNs) (Bottou et al., 2018) , which are ubiquitous in modern learning architectures (LeCun et al., 2015) . Oftentimes, the models trained in this way have not only very small training errors or even interpolate the training examples, but also surprisingly generalize well to testing examples (Zhang et al., 2017) . While the low training error can be well explained by the over-parametrization of models and the efficiency of the optimization algorithm in identifying a local minimizer (Bassily et al., 2018; Vaswani et al., 2019; Ma et al., 2018) , it is still unclear how the highly expressive models also achieve a low testing error (Ma et al., 2018) . With the recent theoretical and empirical study, it is believed that a joint consideration of the interaction among the optimization algorithm, learning models and training examples is necessary to understand the generalization behavior (Neyshabur et al., 2017; Hardt et al., 2016; Lin et al., 2016) . The generalization error for stochastic optimization typically consists of an optimization error and an estimation error (see e.g. Bousquet & Bottou (2008) ). Optimization errors arise from the suboptimality of the output of the chosen optimization algorithms, while estimation errors refer to the discrepancy between the testing error and training error at the output model. There is a large amount of literature on studying the optimization error (convergence) of stochastic optimization algorithms (Bottou et al., 2018; Orabona, 2014; Karimi et al., 2016; Ying & Zhou, 2017; Liu et al., 2018) . In particular, the power of interpolation is clearly justified in boosting the convergence rate of stochastic gradient descent (SGD) (Bassily et al., 2018; Vaswani et al., 2019; Ma et al., 2018) . In contrast, there is far less work on studying estimation errors of optimization algorithms. In a seminal paper (Hardt et al., 2016) , the fundamental concept of algorithmic stability was used to study the generalization behavior of SGD, which was further improved and extended in Charles & Papailiopoulos ( 2018 



); Zhou et al. (2018b); Yuan et al. (2019); Kuzborskij & Lampert (2018).

