SHARPER GENERALIZATION BOUNDS FOR LEARNING WITH GRADIENT-DOMINATED OBJECTIVE FUNCTIONS

Abstract

Stochastic optimization has become the workhorse behind many successful machine learning applications, which motivates a lot of theoretical analysis to understand its empirical behavior. As a comparison, there is far less work to study the generalization behavior especially in a non-convex learning setting. In this paper, we study the generalization behavior of stochastic optimization by leveraging the algorithmic stability for learning with β-gradient-dominated objective functions. We develop generalization bounds of the order O(1/(nβ)) plus the convergence rate of the optimization algorithm, where n is the sample size. Our stability analysis significantly improves the existing non-convex analysis by removing the bounded gradient assumption and implying better generalization bounds. We achieve this improvement by exploiting the smoothness of loss functions instead of the Lipschitz condition in Charles & Papailiopoulos (2018). We apply our general results to various stochastic optimization algorithms, which show clearly how the variance-reduction techniques improve not only training but also generalization. Furthermore, our discussion explains how interpolation helps generalization for highly expressive models.

1. INTRODUCTION

Stochastic optimization has found tremendous applications in training highly expressive machine learning models including deep neural networks (DNNs) (Bottou et al., 2018) , which are ubiquitous in modern learning architectures (LeCun et al., 2015) . Oftentimes, the models trained in this way have not only very small training errors or even interpolate the training examples, but also surprisingly generalize well to testing examples (Zhang et al., 2017) . While the low training error can be well explained by the over-parametrization of models and the efficiency of the optimization algorithm in identifying a local minimizer (Bassily et al., 2018; Vaswani et al., 2019; Ma et al., 2018) , it is still unclear how the highly expressive models also achieve a low testing error (Ma et al., 2018) . With the recent theoretical and empirical study, it is believed that a joint consideration of the interaction among the optimization algorithm, learning models and training examples is necessary to understand the generalization behavior (Neyshabur et al., 2017; Hardt et al., 2016; Lin et al., 2016) . The generalization error for stochastic optimization typically consists of an optimization error and an estimation error (see e.g. Bousquet & Bottou (2008) ). Optimization errors arise from the suboptimality of the output of the chosen optimization algorithms, while estimation errors refer to the discrepancy between the testing error and training error at the output model. There is a large amount of literature on studying the optimization error (convergence) of stochastic optimization algorithms (Bottou et al., 2018; Orabona, 2014; Karimi et al., 2016; Ying & Zhou, 2017; Liu et al., 2018) . In particular, the power of interpolation is clearly justified in boosting the convergence rate of stochastic gradient descent (SGD) (Bassily et al., 2018; Vaswani et al., 2019; Ma et al., 2018) . In contrast, there is far less work on studying estimation errors of optimization algorithms. In a seminal paper (Hardt et al., 2016) , the fundamental concept of algorithmic stability was used to study the generalization behavior of SGD, which was further improved and extended in Charles & Papailiopoulos ( 2018 However, these results are still not quite satisfactory in the following three aspects. Firstly, the existing stability bounds in non-convex learning require very small step sizes (Hardt et al., 2016) and yield suboptimal generalization bounds (Yuan et al., 2019; Charles & Papailiopoulos, 2018; Zhou et al., 2018b) . Secondly, majority of the existing work has focused on functions with a uniform Lipschitz constant which can be very large in practical models if not infinite (Bousquet & Elisseeff, 2002; Hardt et al., 2016; Charles & Papailiopoulos, 2018; Kuzborskij & Lampert, 2018) , e.g., DNNs. Thirdly, the existing stability analysis fails to explain how the highly expressive models still generalize in an interpolation setting, which is observed for overparameterized DNNs (Arora et al., 2019; Brutzkus et al., 2017; Bassily et al., 2018; Belkin et al., 2019) . In this paper, we make attempts to address the above three issues using novel stability analysis approaches. Our main contributions are summarized as follows. 1. We develop general stability and generalization bounds for any learning algorithm to optimize (non-convex) β-gradient-dominated objectives. Specifically, we show that the excess generalization error is bounded by O(1/(nβ)) plus the convergence rate of the algorithm, where n is the sample size. This general theorem implies that overfitting will never happen in this case, and generalization would always improve as we increase the training accuracy, which is due to an implicit regularization effect of gradient dominance condition. In particular, we show that interpolation actually improves generalization for highly expressive models. In contrast to the existing discussions based on either hypothesis stability or uniform stability which imply at best a bound of O(1/ √ nβ), the main idea is to consider a weaker on-average stability measure which allows us to replace the uniform Lipschitz constant in Hardt et al. ( 2016 2. We apply our general results to various stochastic optimization algorithms, and highlight the advantage over existing generalization analysis. For example, we derive an exponential convergence of testing errors for SGD in an interpolation setting, which complements the exponential convergence of optimization errors (Bassily et al., 2018; Vaswani et al., 2019; Ma et al., 2018) and extends the existing results (Pillaud-Vivien et al., 2018; Nitanda & Suzuki, 2019 ) from a strongly-convex setting to a non-convex setting. In particular, we show that stochastic variance-reduced optimization outperforms SGD by achieving a significantly faster convergence of testing errors, while this advantage is only shown in terms of optimization errors in the literature (Reddi et al., 2016; Lei et al., 2017; Nguyen et al., 2017; Zhou et al., 2018a; Wang et al., 2019) .

2. RELATED WORK

Algorithmic Stability. We first review the related work on stability and generalization. Algorithmic stability is a fundamental concept in statistical learning theory (Bousquet & Elisseeff, 2002; Elisseeff et al., 2005) , which has a deep connection with learnability (Shalev-Shwartz et al., 2010; Rakhlin et al., 2005) . The important uniform stability was introduced in Bousquet & Elisseeff (2002) , where the authors showed that empirical risk minimization (ERM) enjoys the uniform stability if the objective function is strongly convex. This concept was extended to study randomized algorithms such as bagging and bootstrap (Elisseeff et al., 2005) . An interesting trade-off between uniform stability and convergence was developed for iterative optimization algorithms, which was then used to study convergence lower bounds of different algorithms (Chen et al., 2018) . While generalization bounds based on stability are often stated in expectation, uniform stability was recently shown to guarantee almost optimal high-probability bounds based on elegant concentration inequalities for weakly-dependent random variables (Maurer, 2017; Feldman & Vondrak, 2019; Bousquet et al., 2020) . Other than the standard classification and regression setting, uniform stability was very successfully to study transfer learning (Kuzborskij & Lampert, 2018) , PAC-Bayesian bounds (London, 2017), privacy learning (Bassily et al., 2019) and pairwise learning (Lei et al., 2020b) . Some other stability measures include the uniform argument stability (Liu et al., 2017) , hypothesis stability (Bousquet & Elisseeff, 2002) , hypothesis set stability (Foster et al., 2019) and on-average stability (Shalev-Shwartz et al., 2010 ). An advantage of on-average stability is that it is weaker than the uniform stability and can imply better generalization by exploiting either the strong convexity of the objective function (Shalev-Shwartz & Ben-David, 2014, Corollary 13.7) or the more relaxed exp-concavity of loss functions (Koren & Levy, 2015; Gonen & Shalev-Shwartz, 2017) . Since gradient-dominance condition is another relaxed extension of strong convexity, we use on-average stability to study generalization bounds.



); Zhou et al. (2018b); Yuan et al. (2019); Kuzborskij & Lampert (2018).

); Kuzborskij & Lampert (2018); Charles & Papailiopoulos (2018) with the training error of the best model.

