SHARPER GENERALIZATION BOUNDS FOR LEARNING WITH GRADIENT-DOMINATED OBJECTIVE FUNCTIONS

Abstract

Stochastic optimization has become the workhorse behind many successful machine learning applications, which motivates a lot of theoretical analysis to understand its empirical behavior. As a comparison, there is far less work to study the generalization behavior especially in a non-convex learning setting. In this paper, we study the generalization behavior of stochastic optimization by leveraging the algorithmic stability for learning with β-gradient-dominated objective functions. We develop generalization bounds of the order O(1/(nβ)) plus the convergence rate of the optimization algorithm, where n is the sample size. Our stability analysis significantly improves the existing non-convex analysis by removing the bounded gradient assumption and implying better generalization bounds. We achieve this improvement by exploiting the smoothness of loss functions instead of the Lipschitz condition in Charles & Papailiopoulos (2018) . We apply our general results to various stochastic optimization algorithms, which show clearly how the variance-reduction techniques improve not only training but also generalization. Furthermore, our discussion explains how interpolation helps generalization for highly expressive models.

1. INTRODUCTION

Stochastic optimization has found tremendous applications in training highly expressive machine learning models including deep neural networks (DNNs) (Bottou et al., 2018) , which are ubiquitous in modern learning architectures (LeCun et al., 2015) . Oftentimes, the models trained in this way have not only very small training errors or even interpolate the training examples, but also surprisingly generalize well to testing examples (Zhang et al., 2017) . While the low training error can be well explained by the over-parametrization of models and the efficiency of the optimization algorithm in identifying a local minimizer (Bassily et al., 2018; Vaswani et al., 2019; Ma et al., 2018) , it is still unclear how the highly expressive models also achieve a low testing error (Ma et al., 2018) . With the recent theoretical and empirical study, it is believed that a joint consideration of the interaction among the optimization algorithm, learning models and training examples is necessary to understand the generalization behavior (Neyshabur et al., 2017; Hardt et al., 2016; Lin et al., 2016) . The generalization error for stochastic optimization typically consists of an optimization error and an estimation error (see e.g. Bousquet & Bottou (2008) ). Optimization errors arise from the suboptimality of the output of the chosen optimization algorithms, while estimation errors refer to the discrepancy between the testing error and training error at the output model. There is a large amount of literature on studying the optimization error (convergence) of stochastic optimization algorithms (Bottou et al., 2018; Orabona, 2014; Karimi et al., 2016; Ying & Zhou, 2017; Liu et al., 2018) . In particular, the power of interpolation is clearly justified in boosting the convergence rate of stochastic gradient descent (SGD) (Bassily et al., 2018; Vaswani et al., 2019; Ma et al., 2018) . In contrast, there is far less work on studying estimation errors of optimization algorithms. In a seminal paper (Hardt et al., 2016) , the fundamental concept of algorithmic stability was used to study the generalization behavior of SGD, which was further improved and extended in Charles & Papailiopoulos (2018) ; Zhou et al. (2018b) ; Yuan et al. (2019) ; Kuzborskij & Lampert (2018) . However, these results are still not quite satisfactory in the following three aspects. Firstly, the existing stability bounds in non-convex learning require very small step sizes (Hardt et al., 2016) and yield suboptimal generalization bounds (Yuan et al., 2019; Charles & Papailiopoulos, 2018; Zhou et al., 2018b) . Secondly, majority of the existing work has focused on functions with a uniform Lipschitz constant which can be very large in practical models if not infinite (Bousquet & Elisseeff, 2002; Hardt et al., 2016; Charles & Papailiopoulos, 2018; Kuzborskij & Lampert, 2018) , e.g., DNNs. Thirdly, the existing stability analysis fails to explain how the highly expressive models still generalize in an interpolation setting, which is observed for overparameterized DNNs (Arora et al., 2019; Brutzkus et al., 2017; Bassily et al., 2018; Belkin et al., 2019) . In this paper, we make attempts to address the above three issues using novel stability analysis approaches. Our main contributions are summarized as follows. 1. We develop general stability and generalization bounds for any learning algorithm to optimize (non-convex) β-gradient-dominated objectives. Specifically, we show that the excess generalization error is bounded by O(1/(nβ)) plus the convergence rate of the algorithm, where n is the sample size. This general theorem implies that overfitting will never happen in this case, and generalization would always improve as we increase the training accuracy, which is due to an implicit regularization effect of gradient dominance condition. In particular, we show that interpolation actually improves generalization for highly expressive models. In contrast to the existing discussions based on either hypothesis stability or uniform stability which imply at best a bound of O(1/ √ nβ), the main idea is to consider a weaker on-average stability measure which allows us to replace the uniform Lipschitz constant in Hardt et al. (2016) ; Kuzborskij & Lampert (2018) ; Charles & Papailiopoulos (2018) with the training error of the best model. 2. We apply our general results to various stochastic optimization algorithms, and highlight the advantage over existing generalization analysis. For example, we derive an exponential convergence of testing errors for SGD in an interpolation setting, which complements the exponential convergence of optimization errors (Bassily et al., 2018; Vaswani et al., 2019; Ma et al., 2018) and extends the existing results (Pillaud-Vivien et al., 2018; Nitanda & Suzuki, 2019 ) from a strongly-convex setting to a non-convex setting. In particular, we show that stochastic variance-reduced optimization outperforms SGD by achieving a significantly faster convergence of testing errors, while this advantage is only shown in terms of optimization errors in the literature (Reddi et al., 2016; Lei et al., 2017; Nguyen et al., 2017; Zhou et al., 2018a; Wang et al., 2019) .

2. RELATED WORK

Algorithmic Stability. We first review the related work on stability and generalization. Algorithmic stability is a fundamental concept in statistical learning theory (Bousquet & Elisseeff, 2002; Elisseeff et al., 2005) , which has a deep connection with learnability (Shalev-Shwartz et al., 2010; Rakhlin et al., 2005) . The important uniform stability was introduced in Bousquet & Elisseeff (2002) , where the authors showed that empirical risk minimization (ERM) enjoys the uniform stability if the objective function is strongly convex. This concept was extended to study randomized algorithms such as bagging and bootstrap (Elisseeff et al., 2005) . An interesting trade-off between uniform stability and convergence was developed for iterative optimization algorithms, which was then used to study convergence lower bounds of different algorithms (Chen et al., 2018) . While generalization bounds based on stability are often stated in expectation, uniform stability was recently shown to guarantee almost optimal high-probability bounds based on elegant concentration inequalities for weakly-dependent random variables (Maurer, 2017; Feldman & Vondrak, 2019; Bousquet et al., 2020) . Other than the standard classification and regression setting, uniform stability was very successfully to study transfer learning (Kuzborskij & Lampert, 2018) , PAC-Bayesian bounds (London, 2017), privacy learning (Bassily et al., 2019) and pairwise learning (Lei et al., 2020b) . Some other stability measures include the uniform argument stability (Liu et al., 2017) , hypothesis stability (Bousquet & Elisseeff, 2002) , hypothesis set stability (Foster et al., 2019) and on-average stability (Shalev-Shwartz et al., 2010) . An advantage of on-average stability is that it is weaker than the uniform stability and can imply better generalization by exploiting either the strong convexity of the objective function (Shalev-Shwartz & Ben-David, 2014, Corollary 13.7) or the more relaxed exp-concavity of loss functions (Koren & Levy, 2015; Gonen & Shalev-Shwartz, 2017) . Since gradient-dominance condition is another relaxed extension of strong convexity, we use on-average stability to study generalization bounds. Generalization analysis. We now review related work on generalization analysis for stochastic optimization. In a seminal paper (Hardt et al., 2016) , the authors used the nonexpansiveness of gradient mapping to develop uniform stability bounds for SGD to optimize convex, strongly convex and even non-convex objective functions. This inspired some interesting work on stochastic optimization. An interesting data-dependent stability bound was developed for SGD, a nice property of which is that it shows how the initialization would affect generalization (Kuzborskij & Lampert, 2018) . These stability bounds were integrated into a PAC-Bayesian analysis of SGD, yielding generalization bounds for arbitrary posterior distributions (London, 2017) . Almost optimal generalization bounds were developed for differentially private stochastic convex optimization (Bassily et al., 2019) . The onaverage variance of stochastic gradients was used to refine the generalization analysis of SGD (Hardt et al., 2016) in non-convex optimization (Zhou et al., 2018b) . The uniform stability was also studied for SGD implemented in a stagewise manner (Yuan et al., 2019) and stochastic gradient Langevin dynamics in a non-convex setting (Li et al., 2020; Mou et al., 2018) . Very recently, the discussions in Hardt et al. (2016) were extended to tackle non-smooth (Lei & Ying, 2020; Bassily et al., 2020) and non-Lipscthiz functions (Lei & Ying, 2020) . The most related work is Charles & Papailiopoulos (2018) , where some general hypothesis stability bounds were developed for learning algorithms that converge to optima. A very interesting point is that their bounds depend only on the convergence of the algorithm to a global minimum and the geometry of loss functions around the global minimum. However, their discussion imply at best the slow generalization bounds O(1/ √ nβ) for β-gradientdominated objective functions, and can not explain the benefit of low optimization errors in helping generalization. The underlying reason is that they used the pointwise hypothesis stability and did not consider the smoothness of loss functions. We aim to improve these results by leveraging the weaker on-average stability and smoothness of loss functions. Other than the stability approach, there is interesting generalization analysis of SGD based on either a uniform convergence approach (Lin et al., 2016) , an integral operator approach (Lin & Rosasco, 2017; Ying & Pontil, 2008; Dieuleveut & Bach, 2016; Dieuleveut et al., 2017; Mücke et al., 2019) or an information-theoretic approach (Xu & Raginsky, 2017; Negrea et al., 2019; Bu et al., 2020) .

3. MAIN RESULTS

Let ρ be a probability measure defined on a sample space Z = X ×Y with X ⊆ R d and Y ⊆ R, from which a training dataset S = z 1 , . . . , z n is drawn independently and identically. The aim is to find a good model w from a model parameter space W based on the training dataset S. The performance of a prescribed model w on a single example z can be measured by a nonnegative loss function f (w; z), where f : W × Z → R + . In machine learning we often apply an (randomized) algorithm A : ∪ n Z n → W to S to produce an output model A(S) ∈ W. Oftentimes, the constructed model w would have a small empirical risk F S (w) = 1 n n i=1 f (w; z i ). However, we are mostly interested in the generalization performance of a model w on testing examples measured by the population (true) risk F (w) = E z f (w; z) , where E z denotes the expectation with respect to (w.r.t.) z. The gap E S,A F (A(S)) -F S (A(S)) between the population risk and empirical risk is called the estimation error, which is due to the approximation of ρ by sampling. Here E A denotes the expectation w.r.t. the randomness of the algorithm A. For example, if A is SGD, then E A denotes the expectation w.r.t. the random indices of training examples selected for the gradient computation. A powerful tool to study the estimation error is the algorithmic stability (Bousquet & Elisseeff, 2002; Elisseeff et al., 2005; Shalev-Shwartz et al., 2010; Hardt et al., 2016) , which measures the sensitivity of the algorithm's output w.r.t. the perturbation of a training dataset. Below we give formal definitions of stability measures, whose connection to generalization is established in Theorem A.1. Definition 1 (Uniform Stability). A randomized algorithm A has uniform stability if for all datasets S, S ∈ Z n that differ by at most one example, we have sup z E A f (A(S); z) -f (A( S); z) ≤ . The following on-average stability is similar to the average-RO stability in Shalev-Shwartz et al. (2010) . The difference is we do not use an absolute value. For m ∈ N, we denote [m] = {1, . . . , m}. Definition 2 (On-average Stability). Let S = {z 1 , . . . , z n } and S = {z 1 , . . . , zn } be drawn independently from ρ. For each i ∈ [n], denote S (i) = {z 1 , . . . , z i-1 , zi , z i+1 , . . . , z n }. We say an algorithm A has on-average stability if 1 n n i=1 E S, S,A f (A(S (i) ); z i ) -f (A(S); z i ) ≤ . In this paper, we are interested in the excess generalization error F (A(S)) -F (w * ), where w * ∈ arg min w∈W F (w) is the best model with the least testing error (population risk). For this purpose, we introduce some basic assumptions. A basic assumption in non-convex learning is the smoothness of loss functions (Ghadimi & Lan, 2013; Karimi et al., 2016) , meaning the gradients are Lipschitz continuous. Let • 2 denote the Euclidean norm and ∇ denote the gradient operator. Assumption 1 (Smoothness Assumption). We assume for all z ∈ Z, the differentiable function w → f (w; z) is L-smooth, i.e., ∇f (w; z) -∇f (w ; z) 2 ≤ L ww 2 for all w, w ∈ W. Another assumption is the Polyak-Lojasiewicz (PL) condition on the objective function, which is common in non-convex optimization (Zhou et al., 2018b; Reddi et al., 2016; Karimi et al., 2016; Wang et al., 2019; Lei et al., 2017) , and was shown to hold true for deep (linear) and shallow neural networks (Hardt & Ma, 2016; Charles & Papailiopoulos, 2018; Li & Yuan, 2017) . Assumption 2 (Polyak-Lojasiewicz Condition). Denote FS = inf w ∈W F S (w ). We assume F S satisfies PL or gradient-dominated condition (in expectation) with parameter β > 0, i.e., E S F S (w) -FS ≤ 1 2β E S ∇F S (w) 2 2 , ∀w ∈ W. (3.1) It is worthy of mentioning that our results in this section continue to hold if the global PL condition is relaxed to a local PL condition, i.e., (3.1) holds for w in a neighborhood of the minimizer of F S . The existing stability analysis often imposes a bounded gradient assumption below (Bousquet & Elisseeff, 2002; Hardt et al., 2016; Charles & Papailiopoulos, 2018; Yuan et al., 2019; Kuzborskij & Lampert, 2018) . Indeed, the resulting stability bounds depend on the uniform Lipschitz constant G (see eq. (3.4)), which can be prohibitively large in practical models, e.g., DNNs, or even infinite, e.g. least squares regression in an unbounded domain. Assumption 3 (Bounded Gradient Assumption). We assume ∇f (w; z) 2 ≤ G for all w ∈ W, z ∈ Z and a constant G > 0. Our main result to be proved in Appendix B removes Assumption 3 and replaces the uniform Lipschitz constant G by the minimal empirical risk FS , which is significantly smaller than the Lipschitz constant. Note the assumption L ≤ nβ/4 is mild, and the previous generalization bounds become vacuous as O(1) (Yuan et al., 2019; Charles & Papailiopoulos, 2018) if this assumption is violated. Theorem 1 (Main Theorem). Let Assumptions 1, 2 hold and w S = A(S). If L ≤ nβ/4, then E F (w S ) -FS ≤ 16LE[ FS ] nβ + LE F S (w S ) -FS 2β . (3.2) An important implication is as follows. Since E FS ≤ E F S (w * ) = F (w * ) and FS ≤ F S (w S ), Eq. (3.2) implies an upper bound on the excess generalization error E[F (w S )] -F (w * ) and E F (w S ) -F S (w S ) = O 1 nβ + E F S (w S ) -FS β . (3.3) The above two terms can be explained as follows. The term O(1/(nβ)) reflects the intrinsic complexity of the problem, while E F S (w S ) -FS is called the optimization error. An interesting observation is that the overfitting phenomenon would never happen for learning under the PL condition (analogous to learning with strongly convex objectives where the global minimizer generalizes well (Bousquet & Elisseeff, 2002) ). Indeed, if the optimization algorithm finds more and more accurate solutions, it achieves the limiting generalization bound O(1/(nβ)). This shows an important message that optimization can be beneficial to generalization. This seemingly counterintuitive phenomenon is due to the implicit regularization enforced by the PL condition (analogous to the strong convexity condition). Another notable property is that Theorem 1 applies to any algorithm. We can plug any known optimization error bounds into it to immediately get generalization bounds. Remark 1. We show that our result significantly improves the existing stability analysis. The work (Charles & Papailiopoulos, 2018) showed the pointwise hypothesis stability is controlled by 2G 2 nβ + 2 √ 2G E[F S (w S ) -FS ]/β, which together with the connection between stability and generalization (cf. (A.1)), implies with probability 1 -δ that F (w S ) ≤ F S (w S ) + M 2 nδ + 24M G 2 nβδ + 24M G 2E[F S (w S ) -FS ] √ βδ 1 2 . (3.4) The above bound requires the bounded gradient assumption ∇f (w; z) 2 ≤ G and the bounded loss assumption 0 ≤ f (w; z) ≤ M for all w ∈ W and z ∈ Z, which are successfully removed in our generalization analysis. Furthermore, our generalization bound significantly improves (3.4). Indeed, assume E[F S (w S ) -FS ] ≤ 2 β for some > 0, then (3.3) implies E F (w S ) = E F S (w S ) + O 1 nβ + 2 , (3.5) while (3.4) becomes F (w S ) = F S (w S ) + O 1 √ nβ + √ . To achieve the generalization guarantee O(1/ √ nβ) , the above bound requires the optimization accuracy = O(1/(nβ)), while our bound (3.5) only requires the accuracy = O(1/ √ nβ) but gets the significantly better generalization bound 1/(nβ). We actually develop a better stability bound. Specifically, the pointwise hypothesis stability is bounded by O 1 nβ + in Charles & Papailiopoulos (2018) while we show that the onaverage stability is bounded by O 1 nβ + 2 , which is significantly tighter if 1 nβ ≤ ≤ 1 (ignoring constant factors). It should be mentioned that Charles & Papailiopoulos (2018) did not impose a smoothness assumption. However, the smoothness assumption is widely used in non-convex optimization to derive meaningful rates (Ghadimi & Lan, 2013) . As compared to probabilistic bounds in Charles & Papailiopoulos (2018) , our bounds are stated in expectation. The extension to highprobability bounds will lead to additional O(1/ √ n) term (Feldman & Vondrak, 2019) . Remark 2 (Bounded gradient assumption). Very recently, the bounded gradient assumption was also removed for the stability analysis (Lei & Ying, 2020) . However, their analysis considered SGD applied to convex loss functions. As a comparison, we study stability and generalization in a nonconvex learning setting, and our analysis applies to any stochastic optimization algorithms. Remark 3. If A is ERM, Theorem 1 immediately implies E F (w S ) -FS ≤ 16L nβ E[ FS ]. If F S is β-strongly convex and L < nβ/2, it was shown for ERM that E F (w S ) -FS ≤ 48L nβ E FS (Shalev-Shwartz & Ben-David, 2014, Corollary 13.7 ). Their result is extended here from a strongly convex setting to a gradient-dominated setting, and from the particular ERM to any algorithm. As a direct corollary, we can derive the following optimistic bound in the interpolation setting, which is the most intriguing case for over-parameterized or highly expressive DNN models.

Corollary 2. Let Assumptions 1, 2 hold and w

S = A(S). If E[ FS ] = 0 and L < nβ/2, then E F (w S ) ≤ L 2β E F S (w S ) . Remark 4. Corollary 2 shows a benefit of interpolation in boosting the generalization by achieving a generalization bound O( ) for any > 0 if we minimize F S sufficiently well. This benefit can not be explained by the existing discussions (Hardt et al., 2016; Charles & Papailiopoulos, 2018) as they imply the same generalization bound O(1/ √ nβ) in the interpolation setting. Although it was observed that interpolation helps in training (Bassily et al., 2018; Vaswani et al., 2019; Ma et al., 2018; Oymak & Soltanolkotabi, 2020; Allen-Zhu et al., 2019; Zou et al., 2018) , it is still largely unclear, as indicated in Ma et al. (2018) , that how interpolation helps in generalization. Corollary 2 shows new insights on how interpolation from highly expressive models helps generalization. We now move on to the discussion on the critical assumption in Corollary 2, i.e. L < nβ/2. According to the proof, the two parameters L and β can be replaced by their local counterparts, i.e., the smoothness and PL condition related to a particular minimizer w of F S (i) (Eqs. (B.6), (B.7)). For example, β can be replaced by 1 2 ∇F S (w ) 2 2 /(F S (w ) -FS ), which can be larger than β. Below are some examples on explaining L/β < n/2. As we will see, the quantity L/β reflects the complexity of the problem (related to condition number as shown in Examples 1, 2). Therefore, the condition L/β < n/2 imposes implicitly a constraint on the complexity of the problems. This explains why the optimization algorithm would never overfit when applied to gradient-dominated objective functions if L/β < n/2, as shown in Theorem 1. Example 1. Let φ : R d → R m be a feature map, and : R×R → R + be a loss function which is Lsmooth and σ -strongly convex w.r.t. the first argument. Consider f (w; z) = ( w, φ(x i ) , y i ) with •, • being an inner product. Then, F S satisfies the PL condition with the parameter σ min (Σ S )σ , where Σ S = 1 n n i=1 φ(x i )φ(x i ) is the empirical covariance matrix, A denotes the transpose of a matrix A and σ min (A) means the minimal non-zero singular value of A. The empirical counterpart (we have an expectation w.r.t. S in PL condition) of L/β is of the order of σ max (Σ S )/σ min (Σ S ), where σ max (A) means the maximal singular value (we give details in Appendix E.1). Example 2. Consider neural networks with a single hidden layer with d inputs, m hidden neurons and a single output neuron, for which the prediction function takes the form h v,w = m k=1 v k φ w k , x . Here w k ∈ R d and v k ∈ R denote the weight of the edges connecting the k-th hidden node to the input and output node, respectively, while φ : R → R is the activation function. Analogous to Arora et al. (2019) ; Oymak & Soltanolkotabi (2020) , we fix v = (v 1 , . . . , v m ) with |v k | = a for some a > 0 and train w = (w 1 , w 2 , . . . , w m ) ∈ R m×d from S. The loss function then takes the form f (w; z) = v φ(wx)-y 2 . If we consider the identity activation function, i.e., φ(t) = t, then F S satisfies the PL condition with the parameter σ min (Σ S ), where σ min (A) denotes the minimal singular value of A and Σ S = 1 n n i=1 x i x i . The empirical counterpart of L/β is of the order of σ max (Σ S )/σ min (Σ S ) (we give details in Appendix E.2 for a general activation function). It is possible to get generalization bounds under some other conditions. Since one-point strong convexity condition together with smoothness assumption implies the PL condition (Yuan et al., 2019) , all our results apply to one-point strongly convex functions. We can also get generalization bounds for objective functions satisfying the quadratic growth condition (Necoara et al., 2018) , which is weaker than the PL condition. However, we need to impose a realizability condition which was also imposed in Charles & Papailiopoulos (2018) . The proof of Theorem 3 is given in Section C. Let w (S) denote the Euclidean projection of w onto the set of global minimizers of F S in W. Definition 3 (Quadratic Growth Condition). We say F S : W → R satisfies the quadratic growth condition (in expectation) with parameter β if E F S (w)-FS ≥ β 2 E w-w (S) 2 2 for all w ∈ W. Theorem 3. Let Assumption 1 hold and F S satisfy the quadratic growth condition with parameter β. If the problem is realizable, i.e., E[ FS ] = 0 and L ≤ nβ/4, then E F (w S ) ≤ 2Lβ -1 E F S (w S ) . Finally, we consider any optimization algorithms applied to gradient-dominated and Lipschitz continuous functions. We do not require loss functions to be smooth here. It shows that the excess generalization bound can decay as fast as O(1/(nβ)) if we solve the optimization problem to a sufficient accuracy, which is much better than the generalization bound O(1/ √ nβ) in Charles & Papailiopoulos (2018). Recall the analysis in Charles & Papailiopoulos (2018) requires Assumptions 2, 3 and a further assumption on boundedness of loss functions. The proof is given in Section C. Theorem 4. Let Assumptions 2, 3 hold and w S = A(S). Then the following inequality holds E F (w S ) -FS ≤ 2G 2 nβ + G E F S (w S ) -FS 1 2 √ 2β .

4. APPLICATIONS

In this section, we apply Theorem 1 to different stochastic optimization algorithms such as stochastic gradient descent, randomized coordinate descent, and stochastic variance-reduced optimization. In particular, we study the number of stochastic gradient evaluations required to achieve a prescribed generalization bound, which is summarized in Table 1 . We always assume L ≤ nβ/4 in this section.

4.1. STOCHASTIC GRADIENT DESCENT

We need some notations to state results on SGD. Specifically, denote by w 1 ∈ W an initial point of SGD. At the t-th iteration, we first randomly select an index i t ∼ unif[n], and then update {w t } t by w t+1 = w t -η t ∇f (w t ; z it ), (4.1) where {η t } t is a sequence of positive step sizes and unif[n] denotes the uniform distribution over [n] . The proof of Theorem 5 is given in Appendix D.1. Theorem 5. Let Assumptions 1, 2 hold with L ≤ nβ/4. Let A be SGD with the step size sequence η t = 2t+1 2β(t+1) 2 . Then E F (w T +1 ) -F (w * ) = O 1 nβ + 1 T β 3 . We can take O( n β 2 ) stochastic gradient evaluations to get excess generalization bounds O(1/(nβ)). Remark 5. We compare Theorem 5 with the recent generalization analysis of SGD under the PL condition. Based on pointwise hypothesis stability analysis and the optimization error bound in Karimi et al. (2016) , it was shown with probability at least 1 -δ (Charles & Papailiopoulos, 2018) 2016) and the optimization error bound in Karimi et al. (2016) , it was shown in Yuan et al. (2019) that F (w T +1 ) -F (w * ) = O 1 √ nβδ + 1 T 1 4 β 3 4 δ 1 2 . (4.2) Algorithm Complexity for 1/(nβ) Complexity for if E[ FS ] = 0 SGD n β 2 1 β 2 log 1 β RCD d log n β d β log 1 β SVRG, SCSG n + n 2 3 /β log n n + n 2 3 /β log 1 β SARAH, SpiderBoost n + 1/β 2 log n n + 1/β 2 log 1 β SNVRG n + √ n/β log 4 n n + √ n/β E[F (w T +1 )] -F (w * ) = O n -1 (βT ) L/β 1+L/β + O 1 T β 2 . (4.3) By taking an optimal T = n 1+L/β 1+2L/β β -2+3L/β 1+2L/β (ignoring a constant factor) to balance the above two terms, we derive E[F (w T +1 )] -F (w * ) = O n -1+L/β 1+2L/β β -L/β 1+2L/β . If L/β is moderately large, then this bound quickly becomes E[F (w T +1 )] -F (w * ) = O(1/ √ nβ). With high probability at least 1 -δ, it was shown that SGD with the step size Zhou et al., 2018b) . However, it is not clear how the optimization errors decay with such step sizes. Typically, c should be of the order O(1/β) as shown in Karimi et al. (2016) and therefore the stability analysis in Zhou et al. (2018b) (Charles & Papailiopoulos, 2018; Zhou et al., 2018b; Yuan et al., 2019) , which is significantly improved to O(1/(nβ)) in our paper by the refined stability analysis. It is worth mentioning that, in this comparison, we have used the same optimization error bounds in Karimi et al. (2016) , and the analysis in Charles & Papailiopoulos (2018) ; Zhou et al. (2018b) ; Yuan et al. (2019) requires a bounded gradient assumption and a bounded loss assumption, which are removed in our analysis. η t = c (t+2) log(t+2) gets the bound F (w T +1 ) -F S (w T +1 ) = O √ c log T / √ nδ ( The above iteration complexity in Theorem 5 can be further improved if we impose a restricted secant inequality (Karimi et al., 2016) on F S , which has been considered for non-convex optimization, e.g., optimizing neural networks (Li & Yuan, 2017) . This is a slightly stronger assumption than the PL condition as shown in Karimi et al. (2016) . Definition 4 (Restricted Secant Inequality). We say F S : W → R satisfies the restricted secant inequality with parameter β if E w -w (S) , ∇F S (w) ≥ βE[ w -w (S) 2 2 ] for all w ∈ W. Theorem 6. Assume F S satisfies the restricted secant inequality with parameter β. Let Assumption 1 hold with L ≤ nβ/4. Let A be SGD with η t = 1/(β(t+1)). Then one can take O(n/β) stochastic gradient evaluations to achieve the excess generalization bounds O(1/(nβ)). Below we apply Theorem 1 to establish fast generalization bounds in an interpolation setting. Our analysis shows that interpolation actually boosts SGD by achieving an exponential convergence of testing errors, which can not be derived from the bound (3.4) in Charles & Papailiopoulos (2018) . Theorem 7. Let Assumptions 1, 2 hold with L ≤ nβ/4, and E[ FS ] = 0. Let A be SGD with η t = β/L 2 . Then E[F (w T +1 )] ≤ L(1-β 2 /L 2 ) T 2β E F S (w 1 ) . We can take O β -2 log(1/(β ))) stochastic gradient evaluations to achieve the generalization bound O( ) for any > 0. The above linear convergence does not contradict existing minimax lower bounds where the benefit of interpolation is not considered. The proofs for Theorems 6, 7 are given in Appendix D.1. Remark 6. We discuss some recent work on error bounds in low-noise conditions. Optimization errors of SGD were studied for general non-convex objectives (Vaswani et al., 2019; Ma et al., 2018) and gradient-dominated objectives (Bassily et al., 2018) . For binary classification problems with the specific squared loss, it was shown SGD achieves an exponential convergence of testing classification errors under a margin condition, i.e., positive and negative classes are separated by a margin that is strictly positive (Pillaud-Vivien et al., 2018) . This was extended to general convex loss functions under the same margin condition (Nitanda & Suzuki, 2019) . These discussions consider regularized objective functions (Pillaud-Vivien et al., 2018; Nitanda & Suzuki, 2019) , which are strongly convex. The exponential convergence in Pillaud-Vivien et al. ( 2018); Nitanda & Suzuki (2019) was established for the testing classification errors, i.e., 0-1 loss. As a comparison, we establish an exponential convergence for the testing errors measured by loss functions used in training. In addition, the exponential convergence in Pillaud-Vivien et al. ( 2018); Nitanda & Suzuki (2019) comes into effect only after a sufficiently large number of iterations, which is not required in Theorem 7.

4.2. RANDOMIZED COORDINATE DESCENT

Randomized coordinate descent (RCD) is an efficient optimization algorithm particularly useful for high-dimensional learning problems (Nesterov, 2012) . At each iteration it firstly randomly selects a single coordinate i t ∈ {1, . . . , d}, and then performs the update along the i t -th coordinate as w t+1 = w t -η t ∇ it F S (w t )e it , where ∇ i F S denotes the derivative of F S w.r.t. the i-th coordinate and e i is a vector in R d with the i-th coordinate being 1 and other coordinates being 0. Theorem 8. Let Assumptions 1 and 2 hold with L ≤ nβ/4. Let A be RCD with η t = 1/L. Then  E[F (w T +1 )] -F (w * ) = O 1 nβ + 1 β 1 -β dL T .

4.3. STOCHASTIC VARIANCE-REDUCED OPTIMIZATION

SGD needs a diminishing step size due to the inherent variance of stochastic gradients, which generally yields a sublinear convergence rate (Bottou et al., 2018) . Recently, there is a large amount of work to accelerate SGD by using some different gradient estimates with a reduced variance (Johnson & Zhang, 2013; Xiao & Zhang, 2014; Zhang et al., 2013; Allen-Zhu & Hazan, 2016; Fang et al., 2018; Wang et al., 2019; Nguyen et al., 2017; Zhou et al., 2018a; Schmidt et al., 2017; Defazio et al., 2014; Reddi et al., 2016) . This class of algorithms proceeds in epochs. Let w0 be an initialization point. At the beginning of s-th epoch, we set a reference point w 0 = ws-1 , draw a batch Ĩs ⊆ [n] and compute v 0 = ∇f Ĩs (w 0 ), where we denote f I (w) = 1 |I| i∈I f (w; z i ) for I ⊆ [n] and |I| is the cardinality of I. The batch Ĩs can be equal to [n] (Johnson & Zhang, 2013; Xiao & Zhang, 2014; Wang et al., 2019; Reddi et al., 2016) or drawn with replacement according to the uniform distribution over [n] (Lei et al., 2017; Fang et al., 2018) . Then we proceed with m s inner iterations by using some gradient estimators with reduced variances. At the t-th inner iteration, we first draw a batch I t ⊆ [n] from the uniform distribution over [n]. The original SVRG (Johnson & Zhang, 2013; Reddi et al., 2016; Xiao & Zhang, 2014) uses the gradient estimator (we omit the dependency on s) v t = ∇f It (w t ) -∇f It (w 0 ) + v 0 . (4.4) Recently a different update of gradient estimator is proposed (Nguyen et al., 2017; Fang et al., 2018) v t = ∇f It (w t ) -∇f It (w t-1 ) + v t-1 . (4.5) An important observation is that the variance of v t diminishes to zero as we are approaching the minimum, which allows us to update the iterate with a constant step size w t+1 = w t -ηv t (Johnson & Zhang, 2013) . The framework of stochastic variance-reduced optimization is described in Algorithm 1 in Appendix D. 

5. SIMULATIONS AND CONCLUSIONS

Simulations. We report some preliminary experiments to support our theory. We consider the dataset IJCNN available from the LIBSVM website (Chang & Lin, 2011) and report the average of experimental results from 25 repetitions. In our first experiment, we aim to check how the condition σ max (Σ S )/σ min (Σ S ) ≤ n/4 would be satisfied in practice. To this aim, we randomly pick a subset I ⊂ {1, 2, . . . , n} and build an empirical covariance matrix Σ I = 1 |I| i∈I x i x i , where |I| denotes the cardinality of I. Then we compute the term κ I := σmax(Σ I ) σmin(Σ I )|I| . Figure 1 plots the κ I as a function of |I|. It is clear that the condition κ I ≤ 1/4 is violated if |I| is small. As |I| increases, κ I decreases and can be as small as 10 -3 . Then, the condition κ I ≤ 1/4 holds trivially for sufficiently large n. Theorem 1 implies that overfitting would never happen for learning with gradient-dominated functions. Our second experiment aims to verify this phenomenon. We consider a generalized linear model for binary classification with the loss function f (w; z) = (w x) -y 2 , where is the logistic link function (a) = (1 + exp(-a)) -1 . It was shown that the corresponding objective function is gradient-dominated (Foster et al., 2018) . We use 80 percents of the dataset for training and reserve the remaining 20 percents for testing. We apply SGD with the step size η t = 1/(1 + 0.001t) and compute the testing error of {w t } on the testing dataset. In Figure 2 , we plot the testing errors versus the number of passes (iteration number divided by sample size). It is clear that the testing error continue to decrease along the learning process, and there is no overfitting even after 100 passes of the dataset. This is well consistent with Theorem 1. Conclusions. We study stochastic optimization under the PL condition. We show that the generalization errors can be bounded by O(1/(nβ)) plus the convergence rate of algorithms. An observation is that the optimization always helps in generalization under the PL condition. Our analysis based on a weak on-average stability measure removes the bounded gradient assumption in the literature, and can imply significantly better bounds. In particular, we show how the interpolation accelerates the generalization. Our study relies on an essential PL condition on the objective function. While this assumption is widely used in the non-convex learning setting, it would be very interesting to extend the discussions here to general non-convex objective functions. Proof. The proof of Part (a) can be found in Hardt et al. (2016, Theorem 2.2) . Part (b) was first proved for deterministic algorithms (Bousquet & Elisseeff, 2002, Theorem 11) , and then extended to randomized algorithms (Elisseeff et al., 2005, Theorem 12 ). We prove Part (c) here due to its simplicity. Since z i and zi are drawn from the same distribution, we know E S,A F (A(S)) -F S (A(S)) = 1 n n i=1 E S, S,A F (A(S (i) )) -F S (A(S)) = 1 n n i=1 E S, S,A f (A(S (i) ); z i ) -f (A(S); z i ) , where the last identity holds since z i is independent of A(S (i) ). The proof is complete by noting the definition of on-average stability.

B PROOF OF THEOREM 1

In this section, we prove Theorem 1. We begin our analysis with some useful properties of smooth functions. If g is L-smooth, we have the following self-bounding property (Srebro et al., 2010 ) ∇g(w) 2 2 ≤ 2L g(w) -inf w g(w ) , ∀w ∈ W (B.1) and the following elementary inequality for all w, w ∈ W (Nesterov, 2012) g(w) ≤ g( w) + ∇g( w), w -w + L w -w 2 2 2 . (B.2) In particular, if g is further nonnegative, then ∇g(w) 2 2 ≤ 2Lg(w), ∀w ∈ W. (B. 3) The following lemma follows directly from the self-bounding property of smooth loss functions. Lemma B.1. Assume F is L-smooth. Then (w can depend on S) E[ ∇F (w) 2 2 ] ≤ 2LE F (w) -FS . Proof. Recall w * = arg min w∈W F (w). According to the self-bounding property (B.1) and the definition of w * we know E[ ∇F (w) 2 2 ] ≤ 2LE F (w) -F (w * ) = 2LE F (w) -F S (w * ) ≤ 2LE F (w) -FS , where we have used E[F S (w * )] = F (w * ) since w * is independent of S, and FS ≤ F S (w * ) due to the definition of FS . The proof is complete. In the following lemma, we derive the on-average stability bounds under the PL condition. Recall for any w, we denote by w (S) the Euclidean projection of w onto the set of global minimizers of F S in W. Lemma B.2. If Assumptions 1, 2 hold, then A has on-average stability satisfying ≤ 2L nβ E[ FS ] + E[F (w (S) S )] + E F (w S ) -F (w (S) S ) + E FS -F S (w S ) . Proof. Let S = {z 1 , . . . , zn } be drawn independently from ρ. For each i ∈ [n], let S (i) be defined in Definition 2. For each i ∈ [n], we denote w S (i) = A(S (i) ) and w (S (i) ) S (i) the projection of w S (i) onto the set of global minimizer of F S (i) . We decompose f (w S (i) ; z i ) -f (w S ; z i ) as follows f (w S (i) ; z i ) -f (w S ; z i ) = f (w S (i) ; z i ) -f (w (S (i) ) S (i) ; z i ) + f (w (S (i) ) S (i) ; z i ) -f (w (S) S ; z i ) + f (w (S) S ; z i ) -f (w S ; z i ) . (B.4) We now address the above three terms separately. We first address f (w (S (i) ) S (i) ; z i ) -f (w (S) S ; z i ). According to the definition of F S , S, S (i) , we know f (w (S (i) ) S (i) ; z i ) = nF S (w (S (i) ) S (i) ) -nF S (i) (w (S (i) ) S (i) ) + f (w (S (i) ) S (i) ; zi ). Since z i and zi follow from the same distribution, we know E[f (w (S (i) ) S (i) ; zi )] = E[f (w (S) S ; z i )] and further get E f (w (S (i) ) S (i) ; z i ) = nE F S (w (S (i) ) S (i) ) -nE F S (i) (w (S (i) ) S (i) ) + E f (w (S) S ; z i ) . It then follows that E f (w (S (i) ) S (i) ; z i ) -f (w (S) S ; z i ) = nE F S (w (S (i) ) S (i) ) -F S (i) (w (S (i) ) S (i) ) = nE F S (w (S (i) ) S (i) ) -inf w∈W F S (w) , (B.5) where we have used the following identity due to the symmetry between z i and zi E[F S (i) (w (S (i) ) S (i) )] = E[ FS ] = E inf w∈W F S (w) . By the PL condition of F S , it then follows from (B.5) that (in our assumption of PL condition, w may depend on S. This was also imposed in the literature (Yuan et al., 2019; Charles & Papailiopoulos, 2018; Zhou et al., 2018b) . Indeed, the PL condition was often shown for empirical functions F S ) E f (w (S (i) ) S (i) ; z i ) -f (w (S) S ; z i ) ≤ n 2β E ∇F S (w (S (i) ) S (i) ) 2 2 . (B.6) According to the definition of w (S (i) ) S (i) we know ∇F S (i) (w (S (i) ) S (i) ) = 0 and therefore ((a + b) 2 ≤ 2a 2 + 2b 2 ) ∇F S (w (S (i) ) S (i) ) 2 2 = ∇F S (i) (w (S (i) ) S (i) ) - 1 n ∇f (w (S (i) ) S (i) ; zi ) + 1 n ∇f (w (S (i) ) S (i) ; z i ) 2 2 ≤ 2 n 2 ∇f (w (S (i) ) S (i) ; zi ) 2 2 + 2 n 2 ∇f (w (S (i) ) S (i) ; z i ) 2 2 ≤ 4L n 2 f (w (S (i) ) S (i) ; zi ) + 4L n 2 f (w (S (i) ) S (i) ; z i ), (B.7) where we have used the self-bounding property of smooth loss functions (B.3). Since z i and zi follow from the same distribution, we know E[f (w (S (i) ) S (i) ; zi )] = E[f (w (S) S ; z i )], E[f (w (S (i) ) S (i) ; z i )] = E[f (w (S) S ; zi )]. It then follows that E ∇F S (w (S (i) ) S (i) ) 2 2 ≤ 4L n 2 E[f (w (S) S ; z i )] + 4L n 2 E[f (w (S) S ; zi )], which, combined with (B.6), gives E f (w (S (i) ) S (i) ; z i ) -f (w (S) S ; z i ) ≤ 2L nβ E[f (w (S) S ; z i )] + E[f (w (S) S ; zi )] . Taking a summation of the above inequality for i = 1, . . . , n, we get n i=1 E f (w (S (i) ) S (i) ; z i ) -f (w (S) S ; z i ) ≤ 2L β E[ FS ] + E[F S (w (S) S )] . (B.8) We then address f (w S (i) ; z i ) -f (w (S (i) ) S (i) ; z i ). Since w S (i) and w (S (i) ) S (i) are independent of z i , we know E f (w S (i) ; z i ) -f (w (S (i) ) S (i) ; z i ) = E F (w S (i) ) -F (w (S (i) ) S (i) ) = E F (w S ) -F (w (S) S ) , (B.9) where we have used the symmetry between z i and zi . Finally, we address f (w (S) S ; z i ) -f (w S ; z i ). By the definition of w (S) S we know n i=1 f (w (S) S ; z i ) -f (w S ; z i ) = n FS -F S (w S ) . (B.10) Plugging (B.8), (B.9) and the above inequality back into (B.4), we derive n i=1 E f (w S (i) ; z i ) -f (w S ; z i ) ≤ 2L β E[ FS ] + E[F S (w (S) S )] + nE F (w S ) -F (w (S) S ) + nE FS -F S (w S ) . The proof is complete by recalling the definition of on-average stability and E[F S (w (S) S )] = E[F (w (S) S )]. We further require a lemma relating the convergence in terms of function values to the convergence in terms of models. This shows that the PL condition is stronger than a quadratic growth condition (Karimi et al., 2016) . Lemma B.3 (Karimi et al. 2016) . If F S satisfies the PL condition with parameter β > 0. Then for all w ∈ W we have E F S (w) -F S (w (S) ) ≥ 2βE[ w -w (S) 2 2 ]. (B.11) We are now in a position to prove Theorem 1. Proof of Theorem 1. Plugging the on-average stability established in Lemma B.2 back into Part (c) of Theorem A.1, we derive E F (w S ) -F S (w S ) ≤ 2L nβ E[ FS ] + E[F (w (S) S )] + E F (w S ) -F (w S ) + E FS -F S (w S ) , (B.12) from which we derive E F (w (S) S ) -FS ≤ 2L nβ E[ FS ] + E[F (w (S) S )] . (B.13) By (B. 2), we know the following inequality for all γ > 0 F (w S ) -F (w (S) S ) ≤ ∇F (w (S) S ), w S -w (S) S + L 2 w S -w (S) S 2 2 ≤ ∇F (w (S) S ) 2 w S -w (S) S 2 + L 2 w S -w (S) S 2 2 ≤ 1 4γ ∇F (w (S) S ) 2 2 + γ + L 2 w S -w (S) S 2 2 , where we have used the Cauchy-Schwartz inequality. This together with Lemma B.1 with w = w (S) S implies that E[F (w S ) -F (w (S) S )] ≤ L 2γ E F (w (S) S ) -FS + γ + L 2 E[ w S -w (S) S 2 2 ]. Plugging (B.13) into the above inequality, we get E F (w S ) -F (w (S) S ) ≤ L 2γ 2L nβ E[ FS ] + E[F (w (S) S )] + γ + L 2 E[ w S -w (S) S 2 2 ]. Taking γ = L/2, we then get E F (w S ) -F (w (S) S ) ≤ 2L nβ E[ FS ] + E[F (w (S) S )] + LE[ w S -w (S) S 2 2 ]. Plugging the above inequality back into (B.12), we derive the following inequality The stated bound then follows from (B.11). The proof is complete. E F (w S ) -F S (w S ) ≤ 4L nβ E[ FS ] + E[F (w (S) S )] + LE[ w S -w (S) S 2 2 ] + E FS -F S (w S ) . It then follows that E F (w S ) -FS ≤ 4L nβ E[ FS ] + E[F (w Our analysis in the proof of Theorem 1 actually gives E F (w S ) -FS ≤ 8LE[ FS ] nβ -2L + LE F S (w S ) -FS 2β . Since we assume E[ FS ] = 0 in Corollary 2, we only need the condition L < nβ/2 to get Corollary 2. C PROOF OF THEOREM 3 AND THEOREM 4 In this section, we present the proof of Theorem 3 and Theorem 4. Proof of Theorem 3. Let w be the projection of w (S (i) ) S (i) onto the set of global minimizer of F S . Then by the quadratic growth condition, we know E F S (w (S (i) ) S (i) ) -FS ≥ β 2 E w (S (i) ) S (i) -w 2 2 . This together with (B.5) and non-negativity of f implies nβ 2 E w (S (i) ) S (i) -w 2 2 ≤ E f (w (S (i) ) S (i) ; z i ) = E F (w (S (i) ) S (i) ) = E F (w (S) S ) , (C.1) where we have used the symmetry between S and S (i) . By the realizability condition, we know almost surely that f (w (S) S ; z i ) = f ( w; z i ) = 0 and ∇f ( w; z i ) = 0. It then follows from the smoothness assumption that E f (w (S (i) ) S (i) ; z i ) -f (w (S) S ; z i ) = E f (w (S (i) ) S (i) ; z i ) -f ( w; z i ) ≤ E w (S (i) ) S (i) -w, ∇f ( w; z i ) + L 2 w (S (i) ) S (i) -w 2 2 = E L 2 w (S (i) ) S (i) -w 2 2 ≤ LE[F (w (S) S )] nβ . We can plug (B.9), (B.10) and the above inequality back into (B.4), and derive the following bound on the on-average stability ≤ LE[F (w (S) S )] nβ + E F (w S ) -F (w (S) S ) + E FS -F S (w S ) . We then can analyze analogously to the proof of Theorem 1 but using the above stability bound and get the stated generalization bound. The proof is complete. Proof of Theorem 4. Similar to (B.7) but using the boundedness of gradients, we know ∇F S (w (S (i) ) S (i) ) 2 2 ≤ 4G 2 n 2 . We can plug this inequality into (B.6) and derive E f (w (S (i) ) S (i) ; z i ) -f (w (S) S ; z i ) ≤ n 2β 4G 2 n 2 = 2G 2 nβ . Taking a summation of the above inequality gives n i=1 E f (w (S (i) ) S (i) ; z i ) -f (w (S) S ; z i ) ≤ 2G 2 /β. (C.2) Plugging (C.2), (B.9) and (B.10) back into (B.4), we derive the following inequality n i=1 E f (w S (i) ; z i ) -f (w S ; z i ) ≤ 2G 2 β + nE F (w S ) -F (w (S) S ) + nE FS -F S (w S ) ≤ 2G 2 β + nGE w S -w (S) S 2 + nE FS -F S (w S ) , where in the last step we have used the inequality F (w S ) -F (w (S) S ) ≤ G w S -w (S) S 2 due to the boundedness of gradients. According to the definition of on-average stability, we know that the on-average stability of A satisfies ≤ 2G 2 nβ + GE w S -w (S) S 2 + E FS -F S (w S ) ≤ 2G 2 nβ + G E F S (w S ) -FS 1 2 √ 2β + E FS -F S (w S ) , where we have used Lemma B.3. According to Part (c) of Theorem A.1, it follows that E F (w S ) -F S (w S ) ≤ 2G 2 nβ + G E F S (w S ) -FS 1 2 √ 2β + E FS -F S (w S ) . The stated bound then follows directly. The proof is complete.

D PROOFS ON APPLICATIONS

In this section, we prove generalization bounds for various stochastic optimization algorithms.

D.1 STOCHASTIC GRADIENT DESCENT

We consider here SGD. In the following proposition, we establish the variance of stochastic gradients for SGD under the PL condition. The variance was also studied in a general nonconvex setting (Lei et al., 2020a) . Proposition D.1. Let Assumptions 1, 2 hold. Let {w t } t be the sequence produced by SGD with step size sequence {η t } t∈N . If there exists t 0 ∈ N such that η t ≤ β/L 2 for all t ≥ t 0 , then E[ ∇f (w t ; z it ) 2 2 ] ≤ 2L max{E[F S (w t0 )], 2E[ FS ]} ∀t ≥ t 0 . Proof. By (B.2) and the update (4.1), we know F S (w t+1 ) ≤ F S (w t ) + w t+1 -w t , ∇F S (w t ) + L w t+1 -w t 2 2 2 = F S (w t ) -η t ∇f (w t ; z it ), ∇F S (w t ) + Lη 2 t ∇f (w t ; z it ) 2 2 2 ≤ F S (w t ) -η t ∇f (w t ; z it ), ∇F S (w t ) + L 2 η 2 t f (w t ; z it ) , where we have used (B.3). Taking expectations on both sides we get the following inequality for all t ≥ t 0 E[F S (w t+1 )] ≤ E[F S (w t )] -η t E[ ∇F S (w t ) 2 2 ] + L 2 η 2 t E[f (w t ; z it )] ≤ E[F S (w t )] -2η t βE[F S (w t ) -FS ] + η t βE[F S (w t )], (D.1) where we have used the PL condition and η t ≤ β/L 2 in the last step. It then follows the following inequality for all t ≥ t 0 E[F S (w t+1 )] ≤ (1 -η t β)E[F S (w t )] + η t β • 2E[ FS ] ≤ max E[F S (w t )], 2E[ FS ] . Applying this inequality recursively, we derive E[F S (w t+1 )] ≤ max{E[F S (w t0 )], 2E[ FS ]} ∀t ≥ t 0 . This together with (B.3) implies the following inequality for all t ≥ t 0 E[ ∇f (w t ; z it ) 2 2 ] ≤ 2LE[f (w t ; z it )] ≤ 2L max{E[F S (w t0 )], 2E[ FS ]}. The proof is complete. We now prove generalization bounds in Theorem 5. We denote B B if there exist some constants c 1 and c 2 > 0 such that c 1 B ≤ B ≤ c 2 B. Proof of Theorem 5. Let t 0 = L 2 /β 2 . It is clear that η t ≤ β/L 2 for all t ≥ t 0 . Let σ = 2L max{E[F S (w 1 )], . . . , E[F S (w t0 )], 2E[ FS ]}. According to the self-bounding property (B.3) and Proposition D.1, we know that E[ ∇f (w t ; z it ) 2 2 ] ≤ σ 2 for all t ∈ N. The following optimization error bound was established in Karimi et al. ( 2016) E F S (w t+1 ) -FS ≤ Lσ 2 2tβ 2 . (D.2) We can plug the above inequality into (3.2) with A(S) = w T +1 , and get E F (w T +1 ) -FS ≤ 16LE FS nβ + L 2 σ 2 4T β 3 . Since E FS ≤ E F S (w * ) = F (w * ), (D.3) we further get E F (w T +1 ) -F (w * ) ≤ 16LF (w * ) nβ + L 2 σ 2 4T β 3 . By taking T n/β 2 , we get E F (w T +1 ) -FS = O(1/(nβ)) . This corresponds to O(n/β 2 ) stochastic gradient evaluations. The proof is complete. Lemma D.2. Assume F S satisfies the restricted secant inequality with parameter β. Let A be SGD with the step size sequence η t = 1/(β(t + 1)). Then there exists some σ ∈ R such that E w T -w (S) T 2 2 ≤ σ 2 /(β 2 T ). Proof of Lemma D.2. Analogous to the proof of Theorem 5, we can find σ ∈ R + such that E[ ∇f (w t ; z it ) 2 2 ] ≤ σ 2 for all t ∈ N. Since w (S) t+1 is a projection of w t+1 onto the set of global minimizer of F S , we know w t+1 -w  ; z it ) 2 2 ] ≤ σ 2 , we derive E w t+1 -w (S) t+1 2 2 ≤ E w t -w (S) t 2 2 + η 2 t σ 2 + 2η t E w (S) t -w t , ∇F S (w t ) ≤ E w t -w (S) t 2 2 + η 2 t σ 2 -2η t βE w t -w (S) t 2 2 = (1 -2η t β)E w t -w (S) t 2 2 + η 2 t σ 2 , where we have used the restricted secant inequality. For the step size η t = 1/(β(t + 1)), we have E w t+1 -w (S) t+1 2 2 ≤ t -1 t + 1 E w t -w (S) t 2 2 + σ 2 β 2 (t + 1) 2 . Multiplying both sides by t(t + 1), we derive t(t + 1)E w t+1 -w (S) t+1 2 2 ≤ (t -1)tE w t -w (S) t 2 2 + σ 2 β 2 . Taking a summation of the above inequality from t = 1 to T -1 gives (T -1)T E w T -w (S) T 2 2 ≤ σ 2 (T -1)/β 2 . The proof is complete. Proof of Theorem 6. It was shown that functions satisfying restricted secant inequality with parameter β also satisfies the PL condition with parameter β/L (Karimi et al., 2016) . Therefore (B.15) holds with β there replaced by β/L. According to Lemma D.2, we know E w T -w (S) T 2 2 ≤ σ 2 /(β 2 T ). We can plug this inequality back into (B.15) with A(S) = w T +1 , and get E F (w T +1 ) -F (w * ) = O F (w * ) nβ + 1 β 2 T , where we have used (D.3). By taking T n/β, we get E F (w T +1 ) -F (w * ) = O(1/(nβ)). This corresponds to O(n/β) stochastic gradient evaluations. The proof is complete. Proof of Theorem 7. Let η = β/L 2 . According to the assumption E[ FS ] = 0 and (D.1), we know E[F S (w t+1 )] ≤ E[F S (w t )] -2ηβE[F S (w t )] + ηβE[F S (w t )] = 1 -ηβ E[F S (w t )]. (D.4) Applying this inequality recursively, we get E[F S (w T +1 )] ≤ (1 -ηβ) T E[F S (w 1 )]. We can plug the above inequality back into (3.2) with A(S) = w T +1 and get E[F (w T +1 )] ≤ LE[F S (w T +1 )] 2β ≤ L(1 -β 2 /L 2 ) T 2β E F S (w 1 ) ≤ L exp(-β 2 T /L 2 ) 2β E F S (w 1 ) , where we have used the elementary inequality 1 -a ≤ exp(-a). (D.5) To achieve E[F (w T +1 )] ≤ , we can take T such that exp -β 2 T /L 2 β ⇐⇒ T β -2 log(1/(β )). The proof is complete.

D.2 RANDOMIZED COORDINATE DESCENT

We prove here the generalization bounds for randomized coordinate descent. We further assume that the gradient is coordinate-wise Lipschitz continuous in the sense that F S (w + αe i ) ≤ F S (w) + α∇ i F S (w) + Lα 2 /2, ∀α ∈ R, w ∈ R d , i ∈ [d]. Proof of Theorem 8. According to Theorem 3 in Karimi et al. (2016) , we know E F S (w T +1 ) -FS ≤ 1 - β dL T E F S (w 1 ) -FS . (D.6) Plugging the above inequality back into (3.2) and using (D.3), we get E[F (w T +1 )] -F (w * ) ≤ 16LE FS nβ + L 2β 1 - β dL T E[F S (w 1 )] = O F (w * ) nβ + O 1 β exp - βT dL , where we have used (D.5). To achieve the excess generalization bounds O(1/(nβ)), we require T satisfying exp -βT dL n -1 ⇐⇒ T d log n β . If E[ FS ] = 0, then it follows from (3.2), (D.6) and (D.5) that E[F (w T +1 )] ≤ L 2β exp - T β dL E[F S (w 1 )]. To achieve the generalization bound , we require T satisfying exp -T β dL β ⇐⇒ T β -1 d log 1/(β ). The proof is complete.

D.3 STOCHASTIC VARIANCE-REDUCED OPTIMIZATION

We prove here generalization bounds for various stochastic variance-reduced optimization algorithms. We formulate the framework in Algorithm 1. 11 choose the output from { ws } according to some strategy We now consider the stochastic variance-reduced gradient descent (SVRG) (Reddi et al., 2016) and stochastically controlled stochastic gradient (SCSG) (Lei et al., 2017) . Theorem D.3. Let Assumptions 1 and 2 hold with L ≤ nβ/4. Let A be either the SVRG in Reddi et al. (2016) or the SCSG in Lei et al. (2017) . Then we can take O n + n 2 3 /β log n stochastic gradient evaluations to get excess generalization bounds O(1/(nβ)). Furthermore, if E[ FS ] = 0, then we can take O n + n 2 3 /β log 1/(β ) stochastic gradient evaluations to achieve the generalization bound O( ) for any > 0. Proof. To achieve E[F S (A(S)) -FS ] ≤ 2/n, it was shown that SVRG and SCSG requires O n + n 2 3 /β log n stochastic gradient evaluations (Reddi et al., 2016; Lei et al., 2017) . We plug this optimization error bound into Theorem 1 and get E[F (A(S))] -F (w * ) = O(1/(nβ)). We now consider the case E[ FS ] = 0. According to (3.2), to achieve generalization bound O( ), it suffices that E[F (A(S))-FS ] = O(β ). This can be achieved by taking O n+n 2 3 /β log 1/(β ) stochastic gradient evaluations (Reddi et al., 2016; Lei et al., 2017) . The proof is complete. We now present the proof of Theorem 9 on the behavior of the stochastic recursive gradient algorithm (SARAH) (Nguyen et al., 2017) and SpiderBoost (Wang et al., 2019) . Proof of Theorem 9. To achieve E[F S (A(S)) -FS ] ≤ 2/n, it was shown that SARAH and Spi-derBoost requires O n + 1/β 2 log n stochastic gradient evaluations (Nguyen et al., 2017; Wang et al., 2019) . We plug this optimization error bound into Theorem 1 and get E[F (A(S))]-F (w * ) = O(1/(nβ)). We now consider the case E[ FS ] = 0. According to (3.2), to achieve generalization bound O( ), it suffices that E[F (A(S)) -FS ] = O(β ). This can be achieved by taking O n + 1/β 2 log 1/(β ) stochastic gradient evaluations (Nguyen et al., 2017; Wang et al., 2019) . The proof is complete. Finally, we consider SNVRG-PL (Zhou et al., 2018a) . Proof. To achieve E[F S (A(S)) -FS ] ≤ 2/n, it was shown that SNVRG-PL requires O n + √ n/β log 4 n stochastic gradient evaluations (Zhou et al., 2018a) . We plug this optimization error bound into Theorem 1 and get E[F (A(S))] -F (w * ) = O(1/(nβ)). We now consider the case E[ FS ] = 0. According to (3.2), to achieve generalization bound O( ), it suffices that E[F (A(S)) -FS ] = O(β ). This can be achieved by taking O n + √ n/β log 4 1/(β ) stochastic gradient evaluations (Zhou et al., 2018a) . The proof is complete.

E DISCUSSIONS OF EXAMPLES

In this section, we present some discussions on understanding the assumption L/β < n/2 in Theorem 2.

E.1 DISCUSSION OF EXAMPLE 1

We first give the definition of strong convexity. For any differentiable function g : W → R, we say g is σ-strongly convex if for any w, w ∈ W there holds g(w ) ≥ g(w) + ww, ∇g(w) + σ 2 ww 2 2 . According to the definition J j , we know J j J j = v 1 φ (w 1 x j )x j , . . . , v m φ (w m x j )x j    v 1 φ (w 1 x j )x j . . . That is, we can take the parameter of the PL condition as β = a 2 2n σ min m k=1 φ (Xw k ) φ (Xw k ) (XX ) . It is reasonable to assume that L is of the order of a 2 n σ max m k=1 φ (Xw k ) φ (Xw k ) (XX ) (Soltanolkotabi et al., 2019) 



We take O((d log n)/β) stochastic gradient evaluations to get excess generalization bounds O(1/(nβ)). If E[ FS ] = 0, we take O β -1 d log 1/(β ) stochastic gradient evaluations to get generalization bounds O( ) for any > 0. The detailed proof for the above theorem is given in Appendix D.2. As indicated in Remark 1, the discussion in Charles & Papailiopoulos (2018) can only imply the generalization bound O(1/ √ nβ).

3. The following theorem gives generalization bounds O(1/(nβ)) for stochastic variance-reduced optimization, which significantly improves the bound O(1/ √ nβ) based on (3.4). The proof is given in Appendix D.3. Theorem 9. Let Assumptions 1 and 2 hold with L ≤ nβ/4. Let A be either the SARAH in Nguyen et al. (2017) or the SpiderBoost inWang et al. (2019). We can take O n + 1/β 2 log n stochastic gradient evaluations to get excess generalization bounds O(1/(nβ)). If E[ FS ] = 0, we take O n + 1/β 2 log 1/(β ) stochastic gradient evaluations to get generalization bounds O( ) for any > 0.

Figure 1: κ I versus |I| Figure 2: Testing error versus number of passes

Generalization by Stability). Let A be a randomized algorithm. (a) If A has uniform stability , then E S,A F S (A(S)) -F (A(S)) ≤ . (b) Let M > 0. If A has pointwise hypothesis stability and 0 ≤ f (w; z) ≤ M for all w ∈ W and z ∈ Z. Then for all δ ∈ (0, 1) with probability at least 1 -δ F (A(S)) ≤ F S (A(SIf A has on-average stability , then E S,A F (A(S)) -F S (A(S)) ≤ .

S) S )] + LE[ w S -w ) ≤ 3E[ FS ].We can plug the above inequality back into (B.14) and deriveE F (w S ) -FS ≤ 16LE[ FS ] nβ + LE[ w S -w

Stochastic Variance Reduced Optimization Input: step size η, initialization w0 , {m s } 1 for s = 1, 2, . . . do 2 set w 0 = ws-1 3 draw a batch Ĩs ⊆ [n] 4 compute v 0 = ∇f Ĩs (w 0 ) 5 update w 1 = w 0 -ηv 0 6 for t = 1, . . . , m s -1 do 7 draw a batch I t ⊆ [n]

compute v t by either (4.4) or (4.5) 9 update w t+1 = w t -ηv t 10 set ws as w is , where i s is drawn according to a distribution on [m s ]

Theorem D.4. Let Assumptions 1 and 2 hold with L ≤ nβ/4. Let A be the SNVRG-PL in Zhou et al. (2018a). Then we can take O n + √ n/β log 4 n stochastic gradient evaluations to get excess generalization bounds O(1/(nβ)). Furthermore, if E[ FS ] = 0, then we can take O n + √ n/β log 4 1 β stochastic gradient evaluations to achieve the generalization bound O( ) for any > 0.

m φ (w m x j )x j k x j )φ (w k x j )x j x j . (x 1 , . . . , x n ) ∈ R n×d is the data matrix and denotes the Hadamard (entry-wise) product of matrices. According to the definition of r, we know F S (w) = 1 n r 2 2 . Then, it follows from (E.6) and (E.7) that ∇F S (w) k ) φ (Xw k ) (XX ) F S (w).

Iteration complexity for different optimization algorithms to achieve a stated generalization bound under Assumptions 1, 2. In the second column, we present the number of stochastic gradient evaluations to achieve excess generalization bounds O(1/(nβ)). In the third column, we present the number of stochastic gradient evaluations to achieve generalization bounds O( ) if E[ FS ] = 0. We ignore constant factors. It is known that variance-reduction techniques improve the iteration complexity to achieve small training errors. Our stability analysis shows that such an improvement is also achieved for testing errors. Note that the stability analysis inCharles & Papailiopoulos (2018) can at most imply an excess generalization bound O(1/ √ nβ) for these algorithms. The above bound indicates that O(n 2 /β) stochastic gradient evaluations are needed to get the excess generalization bounds O(1/ √ nβ). Based on the uniform stability bound in Hardt et al. (

. In this case, we have the empirical counterpart of L/β is of the order of σ maxIf we consider the identify activation function, i.e., φ(t) = t, then it follows from the definition of Σ S that

ACKNOWLEDGMENTS

The work of Yunwen Lei is supported by the National Natural Science Foundation of China (Grant No. 61806091) and the Alexander von Humboldt Foundation. The work of Yiming Ying is supported by NSF grants IIS-1816227 and IIS-2008532. 

A STABILITY AND GENERALIZATION

We first give the definition of pointwise hypothesis stability. For any i ∈ [n], denote S\z i = {z 1 , . . . , z i-1 , z i+1 , . . . , z n }. Definition 5 (Pointwise Hypothesis Stability). We say a randomized algorithm A has pointwise hypothesis stability if for all i ∈ [n] there holds E S,A f (A(S);Theorem A.1 establishes the key connection between the generalization and various stability measures. Part (a) and part (b) show that the algorithm with either uniform stability or pointwise hypothesis stability generalizes well to testing examples (Bousquet & Elisseeff, 2002) . Initially, they were developed for deterministic algorithms (Bousquet & Elisseeff, 2002) , which were then extended to the setting of randomized algorithms (Elisseeff et al., 2005) . Part (c) shows the connection between the generalization and the on-average stability (Shalev-Shwartz et al., 2010) . Note part (b) involves a square root of 1/δ instead of a log(1/δ).Published as a conference paper at ICLR 2021Then the function F S can be written aswhere A = φ(x 1 ), . . . , φ(x n ) ∈ R n×m is the matrix formed from the data. It is known that if g is σ g -strongly convex, then F S satisfies the PL condition (Karimi et al., 2016; Necoara et al., 2018 )Since is σ -strongly convex we know for anyThat is, g is σ n -strongly convex. This together with (E.1) shows thatwhere we have usedThis together with (E.4) shows (σ min replaced by σ max ) ∇F S (w) -∇F S (w ) 2 ≤ L σ max (Σ S ) ww 2 . (E.5) It is reasonable to assume that L is of the order of the smoothness of F S . In this case, it follows from (E.3) and (E.5) that empirical counterpart of L/β is of the order of σ max (Σ S )/σ min (Σ S ).

E.2 DISCUSSION OF EXAMPLE 2

We recall some notations in Example 2. Consider single-hidden-layer neural networks with d inputs, m hidden neurons and a single output, for which the prediction function takes the form h v,w = m k=1 v k φ w k , x . Here w k ∈ R d and v k ∈ R denote the weight of the edges connecting the k-th hidden node to the input and output node, respectively, while φ : R → R is the activation function. We fix v with |v k | = a for some a > 0 and train w = (w 1 , w 2 , . . . , w m ) ∈ R m×d from S. Note we only use the PL conditionS (i) in the proof of Theorem 1 (only in (B.6)). We fix w = w (S (i) ) S (i) here. Analogous to Soltanolkotabi et al. (2019) , we define the Jacobian matrix J = J 1 , J 2 , . . . , J n ∈ R md×n at w = w (S (i) ) S (i) withand r j = v φ(wx j ) -y j for j ∈ [n]. It was shown that (Soltanolkotabi et al., 2019) ∇F S (w) = 1 n Jr, for r = (r 1 , . . . , r n ) , (E.6) and therefore ∇F S (w) 2 2 = 1 n 2 r J Jr = 1 n 2 r J j J j j,j ∈[n] r.

