BEYOND LIPSCHITZ: SHARP GENERALIZATION AND EXCESS RISK BOUNDS FOR FULL-BATCH GD

Abstract

We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex). At the heart of our analysis is an upper bound on the generalization error, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that a small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, convex, and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, and recovers the generalization error guarantees of stochastic algorithms with fewer assumptions. For smooth convex losses, we show that the generalization error is tighter than existing bounds for SGD (up to one order of error magnitude). Consequently the excess risk matches that of SGD for quadratically less iterations. Lastly, for strongly convex smooth losses, we show that full-batch GD achieves essentially the same excess risk rate as compared with the state of the art on SGD, but with an exponentially smaller number of iterations (logarithmic in the dataset size).

1. INTRODUCTION

Gradient based learning (Lecun et al., 1998) is a well established topic with a large body of literature on algorithmic generalization and optimization errors. For general smooth convex losses, optimization error guarantees have long been well-known (Nesterov, 1998) . Similarly, Absil et al. (2005) and Lee et al. (2016) have showed convergence of Gradient Descent (GD) to minimizers and local minima for smooth nonconvex functions. More recently, Chatterjee (2022) , Liu et al. (2022) and Allen-Zhu et al. (2019) established global convergence of GD for deep neural networks under appropriate conditions. Generalization error analysis of stochastic training algorithms has recently gained increased attention. Hardt et al. (2016) showed uniform stability final-iterate bounds for vanilla Stochastic Gradient Descent (SGD). More recent works have developed alternative generalization error bounds with probabilistic guarantees (Feldman & Vondrak, 2018; 2019; Madden et al., 2020; Klochkov & Zhivotovskiy, 2021) and data-dependent variants (Kuzborskij & Lampert, 2018) , or under weaker assumptions such as strongly quasi-convex (Gower et al., 2019) , non-smooth convex (Feldman, 2016; Bassily et al., 2020; Lei & Ying, 2020b; Lei et al., 2021a) , and pairwise losses (Lei et al., 2021b; 2020) . In the nonconvex case, Zhou et al. (2021b) provide bounds that involve the on-average variance of the stochastic gradients. Generalization performance of other algorithmic variants lately gain further attention, including SGD with early momentum (Ramezani-Kebrya et al., 2021) , randomized coordinate descent (Wang et al., 2021c) , look-ahead approaches (Zhou et al., 2021a) , noise injection methods (Xing et al., 2021) , and stochastic gradient Langevin dynamics (Pensia et al., 2018; Mou et al., 2018; Li et al., 2020; Negrea et al., 2019; Zhang et al., 2021; Farghly & Rebeschini, 2021; Wang et al., 2021a; b) . Excess Risk Upper Bounds: GD vs SGD Algorithm Iterations Interpolation Bound β-Smooth Loss GD (this work) η t = 1/2β T = √ n No O 1 √ n Convex SGD η t = 1/ √ T (Lei & Ying, 2020b) T = n No O 1 √ n Convex GD (this work) η t = 1/2β T = n Yes O 1 n Convex SGD η t = 1/2β (Lei & Ying, 2020b) T = n Yes O 1 n Convex GD (this work) η t = 2/(β + γ) T = Θ(log n) No O log(n) n γ-Strongly Convex (Objective) SGD η t = 2/(t + t 0 )γ (Lei & Ying, 2020b) T = Θ(n) No O 1 n γ-Strongly Convex (Objective) Table 1 : Comparison of the excess risk bounds for the full-batch GD and SGD algorithms by Lei & Ying (2020b, Corollary 5 & Theorem 11) . We denote by n the number of samples, T the total number of iterations, η t the step size at time t, and by ϵ c ≜ E[R S (W * S )] the interpolation error. Even though many previous works consider stochastic training algorithms and some even suggest that stochasticity may be necessary (Hardt et al., 2016; Charles & Papailiopoulos, 2018) for good generalization, recent empirical studies have demonstrated that deterministic algorithms can indeed generalize well; see, e.g., (Hoffer et al., 2017; Geiping et al., 2022) . In fact, Hoffer et al. (2017) showed empirically that for large enough number of iterations full-batch GD generalizes comparably to SGD. Similarly, Geiping et al. (2022) experimentally showed that strong generalization behavior is still observed in the absence of stochastic sampling. Such interesting empirical evidence reasonably raise the following question: "Are there problem classes for which deterministic training generalizes more efficiently than stochastic training?" While prior works provide extensive analysis of the generalization error and excess risk of stochastic gradient methods, tight and path-dependent generalization error and excess risk guarantees in nonstochastic training (for general smooth losses) remain unexplored. Our main purpose in this work is to theoretically establish that full-batch GD indeed generalizes efficiently for general smooth losses. While SGD appears to generalize better than full-batch GD for non-smooth and Lipschitz convex losses (Bassily et al., 2020; Amir et al., 2021) , non-smoothness seems to be problematic for efficient algorithmic generalization. In fact, tightness analysis on non-smooth losses (Bassily et al., 2020) shows that the generalization error bounds become vacuous for standard step-size choices. Our work shows that for general smooth losses, full-batch GD achieves either tighter stability and excess error rates (convex case), or equivalent rates (compared to SGD in the strongly convex setting) but with significantly shorter training horizon (strongly-convex objective).

2. RELATED WORK AND CONTRIBUTIONS

Let n denote the number of available training samples (examples). Recent results (Lei & Tang, 2021; Zhou et al., 2022) on SGD provided bounds of order O(1/ √ n) for Lipschitz and smooth nonconvex losses. Neu et al. (2021) also provided generalization bounds of order O(ηT / √ n), with T = √ n and step-size η = 1/T to recover the rate O(1/ √ n). In contrast, we show that full-batch GD generalizes efficiently for appropriate choices of decreasing learning rate that guarantees faster convergence and smaller generalization error, simultaneously. Additionally, the generalization error involves an intrinsic dependence on the set of the stationary points and the initial point. Specifically,

Full-Batch Gradient Descent

Step Size Excess Risk Loss η t ≤ c/βt, ∀c < 1 C T c log(T ) + 1 n + ϵ opt Nonconvex η t = 1/2β C T ϵ c + 1 n + 1 T Convex η t = 1/2β, T = √ n C ϵ c + 1 √ n + O 1 n Convex η t = 2/(β + γ) C √ ϵ c √ T n + exp -4T γ β + γ + O 1 n 2 γ-Strongly Convex η t = 2/(β + γ) T = (β+γ) log n 2γ C √ ϵ c log(n) n + O 1 n 2 γ-Strongly Convex Table 2 : A list of excess risk bounds for the full-batch GD up to some constant factor C > 0. We denote the number of samples by n. "ϵ opt " denotes the optimization error ϵ opt ≜ E[R S (A(S)) -R * S ], T is the total number of iterations and ϵ c ≜ E[R S (W * S ) ] is the model capacity (interpolation) error. we show that full-batch GD with the decreasing learning rate choice of η t = 1/2βt achieves tighter bounds of the order O( T log(T )/n) (since T log(T )/n ≤ 1/ √ n) for any T ≤ n/ log(n). In fact, O( T log(T )/n) essentially matches the rates in prior works (Hardt et al., 2016) for smooth and Lipschitz (and often bounded) loss, however we assume only smoothness at the expense of the log(T ) term. Further, for convex losses we show that full-batch GD attains tighter generalization error and excess risk bounds than those of SGD in prior works (Lei & Ying, 2020b ), or similar rates in comparison with prior works that consider additional assumptions (Lipschitz or sub-Gaussian loss) (Hardt et al., 2016; Lugosi & Neu, 2022; Kozachkov et al., 2022) . In fact, for convex losses and for a fixed step-size η t = 1/2β, we show generalization error bounds of order O(T /n) for non-Lipschitz losses, while SGD bounds in prior work are of order O( T /n) (Lei & Ying, 2020b) . As a consequence, full-batch GD attains improved generalization error rates by one order of error magnitude and appears to be more stable in the non-Lipschitz case, however tightness guarantees for non-Lipschitz losses remains an open problem. Our results also establish that full-batch GD provably achieves efficient excess error rates through fewer number of iterations, as compared with the state-of-the-art excess error guarantees for SGD. Specifically, for convex losses with limited model capacity (non-interpolation), we show that with constant step size and T = √ n, the excess risk is of the order O(1/ √ n), while the SGD algorithm requires T = n to achieve excess risk of the order O(1/ √ n) (Lei & Ying, 2020b, Corollary 5.a) . For γ-strongly convex objectives, our analysis for full-batch GD relies on a leave-one-out γ loo -strong convexity of the objective instead of the full loss function being strongly-convex. This property relaxes strong convexity, while it provides stability and generalization error guarantees that recover the convex loss setting when γ loo → 0. Prior work (Lei & Ying, 2020b , Section 6, Stability with Relaxed Strong Convexity) requires a Lipschitz loss, while the corresponding bound becomes infinity when γ → 0, in contrast to the leave-one-out approach. Further, prior guarantees on SGD (Lei & Ying, 2020b, Theorem 11 and Theorem 12) often achieve the same rate of O(1/n), however with T = Θ(n) iterations (and a Lipschitz loss), in contrast with our full-batch GD bound that requires only T = Θ(log n) iterations (at the expense of a log(n) term)foot_0 . Finally, our approach does not require a projection step (in contrast to Hardt et al. (2016) ; Lei & Ying (2020b) ) in the update rule and consequently avoids dependencies on possibly large Lipschitz constants. In summary, we show that for smooth nonconvex, convex and strongly convex losses, full-batch GD generalizes, which provides an explanation of its good empirical performance in practice (Hoffer et al., 2017; Geiping et al., 2022) . We refer the reader to Table 1 for an overview and comparison of our excess risk bounds and those of prior work (on SGD). A more detailed presentation of the bounds appears in Table 2 (see also Appendix A, Table 3 and Table 4 ).

3. PROBLEM STATEMENT

Let f (w, z) be the loss at the point w ∈ R d for some example z ∈ Z. Given a dataset S ≜ {z i } n i=1 of i.i.d samples z i from an unknown distribution D, our goal is to find the parameters w * of a learning model such that w * ∈ arg min w R(w), where R(w) ≜ E Z∼D [f (w, Z)] and R * ≜ R(w * ). Since the distribution D is not known, we consider the empirical risk R S (w) ≜ 1 n n i=1 f (w; z i ). The corresponding empirical risk minimization (ERM) problem is to find W * S ∈ arg min w R S (w) (assuming minimizers on data exist for simplicity) and we define R * S ≜ R S (W * S ). For a deterministic algorithm A with input S and output A(S), the excess risk ϵ excess is bounded by the sum of the generalization error ϵ gen and the optimization error ϵ opt (Hardt et al., 2016, Lemma 5.1) , (Dentcheva & Lin, 2022)  ϵ excess ≜ E[R(A(S))] -R * ≤ E[R(A(S)) -R S (A(S))] ϵgen + E[R S (A(S))] -E[R S (W * S )] ϵopt . For the rest of the paper we assume that the loss is smooth and non-negative. These are the only globally required assumptions on the loss function. Assumption 1 (β-Smooth Loss) The gradient of the loss function is β-Lipschitz ∥∇ w f (w, z) -∇ u f (u, z)∥ 2 ≤ β∥w -u∥ 2 , ∀z ∈ Z. Additionally, we define the interpolation error that will also appear in our results. Definition 1 (Model Capacity/Interpolation Error) Define ϵ c ≜ E[R S (W * S )]. In general ϵ c ≥ 0 (non-negative loss). If the model has sufficiently large capacity, then for almost every S ∈ Z n , it is true that R S (W * S ) = 0. Equivalently, it holds that ϵ c = 0. In the next section we provide a general theorem for the generalization error that holds for any symmetric deterministic algorithm (e.g. full-batch gradient descent) and any smooth loss under memorization of the data-set.

4. SYMMETRIC ALGORITHM AND SMOOTH LOSS

Consider the i.i.d random variables z 1 , z 2 , . . . , z n , z ′ 1 , z ′ 2 , . . . , z ′ n , with respect to an unknown distribution D, the sets S ≜ (z 1 , z 2 , . . . , z n ) and S (i) ≜ (z 1 , z 2 , . . . , z ′ i , . . . , z n ) that differ at the i th random element. Recall that an algorithm is symmetric if the output remains unchanged under permutations of the input vector. Then (Bousquet & Elisseeff, 2002, Lemma 7) shows that for any i ∈ {1, . . . , n} and any symmetric deterministic algorithm A the generalization error is ϵ gen = E S (i) ,zi [f (A(S (i) ); z i )-f (A(S); z i )]. Identically, we write ϵ gen = E[f (A(S (i) ); z i )-f (A(S); z i )], where the expectation is over the random variables z 1 , . . . , z n , z ′ 1 , . . . , z ′ n for the rest of the paper. We define the model parameters W t , W (i) t evaluated at time t with corresponding inputs S and S (i) . For brevity, we also provide the next definition. Definition 2 We define the expected output stability as ϵ stab(A) ≜ E[∥A(S) -A(S (i) )∥ 2 2 ] and the expected optimization error as ϵ opt ≜ E[R S (A(S)) -R S (W * S )]. We continue by providing an upper bound that connects the generalization error with the expected output stability and the expected optimization error at the final iterate of the algorithm. Theorem 3 (Generalization Error) Let f (• ; z) be non-negative β-smooth loss for any z ∈ Z. For any symmetric deterministic algorithm A(•) the generalization error is bounded as |ϵ gen | ≤ 2 2β(ϵ opt + ϵ c )ϵ stab(A) + 2βϵ stab(A) , where ϵ stab(A) ≜ E[∥A(S) -A(S (i) )∥ 2 2 ]. In the limited model capacity case it is true that ϵ c is positive (and independent of n and T ) and |ϵ gen | = O( √ ϵ stab(A) ). We provide the proof of Theorem 3 in Appendix B.1. The generalization error bound in equation 4 holds for any symmetric algorithm and smooth loss. Theorem 3 consist the tightest variant of (Lei & Ying, 2020b, Theorem 2, b) ) and shows that the expected output stability and a small expected optimization error at termination sufficiently provide an upper bound on the generalization error for smooth (possibly non-Lipschitz) losses. Further, the optimization error term ϵ opt is always bounded and goes to zero (with specific known rates) in the cases of (strongly) convex losses. Under the interpolation condition the generalization error bound satisfies tighter bounds. For a small number of iterations T the above error rate is equivalent with Theorem 3. For sufficiently large T the optimization error rate matches the expected output stability and provides a tighter rate (with respect to that of Theorem 3) of |ϵ gen | = O(ϵ stab(A) ). Remark. As a byproduct of Theorem 3, one can show generalization and excess risk bounds for a uniformly µ-PL objective (Karimi et al., 2016) defined as E[∥∇R S (w)∥ 2 2 ] ≥ 2µE[R S (w) -R * S ] for all w ∈ R d . Let π S ≜ π(A(S) ) be the projection of the point A(S) to the set of the minimizers of R S (•). Further, define the constant c ≜ E[R S (π S ) + R(π S )]. Then a bound on the excess risk is (the proof appears in Appendix D) ϵ excess ≤ 8β √ c nµ ϵ opt + ϵ c + 8 2βϵ opt ϵ c √ µ + 16cβ 2 n 2 µ 2 + 45β µ ϵ opt . We note that a closely related bound has been shown in (Lei & Ying, 2020a) . In fact, (Lei & Ying, 2020a, Therem 7) requires simultaneously the interpolation error to be zero (ϵ c = 0) and an additional assumption, namely the inequality β ≤ nµ/4, to hold. However, if β ≤ nµ/4 and ϵ c = 0 (interpolation assumption), then (Lei & Ying, 2020a, inequality (B.13) , Proof of Theorem 1) implies that (E[R(π S )] ≤ 3E[R S (π S )] and) the expected population risk at π S is zero, i.e., E[R(π S )] = 0. Such a situation is apparently trivial since the population risk is zero at the empirical minimizer π S ∈ arg min R S (•).foot_1 On the other hand, if β > nµ/4, these bounds become vacuous. A PL condition is interesting under the interpolation regime and since the PL is not uniform (with respect to the data-set) in practice (Liu et al., 2022) , it is reasonable to consider similar bounds to that of 5 as trivial.

5. FULL-BATCH GD

In this section, we derive generalization error and excess risk bounds for the full-batch GD algorithm. We start by providing the definition of the expected path error ϵ path , in addition to the optimization error ϵ opt . These quantities will prominently appear in our analysis and results. Definition 5 (Path Error) For any β-smooth (possibly) nonconvex loss, learning rate η t , and for any i ∈ {1, . . . , n}, we define the expected path error as ϵ path ≜ T t=1 η t E[∥∇f (W t , z i )∥ 2 2 ]. The ϵ path term expresses the path-dependent quantity that appears in the generalization bounds in our resultsfoot_2 . Additionally, as we show, the generalization error also depends on the average optimization error ϵ opt (Theorem 3). A consequence of the dependence on ϵ opt , is that full-batch GD generalizes when it reaches the neighborhoods of the loss minima. Essentially, the expected path error and optimization error replace bounds in prior works (Hardt et al., 2016; Kozachkov et al., 2022) that require a Lipschitz loss assumption to upper bound the gradients and substitute the Lipschitz constant with tighter quantities. Later we show the dependence of the expected output stability term in Theorem 3 with respect to the expected path error. Then through explicit rates for both ϵ path and ϵ opt we characterize the generalization error and excess risk.

5.1. NONCONVEX LOSS

We proceed with the average output stability and generalization error bounds for nonconvex smooth losses. Through a stability error bound, the next result connects Theorem 3 with the expected path error and the corresponding learning rate. Then we use that expression to derive generalization error bounds for the full-batch GD in the case of nonconvex losses. Theorem 6 (Stability Error -Nonconvex Loss) Assume that the (possibly) nonconvex loss f (•, z) is β-smooth for all z ∈ Z. Consider the full-batch GD where T denotes the total number of iterates and η t denotes the learning rate, for all t ≤ T + 1. Then for the outputs of the algorithm W T +1 ≡ A(S), W T +1 ≡ A(S (i) ) it is true that ϵ stab(A) ≤ 4ϵ path n 2 T t=1 η t T j=t+1 (1 + βη j ) 2 . ( ) The expected output stability in Theorem 6 is bounded by the product of the expected path error (Definition 5), a sum-product term ( T t=1 η t T j=t+1 (1 + βη j ) 2 ) that only depends on the step-size and the term 4/n 2 that provides the dependence on the sample complexity. In light of Theorem 3, and Theorem 6, we derive the generalization error of full-batch GD for smooth nonconvex losses. Theorem 7 (Generalization Error -Nonconvex Loss) Assume that the loss f (•, z) is β-smooth for all z ∈ Z. Consider the full-batch GD where T denotes the total number of iterates, and the learning rate is chosen as η t ≤ C/t ≤ 1/β, for all t ≤ T + 1. Let ϵ ≜ βC < 1 and C(ϵ, T ) ≜ min {ϵ + 1/2, ϵ log(eT )}. Then the generalization error of full-batch GD is bounded by |ϵ gen | ≤ 4 √ 2 n (ϵ opt + ϵ c )ϵ path (eT ) ϵ C 1 2 (ϵ, T ) + 8 ϵ path n 2 (eT ) 2ϵ C(ϵ, T ) ≤ 4 √ 3(eT ) ϵ n (ϵ opt + ϵ c )ϵ path + 12 (eT ) 2ϵ n 2 ϵ path . Additionally, by the definition of the expected path and optimization error, and from the descent direction of algorithm, we evaluate upper bounds on the terms ϵ path and ϵ opt and derive the next bound as a byproduct of Theorem 7. Corollary 8 The generalization error of full-batch GD in Theorem 7 can be further bounded as |ϵ gen | ≤ 8 √ 3 n log(eT )(eT ) ϵ + 48 n 2 log(eT )(eT ) 2ϵ E[R S (W 1 )]. The inequality (8) in Theorem 7 shows the explicit dependence of the generalization error bound on the path-dependent error ϵ path and the optimization error ϵ opt . Note that during the training process the path-dependent error increases, and the optimization error decreases. Both terms ϵ path and ϵ opt may be upper bounded, to find the simplified (but potentially looser) bound appeared in Corollary 8. We prove Theorem 6, Theorem 7 and Corollary 8 in Appendix C. Finally, the generalization error in Corollary 8 matches bounds in prior work, including information theoretic bounds for the SGLD algorithm (Wang et al., 2021b , Corollary 1) (with fixed step-size), while our results do not require the sub-Gaussian loss assumption and show that similar generalization is achievable through deterministic training. Remark. (Dependence on Stationary Points) Let W 1 be an arbitrary initial point (independent of S). Under mild assumptions (provided in (Lee et al., 2016) ) GD convergences to (local) minimizers. Let W * S,W1 be the stationary point such lim T ↑∞ A(S) → W * S,W1 . Then through the smoothness of the loss, we derive an alternative form of the generalization error bound in Theorem 3 that expresses the dependence of the generalization error with respect to the quality of the set of stationary points, i.e., |ϵ gen | ≤ 4 β βE[∥A(S) -W * S,W1 ∥ 2 2 ] + E[R S (W * S,W1 )] ϵ stab(A) + 2βϵ stab(A) Inequality ( 10) provides a detailed bound that depends on the expected loss at the stationary point and the expected distance of the output from the stationary point, namely E[∥A(S) -W * S,W1 ∥ 2 2 ].

5.2. CONVEX LOSS

In this section, we provide generalization error guarantees for GD on smooth convex losses. Starting from the stability of the output of the algorithm, we show that the dependence on the learning rate is weaker than that of the nonconvex case. That dependence and the fast convergence to the minimum guarantee tighter generalization error bounds than the general case of nonconvex losses in Section 5.1. The generalization error and the corresponding optimization error bounds provide an excess risk bound through the error decomposition (2). We refer the reader to Table 2 for a summary of the excess risk guarantees. We continue by providing the stability bound for convex losses. Theorem 9 (Stability Error -Convex Loss) Assume that the convex loss f (•, z) is β-smooth for all z ∈ Z. Consider the full-batch GD where T denotes the total number of iterates and η t ≤ 1/2β learning rate, for all t ≤ T + 1. Then for outputs of the algorithm W T +1 ≡ A(S),W T +1 ≡ A(S (i) ) it is true that ϵ stab(A) ≤ 4ϵ path n 2 T t=1 η t ≤ 32β T t=1 η t n 2 E[∥W 1 -W * S ∥ 2 2 ] + ϵ c T t=1 η t In the convex case, the expected output stability (inequality 11) is bounded by the product of the expected path error, the number of samples term 2/n 2 and the accumulated learning rate.  |ϵ gen | = O(1/ √ n). Further, GD applies for much larger learning rates (η t = 1/β), which provide not only tighter generalization error bound but also tighter excess risk guarantees than SGD as we later show. By combining Theorem 3 and Theorem 9, we show the next generalization error bound. Theorem 10 (Generalization Error -Convex Loss) Let the loss function f (•, z) be convex and β-smooth for all z ∈ Z. Consider the full-batch GD where T denotes the total number of iterates. We chose the learning rate such that η t ≤ 1/2β, for all t ≤ T + 1. Then the generalization error of full-batch GD is bounded by |ϵ gen | ≤ 4 2β (ϵ opt + ϵ c ) ϵ path n T t=1 η t + 8β ϵ path n 2 T t=1 η t . ( ) We provide the proof of Theorem 9 and Theorem 10 in Appendix E. Similar to the nonconvex case (Theorem 7), the bound in Theorem 10 shows the explicit dependence of the generalization error on the number of samples n, the path-dependent term ϵ path , and the optimization error ϵ opt , as well as the effect of the accumulated learning rate. From the inequality (12), we can proceed by deriving exact bounds on the optimization error and the accumulated learning rate, to find explicit expressions of the generalization error bound. Through Theorem 9, Theorem 10 (and Lemma 20 in Appendix E), we derive explicit generalization error bounds for certain choices of the learning rate. In fact, we consider the standard choice η t = 1/2β in the next result. Theorem 11 (Generalization/Excess Error -Convex Loss) Let the loss function f (•, z) be convex and β-smooth for all z ∈ Z. If η t = 1/2β for all t ∈ {1, . . . , T }, then |ϵ gen | ≤ 8 1 n + 2T n 2 3βE[∥W 1 -W * S ∥ 2 2 ] + T ϵ c , and ϵ excess ≤ 8 1 n + 2T n 2 3βE[∥W 1 -W * S ∥ 2 2 ] + T ϵ c + 3βE[∥W 1 -W * S ∥ 2 2 ] T . ( ) As a consequence, for T = √ n iterations the GD algorithm achieves ϵ excess = O(1/ √ n). In contrast, SGD requires T = n number of iterations to achieve ϵ excess = O(1/ √ n) (Lei & Ying, 2020b, Corollary 5, a)). However, if ϵ c = 0, then both algorithms have the same excess risk rate of O(1/n) through longer training with T = n iterations. Finally, observe that the term E[∥W 1 -W * S ]∥ 2 2 should be O(1) and independent of the parameters of interest (for instance n) to derive the aforementioned rates.

5.3. STRONGLY-CONVEX OBJECTIVE

One common approach to enforce strong-convexity is through explicit regularization. In such a case both the objective R S (•) the individual losses f (•; z) are strongly-convex. In other practical scenarios, the objective is often strongly-convex but the individual losses are not (Ma et al., 2018, Section 3) . In this section, we show stability and generalization error guarantees that include the above cases by assuming a γ-strongly convex objective. We also show a property of fullbatch GD that requires only a leave-one-out variant of the objective to be strongly-convex. If the objective R S (•) is γ-strongly convex and the loss f (•; z) is β-smooth, then the leave-one-out function R S -i (w) ≜ n j=1,j̸ =i f (w; z j )/n is γ loo -strongly convex for all i ∈ {1, . . . , n} for some γ loo ≤ γ. Although γ loo is slightly smaller than γ (γ loo = max{γ -β/n, 0}), our results reduce to the convex loss generalization and stability bounds when γ loo → 0. Further, the faster convergence also provides tighter bounds for the excess risk (see Table 2 ). Theorem 12 (Stability Error -Strongly Convex Loss) Assume that the loss f (•, z) is β-smooth for all z ∈ Z and that R S (•) is γ-strongly convex. Consider the full-batch GD where T denotes the total number of iterates and η t ≤ 2/(β + γ) denotes the learning rate, for all t ≤ T . Then for outputs of the algorithm W T +1 ≡ A(S), W (i) T +1 ≡ A(S (i) ) it is true that ϵ stab(A) ≤ 4ϵ path n 2 T t=1 η t T j=t+1 (1 -η j γ loo ) . Specifically, if η t = 2/(β + γ), then ϵ stab(A) ≤ 4ϵ path n 2 min 1 γ loo , 2T β . ( ) By comparing the stability guarantee of Theorem 9 with Theorem 12, we observe that the learning rate dependent term (sum-product) is smaller than that of the convex case. While the dependence on expected path error (ϵ path ) is identical, we show (Appendix F) that the ϵ path term is smaller in the strongly convex case. Additionally, Theorem 12 recovers the stability bounds of the convex loss case, when γ loo → 0 (and possibly γ → 0). Similarly to the nonconvex and convex loss cases, Theorem 3 and the stability error bound in Theorem 12 provide the generalization error bound for strongly convex losses. Theorem 13 (Generalization Error -Strongly Convex Loss) Let the loss function f (•, z) βsmooth for all z ∈ Z and the objective R S (•) be γ-strongly convex. Consider the full-batch GD where T denotes the total number of iterates. Let us set the learning rate to η t ≤ 2/(β + γ), for all t ≤ T . Then the generalization error of full-batch GD is bounded by |ϵ gen | ≤ 4 (ϵ opt + ϵ c )ϵ path n 2β T t=1 η t T j=t+1 (1 -η j γ loo ) + 8β ϵ path n 2 T t=1 η t T j=t+1 (1 -η j γ loo ) . We prove Theorem 12 and Theorem 13 in Appendix F. Recall that the sum-product term in the inequality of Theorem 13 is smaller than the summation of the learning rates in Theorem 10. This fact together with the tighter optimization error bound provide a smaller excess risk than those of the convex losses. Similar to the convex loss setting, we use known optimization error guarantees of full-batch GD for strongly convex losses to derive explicit expressions of the generalization and excess risk bounds. By combining Theorem 13 and optimization and path error bounds (Lemma 22, Lemma 23 in Appendix F, and Lemma 15 in Appendix B.2), we derive our generalization error bound for fixed step size as follows in the next result. Theorem 14 (Generalization/Excess -Strongly Convex Loss) Let the objective function R S (•) be γ-strongly convex and β-smooth by choosing some β-smooth loss f (•, z)), not necessarily (strongly) convex for all z ∈ Z. Define m(γ loo , T ) ≜ βT min {β/γ loo , 2T } /(β + γ) and M(W 1 ) ≜ max βE[∥W 1 -W * S ∥ 2 2 ], E[R S (W * S )] , and set the learning rate to η t = 2/(β + γ). Then the generalization error of the full-batch GD at the last iteration satisfies the inequality |ϵ gen | ≤ 8 √ 6 n M (W 1 ) + exp -2T γ β + γ + 4 √ 3 n m(γ loo , T ) M(W 1 ) m(γ loo , T ). Additionally the optimization error (Lemma 23 in Appendix F) and the inequality (2) give the following excess risk ϵ excess ≤ 8 √ 6 n ∆ T + exp -2T γ β + γ + 4 √ 3 n ∆ T + Λ exp -4T γ β + γ , ( ) where ∆ T ≜ βT M (W 1 ) min {β/γ loo , 2T } /(β + γ) and Λ ≜ βE[∥W 1 -W * S ∥ 2 2 ]/2. Theorem 13 and Theorem 14 also recover the convex setting when γ → 0 or γ loo → 0. Additionally, for γ > 0 and by setting the number of iterations as T = (β/γ + 1) log(n)/2 and by defining the sequence m n,γ loo ≜ β min {β/γ loo , (β/γ + 1) log n} /2γ, the last inequality gives |ϵ gen | ≤ 8 √ 6 log n n M(W 1 ) + 1 + 4 3m n,γ loo n M(W 1 ) √ m n,γ loo . Finally, for T = (β/γ + 1) log(n)/2 iterations it is true that ϵ excess ≤ 8 √ 6 log n n Γ n + 1 + 4 √ 3 n Γ n + O 1 n 2 , ( ) where Γ n ≜ βM(W 1 ) min {β/γ loo , (β/γ + 1) log n} /2γ and as a consequence the excess risk is of the order O log(n)/n . As a comparison, the SGD algorithm (Lei & Ying, 2020b, Theorem 12) requires T = n number of iterations to achieve an excess risk of the order O(1/n), while full-batch GD achieves essentially the same rate with T = (β/γ + 1) log(n)/2 iterations.

6. CONCLUSION

In this paper we developed generalization error and excess risk guarantees for deterministic training on smooth losses via the the full-batch GD algorithm. At the heart of our analysis is a sufficient condition for generalization, implying that, for every symmetric algorithm, average algorithmic output stability and a small expected optimization error at termination ensure generalization. By exploiting this sufficient condition, we explicitly characterized the generalization error in terms of the number of samples, the learning rate, the number of iterations, a path-dependent quantity and the optimization error at termination, further exploring the generalization ability of full-batch GD for different types of loss functions. More specifically, we derived explicit rates on the generalization error and excess risk for nonconvex, convex and strongly convex smooth (possibly non-Lipschitz) losses/objectives. Our theoretical results shed light on recent empirical observations indicating that full-batch gradient descent generalizes efficiently and that stochastic training procedures might not be necessary and in certain cases may even lead to higher generalization errors and excess risks.

Full-Batch Gradient Descent Step Size

Generalization Error Loss η t ≤ C/βt, ∀C < 1 4e √ 3 n T C (ϵ opt + ϵ c )ϵ path + 12e 2 n 2 T 2C ϵ path NC η t ≤ C/βt, ∀C < 1 48 log(eT )(eT ) C n + log(eT )(eT ) 2C n 2 E[R S (W 1 )] NC η t = 1/2β 8 1 n + 2T n 2 3βE[∥W 1 -W * S ∥ 2 2 ] + T ϵ c C η t = 2/(β + γ) 8 √ 6 n ∆ T + exp -2T γ β + γ + 4 √ 3 n ∆ T γ-SC Table 3 : A list of the generalization error bounds for the full-batch GD. We denote the number of samples by n. W 1 is the initial point of the algorithm, and W * S is a point in the set of minimizers of the objective. Also, "ϵ path " denotes the expected path error ϵ path ≜ T t=1 η t E[∥∇f (W t , z i )∥ 2 2 ], "ϵ opt " denotes the optimization error ϵ opt ≜ E[R S (A(S)) -R * S ], T is the total number of iterations and we define the model capacity (interpolation) error risk as ϵ c ≜ E[R S (W * S )]. Lastly, we define the constant M(W 1 ) ≜ max βE[∥W 1 -W * S ∥ 2 2 ], E[R S (W * S ) ] and the terms Γ n ≜ βM(W 1 ) min {β/γ loo , (β/γ + 1) log n} /2γ, ∆ T ≜ βT M (W 1 ) min {β/γ loo , 2T } /(β + γ). Lastly, "NC", "C" and "γ-SC" correspond to nonconvex, convex and γ-strongly convex objective, respectively.

A SUMMARY OF THE RESULTS

Herein, we present a summary of the generalization and excess risk bounds. The detailed expressions of the generalization and excess risk bounds appear in Table 3 and 4 .

B PROOFS

We provide the proofs of the results in these sections. We start by proving Theorem 3 and the bounds on the sum-product terms that appear in the stability error bounds, and then we continue with stability and generalization error guarantees, that we prove in parallel. We derive the excess risk bounds by applying the decomposition of the inequality (2).

B.1 PROOF OF THEOREM 3

It is true that for any i, j ∈ {1, . . . , n} E[f (A(S); z i )] = E[f (A(S); z j )] = 1 n n k=1 E[f (A(S); z k )] = E[R S (A(S))]. We show equation 20 through the symmetry of the algorithm (at each iteration) and the fact that {z i } n i=1 are identically distributed as follows. The random variables {z i } n i=1 remain exchangeable.foot_3 The β-smooth property of f (• ; z) for all z ∈ Z gives f (A(S (i) ); z) -f (A(S); z) ≤ ⟨A(S (i) ) -A(S), ∇f (A(S); z)⟩ + β∥A(S (i) ) -A(S)∥ 2 2 2 . Full-Batch Gradient Descent Step Size Excess Risk Loss η t ≤ C/βt, ∀C < 1 48 log(eT )(eT ) C n + log(eT )(eT ) 2C n 2 E[R S (W 1 )] + ϵ opt NC η t = 1/2β 8 n + 16T n 2 3βE[∥W 1 -W * S ∥ 2 2 ] + T ϵ c + 3βE[∥W 1 -W * S ∥ 2 2 ] T C η t = 1/2β, T = √ n 8 ϵ c + 3βE[∥W 1 -W * S ∥ 2 2 ] √ n + O 1 n C η t = 2/(β + γ) 8 √ 3 n ∆ T + exp -2T γ β + γ + 4 √ 3 n ∆ T +Λ exp -4T γ β + γ γ-SC η t = 2/(β + γ) T = (β+γ) log n 2γ 8 √ 3 log n n Γ n + 1 + 4 √ 3 n Γ n + O 1 n 2 γ-SC Table 4 : A list of excess risk bounds for the full-batch GD. We denote the number of samples by n. W 1 is the initial point of the algorithm, and W * S is a point in the set of minimizers of the objective. Also, "ϵ path " denotes the expected path error ϵ path ≜ T t=1 η t E[∥∇f (W t , z i )∥ 2 2 ], "ϵ opt " denotes the optimization error ϵ opt ≜ E[R S (A(S)) -R * S ], T is the total number of iterations and we define the model capacity (interpolation) error risk as ϵ c ≜ E[R S (W * S )]. Lastly, we define the constants Λ ≜ βE[∥W 1 -W * S ∥ 2 2 ]/2, M(W 1 ) ≜ max βE[∥W 1 -W * S ∥ 2 2 ], E[R S (W * S ) ] and the terms Γ n ≜ βM(W 1 ) min {β/γ loo , (β/γ + 1) log n} /2γ, ∆ T ≜ βT M (W 1 ) min {β/γ loo , 2T } /(β + γ). Lastly, "NC", "C" and "γ-SC" correspond to nonconvex, convex and γ-strongly convex objective, respectively. The expression ϵ gen = E[f (A(S (i) ); z i ) -f (A(S); z i )] and the inequality (21) give ϵ gen ≤ E ⟨A(S (i) ) -A(S), ∇f (A(S); z i )⟩ + β∥A(S (i) ) -A(S)∥ 2 2 2 (22) We find an upper bound for the expectation of the inner product of the inequality ( 22) by applying Cauchy-Schwartz inequality as E ⟨A(S (i) ) -A(S), ∇f (A(S); z i )⟩ ≤ E ∥A(S (i) ) -A(S)∥ 2 ∥∇f (A(S); z i )∥ 2 (23) ≤ ϵ stab(A) E [∥∇f (A(S); z i )]∥ 2 2 ], here we use the inequalities ⟨a, b⟩ ≤ ∥a∥ 2 ∥b∥ 2 and E 2 [XY ] ≤ E[X 2 ]E[Y 2 ] to derive the bounds in 23 and 24 respectively. By combining the inequalities 22 and 24 we find that for any i ∈ {1, . . . , n} it is true that ϵ gen ≤ ϵ stab(A) E [∥∇f (A(S); z i )∥ 2 2 ] + β 2 ϵ stab(A) . To find an upper bound for the |ϵ gen |, we also need an upper bound for negative of ϵ gen , namely E[f (A(S); z i ) -f (A(S (i) ); z i )] = -ϵ gen . Note that by the same argument -ϵ gen ≤ E ∥A(S) -A(S (i) )∥ 2 2 E ∥∇f (A(S (i) ); z i )]∥ 2 2 + β 2 E[∥A(S) -A(S (i) )∥ 2 2 ]. ( ) Then we find an upper bound on E[∥∇f (A(S (i) ); z i )]∥ 2 2 ] as follows E ∥∇f (A(S (i) ); z i )]∥ 2 2 = E ∥∇f (A(S (i) ); z i ) -∇f (A(S); z i ) + ∇f (A(S); z i )∥ 2 2 ≤ 2E ∥∇f (A(S (i) ); z i ) -∇f (A(S); z i )]∥ 2 2 + ∥∇f (A(S); z i )∥ 2 2 ≤ 2β 2 E[∥A(S) -A(S (i) )∥ 2 2 ] + 2E[∥∇f (A(S); z i )]∥ 2 2 ]. The inequality 27 holds because of the β-smoothness of the loss. Additionally, 2β 2 E 2 ∥A(S) -A(S (i) )∥ 2 2 + 2E ∥A(S) -A(S (i) )∥ 2 2 E[∥∇f (A(S); z i )∥ 2 2 ] ≤ 2E ∥A(S) -A(S (i) )∥ 2 2 E[∥∇f (A(S); z i )∥ 2 2 ] + √ 2βE[∥A(S) -A(S (i) )∥ 2 2 ]. We combine the inequalities 26, 27 and 28 to find -ϵ gen ≤ 2E ∥A(S) -A(S (i) )∥ 2 2 E[∥∇f (A(S); z i )∥ 2 2 ] + 2βE[∥A(S) -A(S (i) )∥ 2 2 ]. (29) Finally, through the inequalities 25 and 29 we find |ϵ gen | ≤ 2ϵ stab(A) E[∥∇f (A(S); z i )∥ 2 2 ] + 2βϵ stab(A) We use the self-bounding property of the non-negative β-smooth loss function f (•; z) (Srebro et al., 2010, Lemma 3.1) , to show ∥∇f (A(S); z i )∥ 2 2 ≤ 4βf (A(S); z i ). ( ) The last display, Assumption 1 and equation 20 give E[∥∇f (A(S); z i )∥ 2 2 ] ≤ 4βE[f (A(S); z i )] = 4β 1 n n i=1 E[f (A(S); z i )] = 4βE[R S (A(S))] (32) = 4β (E[R S (A(S))] -E[R S (W * S )] + E[R S (W * S )]) = 4β (ϵ opt + E[R S (W * S )]) . ( ) We combine the inequalities 30, 33 and the Definition 1 to find |ϵ gen | ≤ 2 2β (ϵ opt + ϵ c ) ϵ stab(A) + 2βϵ stab(A) . ( ) The last inequality gives the bound on the generalization error and completes the proof. □

B.2 SUM PRODUCT TERMS IN THE STABILITY BOUNDS

Herein we show a lemma for the sum product terms associated with learning rate in Theorem 6 and Theorem 12. Then we will apply that lemma to derive the corresponding stability error bounds.

Lemma 15

The following are true: • If η t = C ≤ 2/(β + γ), then T t=1 η t T j=t+1 (1 -η j γ) = 1 -(1 -Cγ) T γ , ( ) • If η t = C/t ≤ 2/(β + γ), for some C ≥ 2/γ for t ≥ 1 + ⌈ β γ ⌉ and η t = C ′ /t ≤ 2/(β + γ) for some C ′ < 2/(γ + β) for t ≤ ⌈ β γ ⌉, then T t=1 η t T j=t+1 1 - η j γ 2 ≤ C log e 2 ⌈β/γ⌉ . ( ) • If η t ≤ C/t < 2/β, then T t=1 η t T j=t+1 (1 + βη j ) 2 ≤ Ce 2Cβ T 2Cβ min 1 + 1 2Cβ , log(eT ) . (37) Proof. • If η t = C ≤ 2/(β + γ) then T t=1 η t T j=t+1 (1 -η j γ) = C T t=1 (1 -Cγ) T -t = C (1 -Cγ) T T t=1 (1 -Cγ) -t = C 1 -(1 -Cγ) T Cγ = 1 -(1 -Cγ) T γ , • If η t = C/t ≤ 2/(β + γ), for some C ≥ 2/γ for t ≥ 1 + ⌈ β γ ⌉ and η t = C ′ /t ≤ 2/(β + γ) for some C ′ < 2/(γ + β) for t ≤ ⌈ β γ ⌉ then T t=1 η t T j=t+1 1 - η j γ 2 ≤ ⌈ β γ ⌉ t=1 C ′ t T j=t+1 1 - C ′ γ 2j + T t=1+⌈ β γ ⌉ C t T j=t+1 1 - 1 j = ⌈ β γ ⌉ t=1 C ′ t T j=t+1 1 - C ′ γ 2j + T t=1+⌈ β γ ⌉ C t t T ≤ ⌈ β γ ⌉ t=1 C ′ t + C T -⌈ β γ ⌉ + T ≤ C 1 + log(⌈β/γ⌉) + 1 -⌈ β T γ ⌉ + ≤ C log e 2 ⌈β/γ⌉ . • If η t ≤ C/t ≤ 2/β, then T t=1 η t T j=t+1 (1 + βη j ) 2 = T t=1 C t T j=t+1 1 + β C j 2 ≤ T t=1 C t T j=t+1 exp 2β C j = T t=1 C t exp   2β T j=t+1 C j   ≤ T t=1 C t exp (2Cβ (log(T ) + 1 -log(t + 1))) = Ce 2Cβ T 2Cβ T t=1 1 t 1 (t + 1) 2Cβ ≤ Ce 2Cβ T 2Cβ T t=1 1 t 1 (t + 1) 2Cβ ≤ Ce 2Cβ T 2Cβ T t=1 1 t 1+2Cβ (38) = Ce 2Cβ T 2Cβ 1 + T t=2 1 t 1+2Cβ ≤ Ce 2Cβ T 2Cβ 1 + T 1 1 x 1+2Cβ dx = Ce 2Cβ T 2Cβ 1 + 1 2Cβ 1 -T -2Cβ = Ce 2Cβ T 2Cβ 1 + 1 2Cβ -C e 2Cβ 2β ≤ Ce 2Cβ T 2Cβ 1 + 1 2Cβ , ( ) additionally T t=1 1/t ≤ log(eT ), thus the term in the inequality 38 may be upper bounded by Ce 2Cβ T 2Cβ log(eT ) for any T ∈ N, and we conclude that T t=1 η t T j=t+1 (1 + βη j ) 2 ≤ Ce 2Cβ T 2Cβ min 1 + 1 2Cβ , log(eT ) . ( ) The last inequality completes the proof. □ In the next section we prove the stability and generalization error bounds for nonconvex losses.

C NONCONVEX LOSS: PROOF OF THEOREM 6 & THEOREM 7

Let z 1 , z 2 , . . . , z i , . . . , z n , z ′ i be i.i.d. random variables, define S ≜ (z 1 , z 2 , . . . , z i , . . . , z n ) and S (i) ≜ (z 1 , z 2 , . . . , z ′ i , . . . , z n ), W 1 = W ′ 1 . The updates for any t ≥ 1 are W t+1 = W t - η t n n j=1 ∇f (W t , z j ), W (i) t+1 = W (i) t - η t n n j=1,j̸ =i ∇f (W (i) t , z j ) - η t n ∇f (W (i) t , z ′ i ). Then for any t ≥ 1, we derive the stability recursion as ∥W t+1 -W (i) t+1 ∥ 2 ≤ ∥W t -W (i) t ∥ 2 + η t n n j=1,j̸ =i ∇f (W t , z j ) -∇f (W (i) t , z j ) 2 + η t n ∥∇f (W t , z i ) -∇f (W (i) t , z ′ i )∥ 2 ≤ ∥W t -W (i) t ∥ 2 + η t n n j=1,j̸ =i ∥∇f (W t , z j ) -∇f (W (i) t , z j )∥ 2 + η t n ∥∇f (W t , z i )∥ 2 + ∥∇f (W (i) t , z ′ i )∥ 2 ≤ ∥W t -W (i) t ∥ 2 + η t (n -1) n β∥W t -W (i) t ∥ 2 + η t n ∥∇f (W t , z i )∥ 2 + ∥∇f (W (i) t , z ′ i )∥ 2 (43) = 1 + n -1 n βη t ∥W t -W (i) t ∥ 2 + η t n ∥∇f (W t , z i )∥ 2 + ∥∇f (W (i) t , z ′ i )∥ 2 , ( ) inequality 43 comes from the smoothness of the loss. Then by solving the recursion we find ∥W T +1 -W (i) T +1 ∥ 2 ≤ 1 n T t=1 η t ∥∇f (W t , z i )∥ 2 + ∥∇f (W (i) t , z ′ i )∥ 2 T j=t+1 1 + n -1 n βη j ≤ 1 n T t=1 η t ∥∇f (W t , z i )∥ 2 + ∥∇f (W (i) t , z ′ i )∥ 2 T j=t+1 (1 + βη j ) ≤ 1 n T t=1 η t ∥∇f (W t , z i )∥ 2 + ∥∇f (W (i) t , z ′ i )∥ 2 2 T t=1 η t T j=t+1 (1 + βη j ) 2 ≤ √ 2 n T t=1 η t ∥∇f (W t , z i )∥ 2 2 + ∥∇f (W (i) t , z ′ i )∥ 2 2 T t=1 η t T j=t+1 (1 + βη j ) 2 . The last display gives ∥W T +1 -W (i) T +1 ∥ 2 2 ≤ 2 n 2 T t=1 η t ∥∇f (W t , z i )∥ 2 2 + ∥∇f (W (i) t , z ′ i )∥ 2 2 T t=1 η t T j=t+1 (1 + βη j ) 2 , and by taking the expectation we find E[∥W T +1 -W (i) T +1 ∥ 2 2 ] ≤ 2 n 2 T t=1 η t E[∥∇f (W t , z i )∥ 2 2 ] + E[∥∇f (W (i) t , z ′ i )∥ 2 2 ] T t=1 η t T j=t+1 (1 + βη j ) 2 ≤ 4ϵ path n 2 T t=1 η t T j=t+1 (1 + βη j ) 2 . ( ) We evaluate the summation of the products in the inequality 45. Lemma 15 under the choice of decreasing learning rate η t ≤ C/t ≤ 2/β shows that T t=1 η t T j=t+1 (1 + βη j ) 2 ≤ Ce 2Cβ T 2Cβ min 1 + 1 2Cβ , log(eT ) . Through the inequalities 45, 46 and Theorem 3, we derive the bound on the generalization error as |ϵ gen | ≤ 2 2β(ϵ opt + ϵ c )ϵ stab(A) + 2βϵ stab(A) ≤ 4 n 2β(ϵ opt + ϵ c )ϵ path T t=1 η t T j=t+1 (1 + βη j ) 2 + 8β ϵ path n 2 T t=1 η t T j=t+1 (1 + βη j ) 2 ≤ 4 n 2Cβ(ϵ opt + ϵ c )ϵ path e Cβ T Cβ min 1 + 1 2Cβ , log(eT ) 1 2 + 8Cβ ϵ path n 2 e 2Cβ T 2Cβ min 1 + 1 2Cβ , log(eT ) Under the choice η t ≤ C/t < 1/β for all t, we choose C < 1/β, further we define ϵ ≜ βC < 1, and C(ϵ, T ) ≜ min {ϵ + 1/2, ϵ log(eT )} to get |ϵ gen | ≤ 4 √ 2 n (ϵ opt + ϵ c )ϵ path (eT ) ϵ C 1 2 (ϵ, T ) + 8 ϵ path n 2 (eT ) 2ϵ C(ϵ, T ) ≤ 4 √ 3 n (ϵ opt + ϵ c )ϵ path (eT ) ϵ + 12 ϵ path n 2 (eT ) 2ϵ . The last inequality provide the generalization error bound and completes the proof. □ Next we derive upper bounds on expected path error ϵ path and optimization error ϵ opt , to show an alternative expression of the generalization error inequality 47. We continue by proving the proof of Corollary 8. C.1 PROOF OF COROLLARY 8. The self-bounding property of the non-negative β-smooth loss function f (•; z) (Srebro et al., 2010, Lemma 3.1) gives ∥∇f (W t , z i )∥ 2 2 ≤ 4βf (W t , z i ). By taking expectation, and through the Assumption 1 and the equation 20 we find E[∥∇f (W t , z i )∥ 2 2 ] ≤ 4βE[f (W t , z i )] = 4βE[R S (W t )]. The definition of ϵ path (Definition 5), and the decreasing learning rate (η t = C/t < 1/βt) give ϵ path ≜ T t=1 η t E[∥∇f (W t , z i )∥ 2 2 ] ≤ 4β T t=1 η t E[R S (W t )] ≤ 4βE[R S (W 1 )] T t=1 η t (49) < 4E[R S (W 1 )] T t=1 1 t ≤ 4E[R S (W 1 )] log(eT ), and the inequality 49 holds since the learning rate η t < 2/β guarantees descent at each iteration. Similarly, ϵ opt + ϵ c ≤ E[R S (W 1 )]. The last inequality together with the inequalities 50 and 47 give |ϵ gen | ≤ 4 √ 3 n (ϵ opt + ϵ c )ϵ path (eT ) ϵ + 12 ϵ path n 2 (eT ) 2ϵ ≤ 8 √ 3 n log(eT )(eT ) ϵ + 48 n 2 log(eT )(eT ) 2ϵ E[R S (W 1 )]. The last inequality provides the bound of the corollary. Proof. We define the constant c ≜ S (π S ) + R(π S )] apply Theorem 3 and Lemma 16 to find |ϵ gen | ≤ 2 2β(ϵ opt + ϵ c )ϵ stab(A) + 2βϵ stab(A) ≤ 8 √ µ √ ϵ opt + 4 √ 2βc nµ 2β(ϵ opt + ϵ c ) + 32β µ ϵ opt + 16β 2 n 2 µ 2 c ≤ 8 2βϵ opt √ µ ϵ opt + ϵ c + 8β √ c nµ ϵ opt + ϵ c + 32β µ ϵ opt + 16β 2 n 2 µ 2 c ≤ 8β √ c nµ ϵ opt + ϵ c + 8 2βϵ opt ϵ c √ µ + 16β 2 n 2 µ 2 c + 44β µ ϵ opt . The last inequality completes the proof. □ E CONVEX LOSS: PROOF OF THEOREM 9 AND THEOREM 10. We start by proving the non-expansive property of the stability iterates for the case of β-smooth convex loss. Then we continue with the proof of the stability generalization error. Lemma 18 Let the gradient of the loss be β-Lipschitz for all z ∈ Z. If the loss function is convex and η t < 2/β, then for any t ≤ T + 1 the updates W t , W t satisfy the next inequality W t -W (i) t - η t n n j=1,j̸ =i ∇f (W t , z j ) -∇f (W (i) t , z j ) 2 2 ≤ ∥W t -W (i) t ∥ 2 2 . Proof. By the definition of β-Lipschitz gradients and triangle inequality, it is true that ∥∇f (W t , z j ) -∇f (W (i) t , z j )∥ 2 ≤ β∥W t -W (i) t ∥ 2 =⇒ (67) ∥ j∈J ∇f (W t , z j ) - j∈J ∇f (W (i) t , z j )∥ 2 ≤ β|J |∥W t -W (i) t ∥ 2 . ( ) Since the function h(W ) ≜ j∈J ∇f (W, z j ) is convex and the gradient of h(w) is β|J |-Lipschitz, it follows that (co-coersivity of the gradient) j∈J ⟨∇f (W t , z j ) -∇f (W (i) t , z j ), W t -W (i) t ⟩ (69) ≥ 1 β|J | ∥ j∈J ∇f (W t , z j ) - j∈J ∇f (W (i) t , z j )∥ 2 2 . Then prove the inequality 66 as follows  W t -W (i) t - η t n n j=1,j̸ =i ∇f (W t , z j ) -∇f (W (i) t , z j ) 2 2 = ∥W t -W (i) t ∥ 2 2 -2 η t n n j=1,j̸ =i ⟨∇f (W t , z j ) -∇f (W (i) t , z j ), W t -W (i) t ⟩ (71) + η 2 t n 2 ∥ n j=1,j̸ =i ∇f (W t , z j ) -∇f (W (i) t , z j ) ∥ 2 2 = ∥W t -W (i) t ∥ 2 2 -2 η t n ⟨ n j=1,j̸ =i ∇f (W t , z j ) - n j=1,j̸ =i ∇f (W (i) t , z j ), W t -W (i) t ⟩ + η 2 t n 2 ∥ n j=1,j̸ =i ∇f (W t , z j ) - n j=1,j̸ =i ∇f (W (i) t , z j )∥ 2 2 ≤ ∥W t -W (i) t ∥ 2 2 -2 η t β(n -1)n ∥ n j=1,j̸ =i ∇f (W t , j ) - n j=1,j̸ =i ∇f (W (i) t , z j )∥ 2 2 + η 2 t n 2 ∥ n j=1,j̸ =i ∇f (W t , z j ) - n j=1,j̸ =i ∇f (W (i) t , z j )∥ 2 2 (72) = ∥W t -W (i) t ∥ 2 2 + η t n η t n - 2 β(n -1) n j=1,j̸ =i ∇f (W t , z j ) -∇f (W (i) t , z j ) 2 2 ≤ ∥W t -W (i) t ∥ 2 2 , ϵ path ≤ 4βE[∥W 1 -W * S ∥ 2 2 ] + 8βE[R S (W * S )] T t=1 η t Proof. The self-bounding property of the non-negative β-smooth loss function f (•; z) (Srebro et al., 2010, Lemma 3.1) gives ∥∇f (W t , z i )∥ 2 2 ≤ 4βf (W t , z i ). By taking expectation, and through the equation 20 we find E[∥∇f (W t , z i )∥ 2 2 ] ≤ 4βE[f (W t , z i )] = 4βE[R S (W t )]. Similarly to the approach by Lei & Ying (2020b, Appendix A, Lemma 2), we use the convexity and the assumption η t ≤ 1/2β to find ∥W t+1 -W * S ∥ 2 2 = ∥W t -η t ∇R S (W t ) -W * S ∥ 2 2 = ∥W t -W * S ∥ 2 2 + η 2 t ∥∇R S (W t )∥ 2 2 + 2η t ⟨W * S -W t , ∇R S (W t )⟩ ≤ ∥W t -W * S ∥ 2 2 + η 2 t ∥∇R S (W t )∥ 2 2 + 2η t (R S (W * S ) -R S (W t )) ≤ ∥W t -W * S ∥ 2 2 + 2βη 2 t R S (W t ) + 2η t (R S (W * S ) -R S (W t )) ≤ ∥W t -W * S ∥ 2 2 + 2η t R S (W * S ) -η t R S (W t ). The last gives T t=1 η t R S (W t ) ≤ T t=1 ∥W t -W * S ∥ 2 2 - T t=1 ∥W t+1 -W * S ∥ 2 2 + 2 T t=1 η t R S (W * S ) ≤ ∥W 1 -W * S ∥ 2 2 + 2 T t=1 η t R S (W * S ). ( ) The definition of ϵ path (Definition 5), the inequalities 75, 76 and the choice of the learning rate (η t ≤ 1/2β) give ϵ path ≜ T t=1 η t E[∥∇f (W t , z i )∥ 2 2 ] ≤ 4β T t=1 η t E[R S (W t )] ≤ 4βE[∥W 1 -W * S ∥ 2 2 ] + 8β T t=1 η t E[R S (W * S )]. The last inequality provides the bound on the ϵ path . □ The standard choice of η t ≤ 1/β gives the next known bound on the optimization error. Lemma 20 (Optimization Error -Convex Loss (Nesterov, 1998) ) If f (•; z) is a convex and βsmooth function and η t ≤ 1/β, then ϵ opt = E[R S (A(S)) -R S (W * S )] ≤ E[∥W 1 -W * S ∥ 2 2 ] T t=1 η t 1 -βηt 2 . E.1 PROOF OF THEOREM 9 AND THEOREM 10 Let z 1 , z 2 , . . . , z i , . . . , z n z ′ i be i.i.d. random variables, define S ≜ (z 1 , z 2 , . . . , z i , . . . , z n ) and S (i) ≜ (z 1 , z 2 , . . . , z ′ i , . . . , z n ), W 1 = W ′ 1 . The updates for any t ≥ 1 are W t+1 = W t - η t n n j=1 ∇f (W t , z j ), W (i) t+1 = W (i) t - η t n n j=1,j̸ =i ∇f (W (i) t , z j ) - η t n ∇f (W (i) t , z ′ i ). Then for any t ≥ 1 ∥W t+1 -W (i) t+1 ∥ 2 ≤ W t -W (i) t - η t n n j=1,j̸ =i ∇f (W t , z j ) -∇f (W (i) t , z j ) 2 + η t n ∥∇f (W t , z i ) -∇f (W (i) t , z ′ i )∥ 2 ≤ W t -W (i) t - η t n n j=1,j̸ =i ∇f (W t , z j ) -∇f (W (i) t , z j ) 2 2 + η t n ∥∇f (W t , z i )∥ 2 + ∥∇f (W (i) t , z ′ i )∥ 2 ≤ ∥W t -W (i) t ∥ 2 + η t n ∥∇f (W t , z i )∥ 2 + ∥∇f (W (i) t , z ′ i )∥ 2 . The inequality 81 comes from Lemma 18. Then by solving the recursion, we find ∥W T +1 -W (i) T +1 ∥ 2 ≤ 1 n T t=1 η t ∥∇f (W t , z i )∥ 2 + ∥∇f (W (i) t , z ′ i )∥ 2 thus ∥W T +1 -W (i) T +1 ∥ 2 2 ≤ 1 n 2 T t=1 η t ∥∇f (W t , z i )∥ 2 + ∥∇f (W (i) t , z ′ i )∥ 2 2 ≤ 2 n 2 T t=1 η t ∥∇f (W t , z i )∥ 2 2 + ∥∇f (W (i) t , z ′ i )∥ 2 2 T t=1 η t . Inequality 82 gives that for any i ∈ {1, . . . , n} E[∥W T +1 -W (i) T +1 ∥ 2 2 ] ≤ 2 n 2 T t=1 η t E[∥∇f (W t , z i )∥ 2 2 ] + E[∥∇f (W (i) t , z ′ i )∥ 2 2 ] T t=1 η t = 4 n 2 T t=1 η t E[∥∇f (W t , z i )∥ 2 2 ] T t=1 η t = 4ϵ path n 2 T t=1 η t . Recall that W T +1 ≡ A(S) and W (i) i) ). Theorem 3 and the inequality 83 give T +1 ≡ A(S ( |ϵ gen | ≤ 2 2β (ϵ opt + E[R S (W * S )]) ϵ stab(A) + 2βϵ stab(A) ≤ 2 2β (ϵ opt + E[R S (W * S )]) 4ϵ path n 2 T t=1 η t + 2β 4ϵ path n 2 T t=1 η t = 4 (ϵ opt + E[R S (W * S )]) ϵ path n 2β T t=1 η t + 8β ϵ path n 2 T t=1 η t . Under the choice of constant learning rate η t = 1/2β, Lemma 19 together with the inequality 83 give ϵ path ≤ 4βE[∥W 1 -W * ∥ 2 2 ] + 8βE[R S (W * S )] T t=1 η t , and ϵ opt ≤ 3βE[∥W 1 -W * S ∥ 2 2 ]/T . Thus ϵ stab(A) ≤ 32 n 2 β 2 E[∥W 1 -W * S ∥ 2 2 ] + βE[R S (W * S )] T t=1 η t T t=1 η t (85) = 32 n 2 β 2 E[∥W 1 -W * S ∥ 2 2 ] + 1 2 E[R S (W * S )]T T 2β (86) = 8T n 2 E[∥W 1 -W * S ∥ 2 2 ] + E[R S (W * S )] T β . ( ) The inequality 84 and Lemma 20 give |ϵ gen | ≤ 2 2β (ϵ opt + E[R S (W * S )]) ϵ stab(A) + 2βϵ stab(A) ≤ 2 2β (ϵ opt + E[R S (W * S )]) 8T n 2 E[∥W 1 -W * S ∥ 2 2 ] + E[R S (W * S )] T β + 2β 8T n 2 E[∥W 1 -W * S ∥ 2 2 ] + E[R S (W * S )] T β ≤ 8 √ T n (ϵ opt + E[R S (W * S )]) (βE[∥W 1 -W * S ∥ 2 2 ] + T E[R S (W * S )]) + 16T n 2 βE[∥W 1 -W * S ∥ 2 2 ] + T E[R S (W * S )] ≤ 8 √ T n 3βE[∥W 1 -W * S ∥ 2 2 ] T + E[R S (W * S )] (βE[∥W 1 -W * S ∥ 2 2 ] + T E[R S (W * S )]) + 16T n 2 βE[∥W 1 -W * S ∥ 2 2 ] + T E[R S (W * S )] = 8 n (3βE[∥W 1 -W * S ∥ 2 2 ] + T E[R S (W * S )]) (βE[∥W 1 -W * S ∥ 2 2 ] + T E[R S (W * S )]) + 16T n 2 βE[∥W 1 -W * S ∥ 2 2 ] + T E[R S (W * S )] ≤ 8 n 3βE[∥W 1 -W * S ∥ 2 2 ] + T E[R S (W * S )] + 16T n 2 βE[∥W 1 -W * S ∥ 2 2 ] + T E[R S (W * S )] ≤ 8 1 n + 2T n 2 3βE[∥W 1 -W * S ∥ 2 2 ] + T E[R S (W * S )] . The last inequality completes the proof. □ F STRONGLY-CONVEX OBJECTIVE: PROOF OF THEOREM 12 AND THEOREM 13 Similarly to the convex case, first we provide the contractive property of the stability recursion in the strongly convex loss case. Then we prove the stability and generalization error bounds. Lemma 21 Let the objective function be γ-strongly convex (γ > 0) and the leave-one-out objective function be γ loo -strongly convex for some γ loo ≥ 0. If the loss function is convex β-smooth for all z ∈ Z and η t ≤ 2/(β + γ), then for any t ≤ T + 1 the updates W t , W t satisfy the inequality W t -W (i) t -η t ∇R S -i (W t ) -∇R S -i (W (i) t ) 2 2 ≤ (1 -η t γ loo ) ∥W t -W (i) t ∥ 2 2 . Proof. The function R S -i (•) is also β-smooth for all z ∈ Z and the strong convexity gives ⟨∇R S -i (W t ) -∇R S -i (W (i) t ), W t -W (i) t ⟩ ≥ βγ loo β + γ loo ∥W t -W (i) t ∥ 2 2 + 1 (β + γ loo ) ∥∇R S -i (W t ) -∇R S -i (W (i) t )∥ 2 2 (88) We expand the squared norm as follows W t -W (i) t -η t ∇R S -i (W t ) -∇R S -i (W (i) t ) 2 2 = ∥W t -W (i) t ∥ 2 2 -2η t ⟨∇R S -i (W t ) -∇R S -i (W (i) t ), W t -W (i) t ⟩ + η 2 t ∥∇R S -i (W t ) -∇R S -i (W (i) t )∥ 2 2 ≤ ∥W t -W (i) t ∥ 2 2 + η 2 t ∥∇R S -i (W t ) -∇R S -i (W (i) t )∥ 2 2 -2η t βγ loo β + γ loo ∥W t -W (i) t ∥ 2 2 + 1 (β + γ loo ) ∥∇R S -i (W t ) -∇R S -i (W (i) t )∥ 2 2 (89) = 1 -2η t βγ loo β + γ loo ∥W t -W (i) t ∥ 2 2 + η t η t - 2 β + γ loo ∥∇R S -i (W t ) -∇R S -i (W (i) t )∥ 2 2 ≤ 1 -2η t βγ loo β + γ loo ∥W t -W (i) t ∥ 2 2 . ( ) We apply the inequality 88 to derive 89. The inequality 90 holds since η t ≤ 2/(β + γ) and β ≥ γ > γ loo . Also 2η t βγ loo β + γ loo ≥ 2η t βγ loo 2β = η t γ loo . Through the 90 and 91 to derive the bound of the lemma. □ Lemma 22 (Accumulated Path Error -Strongly Convex Loss) Let the objective function R S (•) be γ-strongly convex and β-smooth. Define Γ(γ, T ) ≜ (1 -exp( -4T γ β+γ )/(exp( -4γ β+γ ) -1). If η t = 2/(β + γ), then the expected path-error of the full-batch GD after T iterations are bounded as ϵ path ≤ 4β 2 β + γ Γ(γ, T )E[∥W 1 -W * S ∥ 2 2 ] + 8βT β + γ E[R S (W * S )], Proof. The self-bounding property of the non-negative β-smooth loss function f (•; z) (Srebro et al., 2010, Lemma 3.1) gives ∥∇f (W t , z i )∥ 2 2 ≤ 4βf (W t , z i ). By taking expectation, and through the Assumption 1 and equation 20, we find E[∥∇f (W t , z i )∥ 2 2 ] ≤ 4βE[f (W t , z i )] = 4βE[R S (W t )] = 4βE[R S (W t ) -R S (W * S ) + R S (W * S )]. Further, Lemma 23 and the choice of constant learning rate η = 2/(β + γ) give E[R S (W t ) -R S (W * S )] ≤ β 2 exp -4t β γ + 1 E[∥W 1 -W * S ∥ 2 2 ]. The definition of ϵ path (Definition 5), the inequalities 93 and 94 and the constant learning rate (η t = 2/(β + γ)) give ϵ path ≜ T t=1 η t E[∥∇f (W t , z i )∥ 2 2 ] (95) ≤ 4β T t=1 η t E[R S (W t ) -R S (W * S ) + R S (W * S )] ≤ 4β T t=1 2 β + γ β 2 exp -4t β γ + 1 E[∥W 1 -W * S ∥ 2 2 ] + 8βT β + γ E[R S (W * S )] ≤ 4β 2 β + γ E[∥W 1 -W * S ∥ 2 2 ] T t=1 exp -4t β γ + 1 + 8βT β + γ E[R S (W * S )] = 4β 2 β + γ E[∥W 1 -W * S ∥ 2 2 ] exp -4 β γ + 1 1 -exp -4T β γ +1 1 -exp -4 β γ +1 Γ(γ,T ) + 8βT β + γ E[R S (W * S )] = 4β 2 β + γ Γ(γ, T )E[∥W 1 -W * S ∥ 2 2 ] + 8βT β + γ E[R S (W * S )]. The last inequality provides the bound on the ϵ path . Further, we can show that Γ(γ, T ) ≤ min 1 e 4γ β+γ -1 , T to simplify the expression in the inequality 96. □ Lemma 23 ( (Nesterov, 1998, Theorem 2.1.14 )) If f (•; z) is a γ-strongly convex and β-smooth function and η t = 2/(β + γ), then ϵ opt ≤ β 2 exp -4T β γ + 1 E[∥W 1 -W * S ∥ 2 2 ]. Alternatively, if η t = c/t, then ϵ opt ≤ β 2 T -2cβγ β+γ E[∥W 1 -W * S ∥ 2 2 ]. F.1 PROOF OF THEOREM 12 AND THEOREM 13 Let z 1 , z 2 , . . . , z i , . . . , z n , z ′ i be i.i.d. random variables, define S ≜ (z 1 , z 2 , . . . , z i , . . . , z n ) and S (i) ≜ (z 1 , z 2 , . . . , z ′ i , . . . , z n ), W 1 = W ′ 1 . The updates for any t ≥ 1 are W t+1 = W t - η t n n j=1 ∇f (W t , z j ), W (i) t+1 = W (i) t - η t n n j=1,j̸ =i ∇f (W (i) t , z j ) - η t n ∇f (W (i) t , z ′ i ). Then similarly to the inequality 81 we get ∥W t+1 -W (i) t+1 ∥ 2 ≤ W t -W (i) t - η t n n j=1,j̸ =i ∇f (W t , z j ) -∇f (W (i) t , z j ) 2 + η t n ∥∇f (W t , z i ) -∇f (W (i) t , z ′ i )∥ 2 ≤ W t -W (i) t -η t R S -i (W t ) -R S -i (W (i) t ) 2 2 + η t n ∥∇f (W t , z i )∥ 2 + ∥∇f (W (i) t , z ′ i )∥ 2 ≤ (1 -η t γ loo ) 1 2 ∥W t -W (i) t ∥ 2 + η t n ∥∇f (W t , z i )∥ 2 + ∥∇f (W (i) t , z ′ i )∥ 2 and we apply Lemma 21 to derive the bound in 102. Then by solving the recursion we find ∥W T +1 -W (i) T +1 ∥ 2 ≤ 1 n T t=1 η t ∥∇f (W t , z i )∥ 2 + ∥∇f (W (i) t , z ′ i )∥ 2 T j=t+1 (1 -η j γ loo ) 1 2 ≤ 1 n T t=1 η t ∥∇f (W t , z i )∥ 2 + ∥∇f (W (i) t , z ′ i )∥ 2 2 T t=1 η t T j=t+1 (1 -η j γ loo ) ≤ 1 n 2 T t=1 η t ∥∇f (W t , z i )∥ 2 2 + ∥∇f (W (i) t , z ′ i )∥ 2 2 T t=1 η t T j=t+1 (1 -η j γ loo ). The last inequality provides the stability bound ∥W T +1 -W (i) T +1 ∥ 2 2 ≤ 2 n 2 T t=1 η t ∥∇f (W t , z i )∥ 2 2 + ∥∇f (W (i) t , z ′ i )∥ 2 2 T t=1 η t T j=t+1 (1 -η j γ loo ) . Inequality 103 gives that for any i ∈ {1, . . . , n} E[∥W T +1 -W (i) T +1 ∥ 2 2 ] ≤ 2 n 2 T t=1 η t E[∥∇f (W t , z i )∥ 2 2 ] + E[∥∇f (W (i) t , z ′ i )∥ 2 2 ] T t=1 η t T j=t+1 (1 -η j γ loo ) = 4 n 2 T t=1 η t E[∥∇f (W t , z i )∥ 2 2 ] T t=1 η t T j=t+1 (1 -η j γ loo ) = 4ϵ path n 2 T t=1 η t T j=t+1 (1 -η j γ loo ) . Recall that W T +1 ≡ A(S) and W (i) T +1 ≡ A(S (i) ). Due to space limitation, we define Ω(η t , γ loo ) ≜  The last inequality completes the proof. □



SGD naturally requires less computation than GD. However, the directional step of GD can be evaluated in parallel. As a consequence, for a strongly-convex objective GD would be more efficient than SGD (in terms of running time) if some parallel computation is available. In general, this occurs when n → ∞, and the generalization error is zero. As a consequence, the excess risk becomes equals to the optimization error (see also(Lei & Ying, 2020a, Therem 7)), and the analysis becomes not interesting from generalization error prospective. Recall that the initial point W1 may be chosen arbitrarily and uniformly over the dataset. P(z1 = c1, z2 = c2, . . . , zi = ci, . . . , zj = cj, . . . , zn = cn, A(S) = w) = P(z1 = c1, z2 = c2, . . . , zi = cj, . . . , zj = ci, . . . , zn = cn, A(S) = w) for any choice of the values c1, c2, . . . , cn, w and for any i, j ∈ {1, . . . , n}.



Generalization under Memorization) If memorization of the training set is feasible under sufficiently large model capacity, then ϵ c = 0 and consequently |ϵ gen | ≤ 2 2βϵ opt ϵ stab(A) + 2βϵ stab(A) and |ϵ gen | = O(max{ √ ϵ opt ϵ stab(A) , ϵ stab(A) }).

The inequality (11) gives ϵ stab(A) = O(( T t=1 η t /n) 2 ) and through Theorem 3 we find |ϵ gen | = O( T t=1 η t /n). In contrast, stability guarantees for the SGD and non-Lipschitz losses in prior work (Lei & Ying, 2020b, Theorem 3, (4.4)) give ϵ stab(A) = O T t=1 η 2 t /n and |ϵ gen | = O( T t=1 η 2 t /n). As a consequence, GD guarantees are tighter than existing bounds of the SGD for non-Lipschitz losses, a variety of learning rates and T ≤ n. For instance, for fixed η t = 1/ √ T , the generalization error bound of GD is |ϵ gen | = O( √ T /n) which is tighter than the corresponding bound of SGD, namely

equation 71 holds from the expansion of the squared norm, 72 comes from the inequality 70. The inequality 73 holds under the choice η t < 2/β and completes the proof. □ Lemma 19 (Accumulated Path Error -Convex Loss) Let the loss function f (•; z) be convex and β-smooth and η t ≤ 1/2β. Then the expected path-error of the full-batch GD after T iterations is bounded as

Theorem 3 and the inequality 104 give|ϵ gen | ≤ 2 2β(ϵ opt + E[R S (W * S )])ϵ stab(A) + 2βϵ stab(A) opt + E[R S (W * S )])ϵ path n 2βΩ(η t , γ loo ) + 8β ϵ path n 2 Ω(η t , γ loo ).Under the choice of η t = C = 2 β+γ < 2 β , the inequality 104, Lemmata 15 and 22 and give≤ 4 (ϵ opt + E[R S (W * S )])ϵ path n 2βΛ(γ loo , T ) + 8β ϵ path n 2 Λ(γ loo , T )We proceed by applying the upper bounds of Γ(γ, T ), Λ(γ loo , T ) as appear in the inequalities 97 and 106 respectively and equation 107 givesTo simplify the last display, we define the terms m(γ loo , T ) ≜ βT min {β/γ loo , 2T } /(β + γ) and M(W1 ) ≜ max βE[∥W 1 -W * S ∥ 2 2 ], E[R S (W * Choose T = log(n)(β + γ)/2γ and define m n,γ loo ≜ β 2γ min β γ loo , β+γ γ log n , then the inequality 108 gives

7. ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING

We would like to thank the four anonymous reviewers for providing valuable comments and suggestions, which have improved the presentation of the results and the overall quality of our paper. Amin Karbasi acknowledges funding in direct support of this work from NSF (IIS-1845032), ONR (N00014-19-1-2406), and the AI Institute for Learning-Enabled Optimization at Scale (TILOS).

D PL OBJECTIVE

Herein we provide the proofs of the results associated with the PL condition on the objective. We start by proving an upper bound on the average output stability. Then by combining Lemma 16 and Theorem 3 we derive generalization error bounds for symmetric algorithms and smooth losses, as well as the generalization error bound of the full-batch GD under the PL condition. A similar proof technique of the next lemma also appears in prior work by Lei et al. (Lei & Ying, 2020a, Proof of Lemma B.2) .Lemma 16 Let the loss function f (•; z) be non-negative, nonconvex and β-smooth for all z ∈ Z. Further, let the objective be µThen for any algorithm it is true thatProof. Define the projection π S (i) ≜ π(A(S (i) )) of the point A(S (i) ) to the set of the minimizers of R S (i) (•), and the similarly the projection π S ≜ π(A(S)) of the point A(S) to the set of the minimizers of R S (•). Thenthe inequalities 52 and 53 come from the quadratic growth (Karimi et al., 2016) . Recall that, the PL condition on the objective givesWe combine the inequalities 53 and 54 to findAlso, it is true thatequation 56 holds because ∇R S (i) (π S (i) ) = 0, the inequality 57 holds for nonnegative losses (Srebro et al., 2010, (Lemma 3.1) . Through inequality57 we find,and the last equality holds because z i , z ′ i are exchangeable. We combine the inequalities 55 and 59 to findfor any i, j ∈ {1, . . . , n}, we conclude that for any i ∈ {1, . . . , n}The last inequality provides the bound on the expected stability and completes the proof. □ Further, define the constant c ≜ 44 max{E[R S (π S ) + R(π S )], E[R S (W 1 ) -R * S ]}. Then the generalization error of the full-batch GD with step-size choice η t = 1/β and T total number of iterations is bounded as follows

