BEYOND LIPSCHITZ: SHARP GENERALIZATION AND EXCESS RISK BOUNDS FOR FULL-BATCH GD

Abstract

We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex). At the heart of our analysis is an upper bound on the generalization error, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that a small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, convex, and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, and recovers the generalization error guarantees of stochastic algorithms with fewer assumptions. For smooth convex losses, we show that the generalization error is tighter than existing bounds for SGD (up to one order of error magnitude). Consequently the excess risk matches that of SGD for quadratically less iterations. Lastly, for strongly convex smooth losses, we show that full-batch GD achieves essentially the same excess risk rate as compared with the state of the art on SGD, but with an exponentially smaller number of iterations (logarithmic in the dataset size).

1. INTRODUCTION

Gradient based learning (Lecun et al., 1998) is a well established topic with a large body of literature on algorithmic generalization and optimization errors. For general smooth convex losses, optimization error guarantees have long been well-known (Nesterov, 1998) . Similarly, Absil et al. (2005) 2016) showed uniform stability final-iterate bounds for vanilla Stochastic Gradient Descent (SGD). More recent works have developed alternative generalization error bounds with probabilistic guarantees (Feldman & Vondrak, 2018; 2019; Madden et al., 2020; Klochkov & Zhivotovskiy, 2021) and data-dependent variants (Kuzborskij & Lampert, 2018) , or under weaker assumptions such as strongly quasi-convex (Gower et al., 2019 ), non-smooth convex (Feldman, 2016; Bassily et al., 2020; Lei & Ying, 2020b; Lei et al., 2021a) , and pairwise losses (Lei et al., 2021b; 2020) . In the nonconvex case, Zhou et al. (2021b) provide bounds that involve the on-average variance of the stochastic gradients. Generalization performance of other algorithmic variants lately gain further attention, including SGD with early momentum (Ramezani-Kebrya et al., 2021) , randomized coordinate descent (Wang et al., 2021c) , look-ahead approaches (Zhou et al., 2021a) , noise injection methods (Xing et al., 2021) , and stochastic gradient Langevin dynamics (Pensia et al., 2018; Mou et al., 2018; Li et al., 2020; Negrea et al., 2019; Zhang et al., 2021; Farghly & Rebeschini, 2021; Wang et al., 2021a; b) While prior works provide extensive analysis of the generalization error and excess risk of stochastic gradient methods, tight and path-dependent generalization error and excess risk guarantees in nonstochastic training (for general smooth losses) remain unexplored. Our main purpose in this work is to theoretically establish that full-batch GD indeed generalizes efficiently for general smooth losses. While SGD appears to generalize better than full-batch GD for non-smooth and Lipschitz convex losses (Bassily et al., 2020; Amir et al., 2021) , non-smoothness seems to be problematic for efficient algorithmic generalization. In fact, tightness analysis on non-smooth losses (Bassily et al., 2020) shows that the generalization error bounds become vacuous for standard step-size choices. Our work shows that for general smooth losses, full-batch GD achieves either tighter stability and excess error rates (convex case), or equivalent rates (compared to SGD in the strongly convex setting) but with significantly shorter training horizon (strongly-convex objective). In contrast, we show that full-batch GD generalizes efficiently for appropriate choices of decreasing learning rate that guarantees faster convergence and smaller generalization error, simultaneously. Additionally, the generalization error involves an intrinsic dependence on the set of the stationary points and the initial point. Specifically, η t = 1/2β T = √ n No O 1 √ n Convex SGD η t = 1/ √ T (Lei & Ying, 2020b) T = n No O 1 √ n Convex GD (this work) η t = 1/2β T = n Yes O 1 n Convex SGD η t = 1/2β (Lei & Ying, 2020b) T = n Yes O 1 n Convex GD (this work) η t = 2/(β + γ) T = Θ(log n) No O log(n) n γ-Strongly Convex (Objective) SGD η t = 2/(t + t 0 )γ (Lei & Ying, 2020b) T = Θ(n) No O 1 n γ-Strongly Convex (Objective)



and Lee et al. (2016) have showed convergence of Gradient Descent (GD) to minimizers and local minima for smooth nonconvex functions. More recently, Chatterjee (2022), Liu et al. (2022) and Allen-Zhu et al. (2019) established global convergence of GD for deep neural networks under appropriate conditions. Generalization error analysis of stochastic training algorithms has recently gained increased attention. Hardt et al. (

.

Comparison of the excess risk bounds for the full-batch GD and SGD algorithms by Lei

