BEYOND LIPSCHITZ: SHARP GENERALIZATION AND EXCESS RISK BOUNDS FOR FULL-BATCH GD

Abstract

We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex). At the heart of our analysis is an upper bound on the generalization error, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that a small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, convex, and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, and recovers the generalization error guarantees of stochastic algorithms with fewer assumptions. For smooth convex losses, we show that the generalization error is tighter than existing bounds for SGD (up to one order of error magnitude). Consequently the excess risk matches that of SGD for quadratically less iterations. Lastly, for strongly convex smooth losses, we show that full-batch GD achieves essentially the same excess risk rate as compared with the state of the art on SGD, but with an exponentially smaller number of iterations (logarithmic in the dataset size).

1. INTRODUCTION

Gradient based learning (Lecun et al., 1998) is a well established topic with a large body of literature on algorithmic generalization and optimization errors. For general smooth convex losses, optimization error guarantees have long been well-known (Nesterov, 1998) . Similarly, Absil et al. (2005) 2016) showed uniform stability final-iterate bounds for vanilla Stochastic Gradient Descent (SGD). More recent works have developed alternative generalization error bounds with probabilistic guarantees (Feldman & Vondrak, 2018; 2019; Madden et al., 2020; Klochkov & Zhivotovskiy, 2021) and data-dependent variants (Kuzborskij & Lampert, 2018), or under weaker assumptions such as strongly quasi-convex (Gower et al., 2019 ), non-smooth convex (Feldman, 2016; Bassily et al., 2020; Lei & Ying, 2020b; Lei et al., 2021a) , and pairwise losses (Lei et al., 2021b; 2020) . In the nonconvex case, Zhou et al. (2021b) provide bounds that involve the on-average variance of the stochastic gradients. Generalization performance of other algorithmic variants lately gain further attention, including SGD with early momentum (Ramezani-Kebrya et al., 2021) , randomized coordinate descent (Wang et al., 2021c ), look-ahead approaches (Zhou et al., 2021a) , noise injection methods (Xing et al., 2021) , and stochastic gradient Langevin dynamics (Pensia et al., 2018; Mou et al., 2018; Li et al., 2020; Negrea et al., 2019; Zhang et al., 2021; Farghly & Rebeschini, 2021; Wang et al., 2021a; b) . * Lead & corresponding author



and Lee et al. (2016) have showed convergence of Gradient Descent (GD) to minimizers and local minima for smooth nonconvex functions. More recently, Chatterjee (2022), Liu et al. (2022) and Allen-Zhu et al. (2019) established global convergence of GD for deep neural networks under appropriate conditions. Generalization error analysis of stochastic training algorithms has recently gained increased attention. Hardt et al. (

