EXPONENTIAL GENERALIZATION BOUNDS WITH NEAR-OPTIMAL RATES FOR L q -STABLE ALGORITHMS

Abstract

The stability of learning algorithms to changes in the training sample has been actively studied as a powerful proxy for reasoning about generalization. Recently, exponential generalization and excess risk bounds with near-optimal rates have been obtained under the stringent and distribution-free notion of uniform stability (Bousquet et al., 2020; Klochkov & Zhivotovskiy, 2021). In the meanwhile, under the notion of L q -stability, which is weaker and distribution dependent, exponential generalization bounds are also available yet so far only with sub-optimal rates. Therefore, a fundamental question we would like to address in this paper is whether it is possible to derive near-optimal exponential generalization bounds for L q -stable learning algorithms. As the core contribution of the present work, we give an affirmative answer to this question by developing strict analogues of the near-optimal generalization and risk bounds of uniformly stable algorithms for L q -stable algorithms. Further, we demonstrate the power of our improved L qstability and generalization theory by applying it to derive strong sparse excess risk bounds, under mild conditions, for computationally tractable sparsity estimation algorithms such as Iterative Hard Thresholding (IHT).

1. INTRODUCTION

A fundamental issue in statistical learning is to bound the generalization error of a learning algorithm for understanding its prediction performance on unseen data. It has long been recognized in literature that one of the key characteristics that permits learning algorithms to generalize is the stability of estimated model to perturbations in training data. The idea of using algorithmic stability as a proxy for generalization performance analysis dates back to the seventies (Rogers & Wagner, 1978; Devroye & Wagner, 1979) . Since the seminal work of Bousquet & Elisseeff (2002) , the search for generalization bounds under various notions of algorithmic stability has been a flourishing area of learning theory (Zhang, 2003; Mukherjee et al., 2006; Shalev-Shwartz et al., 2010; Kale et al., 2011; Hardt et al., 2016; Celisse & Guedj, 2016; Bousquet et al., 2020) . As one may expect, the stronger an algorithmic stability criterion is, the sharper the resulting generalization bound will be. On one end, exponential generalization bounds can be guaranteed by approaches under the most stringent notion of uniform stability (Bousquet & Elisseeff, 2002; Bousquet et al., 2020) , which requires the change in the prediction loss to be uniformly small regardless data distribution. Despite the strength of generalization, the distribution-free nature makes uniform stability too restrictive to be fulfilled, e.g., by learning rules with unbounded losses (Celisse & Guedj, 2016) . On the other end, based on some weaker and distribution dependent notions of stability such as hypothesis stability and mean-square stability, only polynomial generalization bounds seem possible in general cases, although the corresponding stability criteria are more amenable to verification (Bousquet & Elisseeff, 2002) . These observations have prompted the development of L q -stability, as an in-between state, to achieve the best of two worlds (Celisse & Guedj, 2016; Abou-Moustafa & Szepesvári, 2019) : it generalizes the notion of hypothesis stability from ℓ 1 -norm criterion to L q -norm criterion for q ≥ 2 but remains distribution dependent and thus is weaker than uniform stability; in the meanwhile it can still achieve similar exponential generalization bounds to those of uniformly stable algorithms (Bousquet & Elisseeff, 2002) . By far, the best known (and near-optimal) rates about exponential generalization bounds are offered by approaches based on uniform stability and certain fine-grained concentration inequalities for sum of functions of independent random variables (Feldman & Vondrák, 2019; Bousquet et al., 2020) . These rates are substantially sharper than those of Bousquet & Elisseeff (2002) , which are implied by a naive application of McDiarmid's inequality, in terms of the overhead factors on stability coefficients. While it has long been known that the low probability of failure (over sample) can be handled via developing modified bounded-difference inequalities (Rakhlin et al., 2005) , it still remains less clear how to simply adapt these existing techniques to the more sophisticated frameworks of Feldman & Vondrák (2019); Bousquet et al. ( 2020) to obtain sharper exponential bounds. Particularly for L q -stable learning algorithms, the state-of-the-art exponential generalization bounds are derived based on the moments or exponential extensions of the Efron-Stein inequality (Celisse & Guedj, 2016; Abou-Moustafa & Szepesvári, 2019), which yield similar rates of convergence to those of Bousquet & Elisseeff (2002) and thus are suspected to be sub-optimal. Given the above observed gap in rates of convergence between the generalization bounds under uniform stability and L q -stability, the following question is naturally raised: Is it possible to derive sharper exponential generalization bounds for L q -stable learning algorithms that match those recent breakthrough results for uniformly stable algorithms? As the core contribution of the present work, we give an affirmative answer to this open question by developing strict analogues of the near-optimal generalization bounds of uniformly stable algorithms for L q -stable algorithms. The main results of our work confirm that the notion of L q -stability serves as a neat yet powerful tool for extending those best-known generalization bounds to a broad class of non-uniformly stable algorithms. To illustrate the importance of our theory, we have applied the improved analysis of L q -stable algorithms to derive sharper exponential risk bounds for computationally tractable sparsity recovery estimators, such as the Iterative Hard Thresholding (IHT) algorithms widely used in high dimensional sparse learning (Blumensath & Davies, 2009; Foucart, 2011; Jain et al., 2014) . This application also serves as a main motivation of our study. Notation. Here we provide some notation that will be frequently used throughout the paper. Let S = {Z 1 , Z 2 , ..., Z N } be a set of independent random data samples valued in some measurable set Z. For any indices set I ⊆ [N ] := {1, ..., N }, we denote by S I = {Z i , i ∈ I} and S I = S \ S I . We denote by S ′ = {Z ′ 1 , Z ′ 2 , ..., Z ′ N } another i.i.d. sample from the same distribution as that of S and we write S (i) = {Z 1 , ..., Z i-1 , Z ′ i , Z i+1 , ..., Z N }. For a real-valued random variable Y , its L q -norm for q ≥ 1 is defined by ∥Y ∥ q = (E[|Y | q ]) 1/q . By definition it can be verified that ∀q ≥ 2, ∥Y ∥ 2 q = (E[|Y | q ]) 2/q = E[|Y 2 | q/2 ] 2/q = Y 2 q/2 . Let g : Z N → R be some measurable function and consider the random variable g(S) = g(Z 1 , ...Z N ). For g(S) and any index set I ⊆ [N ], we define the following abbreviations: g(S I ) := E[g(S) | S I ], ∥g∥ q (S I ) := (E[|g(S)| q | S I ]) 1/q . We say a real-valued function f is G-Lipschitz continuous over the domain W if |f (w) -f (w ′ )| ≤ G∥w -w ′ ∥, ∀w, w ′ ∈ W. For a pair of functions f, g ≥ 0, we use f ≲ g (or g ≳ f ) to denote f ≤ cg for some constant c > 0. We denote by supp(w) the support of a vector w which is the index set of non-zero entries of w. where ℓ : W × Z → R + is a non-negative and potentially unbounded loss function whose value ℓ(w; z) measures the loss evaluated at z with parameter w. As a classic fundamental issue in statistical learning, we are interested in deriving the upper bounds on the difference between population



SETUP AND PRIOR RESULTS Problem setup. We consider a statistical learning algorithm A : Z N → W that maps a training data set S to a model A(S) in a closed subset W of an Euclidean space. The population risk and corresponding empirical risk evaluated at A(S) are respectively given by R(A(S)) := E Z [ℓ(A(S); Z)] and R S (A(S)) := 1 N N i=1 ℓ(A(S); Z i ),

