EXPONENTIAL GENERALIZATION BOUNDS WITH NEAR-OPTIMAL RATES FOR L q -STABLE ALGORITHMS

Abstract

The stability of learning algorithms to changes in the training sample has been actively studied as a powerful proxy for reasoning about generalization. Recently, exponential generalization and excess risk bounds with near-optimal rates have been obtained under the stringent and distribution-free notion of uniform stability (Bousquet et al., 2020; Klochkov & Zhivotovskiy, 2021) . In the meanwhile, under the notion of L q -stability, which is weaker and distribution dependent, exponential generalization bounds are also available yet so far only with sub-optimal rates. Therefore, a fundamental question we would like to address in this paper is whether it is possible to derive near-optimal exponential generalization bounds for L q -stable learning algorithms. As the core contribution of the present work, we give an affirmative answer to this question by developing strict analogues of the near-optimal generalization and risk bounds of uniformly stable algorithms for L q -stable algorithms. Further, we demonstrate the power of our improved L qstability and generalization theory by applying it to derive strong sparse excess risk bounds, under mild conditions, for computationally tractable sparsity estimation algorithms such as Iterative Hard Thresholding (IHT).

1. INTRODUCTION

A fundamental issue in statistical learning is to bound the generalization error of a learning algorithm for understanding its prediction performance on unseen data. It has long been recognized in literature that one of the key characteristics that permits learning algorithms to generalize is the stability of estimated model to perturbations in training data. The idea of using algorithmic stability as a proxy for generalization performance analysis dates back to the seventies (Rogers & Wagner, 1978; Devroye & Wagner, 1979) . Since the seminal work of Bousquet & Elisseeff (2002) , the search for generalization bounds under various notions of algorithmic stability has been a flourishing area of learning theory (Zhang, 2003; Mukherjee et al., 2006; Shalev-Shwartz et al., 2010; Kale et al., 2011; Hardt et al., 2016; Celisse & Guedj, 2016; Bousquet et al., 2020) . As one may expect, the stronger an algorithmic stability criterion is, the sharper the resulting generalization bound will be. On one end, exponential generalization bounds can be guaranteed by approaches under the most stringent notion of uniform stability (Bousquet & Elisseeff, 2002; Bousquet et al., 2020) , which requires the change in the prediction loss to be uniformly small regardless data distribution. Despite the strength of generalization, the distribution-free nature makes uniform stability too restrictive to be fulfilled, e.g., by learning rules with unbounded losses (Celisse & Guedj, 2016) . On the other end, based on some weaker and distribution dependent notions of stability such as hypothesis stability and mean-square stability, only polynomial generalization bounds seem possible in general cases, although the corresponding stability criteria are more amenable to verification (Bousquet & Elisseeff, 2002) . These observations have prompted the development of L q -stability, as an in-between state, to achieve the best of two worlds (Celisse & Guedj, 2016; Abou-Moustafa & Szepesvári, 2019) : it generalizes the notion of hypothesis stability from ℓ 1 -norm criterion to L q -norm criterion for q ≥ 2 but remains distribution dependent and thus is weaker than uniform stability; in the meanwhile it can still achieve similar exponential generalization bounds to those of uniformly stable algorithms (Bousquet & Elisseeff, 2002) . By far, the best known (and near-optimal) rates about exponential generalization bounds are offered by approaches based on uniform stability and certain fine-grained concentration inequalities for sum of functions of independent random variables (Feldman & Vondrák, 2019; Bousquet et al., 2020) . These rates are substantially sharper than those of Bousquet & Elisseeff (2002) , which are implied by a naive application of McDiarmid's inequality, in terms of the overhead factors on stability coefficients. While it has long been known that the low probability of failure (over sample) can be handled via developing modified bounded-difference inequalities (Rakhlin et al., 2005) , it still remains less clear how to simply adapt these existing techniques to the more sophisticated frameworks of Feldman & Vondrák (2019) ; Bousquet et al. (2020) to obtain sharper exponential bounds. Particularly for L q -stable learning algorithms, the state-of-the-art exponential generalization bounds are derived based on the moments or exponential extensions of the Efron-Stein inequality (Celisse & Guedj, 2016; Abou-Moustafa & Szepesvári, 2019) , which yield similar rates of convergence to those of Bousquet & Elisseeff (2002) and thus are suspected to be sub-optimal. Given the above observed gap in rates of convergence between the generalization bounds under uniform stability and L q -stability, the following question is naturally raised: Is it possible to derive sharper exponential generalization bounds for L q -stable learning algorithms that match those recent breakthrough results for uniformly stable algorithms? As the core contribution of the present work, we give an affirmative answer to this open question by developing strict analogues of the near-optimal generalization bounds of uniformly stable algorithms for L q -stable algorithms. The main results of our work confirm that the notion of L q -stability serves as a neat yet powerful tool for extending those best-known generalization bounds to a broad class of non-uniformly stable algorithms. To illustrate the importance of our theory, we have applied the improved analysis of L q -stable algorithms to derive sharper exponential risk bounds for computationally tractable sparsity recovery estimators, such as the Iterative Hard Thresholding (IHT) algorithms widely used in high dimensional sparse learning (Blumensath & Davies, 2009; Foucart, 2011; Jain et al., 2014) . This application also serves as a main motivation of our study. Notation. Here we provide some notation that will be frequently used throughout the paper. Let S = {Z 1 , Z 2 , ..., Z N } be a set of independent random data samples valued in some measurable set Z. For any indices set I ⊆ [N ] := {1, ..., N }, we denote by S I = {Z i , i ∈ I} and S I = S \ S I . We denote by S ′ = {Z ′ 1 , Z ′ 2 , ..., Z ′ N } another i.i.d. sample from the same distribution as that of S and we write S (i) = {Z 1 , ..., Z i-1 , Z ′ i , Z i+1 , ..., Z N }. For a real-valued random variable Y , its L q -norm for q ≥ 1 is defined by ∥Y ∥ q = (E[|Y | q ]) 1/q . By definition it can be verified that ∀q ≥ 2, ∥Y ∥ 2 q = (E[|Y | q ]) 2/q = E[|Y 2 | q/2 ] 2/q = Y 2 q/2 . Let g : Z N → R be some measurable function and consider the random variable g(S) = g(Z 1 , ...Z N ). For g(S) and any index set I ⊆ [N ], we define the following abbreviations: g(S I ) := E[g(S) | S I ], ∥g∥ q (S I ) := (E[|g(S)| q | S I ]) 1/q . We say a real-valued function f is G-Lipschitz continuous over the domain W if |f (w) -f (w ′ )| ≤ G∥w -w ′ ∥, ∀w, w ′ ∈ W. For a pair of functions f, g ≥ 0, we use f ≲ g (or g ≳ f ) to denote f ≤ cg for some constant c > 0. We denote by supp(w) the support of a vector w which is the index set of non-zero entries of w.

1.1. SETUP AND PRIOR RESULTS

Problem setup. We consider a statistical learning algorithm A : Z N → W that maps a training data set S to a model A(S) in a closed subset W of an Euclidean space. The population risk and corresponding empirical risk evaluated at A(S) are respectively given by R(A(S)) := E Z [ℓ(A(S); Z)] and R S (A(S)) := 1 N N i=1 ℓ(A(S); Z i ), where ℓ : W × Z → R + is a non-negative and potentially unbounded loss function whose value ℓ(w; z) measures the loss evaluated at z with parameter w. As a classic fundamental issue in statistical learning, we are interested in deriving the upper bounds on the difference between population and empirical risks, i.e., |R(A(S)) -R S (A(S))|, which quantifies the generalization error of A. Let R * := min w∈W R(w) be the optimal value of the population risk. We will also study how to upper bound R(A(S)) -R * (a.k.a. excess risk) which is of particular interest for understanding the population risk minimization performance of A. We first introduce the concept of uniform stability (Bousquet & Elisseeff, 2002) which requires the change in the prediction loss to be uniformly small regardless the distribution of data. Definition 1 (Uniform stability). A learning algorithm A is said to have uniform stability with parameter γ u > 0 if it satisfies the following uniform bound: sup S,S (i) ,Z∈Z |ℓ(A(S); Z) -ℓ(A(S (i) ); Z)| ≤ γ u , ∀i ∈ [N ]. Given that the loss function ℓ is almost surely bound by M , Bousquet & Elisseeff (2002) showed that a large class of regularized empirical risk minimization (ERM) algorithms has uniform stability, and using McDiarmid's inequality yields the following exponential tail generalization bound that holds with probability at least 1 -δ over the draw of S for any δ ∈ (0, 1): |R(A(S)) -R S (A(S))| ≲ γ u N log 1 δ + M log (1/δ) N . Recently, equipped with a strong concentration inequality for sums of random functions, Bousquet et al. (2020) established the following moments bound of uniformly stable algorithms for all q ≥ 2: ∥R(A(S)) -R S (A(S))∥ q ≲ qγ u log(N ) + M q N . In view of the equivalence between tails and moments (see, e.g., Bousquet et al., 2020 , Lemma 1), the above L q -norm bound implies that for any δ ∈ (0, 1), the following tail bound holds with probability at least 1 -δ over the draw of S : |R(A(S)) -R S (A(S))| ≲ γ u log(N ) log 1 δ + M log (1/δ) N . This bound substantially improves the classic result in Eq. ( 2) by reducing the overhead factor on stability coefficient from O N log( 1 δ ) to O(log(N ) log( 1 δ )). For example, in regimes such as regularized ERM where γ u ≲ 1 √ N is usually the case, the convergence rate in Eq. ( 2) becomes vacuous as it is not vanishing in sample size, while the bound in Eq. ( 4) still guarantees O log(N ) log( 1δ ) √ N rate of convergence. Indeed, up to logarithmic factors on sample size and tail bounds, the rate in Eq. ( 4) is nearly optimal in the sense of a lower bound on sums of random functions by Bousquet et al. (2020) . The bound in Eq. ( 4) can be extended to stochastic learning algorithms when the uniform stability (over data) holds with high probability over the internal randomness of algorithm (Feldman & Vondrák, 2019; Bassily et al., 2020) . Under the generalized Bernstein condition (Koltchinskii, 2006) and based on the sharp concentration inequality for sums of random functions by Bousquet et al. (2020) , Klochkov & Zhivotovskiy (2021) alternatively established the following deviation optimal excess risk bound that holds with probability at least 1 -δ over the draw of S: R(A(S)) -R * ≲ ∆ opt + E[∆ opt ] + γ u log(N ) log 1 δ + (M + B) log(1/δ) N , where ∆ opt := R S (A(S)) -min w∈W R S (w) represents the empirical risk sub-optimality of the algorithm on training data, and B is the Bernstein condition constant as defined in Assumption 1. While implying strong generalization guarantees, the uniform stability is also most stringent in the sense that it is distribution independent and hard to be fulfilled, e.g., by learning rules with unbounded losses. To address such an unpleasant restrictiveness, the notion of L q -stability was alternatively introduced by Celisse & Guedj (2016) as a relaxation of uniform stability. Definition 2 (L q -Stability). For q ≥ 1, a learning algorithm A is said to have L q -stability with parameter γ q > 0 if it satisfies the following moment bound: ℓ(A(S); Z) -ℓ(A(S (i) ); Z) q ≤ γ q , ∀i ∈ [N ]. In the above definition, the expectation associated with L q -norm is taken over S, S (i) , Z, and the internal random bits of A, if any (such as in the case of stochastic learning algorithms). Note that slightly different from that of Celisse & Guedj (2016) , the random variable Z in the above definition is not necessarily required to be independent of S and S (i) . By definition, L q -stability is distribution dependent and thus is weaker than uniform stability which can be regarded as a special case of L qstability with γ q ≡ γ for some γ > 0. Particularly for q = 1 and q = 2, the L q -stability reduces to the notions of hypothesis stability (Bousquet & Elisseeff, 2002) and mean-square stability (Kale et al., 2011) , respectively. For an instance, it has been shown that the classical ridge regression model with unbounded responses has L q -stability for all q ≥ 1 rather than uniform stability (Celisse & Guedj, 2016) . As a novel and concrete example, we will see shortly in Section 3 that L q -stability plays a crucial role for deriving strong sparse excess risk bounds for sparsity estimation algorithms such as IHT (Jain et al., 2014; Yuan et al., 2018) . Alternatively, the definition of L q -stability can be extended to the L q -argument-stability as ∥A(S) -A(S (i) )∥ q ≤ γ q , which generalizes the concept of uniform argument stability (Bassily et al., 2020) to the L q -norm criterion. Obviously L q -argument-stability is not at all relying on the random argument Z and it implies L q -stability for Lipschitz losses. The following is by far the best known moments generalization bound under L q -stability that holds for all q ≥ 2 and potentially unbounded losses (Celisse & Guedj, 2016; Abou-Moustafa & Szepesvári, 2019) : ∥R(A(S)) -R S (A(S))∥ q ≲ γ q N q + q N . As one can see that the L q -stability generalization bound in Eq. ( 6) is significantly inferior to the near-optimal uniform stability generalization bound in Eq. ( 3) in terms of the overhead on stability coefficient. Such a gap in rate of convergence is indeed unsurprising: the bound in Eq. ( 6) was derived via more or less directly applying moments or exponential extensions of Efron-Stein inequality to generalization error (Celisse & Guedj, 2016; Abou-Moustafa & Szepesvári, 2019) , and thus yields about the same overhead factor on stability coefficient as that of the sub-optimal exponential bound in Eq. (2) for uniformly stable algorithms. In light of these observations, we are naturally motivated to derive sharper exponential generalization bounds for L q -stable algorithms hopefully to match the near-optimal bound in Eq. (3) achievable by uniformly stable algorithms.

1.2. OUR CONTRIBUTION

The core contribution of the present work is a set of substantially improved exponential generalization bounds for L q -stable algorithms. The key ingredient of our analysis is a sharper concentration bound on sums of functions of independent random variables under the L q -norm bounded difference conditions, which generalizes a previous counterpart under the uniform bounded difference conditions (Bousquet et al., 2020) . With this generic concentration bound in hand, we are able to derive sharper generalization and excess risk bounds for L q -stable learning algorithms that match those best known for uniform stable algorithms. The power of our results is demonstrated through deriving more appealing exponential sparse excess risk bounds for computationally tractable sparsity estimation algorithms (such as IHT). The main results obtained in this work are sketched below: • In Section 2, we first establish in Theorem 1 an L q -norm inequality for sums of functions of random variables with L q -norm bounded difference. Then equipped with such a generalpurpose concentration inequality, we prove in Theorem 2 the following L q -norm generalization bound for L q -stable learning algorithms for all q ≥ 2: ∥R(A(S)) -R S (A(S))∥ q ≲ qγ q log N + M q q N , where M q is an upper bound of moments ∥ℓ(A(S); Z)∥ q . Compared to Eq. ( 6), the preceding bound improves the overhead factor on γ q from √ N to log(N ). As another consequence of our L q -norm concentration inequality, we further derive in Theorem 3 the following excess risk bound for L q -stable algorithms under B-Bernstein-condition with M -bounded losses (where C = M + B), or µ-quadratic-growth condition with G-Lipschitz losses (where C = G 2 µ ): ∥R(A(S)) -R * -∆ opt ∥ q ≲ E[∆ opt ] + qγ q log(N ) + Cq N . Based on the equivalence between moments and tails, this above result implies an identical deviation optimal risk bound in Eq. ( 5) for uniformly stable algorithms. • In Section 3, based on our L q -stability generalization theory, we show in Theorem 4 a novel exponential sparse excess risk bound for inexact L 0 -estimators. A key insight here is that L 0estimators are in many cases "almost always" stable over any fixed supporting set, and thus can be shown to have L q -stability over the same supporting set, which consequently makes our analysis techniques developed for L q -stable algorithms applicable there. This novel application answers a call by Celisse & Guedj (2016) for extending the range of applicability of the L q -stability theory beyond the unbounded ridge regression problem, and it complements other existing applications of the L q -stability theory including k-nearest neighbor classification and k-folds cross-validation (Celisse & Mary-Huard, 2018; Abou-Moustafa & Szepesvári, 2019) . Last but not least, our improved L q -stability theory can also be readily applied to the above mentioned prior applications to obtain sharper generalization bounds. 2 SHARPER EXPONENTIAL BOUNDS FOR L q -STABLE ALGORITHMS

2.1. A MOMENT INEQUALITY FOR SUMS OF RANDOM FUNCTIONS

We start by presenting in the following theorem a moment inequality for sums of random functions of N independent random variables that satisfy the L q -norm bounded difference condition. See Appendix B.1 for its proof. Theorem 1. Let S = {Z 1 , Z 2 , ..., Z N } be a set of independent random variables valued in Z. Let g 1 , ..., g N be a set of measurable functions g i : Z N → R that satisfy the following conditions for any i ∈ [N ]: • E [g i (S) | S \ Z i ] = 0, almost surely; • g i (S) has the following L q -norm bounded difference property with respect to all variables in S except Z i , i.e., ∀j ̸ = i, for all q ≥ 1: g i (S) -g i (S (j) ) q ≤ β q . Then there exists a universal constant κ < 1.271 such that for all q ≥ 2, N i=1 g i (S) -E[g i (S) | Z i ] q ≤ 4κqN ⌈log 2 N ⌉β q . Additionally, if ∥E[g i (S) | Z i ]∥ q ≤ M q , then for all q ≥ 2 N i=1 g i (S) q ≤ 2 2κN qM q + 4κqN ⌈log 2 N ⌉β q . Remark 1. Theorem 1 extends the moment inequality of Bousquet et al. (2020, Theorem 4 ) from under the distribution-free uniform bounded difference property to under the L q -norm bounded difference property which is distribution dependent. Specially if g i (S) have uniformly bounded difference property, then Theorem 1 reduces to the result of Bousquet et al. (2020, Theorem 4 ). Remark 2. The L q -norm boundedness condition ∥E[g i (S) | Z i ]∥ q ≤ M q in our theorem allows g i to be potentially unbounded over domain of interest, which is weaker than the corresponding almost sure boundedness condition on |E[g i (S) | Z i ]| as imposed by Bousquet et al. (2020, Theorem 4) .

2.2. GENERALIZATION BOUNDS FOR L q -STABLE ALGORITHMS

As an important consequence of Theorem 1, we can derive er the following main result on the generalization bound of L q -stable learning algorithms. See Appendix B.2 for a proof of this result. Theorem 2. Let A : Z N → W be a learning algorithm that has L q -stability by γ q > 0 for q ≥ 1. Suppose that ∥ℓ(A(S); Z)∥ q ≤ M q for any Z ∈ Z. Then for all q ≥ 2, ∥R(A(S)) -R S (A(S))∥ q ≲ qγ q log N + M q q N . Remark 3. The L q -norm boundedness condition ∥ℓ(A(S); Z)∥ q ≤ M q allows for learning with unbounded losses over, e.g., data distribution with sub-Gaussian or sub-exponential tail bounds. To compare with the best-known moments generalization bound in Eq. ( 6) under the notion of L qstability, our bound in Theorem 2 substantially improves the overhead factor on γ q from √ N to log(N ). Specially when reduced to regime of uniform stability where γ q ≡ γ u for all q ≥ 1, our result revisits the moments generalization bound in Eq. ( 3) which is nearly tight, up to logarithmic factors on sample size, in the sense of a lower bound on sums of random functions from Bousquet et al. (2020) . More broadly, for any δ ∈ (0, 1), suppose that the following exponential stability bound holds with probability at least 1 -δ with a mixture of sub-Gaussian and sub-exponential tailsfoot_0 over S, S (i) , Z: ℓ(A(S); Z) -ℓ(A(S (i) ); Z) ≤ a log e δ + b log e δ . ( ) Then according to the equivalence of tails and moments, as summarized in Lemma 4 (see Appendix A), we must have that A is L q -stable by γ q = aq + b √ q. Assume that the loss is bounded in (0, M ] almost surely over data. Then the L q -norm generalization bound in Theorem 2 combined with Lemma 4 immediately implies the following generalization bound: |R(A(S)) -R S (A(S))| ≲ a log(N ) log 2 1 δ + b log(N ) log 1.5 1 δ + M log (1/δ) N . Compared with the uniform stability implied tail bound in Eq. ( 4), the preceding L q -stability bound is nearly identical up to slightly worse confidence tail terms which are caused by the uncertainty of L q -stability with respect to data distribution. We conjecture that such a slight deterioration in tail bounds might possibly be remedied by using the exponential versions of Efron-Stein inequality (Boucheron et al., 2003; Abou-Moustafa & Szepesvári, 2019) instead of the currently used variant in moments. We leave the improvement over poly-logarithmic terms for future investigation.

2.3. EXCESS RISK BOUNDS FOR L q -STABLE ALGORITHMS

In addition to the generalization bounds, we further apply Theorem 1 to study the excess risk bounds of an L q -stable learning algorithm which are of particular interest for understanding its population risk minimization performance. Let us denote W * := Argmin w∈W R(w) as the optimal solution set of the population risk. In order to get sharper risk bounds, we need to impose some structural conditions on risk functions. Particularly, the following defined generalized Bernstein condition (Koltchinskii, 2006 ) is conventionally used with multiple global risk minimizers allowed. Assumption 1 (Generalized Bernstein condition). For some B > 0 and for any w ∈ W, there exists w * ∈ W * such that the following holds: E (ℓ(w; Z) -ℓ(w * ; Z)) 2 ≤ B(R(w) -R(w * )). We will also consider the quadratic growth condition which is widely used as an alternative condition for establishing fast rates of convergence in learning theory. Assumption 2 (Quadratic growth condition). For some µ > 0 and for any w ∈ W, there exists w * ∈ W * such that the following holds: R(w) ≥ R * + µ 2 ∥w -w * ∥ 2 . Remark 4. Clearly, when the loss is G-Lipschitz, the quadratic growth condition with parameter µ implies the Bernstein condition with parameter B = 2G 2 µ . The following theorem is our main result on the excess risk bound of L q -stable algorithms, which extends the near-optimal exponential risk bounds of Klochkov & Zhivotovskiy (2021) from uniform stable algorithms to L q -stable algorithms. A proof of this result can be found in Appendix B.3. Theorem 3. Let A : Z N → W be a learning algorithm that has L q -stability with parameter γ q for q ≥ 1. (a) If Assumption 1 holds and ℓ(•; •) ≤ M , then ∀q ≥ 2, ∥R(A(S)) -R * -∆ opt ∥ q ≲ E[∆ opt ] + qγ q log(N ) + (M + B)q N . (b) If Assumption 2 holds and ℓ(•; •) is G-Lipschitz with respect to its first argument, then ∀q ≥ 2, ∥R(A(S)) -R * -∆ opt ∥ q ≲ E[∆ opt ] + qγ q log(N ) + G 2 q µN . Remark 5. Suppose that A satisfies the exponential stability bound in Eq. ( 7), and thus A has L qstability by γ q = aq + b √ q. Then combined with Lemma 4, the L q -norm risk bounds in Theorem 3 suggest that the following exponential tail bound holds: R(A(S)) -R * ≲ ∆ opt + E[∆ opt ] + a log(N ) log 2 1 δ + b log(N ) log 1.5 1 δ + log (1/δ) N . Remark 6. In part (a), the M -bounded-loss condition is not essential and it can be relaxed to a subexponential (or sub-Gaussian) variant by alternatively using the general Bernstein-type inequalities for sums of independent sub-exponential random variables (Vershynin, 2018) . Concerning part (b), under the quadratic growth condition, the loss is allowed to be unbounded if it is Lipschitz continuous.

3. APPLICATION TO INEXACT L 0 -ERM

In this section, we demonstrate an application of our L q -stability and generalization theory to the following problem of high-dimensional stochastic risk minimization under hard sparsity constraint: min w∈W R(w) := E Z [ℓ(w; Z)] subject to ∥w∥ 0 ≤ k, where W ⊆ R d the cardinality constraint ∥w∥ 0 ≤ k for k ≪ d is imposed for enhancing the interpretability and learnability of model in situations where there are no clear favourite explanatory variables, or the model is over-parameterized. We consider the following L 0 -ERM problem over training set S = {Z i } i∈[N ] : w * S,k = arg min ∥w∥0≤k R S (w) := 1 N N i=1 ℓ(w; Z i ) . Since the problem is known to be NP-hard (Natarajan, 1995) in general, it is computationally intractable to solve it exactly in general cases. Alternatively, we consider the inexact L 0 -ERM oracle as a meta-algorithm outlined in Algorithm 1. In order to avoid assuming unrealistic conditions like restricted isometry property (RIP), it is typically needed to allow sparsity level relaxation for approximate algorithms like IHT to achieve favorable converge behavior (Jain et al., 2014; Shen & Li, 2017; Yuan et al., 2018; Murata & Suzuki, 2018) . Therefore, we are particularly interested in Algorithm 1: Inexact L 0 -ERM Oracle Input : A training data set S = {Z i } i∈[N ] and the desired sparsity level k. Output: wS,k . Compute an inexact k-sparse L 0 -ERM estimation wS,k such that • wS,k is optimal over its support J = supp( wS,k ), i.e., wS,k = arg min w∈W,supp(w)⊆ J R S (w); • wS,k attains certain k-sparse sub-optimality level ∆k ,opt ≥ 0 for some k ≤ k such that R S ( wS,k ) -R S (w * S, k) ≤ ∆k ,opt . the inexact L 0 -ERM oracle with k-sparse sub-optimality ∆k ,opt ≥ 0 for some k ≤ k such that the output wS,k of Algorithm 1 satisfies R S ( wS,k ) -R S (w * S, k) ≤ ∆k ,opt . It is typical that ∆k ,opt is a random value over the training set S. For example, the sub-optimality guarantees of IHT for empirical risk usually hold with high probability over training data (Jain et al., 2014) . Let w * k := arg min ∥w∥0≤ k R(w) be the k-sparse minimizer of population risk for some k ≤ k. We are interested in deriving exponential upper bounds for the k-sparse excess risk given by R( wS,k ) -R(w * k). Our analysis also relies on the conditions of Restricted Strong Convexity (RSC) which extends the concept of strong convexity to the analysis of sparsity recovery methods (Bahmani et al., 2013; Blumensath & Davies, 2009; Jain et al., 2014; Yuan et al., 2020) . Definition 3 (Restricted Strong Convexity). For any sparsity level 1 ≤ s ≤ d, we say a function f is restricted µ s -strongly convex if there exists some µ s > 0 such that f (w) -f (w ′ ) -⟨∇f (w ′ ), w -w ′ ⟩ ≥ µ s 2 ∥w -w ′ ∥ 2 , ∀∥w -w ′ ∥ 0 ≤ s. Specially when s = d, we say f is µ-strongly convex if it is µ d -strongly convex. The following basic assumptions will be used in our theoretical analysis. Assumption 3. The loss function ℓ(•; •) is convex and G-Lipschitz with respect to its first argument. Assumption 4. The population risk R is µ-strongly convex and the empirical risk R S is µ k -strongly convex with probability at least 1 -δ N over sample S for some δ N ∈ (0, 1). Assumption 5. The domain of interest is uniformly bounded such that ∥w∥ ≤ D, ∀w ∈ W. Remark 7. Assumption 3 is common in the study of algorithmic stability and generalization theory. Assumption 4 is conventional in the sparsity recovery analysis of L 0 -ERM. Assumption 5 is needed for establishing the L q -stability of L 0 -ERM in Lemma 1 to follow. Similar conditions have also been assumed in the prior work of Yuan & Li (2022) .  We first present the following lemma that guarantees the L q -stability of w * S|J for any fixed J with |J| = k. See Appendix C.1 for its proof. Lemma 1. Assume that Assumptions 3, 4 and 5 hold and log(1/δ N ) log(N ) ≥ 2. Let J ⊆ [d], |J| = k be a set of indices of cardinality k. Then for any q ≥ 2, the oracle estimator w * S|J has L q -stability with parameter γ q = 1 N 4G 2 µ k + 2GD + 2GD log(N )q log(1/δ N ) . Remark 8. For sparse linear regression models, it can be verified based on the result by Agarwal et al. (2012, Lemma 6 ) that Assumptions 4 holds with δ N = e -c0N for some universal positive constant c 0 . Then we have log(1/δ N ) log(N ) = c0N log(N ) ≥ 2 for sufficiently large N , and Lemma 1 implies that γ q ≲ 1 N G 2 µ k + GD log(N )q c0 for all q ≥ 2. The following theorem is our main result on the sparse excess risk of the inexact L 0 -ERM oracle as defined in Algorithm 1. See Appendix C.2 for its proof which is stimulated by that of Theorem 3. Theorem 4. Suppose that Assumptions 3, 4, 5 hold. Assume that log(1/δ N ) log(N ) ≥ 2. Then for any δ ∈ (0, e -1 ), the following k-sparse excess risk bound holds with probability at least 1 -δ over the random draw of S: R( wS,k ) -R(w * k) ≲ GD k log ed k + log e δ 2 log 2 (N ) log(1/δ N ) + log(N ) G 2 µ k + GD + G 2 µ k log ed k + log e δ N + G k log ed k + log e δ R(w * k) -R(w * ) N µ + ∆k ,opt + E ∆k ,opt . Remark 9. For the IHT-style algorithms, the sparse optimization sub-optimality ∆k ,opt can be arbitrarily small (with high probability) after sufficient rounds of iteration (Jain et al., 2014) . Specially for sparse linear regression models in which Assumptions 4 holds with δ N = e -c0N (Agarwal et al., 2012) , we have that log(1/δ N ) log(N ) = c0N log(N ) ≥ 2 can always be fulfilled for sufficiently large sample size N , and the sparse excess risk bound in Theorem 4 roughly scales as R( wS,k ) -R(w * k) ≲ (k log (d) + log (1/δ)) 2 log 2 (N ) N + (k log (d) + log (1/δ)) R(w * k) -R(w * ) N + ∆k ,opt + E ∆k ,opt . Generally for misspecified sparsity models, the dominant rate in the above bound matches the O 1 √ N sparse excess risk bound of Yuan & Li (2022, Theorem 1) for IHT under similar conditions. Compared to the O 1 N bound available in that paper (Yuan & Li, 2022, Theorem 3) , the preceding bound is generally slower in rate but more broadly applicable without imposing any strong-signal or bounded-loss conditions as required in the analysis of Yuan & Li (2022) . For well-specified k-sparse models such that R(w * k) = R(w * ), i.e., the population minimizer is truly k-sparse, the preceding bound improves to R( wS,k ) -R(w * k) ≲ (k log (d) + log (1/δ)) 2 log 2 (N ) N + ∆k ,opt + E ∆k ,opt . Therefore, our bound is more appealing in the sense that it naturally adapts to well-specified models to attain an improved O( 1 N ) rate. In contrast, the regularization technique used by Yuan & Li (2022, Theorem 1) needs an optimal choice of penalty strength of scale O( 1 √ N ) which leads to an overall slow rate of convergence, though the analysis is relatively simpler. We further comment on the role of L q -stability played in deriving the improved bound of Theorem 4. The O( 1 N ) fast-rate component of the bound is indeed rooted from the L q -stability coefficient as established in Lemma 1 and an application of Lemma 6. The relatively slow O( 1 √ N ) component, which is controlled by the oracle factor R(w * k) -R(w * ), is mainly due to a careful analysis customized for handling the combinatorial optimization nature of L 0 -ERM. Such a slow-rate term would be vanished if the global minimizer w * is truly k-sparse. Therefore, we confirm that the fast-rate component attributes to our L q -stability theory, while the slow but adaptive rate component mainly attributes to the optimization property of L 0 -ERM. Finally, we comment in passing that our improved L q -stability theory can also be applied to some prior applications such as unbounded ridge regression and k-folds cross-validation to obtain sharper generalization bounds.

4. CONCLUSION

In this paper, we presented an improved generalization theory for L q -stable learning algorithms. There exits a clear discrepancy between the recently developed near-optimal generalization bounds for uniformly stable algorithms and the best known yet sub-optimal bounds for L q -stable algorithms. Aiming at closing such a theoretical gap, we for the first time derived a set of near-optimal exponential generalization bounds for L q -stable algorithms that match those of uniformly stable algorithms. As a concrete application of our L q -stability theory, we have applied the developed analysis tools to derive strong exponential risk bounds for inexact sparsity-constrained ERM estimators under milder conditions. To conclude, L q -stable algorithms generalize almost as fast as uniformly stable algorithms, though the distribution-dependent notion of L q -stability is weaker than uniform stability.

A PRELIMINARIES

In this section, we collect some preliminary results that will be used in our analysis. We start by introducing the following L q -norm generalization of the celebrated Efron-Stein inequality, which is a corollary of Boucheron et al. (2005, Theorem 2) . Proposition 1 (Generalized Efron-Stein inequality (Celisse & Guedj, 2016) ). Let S = {Z 1 , ..., Z N } be a set of independent random variables valued in Z and g : Z N → R be some measurable function. Then there exists a universal constant κ < 1.271 such that for all q ≥ 2, ∥g(S) -E[g(S)]∥ q ≤ 2κq N i=1 g(S) -g(S (i) ) 2 q/2 . The following result is an immediate consequence of Proposition 1 when applied to sum of independent random variables, which revisits a version of Marcinkiewicz-Zygmund inequality (Chow & Teicher, 2003) . Proposition 2. Let Z 1 , ..., Z N be a set of independent centered random variables. Then for all q ≥ 2, N i=1 Z i q ≤ 2 2κq N i=1 Z 2 i q/2 . The following lemma is simple yet useful in our analysis. Lemma 2. Let S = {Z 1 , ..., Z N } be a set of independent random variables valued in some measure space Z and g : Z N → R be some measurable function. Then for all I ⊆ [N ] and q ≥ 1, we have ∥g(S I )∥ q ≤ ∥g(S)∥ q = ∥∥g∥ q (S I )∥ q . Proof. Recall g(S I ) = E[g(S) | S I ]. Then using Jensen's inequality we can show that ∥g(S I )∥ q = (E [|E[g(S) | S I ]| q ]) 1/q ≤ (E [E[|g(S)| q | S I ]|]) 1/q = (E[|g(S)| q ]) 1/q = ∥g(S)∥ q . By definition we can also express ∥g(S) ∥ q = (E [E[|g(S)| q | S I ]|]) 1/q = ∥∥g(S)∥ q (S I )∥ q . As a direct consequence of Lemma 2, the following result indicates that conditional expectation does not expand the differences in L q -norm. Lemma 3. Let S = {Z 1 , ..., Z N } be a set of independent random variables valued in some measure space Z and g : Z N → R be some measurable function. Let I ⊆ [N ] be an index set. Then for all i ∈ I and q ≥ 1, g(S I ) -g(S (i) I ) q ≤ g(S) -g(S (i) ) q . Proof. For each i ∈ I, by applying Lemma 2 to g(S) -g(S (i) ) we can show that ∥g(S I )g(S (i) I )∥ q ≤ ∥g(S) -g(S (i) )∥ q , which gives the desired result. We also need the following lemma about the equivalence between tails and moments (see, e.g., Bousquet et al., 2020) . Lemma 4. Let Y be a real-valued random variable. • Suppose that Y satisfies the following inequality for some a, b ≥ 0 with probability at least 1 -δ for any δ ∈ (0, 1), |Y | ≤ a log e δ + b log e δ . Then, for any q ≥ 1 it holds that ∥Y ∥ q ≤ 3aq + 9b √ q. • Suppose that Y satisfies ∥Y ∥ q ≤ f (q) for any 1 ≤ q l ≤ q < q u and some non-negative real function f . Then the following holds with probability at least 1 -δ for any δ ∈ (e 1-qu , e 1-q l ]: |Y | ≤ ef log e δ . Proof. We only prove the second part which slightly generalizes the corresponding result of Bousquet et al. (2020, Lemma 1) . For any δ ∈ (e 1-qu , e 1-q l ], we choose q = log(e/δ) ∈ [q l , q u ). Using the condition ∥Y ∥ q ≤ f (q) and Markov's inequality yields P |Y | > ef log e δ ≤ P (|Y | > e∥Y ∥ q ) ≤ E[|Y | q ] e q ∥Y ∥ q q = δ e ≤ δ. This proves the desired bound in the second part. Remark 10. Suppose that for any δ ∈ (0, 1), the following inequality holds with probability at least 1 -δ over S, S (i) , Z: ℓ(A(S); Z) -ℓ(A(S (i) ); Z) ≤ a log e δ + b log e δ . Then according to the first part of Lemma 4 we have that A is L q -stable by γ q = 3aq + 9b √ q. Remark 11. Specially if q l = 1 and q u = ∞ is allowed, then the second bound in Lemma 4 holds with an arbitrary tail bound δ ∈ (0, 1). Finally, we present the following technical lemma about self-bounding inequalities to be used for showing fast rates of excess risk bounds under Bernstein or quadratic growth conditions. Lemma 5. Let x, a, b, c be a set of non-negative quantities satisfying x ≤ a + b(x + c). Then it must hold that x ≤ 3a+2b+c 2 . Proof. If x ≤ a, then the claim holds trivially. In the complementary case of x > a, by condition we must have (x -a) 2 ≤ b(x + c), which then implies x ≤ 2a + b + 2 b(a + c) 2 ≤ 3a + 2b + c 2 , where we have used the basic fact 2 b(a + c) ≤ b + a + c.

B PROOFS FOR SECTION 2 B.1 PROOF OF THEOREM 1

The proof is a generalization of the sample-splitting arguments of Feldman & Vondrák (2019) ; Bousquet et al. (2020) under the considered property of L q -norm bounded difference. For the sake of completeness, we reproduce below the relatively simpler arguments of Bousquet et al. (2020, Theorem 4) , with proper modifications made to adapt to our setting via using the generalized Efron-Stein inequality in places of McDiarmid's inequality. Proof of Theorem 1.  Consider k such that 2 k-1 < N ≤ 2 k . If N < 2 k , I 0 = {{1}, ..., {2 k }}, I 1 = {{1, 2}, {3, 4}..., {2 k -1, 2 k }}, I k = {{1, ..., 2 k }}. For any i ∈ [N ] and l = 0, ..., k, we denote by I l (i) ∈ I l the only set from I l that contains i and consider the following random variables g l i = E g i | Z i , S I l (i) . In particular, g 0 i = g i and g k i = E[g i | Z i ]. Clearly we have the following telescope sum: g i = k-1 l=0 (g l i -g l+1 i ) + E[g i | Z i ]. It follows that N i=1 g i -E[g i | Z i ] q ≤ k-1 l=0 N i=1 g l i -g l+1 i q . ( ) We need to upper bound the right hand side of the above inequality. To this end, it can be verified that g l+1 i = E g i | Z i , S I l+1 (i) } = E g l i | Z i , S I l+1 (i) . Since g i has a bounded L q -difference by β q with respect to all variables except the i-th variable, it is known from Lemma 3 that so is g l i for each l = 0, ..., k. Conditioned on Z i , S I l+1 (i) , invoking Proposition 1 to g l i yields ∥g l i -g l+1 i ∥ q Z i , S I l+1 (i) ≤ 2κq2 l β q , as there are 2 l indices in I l+1 (i) \ I l (i). It follows from Lemma 2 that ∥g l i -g l+1 i ∥ q = ∥g l i -g l+1 i ∥ q (Z i , S I l+1 (i) ) q ≤ 2κq2 l β q . Now consider any I l ∈ I l . Since for each i ∈ I l , g l i -g l+1 i depends only on Z i , S I l , these terms are independent and centered conditioned on S I l . Therefore, applying Proposition 2 yields i∈I l g l i -g l+1 i q S I l ≤ 2 2κq2 l × 2κq2 l β q = 4κq2 l β q , which according to Lemma 2 implies that i∈I l g l i -g l+1 i q ≤ 4κq2 l β q . Then based on the triangle inequality we get i∈[N ] g l i -g l+1 i q ≤ I l ∈I l i∈I l g l i -g l+1 i q ≤ 2 k-l × 4κq2 l β q = 4κq2 k β q < 4κqN β q . Finally, the right hand side of Eq. ( 10) can be bounded as N i=1 g i -E[g i | Z i ] q ≤ k-1 l=0 N i=1 g l i -g l+1 i q ≤ 4κqN ⌈log 2 N ⌉β q , ( ) which gives the first desired bound. In view of Eq. ( 11) and the triangle inequality we have N i=1 g i q ≤ N i=1 E[g i | Z i ] q + 4κqN ⌈log 2 N ⌉β q . ( ) Since ∥E[g i (S) | Z i ]∥ q ≤ M q and E[g i (S) | S \ Z i ] = 0, it follows from Proposition 2 that the first term at the right hand side of Eq. ( 12) can be bounded as N i=1 E[g i | Z i ] q ≤ 2 2κN qM q . ( ) The second desired bound is obtained by plugging Eq. ( 13) into Eq. ( 12). Published as a conference paper at ICLR 2023

B.2 PROOF OF THEOREM 2

The proof technique follows that of Bousquet et al. (2020, Lemma 7 ) developed for uniformly stable algorithms, with natural adaptation to the distribution-dependent notion of L q -stability. Proof. Let us consider h i (S) := R(A(S)) -ℓ(A(S); Z i ), g i (S) = E Z ′ i R(A(S (i) )) -ℓ(A(S (i) ); Z i ) . Then the L q -norm of the generalization gap can be bounded as ∥R(A(S)) -R S (A(S))∥ q = 1 N N i=1 h i (S) q ≤ 1 N       N i=1 g i (S) q A + N i=1 (h i (S) -g i (S)) q B       . ( ) We next respectively upper bound the two terms A and B in Eq. ( 14). To bound the term A, by definition it holds that E[g i (S) | S \ Z i ] = 0. Based on the triangle inequality we can show that ∥E[g i (S) | Z i ]∥ q ≤∥g i (S)∥ q = E Z ′ i [E Z [ℓ(A(S (i) ); Z)]] -E Z ′ i ℓ(A(S (i) ); Z i ) q ≤∥ℓ(A(S (i) ); Z)∥ q + ∥ℓ(A(S (i) ); Z i )∥ q ≤ 2M q , where in the first and second inequalities we have twice used Lemma 2. Next we further show that g i has a bounded L q -norm difference by 2γ q with respect to all variables in S except Z i . Indeed, for each j ̸ = i it can be verified that g i (S) -g i (S (j) ) q ≤ E Z ′ i R(A(S (i) )) -R(A((S (i) ) (j) )) q + E Z ′ i ℓ(A(S (i) ); Z i ) -ℓ(A((S (i) ) (j) ); Z i ) q = E Z ′ i E Z [ℓ(A(S (i) ); Z) -ℓ(A((S (i) ) (j) ); Z)] q + E Z ′ i [ℓ(A(S (i) ); Z i ) -ℓ(A((S (i) ) (j) ); Z i )] q ≤ ℓ(A(S (i) ); Z) -ℓ(A((S (i) ) (j) ); Z) q + ℓ(A(S (i) ); Z i ) -ℓ(A((S (i) ) (j) ); Z i ) q ≤ 2γ q , where in the last but one inequality we have used Lemma 2, while in the last equality we have used the L q -stability assumption on the algorithm A. Therefore, {g i } satisfy the conditions of Theorem 1 and thus A = N i=1 g i (S) q ≤ 4 2κN qM q + 8κqN ⌈log 2 N ⌉γ q . ( ) Now we proceed to bound the term B. It can be verified that B ≤ N i=1 E Z ′ i R(A(S)) -R(A(S (i) )) q + N i=1 E Z ′ i ℓ(A(S); Z i ) -ℓ(A(S (i) ); Z i ) q = N i=1 E Z ′ i E Z ℓ(A(S); Z) -ℓ(A(S (i) ); Z) q + N i=1 E Z ′ i ℓ(A(S); Z i ) -ℓ(A(S (i) ); Z i ) q ≤ N i=1 ℓ(A(S); Z) -ℓ(A(S (i) ); Z) q + N i=1 ℓ(A(S); Z i ) -ℓ(A(S (i) ); Z i ) q ≤ 2N γ q , (16) where in the last but one inequality we have used Lemma 2, and in the last equality we have used the L q -stability assumption. Plugging bounds Eq. ( 15) and Eq. ( 16) into Eq. ( 14) and preserving leading terms yields the desired result.

B.3 PROOF OF THEOREM 3

We need the following lemma which plays a fundamental role in proving the main result. Lemma 6. Let A : Z N → W be a learning algorithm that has L q -stability by γ q for q ≥ 1. Suppose that ∥ℓ(A(S); Z)∥ q ≤ M q for any Z ∈ Z. Let S ′ be an independent copy of S. Then the following bound holds for all q ≥ 2: R(A(S)) -R S (A(S)) -E[R(A(S))] + 1 N N i=1 E[ℓ(A(S ′ ); Z i ) | Z i ] q ≲ qγ q log(N ). Proof. Let us again consider g i (S) = E Z ′ i R(A(S (i) )) -ℓ(A(S (i) ); Z i ) . Then using similar proof arguments to those of Theorem 2 we can show that N (R(A(S)) -R S (A(S))) - N i=1 g i (S) q ≤ N i=1 E Z ′ i R(A(S)) -R(A(S (i) )) q + N i=1 E Z ′ i ℓ(A(S); Z i ) -ℓ(A(S (i) ); Z i ) q = N i=1 E Z ′ i E Z ℓ(A(S); Z) -ℓ(A(S (i) ); Z) q + N i=1 E Z ′ i ℓ(A(S); Z i ) -ℓ(A(S (i) ); Z i ) q ≤ N i=1 ℓ(A(S); Z) -ℓ(A(S (i) ); Z) q + N i=1 ℓ(A(S); Z i ) -ℓ(A(S (i) ); Z i ) q ≤ 2N γ q , which implies R(A(S)) -R S (A(S)) - 1 N N i=1 g i (S) q ≤ 2γ q . Also, g i (S) satisfies the conditions of Theorem 1 with β q = 2γ q and it follows from the second bound of Theorem 1 that for all q ≥ 2, 1 N N i=1 (g i (S) -E[g i (S) | Z i ]) q ≤ 8κqγ q ⌈log 2 N ⌉. Combining the above two yields R(A(S)) -R S (A(S)) - 1 N N i=1 E[g i (S) | Z i ] q ≲ qγ q log(N ). The desired result follows by noting that E[g i (S) | Z i ] = E[R(A(S ′ ))] -E[ℓ(A(S ′ ); Z i ) | Z i ] = E[R(A(S))] -E[ℓ(A(S ′ ); Z i ) | Z i ]. This completes the proof. With Lemma 6 in place, we are ready to prove the main result of Theorem 3. Proof of Theorem 3. Consider any w * ∈ W * . It is standard to decompose and bound the excess risk as R(A(S)) -R * =R(A(S)) -R S (A(S)) + R S (A(S)) -R S (w * ) + R S (w * ) -R * ≤∆ opt + R(A(S)) -R S (A(S)) -(R * -R S (w * )) =∆ opt + Γ(S) + E[R(A(S))] - 1 N N i=1 E[ℓ(A(S ′ ); Z i ) | Z i ] -(R * -R S (w * )), where Γ(S) = R(A(S)) -R S (A(S)) -E[R(A(S))] + 1 N N i=1 E[ℓ(A(S ′ ); Z i ) | Z i ]. Since we have the freedom to choose w * , let us specify it in the above as w * (S ′ ) ∈ W * which is the minimizer that satisfies the Bernstein condition in Assumption 1 associated with A(S ′ ). Then, it follows from Eq. ( 17) that R(A(S)) -R * -∆ opt ≤Γ(S) + E[R(A(S))] - 1 N N i=1 E[ℓ(A(S ′ ); Z i ) | Z i ] -(R * -E[R S (w * (S ′ )) | S]). Consequently, ∥R(A(S)) -R * -∆ opt ∥ q ≤ ∥Γ(S)∥ q + 1 N N i=1 E [ℓ(w * (S ′ ); Z i ) -ℓ(A(S ′ ); Z i ) | Z i ] -(R * -E[R(A(S ′ ))]) q ζ1 ≲qγ q log(N ) + 1 N N i=1 E [ℓ(w * (S ′ ); Z i ) -ℓ(A(S ′ ); Z i ) | Z i ] -(R * -E[R(A(S ′ ))]) q T , where in "ζ 1 " we have applied Lemma 6 to obtain ∥Γ(S)∥ q ≤ qγ q log(N ), and the fact E[R(A(S))] = E[R(A(S ′ ))]. Part (a): To bound the term T , using Bernstein's inequality for sum of independent bounded variablesfoot_1 together with the generalized Bernstein condition we can show (see the proof arguments of Klochkov & Zhivotovskiy (2021, Theorem 1.1) for the details) that T ≲ qBE[R(A(S)) -R * ] N + qM N = qB(E[R(A(S)) -R * -∆ opt ] + E[∆ opt ]) N + qM N ≤ qB(∥R(A(S)) -R * -∆ opt ∥ q + E[∆ opt ]) N + qM N , where the last inequality is due to Jensen's inequality. Therefore, combining the above and Eq. ( 18) yields that for some universal constant C: ∥R(A(S))-R * -∆ opt ∥ q ≤ C qγ q log(N ) + qB(∥R(A(S)) -R * -∆ opt ∥ q + E[∆ opt ]) N + qM N . By invoking Lemma 5 to the above self-bounding inequality with x = ∥R(A(S)) -R * -∆ opt ∥ q , a = C(qγ q log(N ) + qM N ), b = qB N , and c = E[∆ opt ] we immediately obtain that ∥R(A(S)) -R * -∆ opt ∥ q ≲ E[∆ opt ] + qγ q log(N ) + (M + B)q N . This gives the desired bound in part (a). Part (b): Under the given conditions in part (b), we can bound the term T in Eq. ( 18) as follows for q ≥ 2: T = 1 N N i=1 E [ℓ(w * (S ′ ); Z i ) -ℓ(A(S ′ ); Z i ) | Z i ] -(R * -E[R(A(S ′ ))]) q ζ1 ≤ 2 √ 2κq N N i=1 (E [ℓ(w * (S ′ ); Z i ) -ℓ(A(S ′ ); Z i ) | Z i ] -(R * -E[R(A(S ′ ))])) 2 q/2 ≤ 4 √ κq N N i=1 E 2 [ℓ(w * (S ′ ); Z i ) -ℓ(A(S ′ ); Z i ) | Z i ] + (R * -E[R(A(S ′ ))]) 2 q/2 ≤ 4 √ κq N N i=1 ∥E 2 [ℓ(w * (S ′ ); Z i ) -ℓ(A(S ′ ); Z i ) | Z i ]∥ q/2 + N E 2 [R(w * (S ′ )) -R(A(S ′ ))] ζ2 ≤ 4 √ κq N N i=1 ∥E 2 [G∥w * (S ′ ) -A(S ′ )∥ | Z i ]∥ q/2 + N E 2 [G∥w * (S ′ ) -A(S ′ )∥] ζ3 ≤ 4G √ 2κq √ N E [∥w * (S ′ ) -A(S ′ )∥ 2 ] ζ4 ≤ 8G √ κq √ N 1 µ E [R(A(S ′ )) -R * ] 8G √ κq √ N µ E [R(A(S ′ )) -R * -∆ opt ] + E[∆ opt ] ≤ 8G √ κq √ N µ ∥R(A(S)) -R * -∆ opt ∥ q + E[∆ opt ] where in "ζ 1 " we have used Proposition 2, in "ζ 2 " we have used the Lipschitz-loss condition, in "ζ 3 " we have used E 2 [G∥w * (S ′ ) -A(S ′ )∥ | Z i ] q/2 = E 2 [G∥w * (S ′ ) -A(S ′ )∥] ≤ G 2 E ∥w * (S ′ ) -A(S ′ )∥ 2 , in "ζ 4 " we have used Assumption 2, and the last inequality is due to Jensen's inequality. Then, plugging the above bound of term T into Eq. ( 18) yields that for some universal constant C: ∥R(A(S)) -R * -∆ opt ∥ q ≤ C qγ q log(N ) + G q N µ ∥R(A(S)) -R * -∆ opt ∥ q + E[∆ opt ] . Invoking Lemma 5 to the above inequality with x = ∥R(A(S)) -R * -∆ opt ∥ q , a = Cqγ q log(N ), b = qG 2 µN , and c = E[∆ opt ] yields ∥R(A(S)) -R * -∆ opt ∥ q ≲ E[∆ opt ] + qγ q log(N ) + qG 2 µN . This gives the desired bound in part (b). The proof is completed. C PROOFS FOR SECTION 3 C.1 PROOF OF LEMMA 1 Proof. Let us consider the following event about the restricted strong convexity of R S : E : R S is µ k -strongly convex. Let Y = 1 E be the indication random variable associated with E. Then by Assumption 4 we have P(Y = 1) ≥ 1 -δ N . Suppose that E occurs such that Y = 1. Then, R S (w * S (i) |J ) -R S (w * S|J ) = 1 N j̸ =i ℓ(w * S (i) |J ; Z j ) -ℓ(w * S|J ; Z j ) + 1 N ℓ(w * S (i) |J ; Z i ) -ℓ(w * S|J ; Z i ) =R S (i) (w * S (i) |J ) -R S (i) (w * S|J ) + 1 N ℓ(w * S (i) |J ; Z i ) -ℓ(w * S|J ; Z i ) - 1 N ℓ(w * S (i) |J ; Z ′ i ) -ℓ(w * S|J ; Z ′ i ) ≤ 1 N ℓ(w * S (i) |J ; Z i ) -ℓ(w * S|J ; Z i ) + 1 N ℓ(w * S (i) |J ; Z ′ i ) -ℓ(w * S|J ; Z ′ i ) ≤ 2G N w * S (i) |J -w * S|J , where we have used the optimality of w * S (i) |J with respect to R S (i) (w) and the Lipschitz continuity of loss. Since E occurs by assumption, R S is µ k -strongly convex. Since w * S|J is optimal for R S (w) over the supporting set J, we have R S (w * S (i) |J ) ≥ R S (w * S|J ) + µ k 2 w * S (i) |J -w * S|J 2 . Combing the preceding two inequalities yields w * S (i) |J -w * S|J ≤ 4G µ k N . Consequently from the Lipschitz continuity of ℓ we have that for any Z ∈ Z, the following holds conditioned on Y = 1: ℓ(w * S (i) |J ; Z) -ℓ(w * S|J ; Z) ≤ G w * S (i) |J -w * S|J ≤ 4G 2 µ k N . ( ) In the complementary case of Y = 0, in view of Assumption 5, it always holds that |ℓ(w * S (i) |J ; Z) -ℓ(w * S|J ; Z)| ≤ G w * S (i) |J -w * S|J ≤ 2GD. Let us consider q u := log 1 δ N log(N ) . By assumption q u ≥ 2. Then for 2 ≤ q ≤ q u , it can be verified that E ℓ(w * S (i) |J ; Z) -ℓ(w * S|J ; Z) q =P(Y = 1)E ℓ(w * S (i) |J ; Z) -ℓ(w * S|J ; Z) q | Y = 1 + P(Y = 0)E ℓ(w * S (i) |J ; Z) -ℓ(w * S|J ; Z) q | Y = 0 ≤ 4G 2 µ k N q + δ N (2GD) q = 4G 2 µ k N q + 1 N qu (2GD) q ≤ 4G 2 µ k N q + 1 N q (2GD) q . It follows that for all 2 ≤ q ≤ q u ℓ(w * S (i) |J ; Z) -ℓ(w * S|J ; Z) q ≤ 4G 2 µ k N q + 1 N q (2GD) q 1/q ≤ 1 N 4G 2 µ k + 2GD , where we have used a q + b q ≤ (a + b) q for a, b > 0 and q ≥ 2. For the complementary case q > q u , it is trivial to show that ℓ(w * S (i) |J ; Z) -ℓ(w * S|J ; Z) q ≤ 2GD ≤ 2GD q u q. Assembling the preceding two bounds yields the desired L q -stability bound. We now bound the term T in Eq. ( 21) as follows: T = 1 N N i=1 E w * ; Z i ) -ℓ(w * S ′ |J ; Z i ) | Z i -(R(w * ) -E[R(w * S ′ |J )]) q ζ1 ≤ 2 √ 2κq N N i=1 E ℓ(w * ; Z i ) -ℓ(w * S ′ |J ; Z i ) | Z i -(R(w * ) -E[R(w * S ′ |J )]) 2 q/2 ≤ 4 √ κq N N i=1 E 2 ℓ(w * ; Z i ) -ℓ(w * S ′ |J ; Z i ) | Z i + R(w * ) -E[R(w * S ′ |J )] 2 q/2 ≤ 4 √ κp N N i=1 E 2 ℓ(w * ; Z i ) -ℓ(w * S ′ |J ; Z i ) | Z i q/2 + N E 2 R(w * ) -R(w * S ′ |J ) ζ2 ≤ 4 √ κp N N i=1 E 2 G w * -w * S ′ |J | Z i q/2 + N E 2 G w * -w * S ′ |J ζ3 ≤ 4G √ 2κp √ N E w * -w * S ′ |J 2 ζ4 ≤ 8G √ κq √ N 1 µ E R(w * S ′ |J ) -R(w * ) = 8G √ κq √ N µ E R(w * S|J ) -R(w * ) , where in "ζ 1 " we have used Proposition 2, in "ζ 2 " we have used the Lipschitz-loss assumption, in "ζ 3 " we have used E 2 G∥w * -w * S ′ |J ∥ | Z i q/2 = E 2 G∥w * -w * S ′ |J ∥ ≤ G 2 E ∥w * -w * S ′ |J ∥ 2 , in "ζ 4 " we have used the strong convexity condition in Assumption 4. Plugging Eq. ( 22) and Eq. ( 23) into Eq. ( 21) yields that for any q ≥ 2, R(w * S|J ) -R(w * ) -R S (w * S|J ) + R S (w * ) q ≤ q log(N ) N 4G 2 µ k + 2GD + 2q 2 GD log 2 (N ) log(1/δ N ) + 8G κqE R(w * S|J ) -R(w * ) N µ . Next we need to upper bound the factor E R(w * S|J ) -R(w * ) in the second term of the above bound. To do so, let us consider q = 2 in Eq. ( 24). It follows from the optimality of w * and w * S|J that E R(w * S|J ) -R(w * ) -R S (w * S|J ) + R S (w * ) ≤ R(w * S|J ) -R(w * ) -R S (w * S|J ) + R S (w * ) 2 Eq. (24) ≤ log(N ) N 8G 2 µ k + 4GD + 8GD log 2 (N ) log(1/δ N ) + 8G 2κE R(w * S|J ) -R(w * ) N µ ≤ log(N ) N 8G 2 µ k + 4GD + 8GD log 2 (N ) log(1/δ N ) + E R(w * S|J ) -R(w * ) 2 + 64κG 2 N µ , where in the first inequality we have used Cauchy-Schwarz inequality, and in the last inequality we have used the fact √ ab ≤ a 2t + bt 2 for any a, b, t > 0. Rearranging both sides of the above inequality in the following way: R(w * S|J ) -R(w * k) =R(w * S|J ) -R(w * ) -R S (w * S|J ) + R S (w * ) + R S (w * S|J ) -R S (w * k) + R S (w * k) -R S (w * ) + R(w * ) -R(w * k) ≤R(w * S|J ) -R(w * ) -R S (w * S|J ) + R S (w * ) + R S (w * S|J ) -R S (w * S, k) + R S (w * k) -R S (w * ) + R(w * ) -R(w * k) ≤ R(w * S|J ) -R(w * ) -R S (w * S|J ) + R S (w * ) T ′ 1 + R S (w * S|J ) -R S (w * S, k) + + R S (w * k) -R S (w * ) + R(w * ) -R(w * k) T ′ 2 , where in the first inequality we have used the fact R S (w * S, k) ≤ R S (w * k). We are going to bound the above two terms T ′ 1 and T ′ 2 respectively with high probability. Concerning T ′ 1 , since Eq. ( 26) holds for all q ≥ 2, in view of the second part of Lemma 4 (with q l = 2 and q u = ∞) we can show that the following holds with probability at least 1 -δ 2 for any δ ∈ (0, e -1 ): T ′ 1 =R(w * S|J ) -R(w * ) -R S (w * S|J ) + R S (w * ) ≤ R(w * S|J ) -R(w * ) -R S (w * S|J ) + R S (w * ) ≲ GD log 2 e δ log 2 (N ) log(1/δ N ) + log(N ) G 2 µ k + GD + G 2 µ log e δ N + G log e δ R(w * k) -R(w * ) N µ + E R S (w * S|J ) -R S (w * S, k) + . Regarding the term T ′ 2 , similar to the argument of Eq. ( 23), we can bound its L q -norm for any q ≥ 2 as follows: ∥T ′ 2 ∥ q = 1 N N i=1 ℓ(w * k; Z i ) -ℓ(w * ; Z i ) -(R(w * k) -R(w * )) q ≤ 2 √ 2κq N N i=1 ℓ(w * k; Z i ) -ℓ(w * ; Z i ) -(R(w * k) -R(w * )) 2 q/2 ≤ 4 √ κq N N i=1 ℓ(w * k; Z i ) -ℓ(w * ; Z i ) 2 + R(w * k) -R(w * ) 2 q/2 ≤ 4 √ κq N N i=1 ℓ(w * k; Z i ) -ℓ(w * ; Z i ) 2 q/2 + N R(w * k) -R(w * ) 2 ≤ 4 √ 2κq N N G 2 w * k -w * 2 ≤ 8G √ κq √ N µ R(w * k) -R(w * ). By invoking the second part of Lemma 4 with q l = 2 and q u = ∞ we can translate the above moment bound into the following exponential tail bound that holds with probability at least 1 -δ 2 for any δ ∈ (0, e -1 ): et al. (2016) , there is a renewed interest in the use of uniform stability for deriving generalization bounds for various learning algorithms and paradigms including stochastic gradient descent (SGD) (Kuzborskij & Lampert, 2018; Charles & Papailiopoulos, 2018; Lei & Ying, 2020) , stochastic model based optimization (Wang et al., 2017; Deng & Gao, 2021) , optimization based meta learning (Zhou et al., 2019) , and differential privacy stochastic optimization (Bassily et al., 2019; Feldman et al., 2020) . Compared to other stability arguments, uniform stability is notorious for implying high-probability generalization bounds in addition to the traditional in-expectation bounds. Until very recently, the basic result of Bousquet & Elisseeff (2002) , as expressed in Eq. ( 2), remains the best known exponential generalization bound for uniformly stable algorithms. Using some elegant proof techniques from adaptive data analysis, Feldman & Vondrák (2018) managed to replace the first term in Eq. ( 2) by γ u log( 1 δ ), which leads to an improvement whenever γ u ≳ 1 N . Soon after, a series of breakthrough results were obtained (Feldman & Vondrák, 2019; Bousquet et al., 2020) using tighter concentration bounds for sum of random functions, which eventually improve the stability dependent rate to a near-optimal one γ u log(N ) log 1 δ as shown in Eq. ( 4). Additionally under the generalized Bernstein condition, these state-of-the-art results lead to O( 1 N ) excess risk bounds for uniformly stable algorithms (Klochkov & Zhivotovskiy, 2021) . T ′ 2 ≲ G Non-uniform stability and exponential generalization. More broadly for non-uniformly stable algorithms, exponential generalization bounds have also been shown to be possible under various weaker and distribution-dependent notions of stability. As an early work in this line, Kutin & Niyogi (2002) showed that under the so called "almost-everywhere" stability, which is a high-probability counterpart of uniform stability, generalization bounds that hold with overwhelming probability are still possible in view of certain modified McDiarmid's inequality (Kutin, 2002) . Later, Rakhlin et al. (2005) revisited the bounded-difference results of Kutin & Niyogi (2002) in a more straightforward manner by using a powerful moment extension of Efron-Stein inequality (Boucheron et al., 2005) . Recently for general L q -stable algorithms, the exponential leave-one-out generalization bounds were derived using moment or exponential extensions of Efron-Stein inequality, with applications found in ridge regression, k-nearest neighbor classification and k-folds cross-validation (Celisse & Guedj, 2016; Celisse & Mary-Huard, 2018; Abou-Moustafa & Szepesvári, 2019) . However, when it comes to the recent break-through bounds of Feldman & Vondrák (2019) ; Bousquet et al. (2020) , it is much less obvious how to easily extend these near-optimal bounds under the almost-everywhere stability via simply incorporating the low probability failure events into concentration inequality. This is actually in sharp contrast to what have been done by Feldman & Vondrák (2019, Theorem 4.5) and Bassily et al. (2020, Theorem 2.1) for stochastic learning algorithms with uniform stability (over the randomness of data) holding with high probability over the internal randomness of algorithm. We refer the interested readers to Boucheron et al. (2013) ; Kontorovich (2014) ; Combes (2015) ; Warnke (2016) ; Maurer & Pontil (2021) for more results on concentration inequalities beyond boundeddifference conditions, which are fundamental for deriving exponential bounds for non-uniformly stable algorithms. Generalization analysis of ℓ 0 -estimators. We further briefly review some prior results on the generalization guarantees for statistical learning under cardinality constraints, which is the theme of the application part of our work. For the ℓ 0 -ERM estimator as expressed in Eq. ( 8), provided that the solution is exactly known, a series of uniform excess risk bounds were derived over binary prediction classes (Chen & Lee, 2018; 2020) and bounded liner prediction classes (Foster & Syrgkanis, 2019) . The exact solutions of ℓ 0 -ERM, however, is computationally intractable in general high-dimensional cases due to the NP-hardness of problem. Therefore, it is more realistic and desirable to establish generalization bounds for approximate ℓ 0 -estimators such as the IHT-style algorithms (Yuan et al., 2018; Garg & Khandekar, 2009; Li et al., 2016) . Particularly for misspecified sparsity models, a set of sparse excess risk bounds with slow and fast rates were established through the lens of uniform stability theory under proper regularity conditions (Yuan & Li, 2022) .



In an exponential tail bound, the terms associated with log 1 δ and log 1 δ are respectively referred to as sub-Gaussian and sub-exponential tails. It is possible to relax the M -boundedness condition on the loss function ℓ to its sub-Gaussian or subexponential counterparts by alternatively applying general Bernstein-type inequalities for sums of independent sub-Gaussian or sub-exponential random variables(Vershynin, 2018) in this part of proof. For the sake of simplicity and transparency of exposition, here we choose to work on the bounded loss while keeping in mind that the requirement is not essential.



Let w * := arg min w∈W R(w) be the global minimizer which is unique due to the strong convexity of R. For a given index set J ⊆ [d], let us consider the following restrictive estimator over J

we pad the training set S with extra zero-functions so that N = 2 k . Consider the partition I 0 , I 1 , ..., I k of [N ] given by

Uniform stability and exponential generalization. Stimulated by a recent landmark work of Hardt

ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING

The authors would like to thank the anonymous Reviewers and Area Chairs for their insightful comments which are truly helpful for improving this paper. Xiao-Tong Yuan is funded in part by the National Key Research and Development Program of China under Grant No.2018AAA0100400, and in part by the Natural Science Foundation of China (NSFC) under Grant No.U21B2049, No.61936005 and No.61876090.

annex

Published as a conference paper at ICLR 2023

C.2 PROOF OF THEOREM 4

Let us denote a + = max{a, 0}. We need the following key result in our analysis, whose proof idea draws large inspiration from that of Theorem 3 with proper modifications for handing the challenges imposed by the combinatorial optimization nature of L 0 -ERM.Lemma 7. Suppose that Assumptions 3, 4, 5 hold. Assume that log(1/δ N ) log(N ) ≥ 2. Then for any δ ∈ (0, e -1 ), it holds with probability at least 1 -δ that supProof. Given a fixed index set J ⊆ [d] with |J| = k, we can show that the following holds for any q ≥ 1:whereIn view of Lemma 1 we have that w * S|J has L q -stability byThen invoking Lemma 6 over the supporting set J yieldswith simple algebra leads toPlugging Eq. ( 25) into Eq. ( 24) yields that for any q ≥ 2where in "ζ 1 " we have again used the fact √ ab ≤ a 2t + bt 2 for a, b, t > 0. Now we are in the position to finally upper bound the desired sparse excess risk with respect to w * k, which can be decomposed By plugging Eq. ( 28) and Eq. ( 29) into Eq. ( 27) and applying union probability argument we obtain that the following holds with probability at least 1 -δ for any δ ∈ (0, e -1 ):This proves the desired sparse excess risk bound over the supporting set J.As the final step, since there are at most d k ≤ ed k k different J with |J| = k, by union probability we can show that the following holds with probability 1 -δ for any δ ∈ (0, e -1 ):supThis completes the proof. Now we are in the position to prove the main result in Theorem 4.Proof of Theorem 4. Let us consider J = supp( wS,k ). By definition we have wS,k = w * S| J . Then invoking Lemma 7 yields that for any δ ∈ (0, e -1 ), the following holds with probability at least 1 -δ: This proves the desired sparse excess risk bound.

