GENERALIZATION BOUNDS WITH ARBITRARY COMPLEXITY MEASURES

Abstract

In statistical learning theory, generalization bounds usually involve a complexity measure that is determined by the considered theoretical framework. This limits the scope of such analyses, as other forms of capacity measures or regularization are used in practical algorithms. In this paper, we leverage the framework of disintegrated PAC-Bayesian bounds and combine it with Gibbs distributions to derive generalization bounds involving a complexity measure that can be defined by the user. Our bounds stand in probability jointly over the hypotheses and the learning sample, which allows us to tighten the complexity for a given generalization gap since it can be set to fit both the hypothesis class and the task.

1. INTRODUCTION

Statistical learning theory offers various theoretical frameworks to assess generalization by studying whether the empirical risk is representative of the true risk thanks to an upper bounding strategy of the generalization gap. The generalization gap is a deviation between the true risk and the empirical risk. An upper bound on this gap is generally a function of two main quantities: (i) the size of the training sample and (ii) a complexity measure that captures how prone a model is to overfitting. One potential limitation is that existing frameworks are restricted to particular complexity measures, among them the VC-dimension (Vapnik & Chervonenkis, 1971) or the Rademacher complexity (Bartlett & Mendelson, 2002) for which some generalization bounds can be derived. To the best of our knowledge, there is no generalization bound able to take into account, by construction, some arbitrary complexity measures that can serve as good proxies for the generalization gap. In this paper, we tackle this drawback by leveraging the framework of disintegrated PAC-Bayesian bound (Theorem 2.1) to propose a novel generalization bound with arbitrary complexity measures. To do so, we make use of the Gibbs probability distributions (Equation ( 2)) that depend on a user-defined parametric function characterizing the complexity. It allows us to derive guarantees in terms of probabilistic bounds that depend on a model sampled from a Gibbs distribution mentioned above. It is worth noticing that our result allows retrieving the uniform convergence and algorithm-dependent bounds. We believe that our novel result provides theoretical foundations for the many regularizations used in practice to perform model selection. For instance, our result allows integrating complexity measures studied empirically in a recent line of work on over-parametrized models (Jiang et al., 2019; Dziugaite et al., 2020; Jiang et al., 2021) . In our experimental evaluation, we show how these measures can be easily integrated into our framework in practice. We notably provide a stochastic version of the Metropolis Adjusted Langevin algorithm to compute empirical estimates of our bounds. Organization of the paper. In Section 2, we provide some preliminary definitions and concepts. Then, we present our main contribution in Section 3. In Section 4, we provide a practical instantiation of our framework before concluding in Section 5.

2. PRELIMINARIES 2.1 SETTING

We consider the supervised classification learning setting where X denotes the input space and Y is the label space. We consider that an example (x, y) 2 X⇥Y is sampled from an unknown data distribution D on X ⇥ Y. A learning sample S={(x i , y i )} m i=1 contains m examples drawn i.i.d. from D; we denote the distribution of such an m-sample by D m . Let H be a potentially infinite set of functions h : X!Y, called hypotheses (or models), that associate a label from Y given an input from X. Let M(H) be the set of probability densities over H given a reference measure (e.g., the Lebesgue measure); we denote by M ⇤ (H) ✓ M(H) the set of strictly positive probability densities. Given a learning sample S, we aim to find h 2 H that minimizes the so-called true risk R D (h)= P (x,y)⇠D I [h(x) 6 = y], where I[a]=1 if a is true, and 0 otherwise. In practice, as the data distribution D is unknown, we estimate the true risk with its empirical counterpart: the empirical risk R S (h) = 1 m P m i=1 I [h(x i ) 6 = y i ]. We hereafter denote the generalization gap by : [0, 1] 2 ! R, which is usually defined by (R D (h), R S (h)) = |R D (h) R S (h)| that quantifies how much the empirical risk is representative of the true risk. In this paper, we leverage the PAC-Bayesian framework Shawe-Taylor & Williamson (1997); McAllester (1998) ; Guedj (2019); Alquier (2021) to upper-bound the generalization gap with a function that depends on an arbitrary measure of complexity. In PAC-Bayes, we consider an apriori belief on the hypotheses in H that is modeled by a prior distribution ⇡ 2 M ⇤ (H) on H. We aim to learn, from S and ⇡, a posterior distribution ⇢ 2 M(H) on H to assign higher probability to the best hypotheses in H (the support of ⇢ being included in the support of ⇡). The classical PAC-Bayesian generalization bounds provide upper bounds in expectation over ⇢, meaning that they bound the generalization gap expressed as | E h⇠⇢ [R D (h) R S (h)] |, and where the complexity term depends on the KL divergence between ⇢ and ⇡ defined as KL(⇢k⇡) = E h⇠⇢ ln ⇢(h) ⇡(h) . This standard complexity hence captures how much the prior and the posterior distribution deviate in expectation over all the hypotheses. To incorporate custom complexities in the bounds, we follow a slightly different framework recalled below (the disintegrated PAC-Bayesian bounds) in which the expectations on ⇢ are "disintegrated": the gap (R D (h), R S (h))=|R D (h) R S (h)| of a single h sampled from ⇢ is considered in the bounds.

2.2. DISINTEGRATED PAC-BAYESIAN BOUNDS

The disintegrated PAC-Bayesian bounds have been introduced by Catoni (2007, Th 1.2.7) and Blanchard & Fleuret (2007, Prop 3.1) foot_0 . As far as we know, despite their significance, they have been little used in the literature and received only recently renewed interest for deriving tight bounds in practice (e.g., Rivasplata et al. (2020); Viallard et al. (2021) ). Such bounds provide guarantees for a hypothesis h sampled from a posterior distribution ⇢ S . They take the form of a bound that stands with high probability (at least 1 ) over the random choice of training set S ⇠ D m and hypothesis h. This paper mainly focuses on a particular bound, namely, the one of Rivasplata et al. (2020, Theorem 1 (i)) recalled below. where ⇢ S is a posterior distribution such that ⇢ S 2 M(H). In this case, the function '(h, S)=m (R D (h), R S (h)) is a deviation between the true risk R D (h) and the empirical risk R S (h). Moreover, the function (⇢ S , ⇡, ) is constituted of 2 terms: (i) the disintegrated KL divergence ln ⇢ S (h) ⇡(h) defining how much the prior and posterior distributions deviate



Disintegrated PAC-Bayesian bounds have also been introduced as a "single-draw case" byHellström & Durisi (2020).



Theorem 2.1 (General Disintegrated Bound of Rivasplata et al. (2020)). For any distribution D on X ⇥ Y, for any hypothesis set H, for any distribution ⇡ 2 M ⇤ (H), for any measurable function ' : H ⇥ (X ⇥ Y) m ! R, for any 2 (0, 1], we have P S⇠D m ,h⇠⇢ S "

