GENERALIZATION BOUNDS WITH ARBITRARY COMPLEXITY MEASURES

Abstract

In statistical learning theory, generalization bounds usually involve a complexity measure that is determined by the considered theoretical framework. This limits the scope of such analyses, as other forms of capacity measures or regularization are used in practical algorithms. In this paper, we leverage the framework of disintegrated PAC-Bayesian bounds and combine it with Gibbs distributions to derive generalization bounds involving a complexity measure that can be defined by the user. Our bounds stand in probability jointly over the hypotheses and the learning sample, which allows us to tighten the complexity for a given generalization gap since it can be set to fit both the hypothesis class and the task.

1. INTRODUCTION

Statistical learning theory offers various theoretical frameworks to assess generalization by studying whether the empirical risk is representative of the true risk thanks to an upper bounding strategy of the generalization gap. The generalization gap is a deviation between the true risk and the empirical risk. An upper bound on this gap is generally a function of two main quantities: (i) the size of the training sample and (ii) a complexity measure that captures how prone a model is to overfitting. One potential limitation is that existing frameworks are restricted to particular complexity measures, among them the VC-dimension (Vapnik & Chervonenkis, 1971) or the Rademacher complexity (Bartlett & Mendelson, 2002) for which some generalization bounds can be derived. To the best of our knowledge, there is no generalization bound able to take into account, by construction, some arbitrary complexity measures that can serve as good proxies for the generalization gap. In this paper, we tackle this drawback by leveraging the framework of disintegrated PAC-Bayesian bound (Theorem 2.1) to propose a novel generalization bound with arbitrary complexity measures. To do so, we make use of the Gibbs probability distributions (Equation ( 2)) that depend on a user-defined parametric function characterizing the complexity. It allows us to derive guarantees in terms of probabilistic bounds that depend on a model sampled from a Gibbs distribution mentioned above. It is worth noticing that our result allows retrieving the uniform convergence and algorithm-dependent bounds. We believe that our novel result provides theoretical foundations for the many regularizations used in practice to perform model selection. For instance, our result allows integrating complexity measures studied empirically in a recent line of work on over-parametrized models (Jiang et al., 2019; Dziugaite et al., 2020; Jiang et al., 2021) . In our experimental evaluation, we show how these measures can be easily integrated into our framework in practice. We notably provide a stochastic version of the Metropolis Adjusted Langevin algorithm to compute empirical estimates of our bounds. Organization of the paper. In Section 2, we provide some preliminary definitions and concepts. Then, we present our main contribution in Section 3. In Section 4, we provide a practical instantiation of our framework before concluding in Section 5.

2. PRELIMINARIES 2.1 SETTING

We consider the supervised classification learning setting where X denotes the input space and Y is the label space. We consider that an example (x, y) 2 X⇥Y is sampled from an unknown data distribution D on X ⇥ Y. A learning sample S={(x i , y i )} m i=1 contains m examples drawn i.i.d. from D; we denote the distribution of such an m-sample by D m . Let H be a potentially infinite set of functions h : X!Y, called hypotheses (or models), that associate a label from Y given an input from X. Let M(H) be the set of probability densities over H given a reference measure (e.g., the Lebesgue measure); we denote by M ⇤ (H) ✓ M(H) the set of strictly positive probability densities. Given a learning sample S, we aim to find h 2 H that minimizes the so-called true risk R D (h)= P (x,y)⇠D I [h(x) 6 = y], where I[a]=1 if a is true, and 0 otherwise. In practice, as the data distribution D is unknown, we estimate the true risk with its empirical counterpart: the empirical risk R S (h) = 1 m P m i=1 I [h(x i ) 6 = y i ]. We hereafter denote the generalization gap by : [0, 1] 2 ! R, which is usually defined by (R D (h), R S (h)) = |R D (h) R S (h)| that quantifies how much the empirical risk is representative of the true risk. In this paper, we leverage the PAC-Bayesian framework Shawe-Taylor & Williamson (1997) ; McAllester (1998) ; Guedj (2019) ; Alquier (2021) to upper-bound the generalization gap with a function that depends on an arbitrary measure of complexity. In PAC-Bayes, we consider an apriori belief on the hypotheses in H that is modeled by a prior distribution ⇡ 2 M ⇤ (H) on H. We aim to learn, from S and ⇡, a posterior distribution ⇢ 2 M(H) on H to assign higher probability to the best hypotheses in H (the support of ⇢ being included in the support of ⇡). The classical PAC-Bayesian generalization bounds provide upper bounds in expectation over ⇢, meaning that they bound the generalization gap expressed as | E h⇠⇢ [R D (h) R S (h)] |, and where the complexity term depends on the KL divergence between ⇢ and ⇡ defined as KL(⇢k⇡) = E h⇠⇢ ln ⇢(h) ⇡(h) . This standard complexity hence captures how much the prior and the posterior distribution deviate in expectation over all the hypotheses. To incorporate custom complexities in the bounds, we follow a slightly different framework recalled below (the disintegrated PAC-Bayesian bounds) in which the expectations on ⇢ are "disintegrated": the gap (R D (h), R S (h))=|R D (h) R S (h)| of a single h sampled from ⇢ is considered in the bounds.

2.2. DISINTEGRATED PAC-BAYESIAN BOUNDS

The disintegrated PAC-Bayesian bounds have been introduced by Catoni (2007, Th 1.2.7) and Blanchard & Fleuret (2007, Prop 3.1) foot_0 . As far as we know, despite their significance, they have been little used in the literature and received only recently renewed interest for deriving tight bounds in practice (e.g., Rivasplata et al. (2020) ; Viallard et al. (2021) ). Such bounds provide guarantees for a hypothesis h sampled from a posterior distribution ⇢ S . They take the form of a bound that stands with high probability (at least 1 ) over the random choice of training set S ⇠ D m and hypothesis h. This paper mainly focuses on a particular bound, namely, the one of Rivasplata et al. (2020, Theorem 1 (i)) recalled below. Theorem 2.1 (General Disintegrated Bound of Rivasplata et al. (2020) ). For any distribution D on X ⇥ Y, for any hypothesis set H, for any distribution ⇡ 2 M ⇤ (H), for any measurable function ' : H ⇥ (X ⇥ Y) m ! R, for any 2 (0, 1], we have P S⇠D m ,h⇠⇢ S " '(h, S)  ln  ⇢ S (h) ⇡(h) +ln  1 E S 0 ⇠D m E g⇠⇡ exp ('(g, S 0 )) | {z } (⇢ S ,⇡, ) # 1 , where ⇢ S is a posterior distribution such that ⇢ S 2 M(H). In this case, the function '(h, S)=m (R D (h), R S (h)) is a deviation between the true risk R D (h) and the empirical risk R S (h). Moreover, the function (⇢ S , ⇡, ) is constituted of 2 terms: (i) the disintegrated KL divergence ln ⇢ S (h) ⇡(h) defining how much the prior and posterior distributions deviate for a single h, and (ii) the term ln ⇥ 1 E S 0 ⇠D m E g⇠⇡ exp ('(g, S 0 )) ⇤ which is constant w.r.t. h 2 H and S 2 (X⇥Y) m and usually upper-bounded to instantiate the bound. In the following, we refer to the whole right-hand side of the bound, (), as the complexity measure for the sake of simplicity. Note that this is in slight contrast with the standard definitions of complexity, where the term (ii) (related to and the sample size m) is not included. This additional term is, in fact, constant w.r.t. the hypothesis h⇠⇢ S and the learning sample S⇠D m . In the bound of Theorem 2.1, the complexity term () depends on the disintegrated KL divergence and suffers from drawbacks: the KL complexity term is imposed by the framework and can be subject to high variance in practice (Viallard et al., 2021) . However, it is important to notice that this disintegrated KL divergence has a clear advantage: it only depends on the hypothesis h and data sample S, instead of the whole hypothesis class (as it is often the case for instance with the KL divergence in PAC-Bayesian bounds, or the VC-dimension). This might imply a better correlation between the generalization gap and some complexity measures. In the next section, we leverage this disintegrated KL divergence to derive our main contribution: a general bound that involves arbitrary complexity measures.

3. INTEGRATING ARBITRARY COMPLEXITIES IN GENERALIZATION BOUNDS

We first begin with a short presentation of our result to give some preliminary intuitions and to introduce the notion of Gibbs distribution which is a key element in the exposition of our contribution. We then formalize our theoretical result in Section 3.3.

3.1. AN INTRODUCTION TO OUR RESULTS

Let µ (h, S, ) be a real-valued function that takes a hypothesis h 2 H, a learning sample S 2 (X⇥Y) m , and the parameter as arguments and that is dependent on an additional function µ : H⇥(X⇥Y) m !R. The idea is to use this function µ() to parametrize the complexity measure with respect to the data sample S and the model h, in order to introduce custom complexity measures in the bound; we call "parametric function" the function µ(). This function must, in fact, serves to obtain a complexity measure µ (h, S, ) that is representative of the generalization gap (which is unknown). For instance, when H is a set of hypotheses h w parameterized by some weights w 2 R d , we can fix µ(h w , S)=kwk, for some norm k • k. This means that µ(h w , S) can be set to the regularization term of the chosen objective function so that the complexity, hence the bound, will depend on it. This is not entirely new since, for example, uniform stability bounds allow one to consider such norms (see, e.g., Kakade et al., 2008) . This example is just for illustration purposes. Our framework is compatible with broader families of complexity measures, as we will see later. Given such a parametric function µ(), the bound we derive in Theorem 3.1 takes the following form. Definition 3.1 (Generalization Bound with Complexity Measures). Let : [0, 1] 2 !R be the generalization gap, µ : H⇥(X⇥Y) m !R be a parametric function. A generalization bound with arbitrary complexity measures is defined such that if for any distribution D on X⇥Y, for any hypothesis set H, there exists a real-valued function µ : H⇥(X⇥Y) m ⇥(0, 1]!R such that for any 2 (0, 1], we have P S⇠D m , h⇠⇢ S h (R D (h), R S (h))  µ (h, S, ) i 1 . The main trick to obtain such a result is to consider a particular posterior distribution ⇢ S : we incorporate the function µ() by choosing the distribution ⇢ S as the Gibbs distribution defined as ⇢ S (h) / exp [ ↵R S (h) µ(h, S)] , where ↵ 2 R + . ( ) This Gibbs distribution ⇢ S is interesting from an optimization viewpoint: a hypothesis h is more likely to be sampled from it when the objective function h 7 ! R S (h)+ 1 ↵ µ(h, S) is low for a given S. In the ideal case, since we want to minimize the generalization gap (R D (h), R S (h)), one can define the function µ(h, S) = ↵ (R D (h), R S (h)) ↵R S (h) to obtain a Gibbs distribution that samples hypotheses with small gaps. However, since the generalization gaps are unknown, they must be replaced with a computable function µ(). For instance, the function µ() can serve as a "regularizing term" (when µ() is a norm), so that a hypothesis is more likely to be sampled when the trade-off R S (h) + 1 ↵ µ(h, S) is low. Equation (2) might look restrictive, but it can actually represent any probability density function. Indeed, let ⇢ 0 S be a distribution on H, e.g., a Gaussian or a Laplace distribution, by setting µ(h, S) = ↵R S (h) ln ⇢ 0 S (h) we can retrieve the distribution ⇢ 0 S . The Gibbs distribution is well-known and studied in learning theory. In the following, we discuss the principal theoretical works based on it and highlight the differences with our framework.

3.2. RELATED WORKS USING THE GIBBS DISTRIBUTION

This section highlights two lines of work that are related to our setting: (i) the link between the Gibbs distribution and optimization and (i) the usage of the Gibbs distribution in generalization bounds. Relationship between optimization and the Gibbs distribution. Given an objective function f : H⇥(X⇥Y) m ! R, the information risk minimization principle (Zhang, 2006) is related to the Gibbs distribution, i.e., by taking ⇢ S = argmin ⇢2M(H) ⇢ E h⇠⇢ f (h, S) + KL(⇢k⇡) ↵ where ⇢ S (h) / exp [ ↵f (h, S) + ln ⇡(h)] . Note that in our case, we have f (h, S) = R S (h) + 1 ↵ µ(h, S) 1 ↵ ln ⇡(h). This distribution is also linked to the Stochastic Gradient Langevin Dynamics (SGLD) algorithm (Welling & Teh, 2011 ) that learns the hypothesis h 2 H by running several iterations of the form h t h t 1 rf (h, S) + r 2 ↵ ✏ t , with ✏ t ⇠ N (0, I D ), where h t is the hypothesis learned at iteration t 2 N, is the learning rate, and ↵ is the concentration parameter of the Gibbs distribution. This algorithm has an interesting feature: when the learning rate tends to zero, the SGLD algorithm becomes a continuous-time process called Langevin diffusion, defined as the stochastic differential equation in Equation (4). Indeed, Equation (3) can be seen as the Euler-Maruyama discretization (see, Raginsky et al., 2017) of Equation ( 4) defined for t 0 as dh t = rf (h t , S)dt + p 2↵B t , where B t is the Brownian motion. Under some mild assumptions on the function f (), Chiang et al. (1987) show that the invariant distribution of the Langevin diffusion is the Gibbs distribution proportional to exp( ↵f (h t , S)). Gibbs distributions in generalization bounds. The Gibbs distribution is introduced in the PAC-Bayesian theory by Catoni (2004; 2007) Kuzborskij et al., 2019) , i.e., bounds w.r.t. the minimal true risk over the hypothesis set. However, all these bounds consider expected risks while we are interested in the risk of a single hypothesis h sampled from ⇢ S . Hence, to the best of our knowledge, we are the first to derive probabilistic bounds for a single hypothesis sampled from a Gibbs distribution (see Corollary 3.1, Theorem 3.1).

3.3. OUR MAIN RESULT: GENERALIZATION BOUND WITH COMPLEXITY MEASURES

We now state our main result: a bound on the generalization gap involving a custom µ, standing for hypotheses sampled from the posterior ⇢ S (h) / exp [ ↵R S (h) µ(h, S)]. Theorem 3.1 (Generalization Bound with Complexity Measures). Let : [0, 1] 2 !R be the generalization gap. For any D on X ⇥ Y, for any hypothesis set H, for any prior distribution ⇡ 2 M ⇤ (H) on H, for any µ : H⇥(X⇥Y) m !R, for any 2 (0, 1], we have P S⇠D m , h 0 ⇠⇡, h⇠⇢ S " (R D (h), R S (h))  h ↵R S (h 0 ) + µ(h 0 , S) i h ↵R S (h) + µ(h, S) i + ln ⇡(h 0 ) ⇡(h) + ln ✓ 4 2 E S 0 ⇠D m E g⇠⇡ exp [ (R D (g), R S 0 (g))] ◆ # 1 , where ⇢ S is the Gibbs distribution defined by Equation (2). This theorem is general since it depends only on the functions () (expressing the generalization gap) and µ() (expressing the complexity) chosen by the user. Moreover, we show that this theorem allows obtaining uniform-convergence-based and algorithm-dependent bounds with the integration of complexity measures. We defer the proof of this result to Appendix D. Given () and µ(), we note a point that can be surprising at first reading: it appears indeed possible to sample hypotheses with a high objective R S (h)+ 1 ↵ µ(h, S) value and to obtain a tight generalization bound. However, by definition of the Gibbs distribution ⇢ S , such a sampled hypothesis h⇠⇢ S is less likely to be drawn since the density is higher when the objective is low. In other words, when µ(h, S) acts as a regularizer, the bound holds more likely for the hypotheses achieving a low regularized empirical risk, which is a rather expected result when considering regularized learning. In general, the bound may appear loose as there is no explicit dependence on the size of the data sample m. However, to get a bound that converges when m increases, it is sufficient to fix () as a function of m such as (R D (h), R S (h))=m kl[R S (h)kR D (h)] or (R D (h), R S (h))=2m[R D (h) R S (h)] 2 where kl(qkp) , q ln q p + (1 q) ln 1 q 1 p for p 2 (0, 1) and q 2 [0, 1]. Then, the tightness of the bound depends on m, apart from (), µ() and ↵.

The remaining challenge is to upper

-bound E S 0 ⇠D m E g⇠⇡ exp[ (R D (g), R S 0 (g))] and ln ⇡(h 0 ) ⇡(h) to get a practical bound. As an illustration, we provide in the next corollary an instantiation of Theorem 3.1 for two generalization gaps: (R D (h), R S (h))=m kl[R S (h)kR D (h)] and (R D (h), R S (h))=2m[R D (h) R S (h)] 2 ; and for ⇡ is a uniform distribution on a bounded set H. Corollary 3.1 (Practical Generalization Bound with Complexity Measures). For any D on X⇥Y, for any bounded hypothesis set H, given the uniform prior ⇡ on H, for any µ : H⇥(X⇥Y) m !R, for any 2 (0, 1], with probability at least 1 over S⇠D m , h 0 ⇠⇡, h⇠⇢ S we have kl [R S (h)kR D (h)]  1 m  h ↵R S (h 0 ) + µ(h 0 , S) i h ↵R S (h) + µ(h, S) i + 8 p m 2 + , ( ) and R D (h) R S (h)  s 1 2m  h ↵R S (h 0 ) + µ(h 0 , S) i h ↵R S (h) + µ(h, S) i + 8 p m 2 + , where [a] + = max(0, a), and ⇢ S is the Gibbs distribution defined in Equation (2). Interestingly, Corollary 3.1 gives a bound on kl [R S (h)kR D (h)] and |R D (h) R S (h)| where all terms except R D (h) are computable. To compute Equations ( 5) and ( 6) we can rearrange the terms to obtain a generalization bound on the true risk R D (h). We obtain respectively R D (h)  kl R S (h) 1 m  [↵R S (h 0 ) + µ(h 0 , S)] [↵R S (h) + µ(h, S)] + 8 p m 2 + ! , and R D (h)  R S (h) + s 1 2m  [↵R S (h 0 ) + µ(h 0 , S)] [↵R S (h) + µ(h, S)] + 8 p m 2 + , where kl(q|⌧ )= max{p 2 (0, 1) | kl(qkp)  ⌧ }. These bounds are used in Section 4 to illustrate the generalization guarantees for different values of µ() and ↵. In general, Equation ( 7) provides a tighter bound on the true risk than Equation ( 8). This can be proven with Pinsker's inequality (Appendix G) and is shown in our experiments. Notice that the r.h.s. of Equations ( 5) and ( 6) enjoys asymptotic convergence for m!1. However, for some trivial cases, the convergence rate can be arbitrarily degraded by increasing [↵R S (h 0 )+ µ(h 0 , S)] [↵R S (h)+ µ(h, S)]. For example, for a large empirical risk R S (h 0 ) (which is common when h 0 is sampled from a uniform prior on H), and for ↵=m and µ(h, S)=0, the r.h.s. for (R D (h), R S (h)) = kl[R S (h)kR D (h)] simplifies to µ (h, S, ) = [[R S (h 0 ) R S (h)] + 1 m ln 2 p m ] + and is large, no matter m. In order for the bound to be meaningful, we have then to set ↵ and µ() such that (i) the distribution ⇢ S allows us to sample a hypothesis h associated with a low objective function h 7 ! R S (h)+ 1 ↵ µ(h, S) and (ii) the complexity measure µ (h, S, ) is tight. For example, for ↵= p m and µ(h, S)=0, the distribution ⇢ S is less concentrated around the minimizers of the empirical risk, but the complexity measure is tighter compared to the previous example: [ 1 p m [R S (h 0 ) R S (h)] + 1 m ln 2 p m ] + . Lastly, in the ideal case with µ(h, S)= ↵ m (R D (h), R S (h)) ↵R S (h) and ↵= p m, the upper-bound of (R D (h), R S (h)) = m kl[R S (h)kR D (h)] becomes [ 1 p m (kl[R S (h 0 )kR D (h 0 )] kl[R S (h)kR D (h)]) + 1 m ln 2 p m ] + which is tight when the gaps of h and h 0 are small; the tightness arise with high probability since the density ⇢ S (h)/ exp( ↵ m (R D (h), R S (h)) ) is concentrated around the small gaps. This also highlights that the choice of the parametric function µ() is key to obtaining a tight generalization bound. In our previous analysis, we considered a uniform distribution for the prior ⇡ for illustration purposes. It is nevertheless clear that if the prior is good, i.e., it associates a higher probability to hypotheses having a low objective function, then the bounds become tighter. The most favorable case is when both the prior ⇡ and the posterior ⇢ S associate high probabilities to these hypotheses. While the posterior ⇢ S is generally learned from data, the choice of the prior ⇡ matters to get tight bounds. When no prior knowledge of the problem is available, to obtain better bounds, one solution is to consider data-dependent priors that have been heavily used in the PAC-Bayesian literature (see, e.g., Parrado-Hernández et al., 2012; Dziugaite et al., 2021; Pérez-Ortiz et al., 2021) . In the context of our practical evaluation hereafter, we consider only uniform distributions for the prior ⇡, as we think it helps us assess generalization better. Indeed, a hypothesis h sampled from the uniform distribution ⇡ has a high chance of underfitting. Hence, if the hypothesis h ⇠ ⇢ S has a tight bound, it must be that this hypothesis generalizes well. On the other hand, when using data-dependent priors, we cannot tell if the bound is tight because the hypothesis generalizes well or because the posterior is close to the prior.

4. USING ARBITRARY COMPLEXITIES IN PRACTICE

The bound of Corollary 3.1 is not directly applicable in practice: the remaining challenge is to sample h from the Gibbs distribution ⇢ S defined in Equation (2). We address the sampling issue in Section 4.1. Then, we make use of the proposed solution to assess our bound in practice. Section 4.2 introduces our experimental setting and Section 4.3 reports an overview of results on the tightness of the bound. We report more results on the influence of ↵ and the other parameters in Appendix E.

4.1. SAMPLING FROM THE GIBBS DISTRIBUTION

Sampling from the Gibbs distribution of Equation ( 2) is a hard task: naively, it requires to evaluate the function h 7 ! ↵R S (h) µ(h, S) for all h 2 H, which is intractable when H is infinite or even large. In an empirical study of our bound, we tackle this issue for over-parameterized models, which we later consider in Section 4.2. Let us consider a set H of hypotheses h w parameterized by w 2 R D , and a tractable distribution denoted P w U (e.g., a Gaussian distribution) such that its density approximates the density of ⇢ S . In this setting, to learn such a tractable distribution, we propose in Algorithm 1 a stochastic version of the Metropolis Adjusted Langevin Algorithm (MALA, Besag (1994))foot_1 . Its objective is to generate samples from ⇢ S by iteratively refining the tractable distribution that we define as P w U = N ✓ w r  R Ù (w)+ 1 ↵ µ(w, U ) , 2 ↵ I ◆ , where R Ù (w) = E (x,y)⇠U `(h w , (x, y)) is the empirical risk on the mini-batch U ✓ S, and `: H ⇥ (X⇥Y) ! [0, 1] is a loss function. Concretely, we initialize the parameters w of the model as the output of an optimization algorithm (Vanilla SGD in our case) minimizing R S (w)+ 1 ↵ µ(w, S) (which is approximated by R Ù (w)+ 1 ↵ µ(w, U ) for each mini-batch U ).  ⌧ min ✓ 1, ⇢ U (w 0 )P w 0 U (w) ⇢ U (w)P w U (w 0 ) ◆ 7: u Sample from the distribution Uni(0, 1) 8: if u  ⌧ then 9: w w 0 10: return w Then, we refine them as follows: at each iteration, given the current weights w and a mini-batch U ✓ S (Line 4), we sample a candidate vector w 0 (Line 5) according to the distribution P w U ; then (Line 6 to 9) we decide to reject or accept the new candidate to become our current weights w, depending on its ratio ⌧ = min 1, ⇢ U (w 0 )P w 0 U (w) ⇢ U (w)P w U (w 0 ) is larger than a control value u sampled from the uniform distribution Uni(0, 1) on [0, 1]. Note that, to compute ⌧ , it is not necessary to know the normalization constants of the two distributions appearing in ⌧ since they cancel out. In other words, only the function (without the normalizations) associated to the distributions are required to compute ⌧ . Under the mild assumption that ⇢ S is absolute continuous w.r.t. P w S (see Chib & Greenberg, 1995, for details) , when the number of iterations tends to infinity and when U =S, the returned w is sampled according to ⇢ S (Smith & Roberts, 1993) . Note that this assumption requires that the tractable distribution P w S has a strictly positive density when the density of ⇢ S is strictly positive as well (see Chib & Greenberg, 1995) .

4.2. EXPERIMENTAL SETTING

In this section, we investigate the tightness of our bounds of Equations ( 7) and (8) on the MNIST (Le-Cun et al., 1998) and FashionMNIST (Xiao et al., 2017) datasets. We keep the original learning set as S and the original test set T to estimate the true risk that we refer to as test risk R T (h). Model. We use a "Convolutional Network in Network" (Lin et al., 2013) similarly to Jiang et al. (2019) and Dziugaite et al. (2020) , that consists of several modules of 3 convolutional layers each followed by a leaky ReLU activation function (its negative slope is set to 10 2 ). The depth of the network L is the number of convolutional layers, and the width H is the number of channels of each convolution. In addition, for each layer i, we denote its weights by w i . For full details of the architecture, we refer the reader to Appendix E. We consider L2{9, 12, 15} and H2{128, 256}. Furthermore, we initialize the network with the weights w 0 2R D obtained by the uniform Kaiming He initializer He et al. (2015) . The set H corresponds to the hypotheses h w that can be obtained from this initialization (and we clamp the weights during the optimization in the initialization interval). Arbitrary complexity measures. We study 6 different complexity measures parametrized by different functions µ(h w , S) from Jiang et al. (2019, Sec. C) 3 . These 6 functions are actually independent of the learning sample S (S is dropped below for convenience) and defined as follows: We define the considered measures with ↵ taken among 5 values uniformly spaced between [ p m, m]. Note that, as mentioned above, these 6 parametric functions are independent of the sample S, we have also analyzed other parametric functions that depend on S. The results obtained are similar, we decided to defer these results in Appendix E. DIST FRO(h w ) = L X i=1 kw i w 0 i k 2 , Bound optimization. To compute our bound in Equations ( 7) and (8), we aim to sample a hypothesis h ⇠ ⇢ S via Algorithm 1. We set the loss function to the bounded cross entropy from Dziugaite & Roy (2018): `(h, (x, y))= 1 4 ln(e 4 +(1 2e 4 )h[y]), where h[y] is the probability assigned to label y by h. The advantage of Dziugaite & Roy (2018)'s cross-entropy is that it lies in `(h, (x, y)) 2 [0, 1], whereas the classical cross-entropy is unbounded. Indeed, taking into account the classical cross-entropy when optimizing the objective function would lead to focusing too much on the risk minimization, while we want to take into account 1 ↵ µ(w, U ). We initialize the weights w2R D to the solution found by optimizing the objective function R S (w)+ 1 ↵ µ(w, S) with a Vanilla SGD (with 10 epochs, a learning rate of 10 1 , and a batch size of 64). Given these initial parameters w, we execute Algorithm 1 for 1 epoch with a mini-batch of size 64, where =10 4 .

4.3. TIGHTNESS OF THE BOUNDS

For each parametric function µ(), we report in Figures 1 and 2 , the test risks R T (h) and the values of the tightest bound on R D (h) (w.r.t. ↵) associated to Equations ( 7) and ( 8) for different parameters (depth L, width H). First of all, we observe that the bounds correctly upper-bound the test risks and that some measures lead to tighter bounds, such as SUM FRO or DIST L 2 . We also remark that certain empirical risks are high, in particular, for these latter measures. This is due to the sampling of the hypothesis h from the distribution ⇢ S : the hypothesis does not necessarily minimizes the objective function h 7 ! R S (h)+ 1 ↵ µ(h, S). We nevertheless observe that the bounds' values are higher when the empirical risk R S (h) is low. This can be explained by the fact that [↵R S (h 0 ) + µ(h 0 , S)] [↵R S (h) + µ(h, S)] is large in this case due notably to the non-informative prior ⇡. Interestingly, when the empirical risks are a bit worse or close to the true risks, the bounds become tighter for certain parametric functions such as DIST L 2 and SUM FRO, which then appear to capture more information on the generalization capabilities. Indeed, the more the objective function R S (h)+ 1 ↵ µ(h, S) is representative of the gap of h, the tighter the bound. On the other hand, we can also note that for some measures such as DIST FRO, PARAM NORM, and ZERO (mainly for FashionMNIST), the bounds remain similar whatever the hypothesis which illustrates that these 7) resp. Equation (8) in the x-axis and the test risk R T (h) in the y-axis. The dashed line is the identity function. latter measures do not really help to capture some information about the generalization gap. This confirms that there is an interest in using a parametric function that captures information on the model during the training phase to assess its generalization capability. In Appendix E, we provide additional results on the influence of the parameter ↵ and the depth/width of the network. As expected, the bounds tend to increase when ↵ becomes large for smaller ↵ (e.g., close to p m), the bounds are improved but to the price of potentially higher risks. In contrast, about the depth/width impact, some measures are less sensitive to the increase of such parameters, such as PARAM NORM and, to a lesser extend, SUM FRO and DIST L 2 . This illustrates our framework's interest in studying the impact of some regularization when learning (over-)parameterized models.

5. CONCLUSION

In this paper, we provide a novel generalization bound that is able to incorporate arbitrary complexity measures, unlike classical learning theory frameworks (for which the framework imposes the complexity). These measures incorporate a data and model-dependent function, which can favor tightening the complexity for the generalization gap. To the best of our knowledge, our framework is one of the few able to be general enough to bring theoretical guarantees for most of the arbitrary complexity measures used in practice, e.g., based on some norms or a validation set. Such a framework may be adapted to other settings, such as transfer learning, offering new research directions. However, one limitation of this work is clearly that the hypothesis is obtained from a distribution difficult to use, namely, the Gibbs distribution, which uses a specific sampling algorithm, e.g., our algorithm stochastic MALA. It would be interesting to study the performance of such a sampling theoretically. Alternately, the generality of this framework allows one to avoid the sampling if we consider uniform-convergence-type bounds, for example, as in Corollary D.1. Improving the framework in this direction is an interesting future work. In particular, investigating the use of other distributions for sampling the hypothesis could be a possible direction. Another one could be to consider other specific ways to define informative data-dependent priors in order to obtain better bounds. For instance, the parametric function can be leveraged in order to include informative prior. Another interesting perspective is to study SGD-based algorithms, either by analyzing models learned by SGD through our framework or by developing SGD alternatives to optimize our bounds. In conclusion, we believe that this work paves the way for new research directions that try to bridge statistical learning theory and practice.



Disintegrated PAC-Bayesian bounds have also been introduced as a "single-draw case" byHellström & Durisi (2020). See Chib & Greenberg (1995) for an introduction on Metropolis-Hastings Algo on which MALA is based. Note we consider a subset of the functions studied byJiang et al.: we select those that are optimizable.



Figure1: Scatter plot given a parametric function µ(h, S), where each segment represents a neural network h w learned with a given ↵, width H and depth L. Each segment has a corresponding orange square and a blue circle. The orange squares corresponds to the empirical risk R S (h) (x-axis) and test risk R T (h) (y-axis). The blue circle resp. the black triangle represents Equation (7) resp. Equation (8) in the x-axis and the test risk R T (h) in the y-axis. The dashed line is the identity function.

Algorithm 1 Stochastic MALA 1: Input: Learning set S, weights w, function µ(), loss function `() 2: Hyperparameters: Number of iterations T , learning rate , parameter ↵ 3: for t 1 . . . T do

Scatter plot given a parametric function µ(h, S), where each segment represents a neural network h w learned with a given ↵, width H and depth L. Each segment has a corresponding orange square and a blue circle. The orange squares correspond to the empirical risk R S (h) (x-axis) and test risk R T (h) (y-axis). The blue circle resp. the black triangle represents Equation (

REPRODUCIBILITY STATEMENT

In order to ensure the reproducibility of our results, we complete the presentation of the experimental setup of the main text in Section 4.2 with a more complete description of the setting, models, and parameter used in Appendix E where some additional results are also provided. We also include the code of our method as an additional zip file in the supplementary material in order to facilitate the reproduction of the experiments. Regarding the theoretical contributions, we provide in Appendices A to C the proofs of the results presented in the main paper, namely Theorems 2.1 and 3.1 and corollary 3.1. We also provide in Appendix D some additional results and discussion about the comparison of our framework with the uniform convergence and algorithm-dependent generalization bounds.

ETHIC STATEMENT

The contributions of this paper are essentially fundamental and theoretical; we do not see an immediate potential negative social impact from these contributions. We followed classic ethical guidelines in machine learning which in our case mainly consists in bringing information about reproducibility issues which is addressed in the previous paragraph.

