CO-COMPLEXITY: AN EXTENDED PERSPECTIVE ON GENERALIZATION ERROR

Abstract

It is well known that the complexity of a classifier's function space controls its generalization gap, with two important examples being VC-dimension and Rademacher complexity (R-Complexity). We note that these traditional generalization error bounds consider the ground truth label generating function (LGF) to be fixed. However, if we consider a scenario where the LGF has no constraints at all, then the true generalization error can be large, irrespective of training performance, as the values of the LGF on unseen data points can be largely independent of the values on the training data. To account for this, in this work, we consider an extended characterization of the problem, where the ground truth labels are generated by a function within another function space, which we call the generator space. We find that the generalization gap in this scenario depends on the R-Complexity of both the classifier and the generator function spaces. Thus, we find that, even if the R-Complexity of the classifier is low and it has a good training fit, a highly complex generator space could worsen generalization performance, in accordance with the no free lunch theorem. Furthermore, the characterization of a generator space allows us to model constraints, such as invariances (translation and scale in vision) or local smoothness. Subsequently, we propose a joint entropy-like measure of complexity between function spaces (classifier and generator), called co-complexity, which leads to tighter bounds on the generalization error in this setting. Co-complexity captures the similarities between the classifier and generator spaces. It can be decomposed into an invariance co-complexity term, which measures the extent to which the classifier respects the invariant transformations in the generator, and a dissociation co-complexity term, which measures the ability of the classifier to differentiate separate categories in the generator. Our major finding is that reducing the invariance co-complexity of a classifier, while maintaining its dissociation co-complexity, improves the training error and reduces the generalization gap. Furthermore, our results, when specialized to the previous setting where the LGF is fixed, lead to potentially tighter generalization error bounds. Theoretical results are supported by empirical validation on the CNN architecture and its transformation-equivariant extensions. Co-complexity showcases a new side to the generalization abilities of classifiers and can potentially be used to improve their design.

1. INTRODUCTION

In the context of supervised classification, a major factor for consideration is the generalization error of the classifier, i.e., how good a classifier generalizes to test (unseen) data points. The notion of overfitting describes the case when the test error significantly exceeds the training error. Naturally, the objective for building a robust classifier entails the minimization of this generalization gap, to avoid overfitting. To that end, statistical studies on generalization error (Blumer et al. (1989) ; Bartlett & Mendelson (2003) ) find that complexity measures on the classifier function space, F, often directly control the generalization gap of a classifier. Two prominent examples of such measures include the Rademacher Complexity R m (F) (Bartlett & Mendelson (2003) ) and the VC dimension (Blumer et al. (1989) ) V C(F). Both measures directly estimate the flexibility of a function space, i.e., how likely is it for F to contain functions that can fit any random labelling over a set of data points. In this paper, we work with Rademacher complexity and propose extensions that provide a new perspective on generalization error. From a statistical perspective, the generalization gap can be understood through convergence bounds on the error function, i.e., the expected deviation of the error function on the test data compared to the training data. Traditional generalization error bounds (Bartlett & Mendelson, 2003) state that function complexity (i.e., R m (F)) directly corresponds to the generalization gap. Thus, higher R m (F) usually leads to a greater generalization gap and slower convergence. Although the original R m (F) was proposed for binary classification, similar results have been shown for multi-class settings and a larger variety of loss functions (Xu et al., 2016; Liao et al., 2018) . Note that R m (F) is over the entire function space and thus global in nature. Local forms of Rademacher complexity, which involve restricting the function space and lead to minimum error on the training data samples, have been proposed (Bartlett et al., 2005; 2002) . Apart from function complexity based measures, there is also considerable work which uses an information theoretic perspective (Xu & Raginsky, 2017; Russo & Zou, 2020; Bu et al., 2020; Haghifam et al., 2020) in treating the subject.

1.1. WHY THE GROUND TRUTH LABEL GENERATING FUNCTION MATTERS

We define the label generating function for a classification problem as the function which generates the true labels for all possible datapoints. Note that most generalization error bounds, including the traditional ones, primarily are introspective in nature, i.e., they consider the size and flexibility of the classifier's function space F. The main direction proposed in this work is the investigation of the unknowability of the ground truth label generating functions (LGF), using another function space which we call the generator space. The generalization error bounds in Bartlett & Mendelson (2003) state that the difference in test and training performance is roughly bounded above by the Rademacher Complexity of the classifier's function space (i.e., R m (F)). In other words, whatever the training error, the test error will always be likely to be greater by an amount R m (F) on average. We note that, in deriving the original bound, a major assumption is that the LGF g is fixed and knowable from the data. We now outline our main argument for taking the generator space into account. The LGF is indeed fixed, i.e., there cannot be two different ground truth label generating functions applicable to the same problem. However, our primary emphasis is on the fact that the true LGF will always be unknown, i.e., for any finite training data containing data-label pairs (z i , g(z i )), we would only truly know the output of the label generating function on the given training data samples. Only when we have infinite training data samples, the values of LGF are known at each z ∈ R d . In this work, we denote the function space of all possible LGFs, within which the true LGF is contained, as the generator space. Note that the generator space arises due to the unknowability of the LGF. We show in this work that due to the generator space, the true generalization gap is greater than the Rademacher complexity of the classifier, and also depends on the Rademacher complexity of the generator. Note that the size of the generator space, which dictates its complexity, will be dependent on the amount of constraints that the LGF has: if the LGF has no constraints at all, then generator spaces are larger, whereas if the LGF is constrained to be smooth/invariant to many transformations, generator spaces are smaller (as the set of functions which are smooth and invariant are also much smaller, for instance in vision). Let us consider the case where the LGF g has no constraints at all, i.e., it is sampled from a generator space G which contains all possible functions g : R d -→ {-1, 1}. In this case, the function g is expected to have no structure and behaves like a random function, and thus the expected test accuracy of any classifier will be 50% (i.e., random chance). Therefore, even if the classifier function f ∈ F produces a very good fit on the training data and F happens to have a low complexity measure R m (F), the generalization performance will still be poor, as no knowledge of the LGF values on the unseen datapoints is available from the training samples. This is in contrast to the generalization error bounds based on Rademacher complexity, which would estimate that the classifier should have good generalization performance (i.e., low test error), as both R m (F) and training error are low. Note that although a typical training dataset extracted from this LGF may be hard to fit using a low-complexity classifier F, there will be examples of training instances with non-zero probability on which a low-complexity classifier can have a good fit. The takeaway from this example is that the structure of the data (here represented using the complexity of the generator space) can additionally dictate whether a classifier can generalize. Note that, in this scenario, the expected generalization performance would be better if the LGF had more structure. clearly exhibits poor generalization performance on the test data. This agrees with our previous argument that if the LGF has no constraints, generalization is essentially impossible. In example (b), the classifier shows much more robust generalization performance on the test data, due to the LGF having significantly more structure. Furthermore, it is intuitively clear that a classifier function space F which has a high overlap with the generator space G (and therefore its constraints) should yield good generalization performance. This shows, that in addition to the function spaces F and G individually affecting the generalization gap, the similarities between F and G are also important. Both of these perspectives play leading roles in our construction of generator-and-classifier aware complexity measures and the associated novel generalization error bounds.

1.2. RELATED WORK

To the best of our knowledge, the approach we are proposing to study generalization performance is novel and there is not much directly related work. Here, we describe some examples of works which discuss relevant concepts. The no free lunch theorem proposed in Wolpert & Macready (1997) indirectly sheds light on the behaviour of the LGFs. However, it does not incorporate ways to reduce variability in the LGFs by considering constraints related to the classification problem. Invariance constraints in learning algorithms were studied in Sokolic et al. (2017) , where the input space was factored into invariant transformations, similar to what we also do in this work. In doing so, the complexity of the data was indirectly explored, based on the number of invariant transformations present in the input space. However, the generalization bounds were derived with an assumption of perfectly invariant classifier function spaces, which is not applicable for CNNs and their variants (as shown in Kauderer-Abrams ( 2017)). In another relevant study (Jin et al. (2019) ), a cover-complexity measure of a single dataset was proposed, and the generalization error of fully connected networks was analyzed with respect to the same. However, invariances in the dataset and the learning algorithm were neglected, i.e., the similarities between the generator and the classifier spaces were not studied.

1.3. KEY CONTRIBUTIONS

The contributions of this work are as follows: 1. We propose a novel complexity measure between the classifier function space F and the generator function space G called co-complexity, which we use to derive new, more accurate global estimates of generator-aware generalization error bounds (Theorem 3 and 4). Co-complexity considers not only the complexity of the generator and classifier function spaces, but also the similarities between them. Doing so allows for the a more exhaustive look into generalization error. 2. We decompose co-complexity into two different measures of complexity, invariance co-complexity and dissociation co-complexity, which are used to derive new generalization error bounds, including bounds on the expected training error (Theorem 6). We find that reducing invariance co-complexity while keeping the dissociation co-complexity unchanged, helps reduce the generalization gap (low variance) while maintaining low training error (low bias). This emphasizes the importance of having classifiers that share invariance properties with generator spaces, e.g., rotation-invariant CNNs (Cohen & Welling (2016b)) on MNIST-Rot (Larochelle et al. (2007) ). 3. We present empirical validation of co-complexity measures of CNN and its scale-equivariant and rotation-equivariant counterparts (SE-CNN in Sosnovik et al. (2019) , RE-CNN in Cohen & Welling (2016b)), which explains their superior generalization ability compared to MLPs. Our proposed error bounds are easily specialized to the case where ground truth label function is fixed, leading to potentially tighter generalization error bounds (see Appendix A). Although our proposed measures are global in nature, local variants can be derived via similar extensions used in Bartlett et al. (2005) . 

2. DEFINITIONS

= +1) = P r(σ i = -1) = 0.5). F and G are two function spaces from R d - → {-1, 1} and we assume that they are defined at all points in R d . In the context of our problem, F will be the classifier's function space, whereas G will be the generator space. Rademacher Complexity (Bartlett & Mendelson (2003) ): First, we provide the definition of Rademacher Complexity: R m (F) = E σ,S sup f ∈F 1 m m i=1 f (z i )σ i It can be seen that R m (F) indicates the noisy-label fitting ability of the classifier's function space F.

Correlated Rademacher Complexity:

We propose a modified form of the original Rademacher complexity, called Correlated Rademacher Complexity, R C m (F), which is defined as follows: R C m (F) = 1 2 × E σ,S,S sup f ∈F 1 m m i=1 f (z i )f (z i )σ i (2) We have 0 ≤ R C m (F) ≤ 1/2. We also show that R C m (F) ≤ R m (F) (see Appendix C.1). It has been argued that R m (F) (and therefore R C m (F)) can be considered as "entropy" measures of the entire function space F (Anguita et al. (2014) ). Note that, like R m (F), R C m (F) is also eventually depends on the noisy label fitting ability of F, but computes it via ability of the function space to assign the same or a different label to two random points z i and z i (via f (z i )f (z i )). Thus we denote it as the correlated Rademacher complexity of F. Co-Complexity: Now we propose various complexity measures between two separate function spaces F and G. Similar to R m (F) and R C m (F), these measures assess the noisy label fitting abilities of the union of the function spaces F and G. In doing so, these measures compute a joint-entropy like metric over the two function spaces. First, we define the Co-complexity between F and G as follows. R m (F, G) = 1 2 × E σ,S,S sup f ∈F ,g∈G 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i (3) Some of the properties of co-complexity are as follows: P1 R m (F, G) = R m (G, F), i.e., co-complexity is symmetric. P2 R m (F, G) ≥ R C m (F) and R m (F, G) ≥ R C m (G), i.e. , the co-complexity between F and G is always greater than the individual correlated Rademacher complexities of F and G. P3 R m (F, G) ≤ R C m (F) + R C m (G) ≤ R m (F) + R m (G), i.e. , the co-complexity between F and G is upper bounded by the sum of the Rademacher complexities of F and G. P4 R m (F, G) behaves like a joint-entropy measure of F and G. To see this, let us define I m (F, G) = R C m (F) + R C m (G) -R m (F, G) , called the mutual co-complexity between F and G. We later find that I m (F, G) behaves like a mutual-information measure. This, coupled with the fact that R C m (F) and R C m (G) can be construed as entropy measures of F and G, implies the result. Invariance Co-Complexity: Next, we quantify some of the properties of the ground truth generator space G in terms of its invariance transformations. We define the Invariance Classes of G as follows: I C (G) = {τ 1 (., θ 1 ), τ 2 (., θ 2 ), ..., τ n (., θ n )}, where {τ i (., θ i )|i = 1, 2, • • • , n} are functions from R d - → R d , and θ i represents the extent of the transformation τ i . Each θ i ∈ R is a scalar and takes on a set of admissible values depending on the transformation (possibly infinite). The functions are constrained such that, ∀τ ∈ I C (G), ∀g ∈ G, g(τ (z), t) = g(z) for all data points z ∈ R d , and for all admissible values of the transformation parameter t. Also, note that setting θ i = 0 leads to the identity function, i.e., τ i (z, 0) = 1 for all z ∈ R d and all i. Based on the above we define a transformation-indicator function I(z, z i ), such that I(z, z i ) = 1 if z = τ 1 (., t 1 ) • τ 2 (., t 2 ) • .... • τ n (., t n )(z i ), for a certain t 1 , t 2 , ...t n and otherwise I(z, z i ) = 0. Additionally, with respect to a generator space G, we also define the invariance extended set of a datapoint z i ∈ R d as follows: I G (z i ) = {z ∈ R d | I(z, z i ) = 1} (5) Note that that invariance extended set of z will always have the same ground truth label as z. Next, we define the Invariance Co-complexity between two function spaces F and G as follows: R I m (F, G) = 1 2 × E S,S sup f ∈F 1 - 1 m m i=1 f (z i )f (z i ) , where z i ∼ P (z)I(z, z i ), ∀i . Note that R I m (F, G) ≤ 1. Each data point z i in S is contained within the invariance extended set I G (z i ) of z i , and is sampled according to the un-normalized distribution P (z)I(z, z i ) over z ∈ R d . The invariance co-complexity between F and G indicates the degree to which F obeys the invariance transformations within G. For instance, a low R I m (F, G) would indicate that f (z i )f (z i ) = 1 (i.e., f (z i ) = f (z i )) for most z i and z i which are related via some invariance transformation in G. Dissociation Co-Complexity: We now define the Dissociation Co-complexity between F and G when the corresponding datapoints in S and S are not related by any invariance transformation. That is, for all z i ∈ S and z i ∈ S , z i ∈ I G (z i ), and is sampled from the distribution P (z)(1 -I(z, z i )) over z ∈ R d . The dissociation co-complexity can then be defined as: R D m (F, G) = 1 2 × E σ,S,S sup f ∈F 1 m m i=1 f (z i )f (z i )σ i , where z i ∼ P (z)(1 -I(z, z i )), ∀i. Note that R D m (F, G) measures the average flexibility of F in its label assignment to any two points z i and z i which are not related via any invariance transformation in G. Thus, a larger R D m (F, G) would indicate that the classifier function space F is able to easily assign separate labels to datapoints which are not related via any invariance transformation in G. We also define a variant of R D m (F, G), in which S contains only one instance, instead of m instances, denoted as R D,1 m (F, G). R D,1 m (F, G) is defined as follows. R D,1 m (F, G) = 1 2 × E σ,z0∼P,S sup f ∈F 1 m m i=1 f (z 0 )f (z i )σ i , where z i ∼ P (z)(1 -I(z, z 0 )), ∀i. Note that R D m (F, G) ≤ 0.5 and R D,1 m (F, G) ≤ 0.5. Rademacher Smoothness: For our final result in Theorem 6, we assume that the function space F is Rademacher smooth w.r.t. G. We define Rademacher Smoothness as follows. If F is Rademacher smooth w.r.t. G, then any quantity of the form E S,S sup f ∈F 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i or E S,S sup f ∈F ,g∈G 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i , where the expectation is over datapoint permutations S, S , should lie between the two following cases : (i) S ∈ I G (S) (i.e., z i ∼ P (z)(I(z, z i )) for all i) and (ii) S / ∈ I G (S) (i.e., z i ∼ P (z)(1 -I(z, z i )) for all i). Then, for some non-zero value of α, 0 ≤ α ≤ 1, the following holds: E S,S sup f ∈F 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i = (α) E S,S ,S ∈I G (S) sup f ∈F 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i + (1 -α) E S,S ,S / ∈I G (S) sup f ∈F 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i . Similarly, the same applies to E S,S sup f ∈F ,g∈G 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i .

3. THEORETICAL RESULTS

We now present a series of results which extend the generalization framework using generator spaces. The proofs of all our results are available in Appendix C. For the following results, we assume that the classifier's function space F contains the constant function f (z) = c ∀z, c ∈ {-1, 1}. We use the definitions provided in the previous section. Specifically, F is be the classifier's function space and G is the generator space. Given that the data labels are generated by g ∈ G, we extend the definition of the sampled instances S by adding the output labels, i.e., S = [(z 1 , g(z 1 )), (z 2 , g(z 2 )), .., (z m , g(z m ))]. Then for f ∈ F, we denote the 0-1 loss on S by err S (f ) = m i=1 (1 -f (z i )g(z i )) 2m . We also denote the generalization error over the data samples generated by distribution P as err P (f ) = lim N →∞,zi∼P N i=1 (1 -f (z i )g(z i )) 2N . First, we describe the generalization error bound originally proposed in Bartlett & Mendelson (2003) . Theorem 1. (Bartlett & Mendelson (2003) ) For 0 < δ < 1, with probability p ≥ 1 -δ, we have err P (f ) ≤ err S (f ) + R m (F) + log (1/δ) 2m . The above theorem assumes that the ground truth function is completely knowable from the given training examples, and therefore the generator space contains only a single element (Bartlett & Mendelson (2003) ). For all of our results that follow, we assume the extended case where the generator space can contain more than one element due to the unknowability of the LGF, and construct generalization error bounds that consider the complexity of G as well. Also note that the following results hold for every choice of f ∈ F and g ∈ G. First we present the extended Theorem 1 when the LGF is unknowable (R m (G) > 0). Theorem 2. For 0 < δ < 1, with probability p ≥ 1 -δ, we have, err P (f ) ≤ err S (f ) + R m (F) + R m (G) + log (1/δ) 2m . ( ) Remark 1. This result states that the complexity of the generator space also directly contributes towards increasing the generalization gap. In problems where the LGF is known to be heavily constrained and structured, R m (G) will be low, reducing the expected generalization gap of classifiers, and vice-versa. Note that in the knowable ground truth scenario where G has a single element, we have R m (G) = 0, which returns the error bound in Theorem 1. This result shows that larger complexity of G results in a greater generalization gap. We demonstrate this with neural network generator spaces of varying complexity (see Appendix B.3). Intuitively, the additional R m (G) term comes from the fact that given the training data labels, there still exists a subspace of functions in G which are potential candidates for the ground truth function. The following result presents tightens Theorem 2 using co-complexity. Theorem 3. For 0 < δ < 1, with probability p ≥ 1 -δ, we have err P (f ) ≤ err S (f ) + R m (F, G) + log (1/δ) 2m . ( ) Remark 2. Co-complexity measures the degree of similarity between F and G, instead of simply adding their respective complexities, and therefore is tighter than Theorem 2 (co-complexity P3). We also find that the generalization gap, for the case when the roles are reversed, is unchanged (see corollary 3.1 in Appendix C). This demonstrates an inherent symmetry in the problem. Also note that Theorem 3 leads to potentially tighter bounds when specialized to the conventional setting where R m (G) = 0 (Appendix A). The following result outlines a lower bound on the generalization error, using co-complexity. Theorem 4. For 0 < δ < 1, with probability p ≥ 1 -δ, err P (f ) ≥ err S (f ) -R m (F, G) - log (1/δ) 2m . ( ) Remark 3. This result, coupled with Theorem 3, demonstrates that if one interprets the err S (f ) term as the bias of the classifier, R m (F, G) can be interpreted as the variance. The following result addresses the joint-entropy like behaviour of R m (F, G). Theorem 5. We are given the mutual complexity measure I m (F, G) as defined in co-complexity P4. Consider another ground truth generator space G , such that I m (G , G) = 0, i.e., G and G are independent. Then we have, R m (F) ≥ I m (F, G) + I m (F, G ). Remark 4. Let H(X) represents the Shannon entropy of the random variable X and I(X; Y ) represents the mutual information between random variables X and Y . Then, it is known that H(X) ≥ I(X; Y ) + I(X; Y ), when Y and Y are independent random variables (I(Y ; Y ) = 0). This observation, coupled with the fact that R C m (F) can be considered as an entropy measure of the function space (see section 2) indicates that the quantity I m (F, G) behaves like a mutual information estimate between the function spaces F and G. Furthermore, as R m (F, G) = R C m (F) + R C m (G) - I m (F, G) , we can see that R m (F, G) behaves as a joint entropy measure of two function spaces. The following result demonstrates how the co-complexity measure R m (F, G) can be decomposed into separate co-complexity measures. Lemma 1. Consider function spaces F and G, such that F is Rademacher smooth w .r.t. G. Let us define R C,D m (G) as 1 2 × E σ,S,S ,S / ∈I G (S) sup g∈G 1 m m i=1 g(z i )g(z i )σ i . For some non-negative real constant 0 ≤ α ≤ 1, we then have R m (F, G) ≤ αR I m (F, G) + (1 -α)(R D m (F, G) + R C,D m (G)), where R I m (F, G) and R D m (F, G) are the invariance and dissociation co-complexity, respectively. Remark 5. This decomposition allows us to differentiate the impact of R I m (F, G) and R D m (F, G) on the generalization error bound. Note that value of α here will be proportional to the cardinality of the invariance transformation classes I G of G. Using this, we proceed to our final result, where we express the generalization error bound in Theorem 3, in terms of the invariance and dissociation co-complexities. For purposes of simplification, we denote R I m (F, G) as R I m , R D m (F, G) as R D m and similarly for R D,1 m . Theorem 6. Consider function spaces F and G, such that F is Rademacher smooth w.r.t. G. Let err P denote the generalization error for the functions f ∈ F which showcase the best fit on the training data samples (averaged across all g in G). R C,D m (G) is defined in Lemma 1. For some non-negative real constants 0 ≤ α, β ≤ 1 and 0 < δ < 1, with probability p ≥ 1 -δ, we have err P ≤ (1 -β) 1 2 -R D,1 m + αR I m + (1 -α)(R D m + R C,D m (G)) + 2 log(1/δ) m (18) Remark 6. The first term in the generalization error bound in (18) represents an upper bound on the average training error of the function which best fits the training data. We find that smaller R I leads to a smaller generalization gap (variance), while keeping the training error unchanged. However, the same cannot be said for the dissociation co-complexity R D . Thus, for a fixed R D and R D,1 , classifier function spaces with smaller R I will lead to smaller generalization error. Also note that when the generator space only contains a single element, R C,D m (G) = 0. The next proposition gives interpretable bounds on invariance and dissociation co-complexity. Let z = z 1 , z 2 , ..., z m , and z' = z 1 , ..., z m , where z k , z k ∈ R d ∀ k. Define the well known growth function of binary valued functions F as A = Π F (m) = max z |{f (z k ) : k = 1, 2, ..., m, f ∈ F}|. Define the invariance-constrained and dissociation-constrained growth functions of F w.r.t. G as: B = Π I F | G (m) = max z,z',z i ∈I G (zi),∀i |{f (z k ), f (z k ) : k = 1, 2, ..., m, f ∈ F}| C = Π D F | G (m) = max z,z',z i / ∈I G (zi),∀i |{f (z k ), f (z k ) : k = 1, 2, ..., m, f ∈ F}| Proposition 1. Define R I m (F, G) = 1 2 × E σ,S,S ,S ∈I G (S) sup f ∈F 1 m m i=1 f (z i )f (z i )σ i , where z i ∈ I G (z i ), ∀i. Note that R I m (F, G) ≤ R I m (F, G). Given the definitions above, we have R I m (F, G) ≤ log B 2m ≤ log A m and R D m (F, G) ≤ log C 2m ≤ log A m . Remark 7. Reducing Π I F | G (m) while keeping Π D F | G (m) unchanged would result in low invariance co-complexity without affecting dissociation co-complexity, which can't be achieved by simply reducing Π F (m) (or the V C(F)). This shows that although R I m and R D m are not completely independent, careful construction of F makes it possible to reduce R I m while maintaining R D m .

4. EXPERIMENTS AND DISCUSSIONS

Our experiments (on MNIST and STL-10) explore the implications of Theorem 6, which states that low R I m while maintaining R D m improves generalization while keeping training error unchanged. As all of our proposed complexity metrics assume binary classifiers, we create subsets of these datasets containing only two randomly chosen categories, which converts the problem into a binary classification scenario.

4.1. MEASURING INVARIANCE CO-COMPLEXITY

As invariance co-complexity is defined w.r.t. a transformation class (I G ), we compute the invariance co-complexity (in ( 6)) of various networks shown in Table 1 for four different transformations. For each transformation τ , we choose the datapoint pairs S and S in (6) such that z i = τ (z i , t), where t is the transformation parameter. The parameter t is randomly chosen as follows: 0 • -180 • for rotation, 0.7-1.2 for scale, 0 • -90 • for shear angle and 0.5-4 pixels for translation. The computed R I m values indicate the extent to which these networks naturally allow for invariance to various transformation types. To approximately compute R I m , we take 1000 randomly weighted networks to construct F to compute (6). We average the results over 100 batches of data (S and S ), each containing 1000 examples (m = 1000) from their respective datasets. We compute R I m for multi-layered perceptrons (MLPs), CNNs and their transformation-equivariant extensions: scale-equivariant CNN (SE-CNN in Sosnovik et al. ( 2019)) and rotation-equivariant CNN (RE-CNN in Cohen & Welling (2016b)). Note that the scale-equivariant CNN and the rotation-equivariant CNN are known to outperform the vanilla CNN on MNIST and its rotation and scale extensions (Sosnovik et al. (2019) ; Cohen & Welling (2016b)). Hence, these transformation-equivariant extensions showcase better generalization performance. One of the objectives of this experiment is to verify whether the invariance and dissociation co-complexities of these networks indeed point to better generalization performance, as is observed in literature. Please see Appendix D for architecture details. 

4.2. MEASURING DISSOCIATION CO-COMPLEXITY

Smaller invariance co-complexity can be inconsequential, if the network's ability to discriminate between two different categories in G also suffers. Hence, we measure the dissociation co-complexity (in ( 8)) to check whether lower values of R I m also affect their discriminatory potential. We compute R D m similarly to R I m , by using 1000 randomly initialized networks (for the supremum in ( 8)), and then averaging the measure for 100 batches of data (S and S ). Table 2 shows the results, where we find that the values of R D m for all studied networks are very close to each other. Therefore, only the invariance co-complexity of these architectures will majorly decide the generalization error bounds in Theorem 6. This shows that, in spite of better invariance capabilities of CNNs (lower R I m ) and its transformation-equivariant extensions, those networks preserve discriminatory potential necessary for differentiating images that belong to different categories.

5. REFLECTIONS

The proposed co-complexity measures lead to an extended perspective on generalization error by accounting for ground truth generator spaces. The objective of introducing the generator space is to consider the structure in the ground truth labelling functions. New error bounds are proposed which consider various aspects of interaction between generator and classifier function spaces. Co-complexity can be decomposed into two separate components, invariance co-complexity (R I m ) and dissociated co-complexity (R D m ), which are found to measure the degree to which the classifier's function space obeys the invariance transformations in the data (R I m ), and also the degree to which the classifier is able to differentiate between separate categories in the data (R D m ). If we interpret R m (F, G) as the variance of the classifier (see Theorems 3 and 4) and the training error as the bias, we see that R I m and R D m affect the training error (bias) and generalization gap (variance) differently (see Theorem 6). Theorem 6 outlines a clear objective for reducing variance while maintaining low bias: reduce R I m while keeping R D m and R D,1 m unchanged. We note that monitoring R I m and R D m can be useful for finding better learning architectures for a specific problem. Furthermore, it is also clear that classification problems which contain more invariance constraints, R I m plays a greater role in controlling the generalization gap (as α will be higher). Experiments on MNIST and STL-10 reveal that the invariance co-complexity R I m for CNNs and their transformation-equivariant extensions is always lower than MLPs, while their dissociation co-complexities, R D m and R A APPENDIX -TIGHTENING EXISTING RESULTS For the following results, we primarily work in the setting where the ground truth generator space only contains a single element, which pertains to the conventional treatment of generalization error bounds with Rademacher complexity. Note that in this setting, we can still define the invariance classes of G, which we will make use of. Furthermore, we primarily work with the correlated Rademacher complexity R C m (F), which was defined in section 2. We note that R C m (F) involves two instantiations of the dataset in S and S . Furthermore, we also observe that the final R C m (F) depends on how each datapoint z i in S is paired with the corresponding datapoint z i in S , R C m (F) = 1 2 × E σ,S,S sup f ∈F 1 m m i=1 f (z i )f (z i )σ i For the following results, we define the invariance-matched correlated Rademacher complexity R Cinv m (F), which chooses a pairing (i, M (i)) between the points in S and S , such that M = arg max M i |z M (i) ∩ I G (z i )|. ( ) That is, the matching function M is chosen such that there exists the largest number of pairings z i , z M (i) where z M (i) is present the invariance extended set of z i . Thus, the invariance-matched correlated complexity can then be defined as R Cinv m (F) = 1 2 × E σ,S,S sup f ∈F 1 m m i=1 f (z i )f (z M (i) )σ i , ( ) where M is chosen using equation 20. Given this, we have the following theorem. Theorem 7. Consider the classifier function space F, and the previous defined correlated Rademacher complexity of F, R C m (F). Furthermore, let us assume that the generator space only contains a single element, i.e., the conventional setting (R m (G) = 0). Then we have for any f ∈ F, with a probability p ≥ 1 -δ, we have err P (f ) ≤ err S (f ) + R Cinv m (F) + log (1/δ) 2m . ( ) Proof. First, we will show that for any f ∈ F, with a probability p ≥ 1 -δ, err P (f ) ≤ err S (f ) + R C m (F) + log (1/δ) 2m . ( ) This result trivially follows from Theorem 3, when G contains a single element denoted by g 0 , as then R m (F, G) = 1 2 × E σ,S,S sup f ∈F 1 m m i=1 f (z i )f (z i )g 0 (z i )g 0 (z i )σ i (24) = 1 2 × E σ,S,S sup f ∈F 1 m m i=1 f (z i )f (z i )σ i (25) = R C m (F) Note that as R C m (F) ≤ R m (F), this is a tighter global bound that the original Rademacher bound. Then, the result immediately follows when we recognize that changing the ordering of z i and z i does not affect the value of the generalization gap in (57). Furthermore, the matching M for each sampled S, S is independent of the choice of the Rademacher variables σ i , and thus the subsequent steps follow in the same way as in (67). Subsequently, we still have with a probability p ≥ 1 -δ, err P (f ) ≤ err S (f ) + R Cinv m (F) + log (1/δ) 2m . ( ) Note that when the invariance co-complexity of F w.r.t. G is smaller than its dissociation cocomplexity, then the above is expected to be a tighter bound than ( 23). This is because the invarianceaware matching function M leads to a greater number of pairings z i , z i where z i ∈ I G (z i ) (thus lies in its invariance extended set). m (F, G) of these networks, as R D,1 m (F, G) also affects the training error in the generalization error bound in Theorem 6. Please note that the datasets used in our experiments (MNIST and STL-10), have 10 categories. However, since the invariance and dissociation co-complexities can only be computed binary classifiers, we use random subsets of the data which contain only two categories, for computing the value of co-complexity.

B APPENDIX -ADDITIONAL EXPERIMENTS

Table 3 shows the results. We find that in both datasets, the R D,1 m (F, G) values of the tested networks are very close to each other. This is consistent with the dissociation co-complexity R D m (F, G), shown in Table 2 of the main paper. Together with those results, it is clear that only the invariance co-complexities R I m (F, G) will mainly contribute to the generalization error of these networks, as the dissociation co-complexities R D,1 m (F, G) and R D m (F, G) do not show any significant variation. For example, from Table 1 and 2 Theorem 2 implies that only increasing the complexity of G (by increasing H), should lead to a steady increase in the difference between training and testing error (i.e. the generalization gap). We execute this scenario with two variations: m = 10 and m = 100. H was chosen from {2, 4, 8, 16, 32}. Note that the data distribution is kept fixed for all H. Therefore, in this experiment, only the label generating function is changed. We observe that the expected generalization gap steadily increases (ranged between 0.6%-8.9%) as we increase the complexity of the generator class (for both m = 10 and m = 100). Furthermore, we find a very high degree of correlation (0.982 for m = 10 and 0.976 for m = 100) between the generalization gap trend predicted by Theorem 2 (≈ H/m) and the empirically observed average generalization gap. This shows that the complexity of the generator class also directly controls the generalization gap, when everything else is unchanged. This result clearly highlights that when additional information about the generator space is known, more accurate trends of generalization error can be estimated by using both R m (G) and R m (F).

C APPENDIX -PROOF OF THEORETICAL RESULTS

In what follows, we provide the proofs of Theorems 2 to 6. The proof of Theorem 1 of the main paper is available in Bartlett & Mendelson (2003) . We also prove, in Lemma 2, the properties of co-complexity stated in Section 5 of the main paper. We use the definitions provided in Section 5 of the main paper. Please note that for following proofs, a variable z which follows the Rademacher distribution (P r(z = +1) = P r(z = -1) = 0.5) are referred to as Rademacher variable.

C.1 PROOFS OF PROPERTIES P1-P3 OF CO-COMPLEXITY

We begin with the proof of the properties of co-complexity (P1-P3) and a property of the correlated Rademacher complexity, which are described in Section 5 of the main paper. Lemma 2. For any two function spaces F and G, the following statements hold. R m (F, G) = R m (G, F) (28) R m (F, G) ≥ R C m (F) (29) R m (F, G) ≥ R C m (G) (30) R m (F, G) ≤ R C m (F) + R C m (G) (31) R C m (F) ≤ R m (F) Proof. (i) To prove the first statement, it is trivial to see that interchanging f and g in the expression of R m (F, G) keeps it unchanged. Thus, R m (F, G) = R m (G, F). (ii) Let us define σ i = g 0 (z i )g 0 (z i )σ i . To prove the second statement, we choose any fixed function g 0 within G and we have, 1 2 × E σ,S,S sup f ∈F ,g∈G 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i ≥ 1 2 × E σ,S,S sup f ∈F 1 m m i=1 g 0 (z i )g 0 (z i )f (z i )f (z i )σ i (33) = 1 2 × E σ ,S,S sup f ∈F 1 m m i=1 f (z i )f (z i )σ i (34) = R C m (F). ( ) Here, we use the fact that σ i is also a Rademacher variable. (iii) We can prove the third statement in the same manner, fixing f 0 ∈ F instead. (iv) The fourth statement can be proven as follows. First, we note that f (z i )f (z i ) and g(z i )g(z i ) are variables which take the value of 1 or -1. Thus, f (z i )f (z i )g(z i )g(z i ) can be expressed as (1 -|f (z i )f (z i ) -g(z i )g(z i )|) = (1 -f (z i )f (z i ) -g(z i )g(z i )v i ), where v i ∈ {-1, -1} depends on f (z i )f (z i ) and g(z i )g(z i ) together, but is independent of each of them individually. We can therefore write, 1 2 × E σ,S,S sup f ∈F ,g∈G 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i = 1 2 × E σ,S,S sup f ∈F ,g∈G 1 m m i=1 (1 -|f (z i )f (z i ) -g(z i )g(z i )|) σ i (36) = 1 2 × E σ,S,S sup f ∈F ,g∈G 1 m m i=1 (f (z i )f (z i ) -g(z i )g(z i )) v i σ i . ( ) Now, consider any fixed σ = {σ 0 , ..., σ m } = {σ 0 0 , ..., σ 0 m }, and fix S and S . First, let us denote f opt , g opt = arg max f ∈F ,g∈G 1 m m i=1 (f (z i )f (z i ) -g(z i )g(z i )) v i σ 0 i (38) Let us denote the final v i for f opt and g opt by v opt i . Note that v opt i here depends on f opt (z i )f opt (z i ) and g opt (z i )g opt (z i ). Note that v opt 1 , v opt 2 , ..v opt m will be independent of each other, as σ 1 , σ 2 , ..σ m are independent. Also observe that v opt 1 , v opt 2 , ..v opt m will be independent of σ 0 1 , σ 0 2 , ..σ 0 m respectively, as they only depend on f opt and g opt . We define σ i = v opt i σ 0 i . We note that σ 1 , σ 2 , ..σ m will be independent across all samples, and P r(σ i = +1) = P r(σ i = -1) = 0.5, same as σ i . Thus σ 1 , σ 2 , ..σ m are i.i.d Rademacher variables like σ 1 , σ 2 , ..σ m . We thus note, sup f ∈F ,g∈G 1 m m i=1 (f (z i )f (z i ) -g(z i )g(z i ))v i σ 0 i = 1 m m i=1 (f opt (z i )f opt (z i ) -g opt (z i )g opt (z i )) v opt i σ 0 i (39) = 1 m m i=1 (f opt (z i )f opt (z i ) -g opt (z i )g opt (z i )) σ i (40) ≤ sup f ∈F ,g∈G 1 m m i=1 (f (z i )f (z i ) -g(z i )g(z i )) σ i , where in the final supremum σ i are kept fixed at their original values (σ i = v opt i σ 0 i ). We then have, 1 2 × E σ,S,S sup f ∈F ,g∈G 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i ≤ 1 2 × E σ ,S,S sup f ∈F ,g∈G 1 m m i=1 (f (z i )f (z i ) -g(z i )g(z i )) σ i , ( ) = R C m (F) + R C m (G). ( ) The final result R C m (F) ≤ R m (F) can be shown as follows. To prove this, we use the fact that f (z i )f (z i ) = 1 -|f (z i ) -f (z i )|. We have, R C m (F) = 1 2 × E σ,S,S sup f ∈F 1 m m i=1 f (z i )f (z i )σ i (44) = 1 2 × E σ,S,S sup f ∈F 1 m m i=1 (1 -|f (z i ) -f (z i )|)σ i (45) = 1 2 × E σ,S,S sup f ∈F 1 m m i=1 (f (z i ) -f (z i ))v i σ i Here, v i takes values in {-1, 1}, depending on whether f (z i ) or f (z i ) is greater. Note that as σ i is a Rademacher variable, v i σ i is also a Rademacher variable. We also note that v i is independent of f (z i ) and f (z i ) individually. Thus we have, R C m (F) = 1 2 × E σ,S,S sup f ∈F 1 m m i=1 (f (z i ) -f (z i ))v i σ i (47) ≤ 1 2 × E σ ,S sup f ∈F 1 m m i=1 f (z i )(σ i ) + 1 2 × E σ ,S sup f ∈F 1 m m i=1 f (z i )(-σ i ) , where σ i = v i σ i . As -σ i is also a Rademacher variable, we finally have R C m (F) ≤ 1 2 × E σ ,S sup f ∈F 1 m m i=1 f (z i )(σ i ) + 1 2 × E σ ,S sup f ∈F 1 m m i=1 f (z i )(σ i ) (49) = R m (F) + R m (F) 2 (50) = R m (F). This completes the proofs.

C.2 PROOF OF THEOREM 2

The following result proves a generalization error bound which incorporates R m (F) and R m (G). Theorem 2. For 0 < δ < 1, with probability p ≥ 1 -δ, we have err P (f ) ≤ err S (f ) + R m (F) + R m (G) + log (1/δ) 2m . Proof. First, we reiterate that err S (f ) = m i=1 (1 -f (z i )g truth (z i )) /2m , for some g truth ∈ G, and similarly for err P (f ), where m -→ ∞. Then, we retrace the steps of the original proof in Bartlett & Mendelson (2003) , noting that err P (f ) ≤ err S (f ) + sup f ∈F ,g∈G,g(S)=g truth (S) (err P (f ) -err S (f )) ≤ err S (f ) + sup f ∈F ,g∈G (err P (f ) -err S (f )). Here we initially only consider functions g in G such that g(z 0 ) = g truth (z 0 ), g(z 1 ) = g truth (z 1 ), ..., g(z m ) = g truth (z m ). Notice, that differently from (Bartlett & Mendelson (2003) ), the supremum eventually considers the generator space G in addition to the classifier's function space F. We define φ(S) = sup f ∈F ,g∈G (err P (f )err S (f )), and notice that sup z1,z2,...,zm,z i ∈Z |φ(z 1 , z 2 , ..., z i , ...., z m ) -φ(z 1 , z 2 , ..., z i , ...., z m )| ≤ 1 m . ( ) This leads to the of McDiarmid's Inequality (McDiarmid (1989) ) on the concentration bounds for φ(S), which results in the the following bound, with a probability of at least 1 -δ, err P (f ) ≤ err S (f ) + E S sup f ∈F ,g∈G (err P (f ) -err S (f )) + log (1/δ) m . Finally, we express E S [φ(S)] as follows. E S sup f ∈F ,g∈G (err P (f ) -err S (f )) = E S sup f ∈F ,g∈G E S [ err S (f ) -err S (f )|S] (57) = E S sup f ∈F ,g∈G E S 1 m m i=1 1 -f (z i )g(z i ) 2 - 1 -f (z i )g(z i ) 2 |S (58) ≤ E S,S sup f ∈F ,g∈G 1 m m i=1 f (z i )g(z i ) -f (z i )g(z i ) 2 . ( ) The last inequality (Jensen's inequality (Cover & Thomas (2012) ) is applicable because sup is a convex function. We proceed by multiplying the terms with Rademacher variables σ i , which randomly take on {-1, 1} with equal probability. This keeps the expected value unchanged. Note that E[σ i ] = 0. In what follows, we express f (z i )g(z i ) = 1 -|f (z i ) -g(z i )|. This is enabled because both have their range in {-1, 1}. E S,S sup f ∈F ,g∈G 1 m m i=1 f (z i )g(z i ) -f (z i )g(z i ) 2 = E S,S ,σ sup f ∈F ,g∈G 1 m m i=1 (f (z i )g(z i ) -f (z i )g(z i ))σ i 2 (60) ≤ E σ,S,S sup f ∈F ,g∈G 1 m m i=1 |f (z i ) -g(z i )|σ i 2 + E σ,S,S sup f ∈F ,g∈G 1 m m i=1 |f (z i ) -g(z i )|σ i 2 . ( ) Note that multiplication with the Rademacher variables does not change the value of the expectation, as the Rademacher variable σ i simply controls whether z i and z i belong to S and S (σ i = 1), or to S and S (σ i = -1). Coupled with the fact that E[σ i ] = 0, we note that this keeps the expectation in (59) unchanged. Now, note that (|f (z i ) -g(z i )|) σ i = (f (z i ) -g(z i )) v i σ i . Here v i is also a variable that takes its values in {-1, 1}. Since σ i is a Rademacher variable, v i σ i = σ i is also a Rademacher variable. In what follows, we use the fact that v i is independent of f i and g i individually. E σ,S,S sup f ∈F ,g∈G 1 m m i=1 |f (z i ) -g(z i )|σ i 2 ≤ E σ ,S,S sup f ∈F 1 m m i=1 f (z i )σ i 2 + E σ ,S,S sup g∈G 1 m m i=1 g(z i )σ i 2 (62) = R m (F) + R m (G) 2 . ( ) Similarly, we can derive the same for E σ,S,S sup f ∈F ,g∈G 1 m m i=1 |f (z i )-g(z i )|σi 2 . Thus, finally, we can represent the upper bound on E[φ(S)] as follows . E[φ(S)] ≤ R m (F) + R m (G) 2 + R m (F) + R m (G) 2 = R m (F) + R m (G). This leads to the final form of the generalization error bound. With probability at least 1 -δ, we have err P (f ) ≤ err S (f ) + R m (F) + R m (G) + log (1/δ) m .

C.3 PROOF OF THEOREM 3

The following statements propose the generalization error bound in terms of co-complexity. Theorem 3. For 0 < δ < 1, with probability p ≥ 1 -δ, we have err P (f ) ≤ err S (f ) + R m (F, G) + log (1/δ) 2m . ( ) Proof. We proceed in the same way as in appendix C.2, until we arrive at (57) . In what follows, we note that we can write f (z i )g(z i ) -f (z i )g(z i ) = (1 -f (z i )g(z i )f (z i )g(z i ))v i , for some binary valued variable v i with range in {-1, 1}. We also multiply the original Rademacher variable σ i to (57), and derive the following set of bounds. E S sup f ∈F ,g∈G (err P (f ) -err S (f )) ≤ E S,S sup f ∈F ,g∈G 1 m m i=1 f (z i )g(z i ) -f (z i )g(z i ) 2 (67) = E S,S sup f ∈F ,g∈G 1 m m i=1 (1 -f (z i )g(z i )f (z i )g(z i )) v i 2 (68) = E σ,S,S sup f ∈F ,g∈G 1 m m i=1 (1 -f (z i )g(z i )f (z i )g(z i )) v i σ i 2 (69) ≤ E σ ,S,S sup f ∈F ,g∈G 1 m m i=1 (f (z i )g(z i )f (z i )g(z i )) σ i 2 . ( ) The last step follows from the fact that E[σ i ] = 0, and the assumption that σ i = v i σ i , are also i.i.d Rademacher variables. This assumption will hold for most well behaved classifiers, for which less than half of the terms in (68) are greater than zero (which is applicable for classifiers with not too large R m (F)). As 1 2 ×E σ ,S,S sup f ∈F ,g∈G 1 m m i=1 f (z i )g(z i )f (z i )g(z i )σ i = R m (F, G) , we have the final generalization bound as follows. With probability ≥ 1 -δ, err P (f ) ≤ err S (f ) + R m (F, G) + log (1/δ) m . The above theorem leads to the following corollary. Corollary 3.1. We consider the hypothetical case where the roles are reversed, i.e., when the generator function space G is used to fit the labels, which are now generated by a function f ∈ F. For any function g ∈ G, for 0 < δ < 1, with probability p ≥ 1 -δ, we have err P (g) ≤ err S (g) + R m (F, G) + log (1/δ) 2m . ( ) Proof. This is a direct outcome of the fact that R m (F, G) = R m (G, F). When the data labels in S is generated by functions in F rather than G, we will simply have with probability p ≥ 1 -δ, err P (g) ≤ err S (g) + R m (G, F) + log 1 δ m (73) = err S (g) + R m (F, G) + log 1 δ m . Note that here the labels in S are generated by a certain f ∈ F. This completes the proof.

C.4 PROOF OF THEOREM 4

The following result outlines the lower bound on the error function. Theorem 4. For 0 < δ < 1, with probability p ≥ 1 -δ, err P (f ) ≥ err S (f ) -R m (F, G) - log (1/δ) 2m . Proof. First, we note that err P (f ) ≥ err S (f ) -E S sup f ∈F ,g∈G ( err S (f ) -err P (f )) Next, we denote φ(S) = sup f ∈F ,g∈G (err S (f ) -err P (f )). ( ) We then proceed similarly to Theorem 2 and Theorem 3, and we can show that  E S [φ(S)] ≤ R m (F, G). ≥ err S (f ) -R m (F, G) - log (1/δ) m . This completes the proof. C.5 PROOF OF THEOREM 5 Now, we outline a set of results pertaining to the joint-entropy like behaviour of R m (F, G). Theorem 5. We are given the mutual complexity measure I m (F, G) as defined in co-complexity P4 (Section 5 of main paper). We consider an alternative ground truth generator space G , such that I m (G , G) = 0, i.e., G and G are independent. Then we have, R m (F) ≥ I m (F, G) + I m (F, G ). ( ) Proof. To prove the above, note that it is sufficient to prove the following: R m (F, G) + R m (F, G ) ≥ R C m (F) + R C m (G) + R C m (G ) We elaborate on the left hand side of the above inequality as follows R m (F, G) + R m (F, G ) = 1 2 × E σ,S,S sup f ∈F ,g∈G 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i + 1 2 × E σ,S,S sup f ∈F ,g ∈G 1 m m i=1 f (z i )f (z i )g (z i )g (z i )σ i We introduce i.i.d Rademacher variables { 0 , ..., m }, by multiplying them with the terms within R m (F, G ). As E[ i ] = 0 ∀i, this maneuver does not change the value of the expectation. R m (F, G) + R m (F, G ) = 1 2 × E σ,S,S sup f ∈F ,g∈G 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i + 1 2 × E σ, ,S,S sup f ∈F ,g ∈G 1 m m i=1 f (z i )f (z i )g (z i )g (z i )σ i i (85) ≥ 1 2 × E σ, ,S,S sup f ∈F ,g∈G,g ∈G 1 m m i=1 f (z i )f (z i ) (g(z i )g(z i ) + i g (z i )g (z i )) σ i v i Now note that one can re-express g(z i )g(z i ) + i g (z i )g (z i ) as (1 + i g(z i )g(z i )g (z i )g (z i ))v i . Here v i ∈ {-1, 1} ∀i. Note that the value of v i depends on the values of g(z i ), g(z i ), g (z i ), g (z i ) and i . However, note that with so many dependencies v i ultimately is independent of all of these terms, individually. We have, R m (F, G) + R m (F, G ) ≥ 1 2 × E σ, ,S,S sup f ∈F ,g∈G,g ∈G 1 m m i=1 f (z i )f (z i )σ i v i + 1 m m i=1 f (z i )f (z i )g(z i )g(z i )g (z i )g (z i ) i σ i v i . ( ) Note that the expectation is over a fixed values of σ, , S and S . For the next step, we recognize that v i σ i = σ i is a Rademacher variable itself in the first term to the R.H.S of the inequality above, as v i is independent of f (z i ) and f (z i ). For a fixed value of those parameters, we then consider f * = arg max f ∈F 1 m m i=1 f (z i )f (z i )σ i . Then, we can compute a lower bound for the expression in (87) as follows, R m (F, G) + R m (F, G ) ≥ 1 2 × E σ ,S,S 1 m m i=1 f * (z i )f * (z i )σ i + E σ, ,S,S sup f ∈F ,g∈G,g ∈G 1 m m i=1 f (z i )f (z i )g(z i )g(z i )g (z i )g (z i ) i σ i v i , ( ) ≥ R C m (F) + 1 2 E σ, ,S,S sup g∈G,g ∈G 1 m m i=1 g(z i )g(z i )g (z i )g (z i ) ( i σ i v i f * (z i )f * (z i )) . ( ) It is important to note here that the variable i σ i v i f * (z i )f * (z i ) is a Rademacher variable, but however, is not yet independent of the other terms being multiplied to it. The terms i , σ i , f * (z i ) and f * (z i ) are all independent of g(z i ), g(z i ), g (z i )&g (z i ), but however, the same cannot be said of v i . Therefore, to resolve this issue we note that the product g(z i )g(z i )g (z i )g (z i ) itself can be expressed as another function g (z i , z i ) = g(z i )g(z i )g (z i )g (z i ), as an instance of another function space G and the second term in (89) can then be re-expressed as, 1 2 × E σ, S,S sup g ∈G 1 m m i=1 g (z i , z i ) ( i σ i v i f * (z i )f * (z i )) . ( ) Note that although v i was not independent of the g(z i ), g(z i ), g (z i )&g (z i ), it is independent of the product g(z i )g(z i )g (z i )g (z i ), and thus independent of g (z i , z i ). Thus, now we can finally express the variables  {σ i = ( i σ i v i f * (z i )f * (z i )) |i = 1, 2 (F, G) + R m (F, G ) ≥ ≥ R C m (F) + 1 2 × E σ ,S,S sup g∈G,g ∈G 1 m m i=1 g(z i )g(z i )g (z i )g (z i ) (σ i ) . ( ) = R C m (F) + R m (G, G ) = R C m (F) + R C m (G) + R C m (G ). The last step follows from the fact that I m (G , G) = 0, that is, the function spaces G and G are independent (i.e. R m (G, G ) = R C m (G) + R C m (G ) ). This concludes our proof. We now prove a corollary to Theorem 5, where we discuss implications of Theorem 5 on the generalization error bound in Theorem 3. Corollary 5.1. Given a function space G , such that I m (G , G) = 0. Suppose that R m (F) = I m (F, G) + I m (F, G ) + , for some ≥ 0. We have with probability p ≥ 1 -δ, err P (f ) ≤ err S (f ) + R m (G) + I m (F, G ) + + log (1/δ) m . Proof. This is a direct consequence of the fact that R m (F) ≥ I m (F, G) + I m (F, G ), combined with the generalization bound in Theorem 3. Proof. As the functions are Rademacher smooth, we can decompose the expectation over S, S to only the ones where S ∈ I G (S) , and the ones where S / ∈ I G (S). We begin from (67) as follows. R m (F, G) = E σ,S,S sup f ∈F ,g∈G 1 m m i=1 (f (z i )g(z i )f (z i )g(z i )) σ i 2 Then, using the Rademacher smoothness constraint, we have for a certain 0 ≤ α ≤ 1, R m (F, G) = α E σ,S,S ,S ∈I G (S) sup f ∈F ,g∈G 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i 2 + (1 -α) E σ,S,S ,S / ∈I G (S) sup f ∈F ,g∈G 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i 2 (97) = α E σ,S,S ,S ∈I G (S) sup f ∈F 1 m m i=1 f (z i )f (z i )σ i 2 + (1 -α) E σ,S,S ,S / ∈I G (S) sup f ∈F ,g∈G 1 m m i=1 f (z i )f (z i )g(z i )g(z i )σ i 2 (98) ≤ α E σ,S,S ,S ∈I G (S) sup f ∈F 1 m m i=1 (1 -f (z i )f (z i ))σ i 2 + (1 -α)(R D m (F, G) + R C,D m (G)) (99) ≤ α E S,S ,S ∈I G (S) sup f ∈F 1 m m i=1 (1 -f (z i )f (z i )) 2 + (1 -α)(R D m (F, G) + R C,D m (G)) (100) = αR I m (F, G) + (1 -α)(R D m (F, G) + R C,D m (G)). We use the fact that E σ,S,S ,S / ∈I G (S) sup f ∈F ,g∈G 1 m m i=1 f (zi)f (z i )g(zi)g(z i )σi 2 ≤ R D m (F, G)+ R C,D m (G), which follows trivially from a derivation similar to that of co-complexity P3 in Lemma 2 (R m (F, G) ≤ R C m (F) + R C m (G) ). This completes the proof. Using this, we proceed to the proof of Theorem 6, where we re-express the generalization error bound in terms of the invariance and dissociation co-complexities, including bounds on the training error. For purposes of simplification, we denote R I m (F, G), R D m (F, G) as R D m , and R D,1 m (F, G) as R D,1 m . Theorem 6. Consider function spaces F and G, such that F is Rademacher smooth w.r.t. G and F contains the constant function f (z) = c, ∀z, c ∈ {-1, 1} . Let err P denote the generalization error for the functions f ∈ F which showcase the best fit on the training data samples. For some non-negative real constants 0 ≤ α, β ≤ 1 and 0 < δ < 1, with probability p ≥ 1 -δ, we have err P ≤ (1 -β) 1 2 -R D,1 m + αR I m + (1 -α)(R D m + R C,D m (G)) + 2 log(1/δ) m (102) Note: Recall that R C,D m (G) is defined in Lemma 2. Proof. We note that err P is the generalization error for the case when only the best fitting function f opt = arg minf ∈F ( err S (f )) is chosen for each training data set S, and is averaged over all possible ground truth functions g ∈ G. Note that the labels in S is subject to change, depending on the choice of g. Thus for fixed datapoints z 1 , z 2 , .., z m ∈ S, (53) changes to  err P ≤ E g∈G [ err S (f opt )] + E g∈G (err P (f opt ) -err S (f opt )) (103) ≤ E g∈G [ err S (f opt )] + sup g∈G (err P (f opt ) -err S (f opt )) (104) ≤ E g∈G [ err S (f opt )] + sup f ∈F ,g∈G (err P (f ) -err S (f )). ( ) We now denote φ (S) = E g∈G [inf f ∈F ( err S (f opt ))] + sup f ∈F ,g∈G (err P (f ) -err S (f )). Let us also denote ψ(S) = E g∈G [inf f ∈F ( err S (f opt ))] [ err S (f opt )] + E S sup f ∈F ,g∈G (err P (f ) -err S (f )) + 2 log (1/δ) m . We already have that E S sup f ∈F ,g∈G (err P (f ) -err S (f )) ≤ R m (F, G) ≤ αR I m +(1-α)(R D m + R C,D m (G), for a certain 0 ≤ α ≤ 1 (Lemma 2). The first term (expected training error) can be expressed as, E S,g∈G [ err S (f opt )] = 1 2 1 -E S,g∈G sup f ∈F 1 m i f (z i )g(z i ) . ( ) As iterated in the main statement of this theorem, we assume that the classifier function space is complex enough to allow for an error rate ≤ 50% (chance level for binary classification) on the training data itself. This indicates that we can safely assume that sup f ∈F 1 m i f (z i )g(z i ) ≥ 0. We then have, sup f ∈F 1 m i f (z i )g(z i ) sup f ∈F 1 m i f (z i )g(z i ) = sup f ∈F   1 m 2 i,j f (z i )g(z i )f (z j )g(z j )   (110) = sup f ∈F   1 m 2 i,j f (z i )g(z i )f (z )g(z )f (z )g(z )f (z j )g(z j )   (111) = sup f ∈F 1 m i f (z i )g(z i )f (z )g(z ) 2 (112) where we make use of the fact that f (z )g(z )f (z )g(z ) = 1. Here, the datapoint z is sampled from the same distribution P . Thus we have, E S,z ,g∈G sup f ∈F 1 m i f (z i )g(z i ) ≥ E S,z ,g∈G sup f ∈F 1 m i f (z i )g(z i )f (z )g(z ) . As before, we assume that the function space f is Rademacher smooth, i.e. for a certain 0 ≤ β ≤ 1, we can express  1 m i f (z i )f (z )σ i . ( ) The last step above involves the reasonable assumption that the function space is going to be on average worse at fitting random noise labels, rather than a labelling provided by an element of the generator space G. Finally, we note that E S,z ,S∈I G (z ),g∈G sup f ∈F 1 m i f (z i )f (z ) = 1, simply by choosing the constant function f (z) = c, which is assumed to be present within the classifier's function space F. Also, note that E S,z ,S / ∈I G (z ),σ sup f ∈F 1 m i f (z i )f (z )σ i = 2 R D,1 m (F, G). We then have (1 -f (z i )f (z i )) = R I m (F, G) The proof of the main proposition immediately follows from the application of Massart's finite class lemma, which states that for a finite subset K of R m , and Rademacher variables σ 1 , σ 2 , ..., σ m , we have  The result above directly follows from the fact that the cardinality of the finite subset K is upper bounded by the growth function Π F (m), when K is the set of all possible function outputs from functions f ∈ F, on any m points. In our case, as we have that R I m (F, G) = 1 2 × E σ,S,S sup f ∈F 1 m m i=1 f (z i )f (z i )σ i , we can directly apply Massart's lemma, by the using the invariance-constrained growth function Π I F | G (m) to obtain the first result as follows. First, we choose the finite subset K such that it contains all possible outputs of the function product f (z i )f (z i ) when S, S is chosen such that each element in S lies in the invariance extended subset of each corresponding element in S. For this K, the cardinality |K| is bounded above by the previously defined growth function Π I F | G (m). This immediately leads to the first result,  This concludes the proof.

D NEURAL NETWORK ARCHITECTURE DETAILS

Here we present the architectures that were used in our experiments. As mentioned in section 4, four different architectures were tested. In Table 4 , we show the layer-wise details of each architecture. Note that all architectures share a similar parameter count, and also evoke a similar dissociation co-complexity as a result (see Table 2 and 3 ). Note that the input to all these architectures is an image of size 28 × 28 (same size as in the MNIST dataset). For STL-10, we resized the input (96 × 96) to 28 × 28 and used the same architectures for consistency.



Figure 1 illustrates the role of both generator and classifier spaces in generalization via the two example scenarios discussed earlier. In both examples, the same low-complexity classifier has a good fit on the training data, thus R m (F) is low. In example (a), the LGF has no constraints; while, in example (b), the LGF has constraints such as local smoothness. In example (a), the classifier

Figure 1: Two examples where the same low-complexity classifier shows similar fit on training samples, but in (a) the LGF has no constraints (large generator space) and in (b) the LGF is constrained by smoothness and other invariances (small generator space).

EXPERIMENT: MEASURING DISSOCIATION CO-COMPLEXITY (R D,1 m (F, G)) In the main paper, we measure the invariance co-complexities R I m (F, G) and the dissociation cocomplexities R D m (F, G) of four different network types: Multi-layer Perceptron (MLP), Vanilla CNN (CNN), Scale-Equivariant CNN (SE-CNN in Sosnovik et al. (2019)) and Rotation Equivariant CNN (RE-CNN in Cohen & Welling (2016b)). In this experiment, we measure the single-sample dissociation co-complexity R D,1

of the main paper, we see that the invariance co-complexity (rotation) of the RE-CNN (0.357) is smaller than that of the MLP (0.424), whereas the R D,1 m (F, G) (0.4492 vs. 0.4472) and R D m (F, G) (0.438 vs. 0.4415) of these two networks do not show much variation. Co-Complexity (R D,1 m (F, G), STL-10) 0.4352 0.4465 0.4405 0.4570 Table3: Mean dissociation co-complexity values for all networks. Note that CNN and its variants maintain the dissociation co-complexity at a similar level to that of MLP.B.2 EXPERIMENT: MEASURING INVARIANCE CO-COMPLEXITY FOR DIFFERENT TRANSFORMATION PARAMETER CHOICESIn addition to the experiments which estimate the overall invariance co-complexity of the various network baselines, we also showcase how R I m depends on the exact transformation parameter t for each case, in Fig.3. Note that equivariant networks usually have low R I m for most values of t, except for the RE-CNN which is tuned to be invariant to rotations of 90 • , and therefore only showcases significant drops in R I m when t is near 90 • or 180 • .

Figure 2: Invariance co-complexity values for the relevant networks on MNIST, when the datapoint pairs S and S are separated by varying degrees of rotational, scale, shear and translational shifts.

Figure 3: Plots depicting the average generalization gap (difference between test and training error) when 2-layer neural networks of variable hidden neuron number H, are chosen as the label generating function, and a single layer neural network is chosen as the classifier. The generalization gap trend is plotted for two scenarios, (a) m = 10 and (b) m = 100, where m is the number of training examples. As here R m (F) is fixed, but R m (G) ∝ √ H changes with H, we plot the generalization gap averages against √ H. Note that the trend is linear in nature in both cases (correlation of ≈ 0.98 for both), showing that the generalization gap depends directly on the complexity of the generator space.

i )g(z i )f (z )g(z ) ≥ β + 2(1 -β) R D,1 m (F, G). (117)This leads to the final upper bound for err P by replacing the terms in (108) by the above. With probability p ≥ 1 -δ, we then have, err P ≤ (1 -β) provide the proof of Proposition 1, which gives interpretable bounds on invariance co-complexity and dissociation co-complexity. Proposition 1. Define the well known growth function of binary valued functionsF as A = Π F (m) = max z |{f (z k ) : k = 1, 2, ..., m, f ∈ F}|.Define the invariance-constrained and dissociation-constrained growth functions of F w.r.t. G as:B = Π I F | G (m) = max z,z',z i ∈I G (zi),∀i |{f (z k ), f (z k ) : k = 1, 2, ..., m, f ∈ F}| C = Π D F | G (m) = max z,z',z i / ∈I G (zi),∀i |{f (z k ), f (z k ) : k = 1, 2, ..., m, f ∈ F}| Define R I m (F, G) = 1 2 × E σ,S,S ,S ∈I G (S) sup f ∈F 1 m m i=1 f (z i )f (z i )σ i , where z i ∈ I G (z i ), ∀i. Note that R I m (F, G) ≤ R I m (F, G). Given the definitions above, we have R I m (F, G) ≤ log B 2m ≤ log A m and R D m (F, G) ≤ log C 2m ≤ log A m .Proof. First we show that R I m (F, G) ≤ R I m (F, G). This follows from the fact that

i ] ≤ r 2 log |K| (122)This immediately leads to the well known result for Rademacher complexity R m (F), which statesR m (F) ≤ 2 log Π F (m) m .

show the equivalent result for dissociation co-complexity R D m (F, G), to obtain recognize that the growth functions Π I F | G (m) and Π D F | G (m) are bounded above by the cardinality of all possible function values taken in S, S , which in turn is bounded above byΠ F (m) × Π F (m). This implies Π I F | G (m) ≤ Π F (m) 2 and Π D F | G (m) ≤ Π F (m) 2, from which we then have

Assume that we have m number of d-dimensional i.i.d sampled instances S = [z 1 , z 2 , .., z m ] drawn from some distribution P , and another set of m i.i.d sampled instances S = [z 1 , z 2 , .., z m ] also drawn from P . Define σ = [σ 1 , σ 2 , .., σ m ] where σ i are i.i.d random variables following the Rademacher distribution (P r(σ i

we find that CNNs have lower R I m than MLPs. As expected, we see that SE-CNN and RE-CNN have even lower R I m . CNNs are primarily expected to be robust to translation, due to max-pooling layers. However, even for other transformations such as rotation and shear, the R I m of the CNN is less than that of an MLP. Please see Appendix B.2 for a more detailed analysis on how invariance co-complexity varies with choice of the transformation parameter.

Mean invariance co-complexity values for all 4 networks over 4 transformations, rotation, scale, shear, and translation. We approximate the computation of the invariance co-complexity in (6) by taking 1000 randomly initialized networks in each case (for the supremum in (6)). For fairness, all architectures share approximately the same number of parameters (around 16k).

Mean dissociation co-complexity values for all networks. Note that CNN and its variants maintain the dissociation co-complexity at a similar level to that of MLP.

I m than the vanilla CNN itself, while maintaining R D m , i.e., they have low variance and low bias. Appendix B.3 presents results when neural networks of varying complexity are used as generator spaces and verifies that higher complexity generator spaces lead to a larger generalization gap. Ivan Sosnovik, Michał Szmaja, and Arnold Smeulders. Scale-Equivariant Steerable Networks. arXiv e-prints, art. arXiv:1910.11093, October 2019. D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization. Trans. Evol. Comp, 1 (1):67-82, April 1997. ISSN 1089-778X. doi: 10.1109/4235.585893. Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 2524-2533. Curran Associates, Inc., 2017. C. Xu, T. Liu, D. Tao, and C. Xu. Local rademacher complexity for multi-label learning. IEEE Transactions on Image Processing, 25(3):1495-1507, 2016.

, z 2 , ..., z i , ...., z m ) -φ(z 1 , z 2 , ..., z i , ...., z m )| ≤

annex

Table 4 : Shown are the architectural details of all neural networks used in our experiments. Scale-Conv represents scale-equivariant convolution and P4Conv-Z2 and P4Conv-Z4 represents rotation equivariant convolution (as detailed in Cohen & Welling (2016a)). For the ScaleConv layers, note that 2.2:1 represents the ratio of the maximum to the minimum scale of the filters. A total of 4 scale pathways were chosen for those layers. FC represents the fully connected layers. Networks are chosen such that overall they share a similar parametric count (shown within the brackets).

