CO-COMPLEXITY: AN EXTENDED PERSPECTIVE ON GENERALIZATION ERROR

Abstract

It is well known that the complexity of a classifier's function space controls its generalization gap, with two important examples being VC-dimension and Rademacher complexity (R-Complexity). We note that these traditional generalization error bounds consider the ground truth label generating function (LGF) to be fixed. However, if we consider a scenario where the LGF has no constraints at all, then the true generalization error can be large, irrespective of training performance, as the values of the LGF on unseen data points can be largely independent of the values on the training data. To account for this, in this work, we consider an extended characterization of the problem, where the ground truth labels are generated by a function within another function space, which we call the generator space. We find that the generalization gap in this scenario depends on the R-Complexity of both the classifier and the generator function spaces. Thus, we find that, even if the R-Complexity of the classifier is low and it has a good training fit, a highly complex generator space could worsen generalization performance, in accordance with the no free lunch theorem. Furthermore, the characterization of a generator space allows us to model constraints, such as invariances (translation and scale in vision) or local smoothness. Subsequently, we propose a joint entropy-like measure of complexity between function spaces (classifier and generator), called co-complexity, which leads to tighter bounds on the generalization error in this setting. Co-complexity captures the similarities between the classifier and generator spaces. It can be decomposed into an invariance co-complexity term, which measures the extent to which the classifier respects the invariant transformations in the generator, and a dissociation co-complexity term, which measures the ability of the classifier to differentiate separate categories in the generator. Our major finding is that reducing the invariance co-complexity of a classifier, while maintaining its dissociation co-complexity, improves the training error and reduces the generalization gap. Furthermore, our results, when specialized to the previous setting where the LGF is fixed, lead to potentially tighter generalization error bounds. Theoretical results are supported by empirical validation on the CNN architecture and its transformation-equivariant extensions. Co-complexity showcases a new side to the generalization abilities of classifiers and can potentially be used to improve their design.

1. INTRODUCTION

In the context of supervised classification, a major factor for consideration is the generalization error of the classifier, i.e., how good a classifier generalizes to test (unseen) data points. The notion of overfitting describes the case when the test error significantly exceeds the training error. Naturally, the objective for building a robust classifier entails the minimization of this generalization gap, to avoid overfitting. To that end, statistical studies on generalization error (Blumer et al. (1989) ; Bartlett & Mendelson (2003) ) find that complexity measures on the classifier function space, F, often directly control the generalization gap of a classifier. Two prominent examples of such measures include the Rademacher Complexity R m (F) (Bartlett & Mendelson (2003) ) and the VC dimension (Blumer et al. (1989) ) V C(F). Both measures directly estimate the flexibility of a function space, i.e., how likely is it for F to contain functions that can fit any random labelling over a set of data points. In this paper, we work with Rademacher complexity and propose extensions that provide a new perspective on generalization error. From a statistical perspective, the generalization gap can be understood through convergence bounds on the error function, i.e., the expected deviation of the error function on the test data compared to the training data. Traditional generalization error bounds (Bartlett & Mendelson, 2003) state that function complexity (i.e., R m (F)) directly corresponds to the generalization gap. Thus, higher R m (F) usually leads to a greater generalization gap and slower convergence. Although the original R m (F) was proposed for binary classification, similar results have been shown for multi-class settings and a larger variety of loss functions (Xu et al., 2016; Liao et al., 2018) . Note that R m (F) is over the entire function space and thus global in nature. Local forms of Rademacher complexity, which involve restricting the function space and lead to minimum error on the training data samples, have been proposed (Bartlett et al., 2005; 2002) . Apart from function complexity based measures, there is also considerable work which uses an information theoretic perspective (Xu & Raginsky, 2017; Russo & Zou, 2020; Bu et al., 2020; Haghifam et al., 2020) in treating the subject.

1.1. WHY THE GROUND TRUTH LABEL GENERATING FUNCTION MATTERS

We define the label generating function for a classification problem as the function which generates the true labels for all possible datapoints. Note that most generalization error bounds, including the traditional ones, primarily are introspective in nature, i.e., they consider the size and flexibility of the classifier's function space F. The main direction proposed in this work is the investigation of the unknowability of the ground truth label generating functions (LGF), using another function space which we call the generator space. The generalization error bounds in Bartlett & Mendelson (2003) state that the difference in test and training performance is roughly bounded above by the Rademacher Complexity of the classifier's function space (i.e., R m (F)). In other words, whatever the training error, the test error will always be likely to be greater by an amount R m (F) on average. We note that, in deriving the original bound, a major assumption is that the LGF g is fixed and knowable from the data. We now outline our main argument for taking the generator space into account. The LGF is indeed fixed, i.e., there cannot be two different ground truth label generating functions applicable to the same problem. However, our primary emphasis is on the fact that the true LGF will always be unknown, i.e., for any finite training data containing data-label pairs (z i , g(z i )), we would only truly know the output of the label generating function on the given training data samples. Only when we have infinite training data samples, the values of LGF are known at each z ∈ R d . In this work, we denote the function space of all possible LGFs, within which the true LGF is contained, as the generator space. Note that the generator space arises due to the unknowability of the LGF. We show in this work that due to the generator space, the true generalization gap is greater than the Rademacher complexity of the classifier, and also depends on the Rademacher complexity of the generator. Note that the size of the generator space, which dictates its complexity, will be dependent on the amount of constraints that the LGF has: if the LGF has no constraints at all, then generator spaces are larger, whereas if the LGF is constrained to be smooth/invariant to many transformations, generator spaces are smaller (as the set of functions which are smooth and invariant are also much smaller, for instance in vision). Let us consider the case where the LGF g has no constraints at all, i.e., it is sampled from a generator space G which contains all possible functions g : R d -→ {-1, 1}. In this case, the function g is expected to have no structure and behaves like a random function, and thus the expected test accuracy of any classifier will be 50% (i.e., random chance). Therefore, even if the classifier function f ∈ F produces a very good fit on the training data and F happens to have a low complexity measure R m (F), the generalization performance will still be poor, as no knowledge of the LGF values on the unseen datapoints is available from the training samples. This is in contrast to the generalization error bounds based on Rademacher complexity, which would estimate that the classifier should have good generalization performance (i.e., low test error), as both R m (F) and training error are low. Note that although a typical training dataset extracted from this LGF may be hard to fit using a low-complexity classifier F, there will be examples of training instances with non-zero probability on which a low-complexity classifier can have a good fit. The takeaway from this example is that the structure of the data (here represented using the complexity of the generator space) can additionally dictate whether a classifier can generalize. Note that, in this scenario, the expected generalization performance would be better if the LGF had more structure. 



Figure 1 illustrates the role of both generator and classifier spaces in generalization via the two example scenarios discussed earlier. In both examples, the same low-complexity classifier has a good fit on the training data, thus R m (F) is low. In example (a), the LGF has no constraints; while, in example (b), the LGF has constraints such as local smoothness. In example (a), the classifier

