CO-COMPLEXITY: AN EXTENDED PERSPECTIVE ON GENERALIZATION ERROR

Abstract

It is well known that the complexity of a classifier's function space controls its generalization gap, with two important examples being VC-dimension and Rademacher complexity (R-Complexity). We note that these traditional generalization error bounds consider the ground truth label generating function (LGF) to be fixed. However, if we consider a scenario where the LGF has no constraints at all, then the true generalization error can be large, irrespective of training performance, as the values of the LGF on unseen data points can be largely independent of the values on the training data. To account for this, in this work, we consider an extended characterization of the problem, where the ground truth labels are generated by a function within another function space, which we call the generator space. We find that the generalization gap in this scenario depends on the R-Complexity of both the classifier and the generator function spaces. Thus, we find that, even if the R-Complexity of the classifier is low and it has a good training fit, a highly complex generator space could worsen generalization performance, in accordance with the no free lunch theorem. Furthermore, the characterization of a generator space allows us to model constraints, such as invariances (translation and scale in vision) or local smoothness. Subsequently, we propose a joint entropy-like measure of complexity between function spaces (classifier and generator), called co-complexity, which leads to tighter bounds on the generalization error in this setting. Co-complexity captures the similarities between the classifier and generator spaces. It can be decomposed into an invariance co-complexity term, which measures the extent to which the classifier respects the invariant transformations in the generator, and a dissociation co-complexity term, which measures the ability of the classifier to differentiate separate categories in the generator. Our major finding is that reducing the invariance co-complexity of a classifier, while maintaining its dissociation co-complexity, improves the training error and reduces the generalization gap. Furthermore, our results, when specialized to the previous setting where the LGF is fixed, lead to potentially tighter generalization error bounds. Theoretical results are supported by empirical validation on the CNN architecture and its transformation-equivariant extensions. Co-complexity showcases a new side to the generalization abilities of classifiers and can potentially be used to improve their design.

1. INTRODUCTION

In the context of supervised classification, a major factor for consideration is the generalization error of the classifier, i.e., how good a classifier generalizes to test (unseen) data points. The notion of overfitting describes the case when the test error significantly exceeds the training error. Naturally, the objective for building a robust classifier entails the minimization of this generalization gap, to avoid overfitting. To that end, statistical studies on generalization error (Blumer et al. (1989) ; Bartlett & Mendelson (2003) ) find that complexity measures on the classifier function space, F, often directly control the generalization gap of a classifier. Two prominent examples of such measures include the Rademacher Complexity R m (F) (Bartlett & Mendelson (2003) ) and the VC dimension (Blumer et al. (1989) ) V C(F). Both measures directly estimate the flexibility of a function space, i.e., how likely is it for F to contain functions that can fit any random labelling over a set of data points. In this paper, we work with Rademacher complexity and propose extensions that provide a new perspective on generalization error. From a statistical perspective, the generalization gap can be understood through convergence bounds on the error function, i.e., the expected deviation of the error function on the test data compared to the

