DISTRIBUTIONAL GENERALIZATION: A NEW KIND OF GENERALIZATION

Abstract

We introduce a new notion of generalization-Distributional Generalizationwhich roughly states that outputs of a classifier at train and test time are close as distributions, as opposed to close in just their average error. For example, if we mislabel 30% of dogs as cats in the train set of CIFAR-10, then a ResNet trained to interpolation will in fact mislabel roughly 30% of dogs as cats on the test set as well, while leaving other classes unaffected. This behavior is not captured by classical generalization, which would only consider the average error and not the distribution of errors over the input domain. Our formal conjectures, which are much more general than this example, characterize the form of distributional generalization that can be expected in terms of problem parameters: model architecture, training procedure, number of samples, and data distribution. We give empirical evidence for these conjectures across a variety of domains in machine learning, including neural networks, kernel machines, and decision trees. Our results thus advance our understanding of interpolating classifiers.

1. INTRODUCTION

We begin with an experiment motivating the need for a notion of generalization beyond test error. Experiment 1. Consider a binary classification version of CIFAR-10, where CIFAR-10 images x have binary labels Animal/Object. Take 50K samples from this distribution as a train set, but apply the following label noise: flip the label of cats to Object with probability 30%. Now train a WideResNet f to 0 train error on this train set. How does the trained classifier behave on test samples? Options below: 1. The test error is low across all classes, since there is only 3% label noise in the train set 2. Test error is "spread" across the animal class, After all, the classifier is not explicitly told what a cat or a dog is, just that they are all animals. 3. The classifier misclassifies roughly 30% of test cats as "objects", but all other types of animals are largely unaffected. In fact, reality is closest to option (3). Figure 1 shows the results of this experiment with a WideResNet. The left panel shows the joint density of train inputs x with train labels Object/Animal. Since the classifier is interpolating, the classifier outputs on the train set are identical to the left panel. The right panel shows the classifier predictions f (x) on test inputs x. There are several notable things about this experiment. First, the error is localized to cats in test set as it was in the train set, even though no explicit cat labels were provided. Second, the amount of error on the cat class is close to the noise applied on the train set. Thus, the behavior of the classifier on the train set generalizes to the test set in a certain sense. This type of similarity in behavior would not be captured solely by average test error -it requires reasoning about the entire distribution of classifier outputs. In our work, we show that this experiment is just one instance of a different type of generalization, which we call "Distributional Generalization". We now describe the mathematical form of this generalization. Then, through extensive experiments, we will show that this type of generalization occurs widely in existing machine learning methods: neural networks, kernel machines and decision trees. 1.1 DISTRIBUTIONAL GENERALIZATION Supervised learning aims to learn a model that correctly classifies inputs x ∈ X from a given distribution D into classes y ∈ Y. We want a model with small test error on this distribution. In practice, we find such a classifier by minimizing the train error of a model on the train set. This procedure is justified when we expect a small generalization gap: the gap between the error on the train and test set. That is, the trained model f should have: Error TrainSet (f ) ≈ Error TestSet (f ). We now re-write this classical notion of generalization in a form better suited for our extension. Classical Generalization: Let f be a trained classifier. Then f generalizes if: E x∼TrainSet y←f (x) [1{ y = y(x)}] ≈ E x∼TestSet y←f (x) [1{ y = y(x)}] Above, y(x) is the true class of x and y is the predicted class. The LHS of Equation 1 is the train error of f , and the RHS is the test error. Crucially, both sides of Equation 1 are expectations of the same function (T err (x, y) := 1{ y = y(x)}) under different distributions. The LHS of Equation 1is the expectation of T err under the "Train Distribution" D tr , which is the distribution over (x, y) given by sampling a train point x along with its classifier-label f (x). Similarly, the RHS is under the "Test Distribution" D te , which is this same construction over the test set. These two distributions are the central objects in our study, and are defined formally in Section 2.1. We can now introduce Distributional Generalization, which is a property of trained classifiers. It is parameterized by a set of bounded functions ("tests"): T ⊆ {T : X × Y → [0, 1]}. Distributional Generalization: Let f be a trained classifier. Then f satisfies Distributional Generalization with respect to tests T if: ∀T ∈ T : E x∼TrainSet y←f (x) [T (x, y)] ≈ E x∼TestSet y←f (x) [T (x, y)] We write this property as D tr ≈ T D te . This states that the train and test distribution have similar expectations for all functions in the family T . For the singleton set T = {T err }, this is equivalent to classical generalization, but it may hold for much larger sets T . For example in Experiment 1, the train and test distributions match with respect to the test function "Fraction of true cats labeled as object." In fact, it is best to think of Distributional Generalization as stating that the distributions D tr and D te are close as distributions. This property becomes especially interesting for interpolating classifiers, which fit their train sets exactly. Here, the Train Distribution (x i , f (x i )) is exactly equalfoot_0 to the original distribution (x, y) ∼ D, since f (x i ) = y i on the train set. In this case, distributional generalization claims that the output distribution (x, f (x)) of the model on test samples is close to the true distribution (x, y). The following conjecture specializes Distributional Generalization to interpolating classifiers, and will be the main focus of our work. Interpolating Indistinguishability Meta-Conjecture (informal): For interpolating classifiers f , and a large family T of test functions, the distributions: (x, f (x)) x∈TestSet ≈ T (x, f (x)) x∈TrainSet ≡ (x, y) x,y∼D



The formal definition of Train Distribution, in Section 2.1, includes the randomness of sampling the train set as well. We consider a fixed train set in the Introduction for sake of exposition.



Figure 1: The setup and result of Experiment 1. The CIFAR-10 train set is labeled as either Animals or Objects, with label noise affecting only cats. A WideResNet-28-10 is then trained to 0 train error on this train set, and evaluated on the test set. Full experimental details are in C.2

