FOR SELF-SUPERVISED LEARNING, RATIONALITY IMPLIES GENERALIZATION, PROVABLY

Abstract

We prove a new upper bound on the generalization gap of classifiers that are obtained by first using self-supervision to learn a representation r of the training data, and then fitting a simple (e.g., linear) classifier g to the labels. Specifically, we show that (under the assumptions described below) the generalization gap of such classifiers tends to zero if C(g) n, where C(g) is an appropriately-defined measure of the simple classifier g's complexity, and n is the number of training samples. We stress that our bound is independent of the complexity of the representation r. We do not make any structural or conditional-independence assumptions on the representation-learning task, which can use the same training dataset that is later used for classification. Rather, we assume that the training procedure satisfies certain natural noise-robustness (adding small amount of label noise causes small degradation in performance) and rationality (getting the wrong label is not better than getting no label at all) conditions that widely hold across many standard architectures. We also conduct an extensive empirical study of the generalization gap and the quantities used in our assumptions for a variety of self-supervision based algorithms, including SimCLR, AMDIM and BigBiGAN, on the CIFAR-10 and Ima-geNet datasets. We show that, unlike standard supervised classifiers, these algorithms display small generalization gap, and the bounds we prove on this gap are often non vacuous.

1. INTRODUCTION

The current standard approach for classification is "end-to-end supervised learning" where one fits a complex (e.g., a deep neural network) classifier to the given training set (Tan & Le, 2019; He et al., 2016) . However, modern classifiers are heavily over parameterized, and as demonstrated by Zhang et al. (2017) , can fit 100% of their training set even when given random labels as inputs (in which case test performance is no better than chance). Hence, the training performance of such methods is by itself no indication of their performance on new unseen test points. In this work, we study a different class of supervised learning procedures that have recently attracted significant interest. These classifiers are obtained by: (i) performing pre-training with a selfsupervised task (i.e., without labels) to obtain a complex representation of the data points, and then (ii) fitting a simple (e.g., linear) classifier on the representation and the labels. Such "Self-Supervised + Simple" (SSS for short) algorithms are commonly used in natural language processing tasks (Devlin et al., 2018; Brown et al., 2020) , and have recently found uses in other domains as well (Ravanelli et al., 2020; Liu et al., 2019) . Compared to standard "end-to-end supervised learning", SSS algorithms have several practical advantages. In particular, SSS algorithms can incorporate additional unlabeled data, the representation obtained can be useful for multiple downstream tasks, and they can have improved out-of-distribution performance (Hendrycks et al., 2019) . Moreover, recent works show that even without additional unlabeled data, SSS algorithms can get close to state-of-art accuracy in several classification tasks (Chen et al., 2020b; He et al., 2020; Misra & Maaten, 2020; Tian et al., 2019) . For instance, SimCLRv2 (Chen et al., 2020b) achieves 79.8% top-1 performance on ImageNet with a variant of ResNet-152, on par with the end-to-end supervised accuracy of this architecture at 80.5%. We show that SSS algorithms have another advantage over standard supervised learning-they often have a small generalization gap between their train and test accuracy, and we prove non-vacuous bounds on this gap. We stress that SSS algorithms use over-parameterized models to extract the representation, and reuse the same training data to learn a simple classifier on this representation. Thus, the final classifier they produce has high complexity by most standard measures, and it is by no means apriori evident that their generalization gap will be small. Our bound is obtained by first noting that the generalization gap of every training algorithm is bounded by the sum of three quantities, which we name the Robustness gap, Rationality gap, and Memorization gap (we call this the RRM bound, see Fact I). We now describe these gaps at a high level, deferring the formal definitions to Section 2. All three gaps involve comparison with a setting where we inject label noise by replacing a small fraction η of the labels with random values. The robustness gap corresponds to the amount by which training performance degrades by noise injection. That is, it equals the difference between the standard expected training accuracy (with no label noise) and the expected training accuracy in the noisy setting; in both cases, we measure accuracy with respect to the original (uncorrupted) labels. The robustness gap is nearly always small, and sometimes provably so (see Section 3). The rationality gap corresponds to the difference between performance on the noisy training samples (on which the training algorithm gets the wrong label) and test samples (on which it doesn't get any label at all), again with respect to uncorrupted labels. An optimal Bayesian procedure would have zero rationality gap, and indeed this gap is typically zero or small in practice. Since it is a nonstandard quantity, We discuss the rationality gap in Section 3.1, and explain assuming it is small is both well-founded and does not trivialize the question of generalization. The memorization gap, which often accounts for the lion's share of the generalization gap, corresponds to the difference in the noisy experiment between the training accuracy on the entire train set and the training accuracy on the samples that received the wrong label (both measured with respect to uncorrupted labels). The memorization gap can be thought of as quantifying the extent to which the classifier can "memorize" noisy labels, or act differently on the noisy points compared to the overall train set. The memorization gap is large in standard "end-to-end supervised training". In contrast, our main theoretical result is that for SSS algorithms, the memorization gap is small if the simple classifier has small complexity, independently of the complexity of the representation. As long as the simple classifier is under-parameterized (i.e., its complexity is asymptotically smaller than the sample size), our bound on the memorization gap tends to zero. When combined with small rationality and robustness, we get concrete non-vacuous generalization bounds for various SSS algorithms on the CIFAR-10 and ImageNet datasets (see Figures 1 and 4 ). In a nutshell, our results are the following: Theoretical contributions. 1. Our main theoretical result (Theorem II) is that the memorization gap of an SSS algorithm is bounded by O( C/n) where C is the complexity of the simple classifier produced in the "simple fit" stage. This bound is oblivious to the complexity of the representation produced in the pre-training and does not make any assumptions on the relationship between the representation learning method and the supervised learning task. One way to interpret this result is that we give a rigorous bound on the generalization gap of SSS algorithms, under the assumptions that the robustness and rationality gaps are bounded by some small constant (e.g., 5%). As mentioned below, these assumptions hold widely in practice across many different classifiers. Moreover, these assumptions are nontrivial and do not "assume away the difficulty". Indeed, there are many natural examples of training algorithms for which these assumptions hold but the generalization gap is large. Last, making some assumptions is necessary for a generalization bound to hold for SSS algorithms; see Remark 3.1 and Appendix E.

