FOR SELF-SUPERVISED LEARNING, RATIONALITY IMPLIES GENERALIZATION, PROVABLY

Abstract

We prove a new upper bound on the generalization gap of classifiers that are obtained by first using self-supervision to learn a representation r of the training data, and then fitting a simple (e.g., linear) classifier g to the labels. Specifically, we show that (under the assumptions described below) the generalization gap of such classifiers tends to zero if C(g) n, where C(g) is an appropriately-defined measure of the simple classifier g's complexity, and n is the number of training samples. We stress that our bound is independent of the complexity of the representation r. We do not make any structural or conditional-independence assumptions on the representation-learning task, which can use the same training dataset that is later used for classification. Rather, we assume that the training procedure satisfies certain natural noise-robustness (adding small amount of label noise causes small degradation in performance) and rationality (getting the wrong label is not better than getting no label at all) conditions that widely hold across many standard architectures. We also conduct an extensive empirical study of the generalization gap and the quantities used in our assumptions for a variety of self-supervision based algorithms, including SimCLR, AMDIM and BigBiGAN, on the CIFAR-10 and Ima-geNet datasets. We show that, unlike standard supervised classifiers, these algorithms display small generalization gap, and the bounds we prove on this gap are often non vacuous.

1. INTRODUCTION

The current standard approach for classification is "end-to-end supervised learning" where one fits a complex (e.g., a deep neural network) classifier to the given training set (Tan & Le, 2019; He et al., 2016) . However, modern classifiers are heavily over parameterized, and as demonstrated by Zhang et al. (2017) , can fit 100% of their training set even when given random labels as inputs (in which case test performance is no better than chance). Hence, the training performance of such methods is by itself no indication of their performance on new unseen test points. In this work, we study a different class of supervised learning procedures that have recently attracted significant interest. These classifiers are obtained by: (i) performing pre-training with a selfsupervised task (i.e., without labels) to obtain a complex representation of the data points, and then (ii) fitting a simple (e.g., linear) classifier on the representation and the labels. Such "Self-Supervised + Simple" (SSS for short) algorithms are commonly used in natural language processing tasks (Devlin et al., 2018; Brown et al., 2020) , and have recently found uses in other domains as well (Ravanelli et al., 2020; Liu et al., 2019) . Compared to standard "end-to-end supervised learning", SSS algorithms have several practical advantages. In particular, SSS algorithms can incorporate additional unlabeled data, the representation obtained can be useful for multiple downstream tasks, and they can have improved out-of-distribution performance (Hendrycks et al., 2019) . Moreover, recent works show that even without additional unlabeled data, SSS algorithms can get close to state-of-art accuracy in several classification tasks (Chen et al., 2020b; He et al., 2020; Misra & Maaten, 2020;  

