SEMPPL: PREDICTING PSEUDO-LABELS FOR BETTER CONTRASTIVE REPRESENTATIONS

Abstract

Learning from large amounts of unsupervised data and a small amount of supervision is an important open problem in computer vision. We propose a new semisupervised learning method, Semantic Positives via Pseudo-Labels (SEMPPL), that combines labelled and unlabelled data to learn informative representations. Our method extends self-supervised contrastive learning-where representations are shaped by distinguishing whether two samples represent the same underlying datum (positives) or not (negatives)-with a novel approach to selecting positives. To enrich the set of positives, we leverage the few existing ground-truth labels to predict the missing ones through a k-nearest neighbours classifier by using the learned embeddings of the labelled data. We thus extend the set of positives with datapoints having the same pseudo-label and call these semantic positives. We jointly learn the representation and predict bootstrapped pseudolabels. This creates a reinforcing cycle. Strong initial representations enable better pseudo-label predictions which then improve the selection of semantic positives and lead to even better representations. SEMPPL outperforms competing semisupervised methods setting new state-of-the-art performance of 68.5% and 76% top-1 accuracy when using a ResNet-50 and training on 1% and 10% of labels on ImageNet, respectively. Furthermore, when using selective kernels, SEMPPL significantly outperforms previous state-of-the-art achieving 72.3% and 78.3% top-1 accuracy on ImageNet with 1% and 10% labels, respectively, which improves absolute +7.8% and +6.2% over previous work. SEMPPL also exhibits stateof-the-art performance over larger ResNet models as well as strong robustness, out-of-distribution and transfer performance. We release the checkpoints and the evaluation code at https://github.com/deepmind/semppl.

1. INTRODUCTION

In recent years, self-supervised learning has made significant strides in learning useful visual features from large unlabelled datasets [Oord et al., 2018; Chen et al., 2020a; Mitrovic et al., 2021; Grill et al., 2020; Caron et al., 2021] . Moreover, self-supervised representations have matched the performance of historical supervised baselines on the ImageNet-1k benchmark [Russakovsky et al., 2015] in like-for-like comparisons as well as outperformed supervised learning in many transfer settings [Tomasev et al., 2022] . While such results show exciting progress in the field, in many real-wold applications often there exists a small amount of ground-truth labelled datapoints making the problem of representation learning semi-supervised. In this work we propose a novel approach to semi-supervised learning called Semantic Positives via Pseudo-Labels (SEMPPL) which incorporates supervised information during the representation learning stage within a self-supervised loss. Unlike previous work which uses the available supervision as targets within a cross-entropy objective, we propose to use the supervised information to help inform which points should have similar representations. We propose to learn representations using a contrastive approach, i.e. we learn the representation of a datapoint (anchor) by maximizing the similarity of the embedding of that datapoint with a set of similar points (positives), while simultaneously minimizing the similarity of that embedding with a set of dissimilar points (negatives). As such, the appropriate construction of these sets of positives and negatives is crucial to the success of contrastive learning methods. While strategies for sampling negatives have been extensively studied in the literature [Schroff et al., 2015; Harwood et al., 2017; Ge et al., 2018; Wang et al., 2019a; He et al., 2020; Chen et al., 2020c] , the sampling of positives has received far less attention. We propose a novel approach to selecting positives which leverages supervised information. Specifically, we propose using the small amount of available ground-truth labels in order to nonparametrically predict the missing labels (pseudo-labels) for the unlabelled data. Note that many previous semi-supervised approaches use pseudo-labels as targets within a cross-entropy-based objective [Van Engelen & Hoos, 2020; Yang et al., 2021] . In SEMPPL we use pseudo-labels in a very different way, i.e. we use them to select positives based on whether two datapoints (we call these semantic positives) share the same (pseudo-)label. By maximizing the similarity of a datapoint with its semantic positives we expect to learn representations that are more semantically aligned and as a consequence encode more abstract, higher-level features which should generalise better. To predict informative pseudo-labels, we compare the representations of the unlabelled data with those of the labelled subset and use a k-nearest neighbours (k-NN) classifier to impute the missing labels. We simultaneously learn the representation, predict pseudo-labels and select semantic positives. This creates a virtuous cycle: better representations enable better pseudo-label prediction which in turn enables better selection of semantic positives and thus helps us learn better representations. Importantly, as the prediction of pseudo-labels and selection of semantic positives does not depend on the exact form of the contrastive objective employed, SEMPPL is compatible with and complements all contrastive losses, e.g. [Chen et al., 2020a; b; Caron et al., 2020; He et al., 2020; Mitrovic et al., 2021] and may even be extended to non-contrastive losses [Grill et al., 2020; Chen & He, 2021] . We evaluate the representations learned with SEMPPL across a varied set of tasks and datasets. In particular, SEMPPL sets new state-of-the-art in semi-supervised learning on ImageNet with 1% and 10% of labels on the standard ResNet-50 (1×) architecture with respectively 68.5% and 76.0% top-1 performance and across larger architectures. When combined with Selective Kernels [Li et al., 2019b] , we achieve 72.3% and 78.3% top-1 performance with 1% and 10% labels, respectively, significantly outperforming previous state-of-the-art by absolute +7.8% and +6.2% in top-1 performance. We also outperform previous state-of-the-art on robustness and out-of-distribution (OOD) generalisation benchmarks while retaining competitive performance in transfer learning. Our main contributions are: • We extend contrastive learning to the semi-supervised setting by introducing the idea of estimating pseudo-labels for selecting semantic positives as a key component especially in the low-label regime, • We propose a novel semi-supervised method SEMPPL that jointly estimates pseudo-labels, selects semantic positives and learns representations which creates a virtuous cycle and enables us to learn more informative representations, • We extensively evaluate SEMPPL and achieve a new state-of-the-art in semi-supervised learning, robustness and out-of-distribution generalisation, and competitive performance in transfer.

2. SEMANTIC POSITIVES VIA PSEUDO-LABELS

The selection of appropriate positive and negative examples are the cornerstone of contrastive learning. Though the research community has mainly focused on the selection of negatives, positives are equally important as they play a vital role in learning semantic similarity. We thus leverage labelled information as it encodes semantic information to improve the selection of informative positives. Specifically, we expand a self-supervised model to use this labelled data to non-parametrically predict pseudo-labels for the remaining unlabelled data. Using both ground-truth labels and the predicted pseudo-labels, we expand the set of positives with semantic positives. Notations Let D = D l ∪ D u be a dataset consisting of labelled training data D l = {(x i , y i )} N i=1 and unlabelled training data D u = {(x j )} M j=N +1 with M ≫ N . Let B be a batch of data of size B with B = {(x i , y i )} b i=1 ∪ {x j } B j=b+1 where (x i , y i ) ∈ D l and x j ∈ D u , where the indices i, j and m to denote labelled, unlabelled, and all datapoints, respectively. Following established self-supervised learning practices [Chen et al., 2020a; b; Caron et al., 2020; Mitrovic et al., 2021; Dwibedi et al., 2021; Tomasev et al., 2022] , we create different views of the data by applying pairs of randomly

