SUPERVISION ACCELERATES PRE-TRAINING IN CONTRASTIVE SEMI-SUPERVISED LEARNING OF VISUAL REPRESENTATIONS

Abstract

We investigate a strategy for improving the efficiency of contrastive learning of visual representations by leveraging a small amount of supervised information during pre-training. We propose a semi-supervised loss, SuNCEt , based on noise-contrastive estimation and neighbourhood component analysis, that aims to distinguish examples of different classes in addition to the self-supervised instancewise pretext tasks. On ImageNet, we find that SuNCEt can be used to match the semi-supervised learning accuracy of previous contrastive approaches while using less than half the amount of pre-training and compute. Our main insight is that leveraging even a small amount of labeled data during pre-training, and not only during fine-tuning, provides an important signal that can significantly accelerate contrastive learning of visual representations.

1. INTRODUCTION

Learning visual representations that are semantically meaningful with limited semantic annotations is a longstanding challenge with the potential to drastically improve the data-efficiency of learning agents. Semi-supervised learning algorithms based on contrastive instance-wise pretext tasks learn representations with limited label information and have shown great promise (Hadsell et al., 2006; Wu et al., 2018b; Bachman et al., 2019; Misra & van der Maaten, 2020; Chen et al., 2020a) . Unfortunately, despite achieving state-of-the-art performance, these semi-supervised contrastive approaches typically require at least an order of magnitude more compute than standard supervised training with a cross-entropy loss (albeit without requiring access to the same amount of labeled data). Burdensome computational requirements not only make training laborious and particularly timeand energy-consuming; they also exacerbate other issues, making it more difficult to scale to more complex models and problems, and potentially inducing significant carbon footprints depending on the infrastructure used for training (Henderson et al., 2020) . In this work, we investigate a strategy for improving the computational efficiency of contrastive learning of visual representations by leveraging a small amount of supervised information during pre-training. We propose a semi-supervised loss, SuNCEt , based on noise-contrastive estimation (Gutmann & Hyvärinen, 2010) and neighbourhood component analysis (Goldberger et al., 2005) , that aims at distinguishing examples of different classes in addition to the self-supervised instance-wise pretext tasks. We conduct a case-study with respect to the approach of Chen et al. (2020a) on the ImageNet (Russakovsky et al., 2015) and CIFAR10 (Krizhevsky & Hinton, 2009) benchmarks. We find that using any available labels during pre-training (either in the form of a cross-entropy loss or SuNCEt ) can be used to reduce the amount of pre-training required. Our most notable results on ImageNet are obtained with SuNCEt , where we can match the semi-supervised learning accuracy of previous contrastive approaches while using less than half the amount of pre-training and compute, and require no hyper-parameter tuning. 

2. BACKGROUND

The goal of contrastive learning is to learn representations by comparison. Recently, this class of approaches has fueled rapid progress in unsupervised representation learning of images through selfsupervision (Chopra et al., 2005; Hadsell et al., 2006; Bachman et al., 2019; Oord et al., 2018; Hénaff et al., 2019; Tian et al., 2019; Misra & van der Maaten, 2020; He et al., 2019; Arora et al., 2019; Chen et al., 2020a; Caron et al., 2020; Grill et al., 2020; Chen et al., 2020b) . In that context, contrastive approaches usually learn by maximizing the agreement between representations of different views of the same image, either directly, via instance discrimination, or indirectly through, cluster prototypes. Instance-wise approaches perform pairwise comparison of input data to push representations of similar inputs close to one another while pushing apart representations of dissimilar inputs, akin to a form of distance-metric learning. Self-supervised contrastive approaches typically rely on a data-augmentation module, an encoder network, and a contrastive loss. The data augmentation module stochastically maps an image x i ∈ R 3×H×W to a different view. Denote by xi,1 , xi,2 two possible views of an image x i , and denote by f θ the parameterized encoder, which maps an input image xi,1 to a representation vector z i,1 = f θ ( xi,1 ) ∈ R d . The encoder f θ is usually parameterized as a deep neural network with learnable parameters θ. Given a representation z i,1 , referred to as an anchor embedding, and the representation of an alternative view of the same input z i,2 , referred to as a positive sample, the goal is to optimize the encoder f θ to output representations that enable one to easily discriminate between the positive sample and noise using multinomial logistic regression. This learning by picking out the positive sample from a pool of negatives is in the spirit of noise-contrastive estimation (Gutmann & Hyvärinen, 2010) . The noise samples in this context are often taken to be the representations of other images. For example, suppose we have a set of images (x i ) i∈[n] and apply the stochastic data-augmentation to construct a new set with two views of each image, ( xi,1 , xi,2 ) i∈[n] . Denote by Z = (z i,1 , z i,2 ) i∈[n] the set of representations corresponding to these augmented images. Then the noise samples with respect to the anchor embedding z i,1 ∈ Z are given by Z\{z i,1 , z i,2 }. In this work, we minimize the normalized temperature-scaled cross entropy loss (Chen et al., 2020a) for instance-wise discrimination inst (z i,1 ) = -log exp(sim(z i,1 , z i,2 )/τ ) z∈Z\{zi,1} exp(sim(z i,1 , z)/τ ) , where sim(a, b) = a T b a b denotes the cosine similarity and τ > 0 is a temperature parameter. In typical semi-supervised contrastive learning setups, the encoder f θ is learned in a fully unsupervised pre-training phase. The goal of this pre-training is to learn a representation invariant to common data augmentations (cf. Hadsell et al. (2006) ; Misra & van der Maaten (2020)) such as random crop/flip, resizing, color distortions, and Gaussian blur. After pre-training on unlabeled data, labeled training instances are leveraged to fine-tune f θ , e.g., using the canonical cross-entropy loss.

3. METHODOLOGY

Our goal is to investigate a strategy for improving the computational efficiency of contrastive learning of visual representations by leveraging the available supervised information during pre-training. Here we explore a contrastive approach for utilizing available labels, but we also include additional numerical evaluations with a cross-entropy loss and a parametric classifier in Section 4. Contrastive approach. Consider a set S of labeled samples operated upon by the stochastic dataaugmentation module. The associated set of parameterized embeddings are given by Z S (θ) = (f θ ( x)) x∈S . Let x ∈ S denote an anchor image view with representation z = f θ ( x) and class label y. By slight overload of notation, denote by Z y (θ) the set of embeddings for images in S with class label y (same class as the anchor z). We define the Supervised Noise Contrastive Estimation (SuNCEt ) loss as (z) = -log zj ∈Zy(θ) exp(sim(z, z j )/τ ) z k ∈Z S (θ)\{z} exp(sim(z, z k )/τ ) , which is then averaged over all anchors 1 |S| z∈Z S (θ) (z).



::: By ::::::::: combining SuNCEt ::: with ::: the ::::::::: contrastive :::::: SwAV :::::: method :: of ::::::::::::::: Caron et al. (2020) : , ::: we ::: also ::::::: achieve :::::::::::: state-of-the-art ::::: top-5 ::::::: accuracy ::: on :::::::: ImageNet :::: with :::: 10% :::::: labels, :::: while :::::: cutting ::: the :::::::::: pre-training :::::: epochs :: in ::::: half.

