SUPERVISION ACCELERATES PRE-TRAINING IN CONTRASTIVE SEMI-SUPERVISED LEARNING OF VISUAL REPRESENTATIONS

Abstract

We investigate a strategy for improving the efficiency of contrastive learning of visual representations by leveraging a small amount of supervised information during pre-training. We propose a semi-supervised loss, SuNCEt , based on noise-contrastive estimation and neighbourhood component analysis, that aims to distinguish examples of different classes in addition to the self-supervised instancewise pretext tasks. On ImageNet, we find that SuNCEt can be used to match the semi-supervised learning accuracy of previous contrastive approaches while using less than half the amount of pre-training and compute. Our main insight is that leveraging even a small amount of labeled data during pre-training, and not only during fine-tuning, provides an important signal that can significantly accelerate contrastive learning of visual representations.

1. INTRODUCTION

Learning visual representations that are semantically meaningful with limited semantic annotations is a longstanding challenge with the potential to drastically improve the data-efficiency of learning agents. Semi-supervised learning algorithms based on contrastive instance-wise pretext tasks learn representations with limited label information and have shown great promise (Hadsell et al., 2006; Wu et al., 2018b; Bachman et al., 2019; Misra & van der Maaten, 2020; Chen et al., 2020a) . Unfortunately, despite achieving state-of-the-art performance, these semi-supervised contrastive approaches typically require at least an order of magnitude more compute than standard supervised training with a cross-entropy loss (albeit without requiring access to the same amount of labeled data). Burdensome computational requirements not only make training laborious and particularly timeand energy-consuming; they also exacerbate other issues, making it more difficult to scale to more complex models and problems, and potentially inducing significant carbon footprints depending on the infrastructure used for training (Henderson et al., 2020) . In this work, we investigate a strategy for improving the computational efficiency of contrastive learning of visual representations by leveraging a small amount of supervised information during pre-training. We propose a semi-supervised loss, SuNCEt , based on noise-contrastive estimation (Gutmann & Hyvärinen, 2010) and neighbourhood component analysis (Goldberger et al., 2005) , that aims at distinguishing examples of different classes in addition to the self-supervised instance-wise pretext tasks. We conduct a case-study with respect to the approach of Chen et al. (2020a) on the ImageNet (Russakovsky et al., 2015) and CIFAR10 (Krizhevsky & Hinton, 2009) benchmarks. We find that using any available labels during pre-training (either in the form of a cross-entropy loss or SuNCEt ) can be used to reduce the amount of pre-training required. Our most notable results on ImageNet are obtained with SuNCEt , where we can match the semi-supervised learning accuracy of previous contrastive approaches while using less than half the amount of pre-training and compute, and require no hyper-parameter tuning. 



::: By ::::::::: combining SuNCEt ::: with ::: the ::::::::: contrastive :::::: SwAV :::::: method :: of ::::::::::::::: Caron et al. (2020) : , ::: we ::: also ::::::: achieve :::::::::::: state-of-the-art ::::: top-5 ::::::: accuracy ::: on :::::::: ImageNet :::: with :::: 10% :::::: labels, :::: while :::::: cutting ::: the :::::::::: pre-training :::::: epochs :: in ::::: half. 1

