PUSHING THE LIMITS OF SELF-SUPERVISED RESNETS: CAN WE OUTPERFORM SUPERVISED LEARNING WITHOUT LABELS ON IMAGENET?

Abstract

Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark, limiting their applicability in performancecritical settings. Building on prior theoretical insights from RELIC [Mitrovic et al., 2021], we include additional inductive biases into self-supervised learning. We propose a new self-supervised representation learning method, RELICv2, which combines an explicit invariance loss with a contrastive objective over a varied set of appropriately constructed data views to avoid learning spurious correlations and obtain more informative representations. RELICv2 achieves 77.1% top-1 classification accuracy on ImageNet using linear evaluation with a ResNet50 architecture and 80.6% with larger ResNet models, outperforming previous stateof-the-art self-supervised approaches by a wide margin. Most notably, RELICv2 is the first unsupervised representation learning method to consistently outperform the supervised baseline in a like-for-like comparison over a range of ResNet architectures. Finally, we show that despite using ResNet encoders, RELICv2 is comparable to state-of-the-art self-supervised vision transformers.

1. INTRODUCTION

Large-scale foundation models [Bommasani et al., 2021] -in particular for language [Devlin et al., 2018; Brown et al., 2020] and multimodal domains [Radford et al., 2021] -are an important recent development in representation learning. The idea that massive models can be trained without labels in an unsupervised (or self-supervised) manner and be readily adapted, in a few-or zero-shot setting, to perform well on a variety of downstream tasks is important for many problem areas for which labeled data is expensive or impractical to obtain. Contrastive objectives have emerged as a successful strategy for representation learning [Chen et al., 2020a; He et al., 2019] . However, downstream utilityfoot_0 of these representations has until now never exceeded the performance of supervised training of the same architecture, thus limiting their usefulness. In this work, we tackle the question "Can we outperform supervised learning without labels on Im-ageNet?". As such, our focus is on learning good representations for high-level vision tasks such as image classification. In supervised learning we have access to label information which provides a signal of what features are relevant for classification. In unsupervised learning there is no such signal and we have strictly less information available from which to learn compared to supervised learning. Thus, the challenge of outperforming supervised learning without labels might seem impossible. Comparing supervised and contrastive objectives we see that they are two very different approaches to learning representations that yield significantly different representations. While supervised approaches use labels as targets in within a cross-entropy objective, contrastive methods rely on comparing against similar and dissimilar datapoints. Thus, supervised representations end up encoding a small set of highly informative features for downstream performance, while contrastive representations encode many more features with some of these features not related to downstream performance. This intuition is also supported by the observation that when visualizing the encoded



Downstream utility is commonly measured by how well a method performs under the standard linear evaluation protocol on ImageNet; see section 3.

