PUSHING THE LIMITS OF SELF-SUPERVISED RESNETS: CAN WE OUTPERFORM SUPERVISED LEARNING WITHOUT LABELS ON IMAGENET?

Abstract

Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark, limiting their applicability in performancecritical settings. Building on prior theoretical insights from RELIC [Mitrovic et al., 2021], we include additional inductive biases into self-supervised learning. We propose a new self-supervised representation learning method, RELICv2, which combines an explicit invariance loss with a contrastive objective over a varied set of appropriately constructed data views to avoid learning spurious correlations and obtain more informative representations. RELICv2 achieves 77.1% top-1 classification accuracy on ImageNet using linear evaluation with a ResNet50 architecture and 80.6% with larger ResNet models, outperforming previous stateof-the-art self-supervised approaches by a wide margin. Most notably, RELICv2 is the first unsupervised representation learning method to consistently outperform the supervised baseline in a like-for-like comparison over a range of ResNet architectures. Finally, we show that despite using ResNet encoders, RELICv2 is comparable to state-of-the-art self-supervised vision transformers.

1. INTRODUCTION

Large-scale foundation models [Bommasani et al., 2021] -in particular for language [Devlin et al., 2018; Brown et al., 2020] and multimodal domains [Radford et al., 2021] -are an important recent development in representation learning. The idea that massive models can be trained without labels in an unsupervised (or self-supervised) manner and be readily adapted, in a few-or zero-shot setting, to perform well on a variety of downstream tasks is important for many problem areas for which labeled data is expensive or impractical to obtain. Contrastive objectives have emerged as a successful strategy for representation learning [Chen et al., 2020a; He et al., 2019] . However, downstream utilityfoot_0 of these representations has until now never exceeded the performance of supervised training of the same architecture, thus limiting their usefulness. In this work, we tackle the question "Can we outperform supervised learning without labels on Im-ageNet?". As such, our focus is on learning good representations for high-level vision tasks such as image classification. In supervised learning we have access to label information which provides a signal of what features are relevant for classification. In unsupervised learning there is no such signal and we have strictly less information available from which to learn compared to supervised learning. Thus, the challenge of outperforming supervised learning without labels might seem impossible. Comparing supervised and contrastive objectives we see that they are two very different approaches to learning representations that yield significantly different representations. While supervised approaches use labels as targets in within a cross-entropy objective, contrastive methods rely on comparing against similar and dissimilar datapoints. Thus, supervised representations end up encoding a small set of highly informative features for downstream performance, while contrastive representations encode many more features with some of these features not related to downstream performance. This intuition is also supported by the observation that when visualizing the encoded information of contrastive representations through reconstruction, they are found to retain more detailed information of the original image, such as background and style, than supervised representations [Bordes et al., 2021] . Based on this, we hypothesize that one of the key reasons for the current subpar performance of contrastive (and thus self-supervised) representations, is the presence of features which are not directly related to downstream tasks, i.e. so-called spurious features. In general, basing representations on spurious features can have negative consequences for the generalization performance of the model and thus avoiding to encode these features is paramount for learning informative representations. In this paper, we propose to equip self-supervised methods with additional inductive biases to obtain more informative representations and overcome the lack of additional information that supervised methods have access to. We use as our base selfsupervised approach the performant RELIC method [Mitrovic et al., 2021] which combines a contrastive loss with an invariance loss. We propose to extend RELIC by adding inductive biases that penalize the learning of spurious features such as background and style, e.g. brightness; we denote this method as RELICv2. First, we propose a new fully unsupervised saliency masking pipeline which enables us to separate the image foreground from its background. We include this novel saliency masking approach as part of the data augmentation pipeline. Leveraging the invariance loss, this enables us to learn representations that do not rely on the spurious feature of background. Second, while RELIC operates on just two data views of the same size, we extend RELICv2 to multiple data views of varying sizes. Specifically, we argue for learning from a large number of data views of standard size as well as including a small number of data views that encode only a small part of the image. The intuition for using smaller crops is that this enables learning of more robust representations as not all features of the foreground might be contained in the small crop. Thus, the learned representation is robust against individual object features being absent as it is the case in many real-world settings, e.g. only parts of an object are visible because of occlusion. On the other hand, using multiple large crops enables us to learn representations that are invariant to object style. We extend the contrastive and invariance losses of RELIC to operate over multiple views of varying sizes. Empirically we show RELICv2 achieves state-of-the-art performance in self-supervised learning on a wide range of ResNet architectures (ResNet50, ResNet 50 2x, ResNet 50 4x, ResNet101, ResNet152, ResNet200, ResNet 200 2x). As shown in figure 1 , RELICv2 is the first self-supervised representation learning method that outperforms a standard supervised baseline on linear ImageNet evaluation across a wide range of ResNet architectures. On top-1 classification accuracy on Ima-geNet, RELICv2 achieves 77.1% with a ResNet50, while with a ResNet200 2× it achieves 80.6%. Furthermore, RELICv2 learns representations that achieve state-of-the-art performance on a wide range of downstream tasks and datasets such as semi-supervised and transfer learning, robustness and out-of-distribution generalisation. Although using ResNets, RELICv2 demonstrates comparable performance to recent vision transformers (figure 6 ). These strong experimental results across different vision tasks, learning settings and datasets showcase the generality of our proposed representation learning method. Moreover, we believe that the concepts and results developed in this work could have important implications for wider adoption of self-supervised pre-training in a variety of domains as well as the design of objectives for foundational machine learning systems. Summary of contributions. We tackle the question "Can we outperform supervised learning without labels on ImageNet?". We propose to extend the self-supervised method RELIC [Mitrovic et al., 2021] with inductive biases to learn more informative representations. We develop a fully unsupervised saliency masking pipeline and use multiple data views of varying sizes to encode these inductive biases. The resulting method RELICv2 achieves state-of-the-art performance in self-supervised learning across a wide range of ResNet architectures of different depths and width. Furthermore, RELICv2 is the first self-supervised representation learning method that outperforms a standard supervised ResNet50 baseline on linear ImageNet evaluation across 1×, 2× and 4× ResNet50 vari-



Downstream utility is commonly measured by how well a method performs under the standard linear evaluation protocol on ImageNet; see section 3.



Figure 1: Top-1 linear evaluation accuracy on ImageNet using ResNet50 encoders with 1×, 2×, 4× width multipliers and a ResNet200 with a 2× width multiplier.

