NO REASON FOR NO SUPERVISION: IMPROVED GENERALIZATION IN SUPERVISED MODELS

Abstract

We consider the problem of training a deep neural network on a given classification task, e.g., , so that it excels at both the training task as well as at other (future) transfer tasks. These two seemingly contradictory properties impose a trade-off between improving the model's generalization and maintaining its performance on the original task. Models trained with self-supervised learning tend to generalize better than their supervised counterparts for transfer learning; yet, they still lag behind supervised models on IN1K. In this paper, we propose a supervised learning setup that leverages the best of both worlds. We extensively analyze supervised training using multi-scale crops for data augmentation and an expendable projector head, and reveal that the design of the projector allows us to control the trade-off between performance on the training task and transferability. We further replace the last layer of class weights with class prototypes computed on the fly using a memory bank and derive two models: t-ReX that achieves a new state of the art for transfer learning and outperforms top methods such as DINO and PAWS on IN1K, and t-ReX* that matches the highly optimized RSB-A1 model on IN1K while performing better on transfer tasks.

1. INTRODUCTION

Figure 1 : We present t-ReX and t-ReX*, two ResNet50 models trained with an improved supervised learning setup on Ima-geNet (IN1K), with strong performance on both transfer learning (y-axis, averaged over 13 tasks) and IN1K (x-axis). Deep convolutional neural networks trained on large annotated image sets like ImageNet-1K (IN1K) (Russakovsky et al., 2015) have shown strong generalization properties. This motivated their application to a broad range of transfer tasks including the recognition of concepts that are not encountered during training (Donahue et al., 2014; Razavian et al., 2014) . Recently, models trained in a self-supervised learning (SSL) framework have become popular due to their ability to learn without manual annotations, as well as their capacity to surpass supervised models in the context of transferable visual representations. SSL models like MoCo (He et al., 2020 ), SwAV (Caron et al., 2020) , BYOL (Grill et al., 2020) or DINO (Caron et al., 2021) exhibit stronger transfer learning performance than models (Wightman et al., 2021) trained on the same data with annotations (Sariyildiz et al., 2021) . This achievement is on the one hand exciting, as SSL approaches do not require an expensive and errorprone annotation process, but also seemingly counter-intuitive (Wang et al., 2022b) as it suggests that access to additional information, i.e., image labels, actually hinders the generalization properties of a model. Models learned via SSL are however not able to match their supervised counterparts on IN1K classification, i.e., on the concepts seen during training. Top-performing SSL and semi-supervised methods like DINO (Caron et al., 2021) or PAWS (Assran et al., 2021) still result in 3-5% lower top-1 accuracy compared to optimized supervised models such as RSB-A1 (Wightman et al., 2021) . In this paper, we argue that access to more information (in the form of manual annotations) should not hurt generalization, and we seek to improve the transferability of encoders learned in a supervised manner, while retaining their state-of-the-art performance on the supervised training task. The mismatch observed between IN1K and transfer performance suggests that this goal is not trivial. It has been shown, for example, that popular regularization techniques such as Label Smoothing (Szegedy et al., 2016 ), Dropout (Srivastava et al., 2014) or CutMix (Yun et al., 2019) , which improve IN1K performance, actually lead to less transferable representations (Kornblith et al., 2021; Sariyildiz et al., 2021) , and that representations learned on top of models underfitting their original task transfer better (Zhang et al., 2022) . We identify two key training components from the most successful SSL approaches that may lead to more transferable representations: multi-crop data augmentation (Caron et al., 2020) and the use of an expendable projector head, i.e., an auxiliary module added after the encoder during training and discarded at test time (Chen et al., 2020a) . We study the impact of these two components on the transfer performance together with the performance on the training task, and present novel insights on the role of the projector design in this context. Furthermore, inspired by recent work on supervised learning (Feng et al., 2022; Khosla et al., 2020) , we introduce Online Class Means, a memory-efficient variant of the Nearest Class Means classifier (Mensink et al., 2012) that computes class prototypes in an "online" manner with the help of a memory queue. This further increases performance. We perform an extensive analysis on how each component affects the learned representations, and look at feature sparsity and redundancy as well as intra-class distance. We also study the training dynamics and show that class prototypes and classifier weights change in different ways across iterations. We single out the two ResNet50 instantiations that perform best at one of the two dimensions (transfer learning and IN1K), denoted as t-ReX and t-ReX*. t-ReX exceeds the state-of-the-art transfer learning performance of DINO (Caron et al., 2021) or PAWS (Assran et al., 2021) and still performs much better than these two on IN1K classification. t-ReX* outperforms the state-of-the-art results of RSB-A1 (Wightman et al., 2021) on IN1K while generalizing better to transfer tasks. We visualize the performance of these two selected models, together with those of other top-performing configurations from our setup in Fig. 1 , and compare it to state-of-the-art supervised, semi-supervised and selfsupervised learning methods, across two dimensions: IN1K accuracy and mean transfer accuracy across 13 transfer tasks. This intuitively conveys how the proposed training setup pushes the envelope of the training-versus-transfer performance trade-off (from the "Previous SotA" region, to the "New SotA" one in Fig. 1 ) and offers strong pretrained visual encoders that future approaches could build on. Contributions. We propose a supervised training setup that incorporates multi-crop data augmentation and an expendable projector and can produce models with favorable performance both on the training task of IN1K and on diverse transfer tasks. We thoroughly ablate this setup and reveal that the design of the projector allows to control the performance trade-off between these two dimensions, while a number of analyses of the features and class weights give insights on how each component of our setup affects the training and learned representations. We also introduce Online Class Means, a prototype-based training objective that increases performance even further and gives state-of-the-art models for transfer learning (t-ReX) and IN1K (t-ReX*).

2. RELATED WORK

Visual representations learned by deep networks for IN1K classification can transfer to other tasks and datasets (Donahue et al., 2014; Razavian et al., 2014) . This generalization capability of networks has motivated researchers to propose practical approaches for measuring transfer learning (Goyal et al., 2019; Pándy et al., 2022; Zhai et al., 2019) or contribute to a formal understanding of generalization properties (Kornblith et al., 2019; Tripuraneni et al., 2020; Yosinski et al., 2014) . Recent work in this context (Kornblith et al., 2021; Sariyildiz et al., 2021) shows that the best representations for IN1K are not necessarily the ones transferring best. For instance, some regularization techniques or loss functions improving IN1K classification lead to underwhelming transfer results. A parallel line of work based on self-supervised learning (Caron et al., 2020; Chen et al., 2020a; Grill et al., 2020) focuses on training models without manual labels, and demonstrates their strong generalization capabilities to many transfer datasets, clearly surpassing their supervised counterparts (Sariyildiz et al., 2021 ). Yet, as expected, SSL models are no match to the supervised models on the IN1K classification task itself.

