RANKME: ASSESSING THE DOWNSTREAM PERFOR-MANCE OF PRETRAINED SELF-SUPERVISED REPRESEN-TATIONS BY THEIR RANK

Abstract

Joint-Embedding Self Supervised Learning (JE-SSL) has seen a rapid development, with the emergence of many method variations and few principled guidelines that would help practitioners to successfully deploy those methods. The main reason for that pitfall actually comes from JE-SSL's core principle of not employing any input reconstruction. Without any visual clue, it becomes extremely cryptic to judge the quality of a learned representation without having access to a labelled dataset. We hope to correct those limitations by providing a single -theoretically motivatedcriterion that reflects the quality of learned JE-SSL representations: their effective rank. Albeit simple and computationally friendly, this method -coined RankMeallows one to assess the performance of JE-SSL representations, even on different downstream datasets, without requiring any labels, training or parameters to tune. Through thorough empirical experiments involving hundreds of repeated training episodes, we demonstrate how RankMe can be used for hyperparameter selection with nearly no loss in final performance compared to the current selection method that involves dataset labels. We hope that RankMe will facilitate the use of JE-SSL in domains with little or no labeled data.

1. INTRODUCTION

Self-supervised learning (SSL) has shown great progress to learn informative data representations in recent years (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b; Grill et al., 2020; Lee et al., 2021; Caron et al., 2020; Zbontar et al., 2021; Bardes et al., 2021; Tomasev et al., 2022; Caron et al., 2021; Chen et al., 2021; Li et al., 2022a; Zhou et al., 2022a; b; HaoChen et al., 2021; He et al., 2022) , catching up to supervised baselines and even surpassing them in few-shot learning, i.e., when evaluating the SSL model from only a few labeled examples. Although various SSL families of losses have emerged, most are variants of the joint-embedding (JE) framework with a siamese network architecture (Bromley et al., 1994) , denoted as JE-SSL for short. The only technicality we ought to introduce to make our study precise is the fact that JE-SSL has introduced some different notations to denote an input's representation. In short, JE-SSL often composes a backbone or encoder network e.g., a ResNet50 and a projector network e.g., a multilayer perceptron. This projector is only employed during training, and we refer to its outputs as embeddings, while the actual inputs' representation employed for downstream tasks are obtained at the encoder's output. Although downstream tasks performance of JE-SSL representations might seem impressive, one pondering fact should be noted: all existing methods, hyperparameters, models -and thus performance -of JE-SSL are obtained by ad-hoc manual search involving the labels of the training samples. In words, JE-SSL is tuned by monitoring the supervised performance of the model at hand. Hence, although labels are not directly employed to compute the weight updates, they are used as a proxy to signal the JE-SSL designer on how to refine their method. This single limitation prevents the deployment of JE-SSL in challenging domains where the number of available labelled examples is limited and such search can not be performed. Adding to the challenge, one milestone of JE-SSL is to move away from reconstruction based learning; hence without labels and without visual cues, tuning JE-SSL methods on unlabeled datasets remains challenging. This led to the application of feature inversion methods e.g. Deep Image Prior (Ulyanov et al., 2018) or conditional diffusion models (Bordes et al., 2021) to be deployed onto learned JE-SSL representation to try to visualize the

