RANKME: ASSESSING THE DOWNSTREAM PERFOR-MANCE OF PRETRAINED SELF-SUPERVISED REPRESEN-TATIONS BY THEIR RANK

Abstract

Joint-Embedding Self Supervised Learning (JE-SSL) has seen a rapid development, with the emergence of many method variations and few principled guidelines that would help practitioners to successfully deploy those methods. The main reason for that pitfall actually comes from JE-SSL's core principle of not employing any input reconstruction. Without any visual clue, it becomes extremely cryptic to judge the quality of a learned representation without having access to a labelled dataset. We hope to correct those limitations by providing a single -theoretically motivatedcriterion that reflects the quality of learned JE-SSL representations: their effective rank. Albeit simple and computationally friendly, this method -coined RankMeallows one to assess the performance of JE-SSL representations, even on different downstream datasets, without requiring any labels, training or parameters to tune. Through thorough empirical experiments involving hundreds of repeated training episodes, we demonstrate how RankMe can be used for hyperparameter selection with nearly no loss in final performance compared to the current selection method that involves dataset labels. We hope that RankMe will facilitate the use of JE-SSL in domains with little or no labeled data.

1. INTRODUCTION

Self-supervised learning (SSL) has shown great progress to learn informative data representations in recent years (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b; Grill et al., 2020; Lee et al., 2021; Caron et al., 2020; Zbontar et al., 2021; Bardes et al., 2021; Tomasev et al., 2022; Caron et al., 2021; Chen et al., 2021; Li et al., 2022a; Zhou et al., 2022a; b; HaoChen et al., 2021; He et al., 2022) , catching up to supervised baselines and even surpassing them in few-shot learning, i.e., when evaluating the SSL model from only a few labeled examples. Although various SSL families of losses have emerged, most are variants of the joint-embedding (JE) framework with a siamese network architecture (Bromley et al., 1994) , denoted as JE-SSL for short. The only technicality we ought to introduce to make our study precise is the fact that JE-SSL has introduced some different notations to denote an input's representation. In short, JE-SSL often composes a backbone or encoder network e.g., a ResNet50 and a projector network e.g., a multilayer perceptron. This projector is only employed during training, and we refer to its outputs as embeddings, while the actual inputs' representation employed for downstream tasks are obtained at the encoder's output. Although downstream tasks performance of JE-SSL representations might seem impressive, one pondering fact should be noted: all existing methods, hyperparameters, models -and thus performance -of JE-SSL are obtained by ad-hoc manual search involving the labels of the training samples. In words, JE-SSL is tuned by monitoring the supervised performance of the model at hand. Hence, although labels are not directly employed to compute the weight updates, they are used as a proxy to signal the JE-SSL designer on how to refine their method. This single limitation prevents the deployment of JE-SSL in challenging domains where the number of available labelled examples is limited and such search can not be performed. Adding to the challenge, one milestone of JE-SSL is to move away from reconstruction based learning; hence without labels and without visual cues, tuning JE-SSL methods on unlabeled datasets remains challenging. This led to the application of feature inversion methods e.g. Deep Image Prior (Ulyanov et al., 2018) or conditional diffusion models (Bordes et al., 2021) to be deployed onto learned JE-SSL representation to try to visualize the learned features. This first step towards removing the need for labels has seen some success but is doomed by the computational complexity of the methods, and their biases towards natural images i.e. it is not clear how such methods would perform on different data modalities. In this study we propose RankMe, which is able to assess a model's performance without having access to any labels and without requiring any training or tuning. RankMe accurately predicts a model's performance both on In-Distribution (ID), i.e., same data distribution as used during the JE-SSL training, and on Out-Of-Distribution (OOD), i.e., different data distribution scenarios. We highlight this property at the top of fig. 1 . The strength of RankMe lies in the fact that it is solely based on the singular values distribution of the learned embeddings, and thus does not rely on any parameters that need training, nor requires any ID/OOD labels. In fact, RankMe's motivation hinges on Cover's theorem (Cover, 1965) that states how increasing the rank of a linear classifier's input increases its training performance, and three simple hypotheses that we summarize below and thoroughly validate empirically. As such, RankMe provides a step towards (unlabeled) JE-SSL by allowing practitioners to cross-validate hyperparameters and select models without resorting to labels or feature inversion methods. We hope that RankMe will enable JE-SSL to be deployed even in challenging domains that possess no or little labelled data; we summarize our contributions below: 1. We introduce (eq. ( 1 2 concluding that embeddings with greater rank will have greater train performance (Cover's theorem) and test performance (H1) on ID and OOD cases (H2), even before the projector (H3). 2. We demonstrate that RankMe's ability to assess JE-SSL downstream performance is robust across methods, e.g. VICReg, SimCLR, and their variants, and is also robust to architecture changes, e.g. using a projector network and/or a nonlinear evaluation method (see fig. 3 and section 3.3). 3. We demonstrate that RankMe enables hyperparameter cross-validation for any given JE-SSL method; RankMe is able to retrieve and sometimes surpass most of the performance previously found by manual search using labels, on both in domain and out of domain datasets; see bottom of fig. 1 and table 1 . We provide a hyperparameter free numerically stable implementation of RankMe in section 3.1 and pseudo-code for cross-validation in fig. 5 . Through extensive experiments involving 11 different datasets and more than 85 trained models over 4 methods, we demonstrate that in the linear and nonlinear probing regime, RankMe is able to tell apart high and low performing models, even on different downstream tasks without having access to labels or downstream task data samples.

2. RELATED WORKS

Joint embedding self-supervised learning (JE-SSL). In JE-SSL, two main families of method can be distinguished: contrastive and non-contrastive. Contrastive methods (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b; 2021; Yeh et al., 2021) Caron et al., 2018; 2020; 2021) and can be thought of as contrastive methods, but between cluster centroids instead of samples. Non-contrastive methods (Grill et al., 2020; Chen & He, 2020; Caron et al., 2021; Bardes et al., 2021; Zbontar et al., 2021; Ermolov et al., 2021; Li et al., 2022b) aim at bringing together embeddings of positive samples, similar to contrastive learning. However, a key difference with contrastive learning lies in how those methods prevent a representational collapse. In the former, the criterion explicitly pushes away negative samples, i.e., all samples that are not positive, from each other. In the latter, the criterion does not prevent collapse by distinguishing positive and negative samples, but instead considers the embeddings as a whole and encourages information content maximization e.g., by regularizing



)) and motivate RankMe which combines Cover's theorem with the following three key hypotheses: (H1) increasing training performance increases testing performance on both representations and embeddings i.e. no over-fitting is observed from the (non)linear probe, validated empirically in the bottom left of fig. 2 (H2) embeddings' rank scale linearly between datasets. Assuming a pretraining on the same dataset, if a set of embeddings has a greater rank than another on a dataset, it also holds on another one, validated empirically in the top row of fig. 2 (H3) increasing embeddings performance increases representations performance, validated empirically in the bottom right of fig.

mostly rely on the InfoNCE criterion (Oord et al., 2018) except for HaoChen et al. (2021) which uses squared similarities between the embedding. A clustering variant of contrastive learning has also emerged (

