RANKME: ASSESSING THE DOWNSTREAM PERFOR-MANCE OF PRETRAINED SELF-SUPERVISED REPRESEN-TATIONS BY THEIR RANK

Abstract

Joint-Embedding Self Supervised Learning (JE-SSL) has seen a rapid development, with the emergence of many method variations and few principled guidelines that would help practitioners to successfully deploy those methods. The main reason for that pitfall actually comes from JE-SSL's core principle of not employing any input reconstruction. Without any visual clue, it becomes extremely cryptic to judge the quality of a learned representation without having access to a labelled dataset. We hope to correct those limitations by providing a single -theoretically motivatedcriterion that reflects the quality of learned JE-SSL representations: their effective rank. Albeit simple and computationally friendly, this method -coined RankMeallows one to assess the performance of JE-SSL representations, even on different downstream datasets, without requiring any labels, training or parameters to tune. Through thorough empirical experiments involving hundreds of repeated training episodes, we demonstrate how RankMe can be used for hyperparameter selection with nearly no loss in final performance compared to the current selection method that involves dataset labels. We hope that RankMe will facilitate the use of JE-SSL in domains with little or no labeled data.

1. INTRODUCTION

Self-supervised learning (SSL) has shown great progress to learn informative data representations in recent years (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b; Grill et al., 2020; Lee et al., 2021; Caron et al., 2020; Zbontar et al., 2021; Bardes et al., 2021; Tomasev et al., 2022; Caron et al., 2021; Chen et al., 2021; Li et al., 2022a; Zhou et al., 2022a; b; HaoChen et al., 2021; He et al., 2022) , catching up to supervised baselines and even surpassing them in few-shot learning, i.e., when evaluating the SSL model from only a few labeled examples. Although various SSL families of losses have emerged, most are variants of the joint-embedding (JE) framework with a siamese network architecture (Bromley et al., 1994) , denoted as JE-SSL for short. The only technicality we ought to introduce to make our study precise is the fact that JE-SSL has introduced some different notations to denote an input's representation. In short, JE-SSL often composes a backbone or encoder network e.g., a ResNet50 and a projector network e.g., a multilayer perceptron. This projector is only employed during training, and we refer to its outputs as embeddings, while the actual inputs' representation employed for downstream tasks are obtained at the encoder's output. Although downstream tasks performance of JE-SSL representations might seem impressive, one pondering fact should be noted: all existing methods, hyperparameters, models -and thus performance -of JE-SSL are obtained by ad-hoc manual search involving the labels of the training samples. In words, JE-SSL is tuned by monitoring the supervised performance of the model at hand. Hence, although labels are not directly employed to compute the weight updates, they are used as a proxy to signal the JE-SSL designer on how to refine their method. This single limitation prevents the deployment of JE-SSL in challenging domains where the number of available labelled examples is limited and such search can not be performed. Adding to the challenge, one milestone of JE-SSL is to move away from reconstruction based learning; hence without labels and without visual cues, tuning JE-SSL methods on unlabeled datasets remains challenging. This led to the application of feature inversion methods e.g. Deep Image Prior (Ulyanov et al., 2018) or conditional diffusion models (Bordes et al., 2021) to be deployed onto learned JE-SSL representation to try to visualize the learned features. This first step towards removing the need for labels has seen some success but is doomed by the computational complexity of the methods, and their biases towards natural images i.e. it is not clear how such methods would perform on different data modalities. In this study we propose RankMe, which is able to assess a model's performance without having access to any labels and without requiring any training or tuning. RankMe accurately predicts a model's performance both on In-Distribution (ID), i.e., same data distribution as used during the JE-SSL training, and on Out-Of-Distribution (OOD), i.e., different data distribution scenarios. We highlight this property at the top of fig. 1 . The strength of RankMe lies in the fact that it is solely based on the singular values distribution of the learned embeddings, and thus does not rely on any parameters that need training, nor requires any ID/OOD labels. In fact, RankMe's motivation hinges on Cover's theorem (Cover, 1965) that states how increasing the rank of a linear classifier's input increases its training performance, and three simple hypotheses that we summarize below and thoroughly validate empirically. As such, RankMe provides a step towards (unlabeled) JE-SSL by allowing practitioners to cross-validate hyperparameters and select models without resorting to labels or feature inversion methods. We hope that RankMe will enable JE-SSL to be deployed even in challenging domains that possess no or little labelled data; we summarize our contributions below: 1. We introduce (eq. ( 1)) and motivate RankMe which combines Cover's theorem with the following three key hypotheses: (H1) increasing training performance increases testing performance on both representations and embeddings i.e. no over-fitting is observed from the (non)linear probe, validated empirically in the bottom left of fig. 2 (H2) embeddings' rank scale linearly between datasets. Assuming a pretraining on the same dataset, if a set of embeddings has a greater rank than another on a dataset, it also holds on another one, validated empirically in the top row of fig. 2 (H3) increasing embeddings performance increases representations performance, validated empirically in the bottom right of fig. 2 concluding that embeddings with greater rank will have greater train performance (Cover's theorem) and test performance (H1) on ID and OOD cases (H2), even before the projector (H3). 2. We demonstrate that RankMe's ability to assess JE-SSL downstream performance is robust across methods, e.g. VICReg, SimCLR, and their variants, and is also robust to architecture changes, e.g. using a projector network and/or a nonlinear evaluation method (see fig. 3 and section 3.3). 3. We demonstrate that RankMe enables hyperparameter cross-validation for any given JE-SSL method; RankMe is able to retrieve and sometimes surpass most of the performance previously found by manual search using labels, on both in domain and out of domain datasets; see bottom of fig. 1 and table 1 . We provide a hyperparameter free numerically stable implementation of RankMe in section 3.1 and pseudo-code for cross-validation in fig. 5 . Through extensive experiments involving 11 different datasets and more than 85 trained models over 4 methods, we demonstrate that in the linear and nonlinear probing regime, RankMe is able to tell apart high and low performing models, even on different downstream tasks without having access to labels or downstream task data samples.

2. RELATED WORKS

Joint embedding self-supervised learning (JE-SSL). In JE-SSL, two main families of method can be distinguished: contrastive and non-contrastive. Contrastive methods (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b; 2021; Yeh et al., 2021) mostly rely on the InfoNCE criterion (Oord et al., 2018) except for HaoChen et al. (2021) which uses squared similarities between the embedding. A clustering variant of contrastive learning has also emerged (Caron et al., 2018; 2020; 2021) and can be thought of as contrastive methods, but between cluster centroids instead of samples. Non-contrastive methods (Grill et al., 2020; Chen & He, 2020; Caron et al., 2021; Bardes et al., 2021; Zbontar et al., 2021; Ermolov et al., 2021; Li et al., 2022b) aim at bringing together embeddings of positive samples, similar to contrastive learning. However, a key difference with contrastive learning lies in how those methods prevent a representational collapse. In the former, the criterion explicitly pushes away negative samples, i.e., all samples that are not positive, from each other. In the latter, the criterion does not prevent collapse by distinguishing positive and negative samples, but instead considers the embeddings as a whole and encourages information content maximization e. the empirical covariance matrix of the embeddings. Such a categorization is not needed for our development, and we thus refer to any of the above method as JE-SSL. Dimensional collapse in JE-SSL. The phenomenon of learning rank-deficient embeddings, or dimensional collapse, in JE-SSL has recently been studied from both a theoretical and empirical point of view. The empirical emergence of dimensional collapse was studied in Hua et al. (2021) where they proposed the use of a whitening batch normalization layer to help alleviate it. In Jing et al. (2022) , a focus on contrastive approaches in a linear setting enabled a better understanding of dimensional collapse and the role of augmentations in its emergence. Performance in a low label regime of a partially collapsed encoder can also be improved by forcing the whitening of its output, as shown in He & Ozay (2022) . Furthermore, it was shown in Balestriero & LeCun (2022) how dimensional collapse is a phenomenon that should not necessarily happen in theory and how its emergence is mostly due to practical concerns. Interestingly, we will see through the lens of RankMe that while reducing dimensional collapse is often beneficial, doing so "at all cost" can lead to degenerate solutions. The collapse induced from training using a softmax layer was also studied in Ganea et al. (2019) , where they show that high rank embeddings are desirable. Evaluation of JE-SSL representations. Evaluating the representations learned by JE-SSL methods is fundamental to enable the optimal selection of those methods' hyperparameters, which are numerous. Yet, due to the imprecise nature of what makes a good representation, multiple strategies have emerged which evaluate different properties of representations. The most common approach relies on the strong assumption of having labels on the dataset where the JE-SSL method is trained on. In this case, on trains a linear classifier on the JE-SSL representations (Misra & Maaten, 2020) and directly use the test accuracy to compare models. This method was extended to the use of nonlinear classifiers, e.g., a k-nn classifier (Wu et al., 2018; Zhuang et al., 2019) . Performance evaluation without labels can also be done using a pretext-task, such as rotation prediction. This technique helped in selecting data augmentation policies in Reed et al. (2021) . One limitation lies in the need to select and train the classifier of the pretext-task, and on the strong assumption that rotation were not part of the transformations one aimed to be invariant to. Since (supervised) linear evaluation is the most widely used evaluation method, we will focus on showing how RankMe compares with it. Most related to us is Ghosh et al. (2022) where representations are evaluated by their eigenspectrum decay, giving a baseline for unsupervised hyperparameter selection.

3. REPRESENTATIONS' RANK CORRELATE WITH DOWNSTREAM PERFORMANCE ACROSS TASKS AND MODELS

The goal of this section is to formally introduce and motivate RankMe while providing a numerically stable implementation (section 3.1). The construction of RankMe hinges on three hypotheses that we validate empirically throughout this section.

3.1. RANKME: FROM THEORY TO IMPLEMENTATION

We first want to build notations and intuition into the construction of RankMe. To that hand, we first quantify approximation and classification errors of learned embeddings as a function of their rank, and then motivate how embeddings' rank can be sufficient to compare test performance of JE-SSL models's representations. This criterion should however only be used to compare different runs of a given method, since the embeddings' rank is not the only factor that affects performance. To ease notations, we refer to the (train) dataset used to obtain the JE-SSL model as the source dataset, and the test set on the same dataset or a different OOD dataset as the target dataset. From Source Embeddings's Rank to Target Representations performance. We first build some intuition in the regression settings. In this case, a common linear algebra result ties the best-case and worst-case approximation error of any target matrix Y ∈ R N ×C from a rank-R matrix P ∈ R N ×C to the singular values of Y that run from R to the rank of Y when ordered in decreasing order. Without loss of generality, we only consider the case C > N in this study, i.e., we have more samples than dimensions. Formally, this provides a lower bound on ∥Y -P ∥ 2 F ≥ C r=R+1 σ 2 r (Y ), which is tight for P of rank R, and with σ k the operator returning the k th singular value of its argument, ordered in decreasing order. This result, on which RankMe relies on, demonstrates that a necessary (but not sufficient) condition for an approximation P to well approximate Y is to have at least the same rank as Y . A similar result can be obtained in classification by considering multiple one-vs-all classifiers. In practice, however, we commonly employ a linear probe network on top of given embeddings Z to best adapt them to the target Y , i.e., P = ZW + 1b T . However, a linear transformation is not able to increase the rank of the input matrix since rank(P ) ≤ min(rank(Z), rank(W )) + 1. We directly obtain that min W ,b ∥Y -ZW -1b T ∥ 2 F ≥ C r=R+1 σ 2 r (Y ). In short, the approximation lower bound is not improved by allowing linear transformation of the embeddings. Further supporting the above, we ought to recall Cover's theorem (Cover, 1965) stating that the probability of a randomly labeled set of points being linearly separable only increases if N is reduced or R is increased. We formalize those results below. Proposition 1. The maximum training accuracy of given embeddings in linear regression or classification increases with their rank. For classification, it plateaus when the rank surpasses the number of classes. We thus introduce RankMe formally as the following smooth rank measure, originally introduced in Roy & Vetterli (2007) , RankMe(Z) = exp   - min(N,K) k=1 p k log p k   , p k = σ k (Z) ∥σ(Z)∥ 1 + ϵ, ( ) where Z is the source dataset's embeddings. By noticing that RankMe provides a smooth measure of the embeddings' rank (more details in the implementation section) we can lean on proposition 1 to see that given two models, the one with greater RankMe value will have greater training performance. This is only guaranteed for different models of the same method, since embedding rank is not necessarily the only factor that affects performance. The above result is however not too practical yet since what we are truly interested in are (i) performance on unseen samples, i.e., on the test set and out-of-distribution tasks, and (ii) performance on the representations and not the embeddings since it is common to ablate the projector network of JE-SSL models. Below, we validate three key hypotheses which, when verified, imply that we can extend the impact of RankMe such that (OOD) test performance of JE-SSL representations are increased when RankMe's value on their train set embeddings is increased. Validating RankMe's Hypotheses The development of RankMe is theoretically grounded when it comes to guaranteeing improved source dataset embeddings performance. To empirically extend it to target dataset representations performance we need to verify three hypotheses: (i) linear probes do not overfit, (ii) embeddings and representations performance are monotonically linked, and (iii) source and (OOD) target embeddings ranks are monotonically linked. Due to the different nature of datasets used for downstream tasks, there is no inherent reason for the rank of embeddings to transfer in a monotonic way to them. However, if the source dataset is diverse enough and target datasets have some semantic overlap with the source dataset, then we have rank(Z target ) ∝ rank(Z source ). We observe in section 3.2 and fig. 2 that the rank of JE-SSL representations scales linearly between different input distributions e.g. going from a source task such as Imagenet (Deng et al., 2009) to a target task such as iNaturalist. This is further confirmed by Pearson correlation coefficients greater than 0.99, except for StanfordCars where it is 0.88. Interestingly, we observe that the StanfordCars dataset suffers from a less distinctive linear scaling due to the dataset distribution having a small overlap with ImageNet. This indicates that as long as the source dataset is relatively diverse, then using RankMe to select a model with greater embeddings' rank will also correspond to selecting a model with greater embeddings' rank on the target dataset. Furthermore, as the train performance increases, so does the test performance. We validate this in fig. 2 . As a result, using RankMe to select a model with greater train performance is enough to also select a model with greater test performance. Finally, we report in fig. 2 that the performance of embeddings and representations scales almost monotonically. These results are supported by visualization of representations and embeddings from feature inversion models (Bordes et al., 2021) . Hence, using RankMe to select the model maximizing the performance on the former also selects a model maximizing performance on the latter. With these three hypotheses validated empirically, we can confidently say that RankMe computed on the embeddings of the source dataset is a predictor of representations' performance on target datasets. Robust RankMe Implementation. One of the most crucial step of RankMe is the estimation of the embeddings' rank. A trivial solution could be to check at the number of nonzero singular values. Denoting by σ k the k th singular value of the (N × K) embedding matrix Z, this would lead to rank(Z) = min(N,K) k=1 1 {σ k >0} . However, such a definition is too rigid for practical scenarios. For example, round-off error alone could have a dramatic impact on the rank estimate. Instead, alternative and robust rank definitions have emerged (Press et al., 2007) such as rank(Z) = min(N,K) k=1 1 {σ k >maxi σi×max(M,N )×ϵ} , where ϵ is a small constant dependent on the data type, typically 10 -7 for float32. An alternative measure of rank comes from a probability viewpoint where the singular values are normalized to sum to 1 and the Shannon Entropy (Shannon, 1948) is used, which corresponds to our definition of RankMe from eq. (1). As opposed to the classical rank, the chosen eq. (1) does not rely on specifying the exact threshold at which the singular value is treated as nonzero. Throughout our study, we employ eq. ( 1), and provide the matching analysis with the classical rank in the appendix. Another benefit of RankMe's eq. ( 1) is in its quantification of the whitening of the embeddings in addition to their rank, which is known to simplify optimization of (non)linear probes put on top of them (Santurkar et al., 2018) . Lastly, although eq. ( 1) is defined with the full embedding matrix Z, we observe that not all the samples need to be used to have an accurate estimate of RankMe. In practice, we use 25600 samples as ablation studies provided in appendix G and fig. S11 indicate that this provides a highly accurate estimate. 

3.2. RANKME PREDICTS LINEAR PROBING PERFORMANCE EVEN ON UNSEEN DATASETS

In order to empirically validate RankMe, we compare it to linear evaluation, which is the default evaluation method of JE-SSL methods. Finetuning has gained in popularity with Masked Image Modeling methods, but this can have a significant impact on the properties of the embeddings and alters what was learned during the pretraining. As such, we do not focus on this evaluation.

Experimental Methods and Datasets Considered

In order to provide a meaningful assessment of the embeddings rank's impact on performance, we focus on 4 JE-SSL methods. We use SimCLR as a representative contrastive method, VICReg as a representative covariance based method, and VICReg-exp and VICReg-ctr which were introduced in Garrido et al. (2022) . To make our work self-contained, we present the methods in appendix A. We chose to use VICReg-exp and VICReg-ctr as they provide small modifications to VICReg and SimCLR while producing embeddings with different rank properties. For each method we vary parameters that directly influence the rank of the embeddings, whether it is the temperature use in softmax based methods, which directly impacts the hardness of the softmax, or the loss weights to give more or less importance to the regularizing aspect of loss functions. We also vary optimization parameters such as the learning rate and weight decay to provide a more complete analysis. We provide the hyperparameters used for all experiments in appendix K. All approaches were trained in the same experimental setting with a ResNet-50 (He et al., 2016) backbone with a MLP projector having intermediate layers of size 8192, 8192, 2048, which avoids any architectural rank constraints. The models were trained for 100 epochs on ImageNet with the LARS (You et al., 2017; Goyal et al., 2017) optimizer. In order to evaluate the methods, we used ImageNet (our source dataset), as well as iNatural-ist18 (Horn et al., 2018 ), Places205 (Zhou et al., 2014) , EuroSat (Helber et al., 2019) , SUN397 (Xiao et al., 2010) and StanfordCars (Krause et al., 2013) to evaluate the trained models on unseen datasets. These commonly used datasets provide a wide range of scenarios that differ from ImageNet and provide meaningful ways to test the robustness of RankMe. For example, iNaturalist18 consists of 8412 classes focused on fauna and flora which requires more granularity than similar classes on Ima-geNet, SUN397 focuses on scene understanding, deviating from the single object and object-centric images of ImageNet, and EuroSat consists of satellite images which again differ from ImageNet. Datasets such as iNaturalist can allow theoretical limitations to manifest themselves more clearly due to the number of classes being significantly higher than the rank of learned representations. While we focus on these datasets for our visualizations, we also include CIFAR10, CIFAR100 Krizhevsky et al. (2009 ), Food101 Bossard et al. (2014) , VOC07 Everingham et al. and CLVR-count Johnson et al. (2017) for our hyperparameter selection results, and provide visualizations in appendix D. In order to evaluate on those datasets, we relied on the VISSL library (Goyal et al., 2021) . We provide complete details on the pretrainings and evaluations in appendix I. RankMe as a prediction of linear classification accuracy. As can be seen in fig. 3 , for a given method the performance on the embedding is improved by with a higher embedding rank, whether we look on ImageNet on which the models were pretrained or on downstream datasets. Nonetheless, there are some visible outliers, but they are mostly on SimCLR in settings with very high error rates compared to before the projector, such as in iNaturalist or StanfordCars. The conceptual closeness between VICReg-ctr and SimCLR pointed out in Garrido et al. (2022) would also suggest that these results need to be interpreted carefully, but they do reinforce the fact that a high rank is a necessary and not sufficient condition for improved performance on downstream tasks. It is also very tempting to draw conclusions when comparing different approaches, especially when looking at the ImageNet performance, however since dimensional collapse is not the only performance deciding factor one should refrain from doing so. The link between embedding rank and performance is even clearer when evaluating on the representations, as is usually done. In this scenario the link is more consistent across datasets, where we observe again that a higher rank is necessary for improved performance. This solidifies the use of RankMe as a performance metric that can be used in practice.

3.3. GOING FURTHER: RANKME ALSO HOLDS FOR NONLINEAR PROBING AND FOR DIFFERENT ARCHITECTURES

Non-linear evaluation. While we have been focusing on linear evaluation, one can wonder if the behaviour changes when using a more complex task-related head. We thus give some evidence that the previously observed behaviours are similar with a non-linear classification head. We used a simple 3 layer MLP with intermediate dimensions 2048, where each layer is followed by a ReLU activation. This choice of dimensions ensures that there are no architectural rank constraints on the embeddings. We focused on SUN397 and StanfordCars for this study due to their conceptual differences to ImageNet. The low rank of embeddings produced by SimCLR on these datasets would suggest that a non-linear classifier might help improve performance, since it is not as theoretically limited by the embeddings' rank as it is in the linear setting. However, as we can see in fig. 4 , the behaviors for all methods is the same as in the linear regime. This would suggest that RankMe is also a suitable metric to evaluate downstream performance in a non-linear setting. Dimensional collapse on different architectures. Our results so far have only focused on ResNet-50s, and a concern could be that the architecture played a significant role in the introduction of Algorithm 1 Hyperparameter selection with RankMe Require: Models f 1 , . . . , f N to compare, in increasing value of the hyperparameter Require: Corresponding ranks r 1 , . . . , r N 1: f best ← f 1 , r best ← r 1 2: for i = 2 to N do 3: if r i > r best then 4: f best ← f i , r best ← r i 5: else if r i = r best and (r i > r i-1 or r i > r i+1 ) then 6: collapse. As such, we trained VICReg in the same setting as before but using ConvNext-T (Liu et al., 2022) as the backbone architecture. As we can see in fig. 4 , collapse still appears in this case, with an even stronger impact on performance on ImageNet. This reinforces the findings of Jing et al. (2022) ; He & Ozay (2022) , which study collapse through the used loss function independently of the architecture of the backbone. f best ← f i , RankMe in more diverse settings. While our focus has been on contrastive methods, we further study in appendix C how RankMe can be applied to clustering methods such as DINO, where it shows great effectiveness. We also take a look at the effectiveness of RankMe when pretraining on other source datasets in appendix B, validating RankMe on iNaturalist18 pretraining.

4. RANKME FOR LABEL-FREE HYPERPARAMETER SELECTION IN SSL

We previously focused on validating RankMe by comparing overall performance compared to linear evaluation. In this section we focus on the evolution of rank and performance when varying one hyperparameter at a time in order to demonstrate how RankMe can be used for hyperparameter selection. We focus on loss specific hyperparameters such as the loss weights or temperature as well as hyperparameters related to optimization, such as the learning rate and weight decay.

4.1. USING RANKME TO CHOOSE THE CORRECT HYPERPARAMETER VALUE

As we have shown before, having a higher rank is necessary for better performance, and using RankMe to find the best value of an hyperparameter is as simple as choosing the value that leads to the highest rank, as illustrated in fig. 5 . Certain hyperparameters will lead to plateaus of equal rank, and in those the value that first achieves the maximal value should be selected. This second part is however only applicable when hyperparameter values can be ordered. Even in cases where the values cannot be compared, and equal ranks are found in a different setting, this still makes it possible to discard some runs and only focus on the one that achieve the maximal rank. This further highlight how maximal rank is only a necessary condition for good performance. Nonetheless, when the hyperparameters are ordered we can go one step further and use the rank alone to find a good hyperparameter value.

4.2. EXPERIMENTS

In order to demonstrate the effectiveness of RankMe for hyperparameter selection, we apply the algorithm presented in fig. 5 to find the best values for a given set of hyperparameters for VICReg and SimCLR. Our focus is on the covariance and invariance weights in VICReg, the temperature in SimCLR, and on learning rate and weight decay for both. We compare the performance on ImageNet as well as the average performance on the previously discussed OOD datasets to models selected by their ImageNet top-1 accuracy on its validation set. For per dataset performance, confer appendix J. As can be seen in table 1, using RankMe we are able to retrieve most of the performance on ImageNet, with gaps being lower than half a point. It is not possible to beat the selection using ImageNet's validation, since this is the metric we are evaluating on. However, on OOD datasets we are able to improve the performance in certain settings, and match it in the others. Thus, when comparing performance after the projector, RankMe is the better approach of the two to select the hyperparameters that will generalize best to unseen datasets. When comparing to α-Req, RankMe achieves better in domain performance, but on OOD datasets α-ReQ performs slightly better, though with bigger worst case performance drops. We provide an in-depth analysis of α-ReQ in appendix E, where we find that the power law prior of α-ReQ fails on the embeddings and as such those results must be interpreted with care. As pointed out in Girish et al. (2022) , using ImageNet performance to select models can lead to suboptimal performance in downstream tasks, which our results further confirm and reinforces the need for a new way of selecting hyperparameters. When looking at performance before the projector in fig. 1 , we can see that here RankMe does not beat the models selected with ImageNet's validation set, even on OOD datasets. However, RankMe performs better than α-ReQ in most settings, while not suffering from severe drops in the worst cases. Nevertheless, the gaps between RankMe and the ImageNet oracle are on average of less than half a point, which shows how competitive RankMe can be for hyperparameter selection, despite using no labeled data, having no parameters to tune, and being able to be computed in a couple of minutes.

5. CONCLUSION

We have shown how the phenomenon of dimensional collapse in self-supervised learning can be used as a powerful metric to evaluate models. By using a theoretically motivated analogue of the rank of embeddings, we show that the performance on downstream datasets can easily be assessed by only looking at the training dataset, without any labels, training, or parameters. While our work focuses on linear classification, we show promising results in non-linear classification that raise the question of how general this simple metric can be. Furthermore, its competitiveness with traditional oracle based hyperparameter selection methods makes it a promising tool in settings where labels are scarce, such as in the case of large uncurated datasets. As such, this work makes a step towards completely label-less self-supervised learning, as most existing approaches' hyperparameters are tuned with the help of ImageNet's validation set. Further work will explore the use of RankMe in more varied scenarios, to further legitimize its use in designing better self-supervised approaches.

6. REPRODUCIBILITY STATEMENT

While reproducing the pretrainings is prohibitively expensive since each training takes around a day on 8 V100 GPUs, we provide all of the hyperparameters used in appendix K. We also provide all of the pretraining details in appendix I, along with the hyperparameters used for the linear evaluations in the same section. We also provide all the performance and rank values to reproduce our main figures in appendix K. While the implementation of RankMe is straightforward, we provide an example algorithm to use it in fig. 5 . All of these efforts should make our results reproducible and verifiable. 

A BACKGROUND

In order to make our work as self-contained as possible, we recall the loss functions of the methods we study. For conciseness, we refer to the outputs of the encoder as representations and the outputs of the projection head as embeddings, which we denote by z i ∈ R d . We first briefly recall that the SimCLR loss is given by L = - (i,j)∈P e CoSim(zi,zj ) N k=1 1 {k̸ =i} e CoSim(zi,z k ) , with P the set of all positive pairs in the current mini-batch or dataset that comprise N exemplars. VICReg's loss is defined with three components. The variances loss v acts as a norm regularizer for the dimensions, and the covariance loss aims at decorrelating dimensions in the embeddings. They are respectively defined as v(Z) = 1 d d i=1 max 0, 1 -Var(Z •,i ) and c(Z) = 1 d i̸ =j Cov(Z) 2 . Both of these loss are combined with an invariance loss that matches positives pairs, giving a final loss of L = λ (i,j)∈P ∥z i -z j ∥ 2 2 + µ c(Z) + ν v(Z). VICReg-exp is defined similarly, but with the exponential covariance loss defined as c exp (Z) = 1 d i log   j̸ =i e Cov(Z)i,j /τ   . VICReg-ctr is then VICReg-exp but applied to Z T , making it a contrastive approach and conceptually similar to SimCLR. These methods give us different scenarios of collapse and allow us to make a more general study of the rank of representations as a powerful metric. RankMe is able to select hyperparameters when pretraining on iNaturalist (Right).

B INATURALIST18 PRETRAINING

While our experiments have previously focused on ImageNet pretraining due to its wide use in the community, one can wonder if RankMe is still applicable when trained on another source dataset. To verify this, we pretrained VICReg and SimCLR on iNaturalist18, and respectively varied the covariance loss' weight and the temperature to study the influence on the rank of embeddings. We used the same protocol as ImageNet pretraining but trained for 300 epochs instead of 100 to obtain a similar number of iterations. We then evaluated the performance on iNaturalist18 and ImageNet. We use 8192 dimensional embeddings due to the high number of classes of iNaturalist, but since the representations are 2048 dimensional, the rank cannot intrinsically go higher so we consider that all highers ranks are effectively 2048. As we can see in fig. S1 , RankMe provides a similar level of performance as on ImageNet pretrainings, validating it on a different source dataset. RankMe is even able to improve performance on ImageNet compared to the iNaturalist18 oracle, further showing the limitations of such oracles on downstream tasks.

C APPLICABILITY TO CLUSTER BASED METHODS

While we have studied the applicability of RankMe on contrastive methods, cluster based methods such as DINO have become extremely popular, and since the definition of embeddings is not as clear cut in them, a thorough analysis is required. We will proceed in two steps: • Show that dimensional collapse happens right before the clustering layer, and not on the prototypes • Show that RankMe is a good measure of performance on DINO As we can see in figure S2 , DINO's projector can be interpreted as both a classical projector and a clustering layer, whose weights are clustering prototypes. This interpretation comes from the softmax that is applied on the output of the projection head which can be interpreted as an InfoNCE between the embeddings and the clustering prototypes that make up the clustering layer. While prototypes themselves are not particularly collapsed, the embeddings that are obtained before the clustering present dimensional collapse. : RankMe is able to measure DINO's performance on its source dataset (Left). DINO's hyperparameters can be selected by using RankMe (Right). As we can see in fig. S3 , the phenomenon of dimensional collapse is highly visible in DINO, which enables the use RankMe to find optimal hyperparameter values. Validating this on ImageNet, we see that RankMe is able to match the performance of the oracle, or leads to slightly lower performance, further validating RankMe on another popular method.

D RESULTS ON SUPPLEMENTARY DATASETS

While we previously focused on certain datasets for their interesting natures, we will provide additional visualizations for the remaining datasets. 8 As we can see in figs. S4 and S5, we find similar behaviours as before, apart from Food101 where performance are almost identical for all methods. This reinforces the previous validation of RankMe. The relative simplicity of the datasets targeted here make the theoretical limitations of rank-deficient embeddings harder to see, even though we still see that a high rank helps generalization.

E DETAILED RESULTS FOR α-REQ

In order to further study the performance of α-ReQ, we reproduce our plots for RankMe using α-ReQ instead of the rank of embeddings. We compare both the intended use of α-ReQ in fig. S6 , as well as applying it on the embeddings to measure performance on the representations, which we foudn was necessary for RankMe in fig. S7 . As we can see in fig. S6 , there are no clear link visible between the value of α-ReQ and downstream performance. Especially we are unable to see the tendency of performance to increase as α tends to one. Nonetheless α-ReQ was still able to lead to good performance when used for hyperparameter selection. When applying α-ReQ as we would RankMe, we can see in fig. S7 that there is again no trend of performance increase when α tends to one. On the contrary we even find that performance tends to get better with a lower α, as is most visible in StanfordCars, iNaturalist18 or ImageNet for example. α going towards one means that the singular values of the embeddings tends to a uniform distribution, in line with the goal of RankMe. As we can see in figs. S8 and S9, the power-law prior of α-ReQ holds well in the case of non-collapsed embeddings, but when we apply it on collapsed ones, this assumptions fails. It even provides a poor approximation of the main rank "plateau" with the highest singular values as can be seen on the right of fig. S9 . This further confirms the findings of He & Ozay (2022) , and shows that when applying α-ReQ directly on the embeddings one must be careful since the core assumptions of the method is violated. Since we do not rely on the classical threshold-based rank estimator, it is important to verify how well our entropy based one correlates with it. As we can see in fig. S10 , both estimates discussed previously correlate extremely well, showing that using one or the other should not lead to significant differences, as validated in appendix H. Nonetheless, the entropic estimator takes into account the degree of whitening of the embeddings, which links better to theoretical results. As we can see in fig. S11 , the rank estimates converge extremely quickly, especially for VICReg.

G CONVERGENCE OF THE RANK ESTIMATORS

For both VICReg and SimCLR, 10000 samples are enough to obtain more than 95% of the final rank. It is worth noting that the entropic rank estimator converges more slowly than the classical rank estimator, as it is more sensitive to the singular values. The fact that the rank can be approximated with few samples is encouraging for its use during training and not only as a measure of performance after pretraining. 

H REPRODUCTION OF FIGURES WITH THE CLASSICAL RANK ESTIMATOR

As can be seen in figs. S12 and S13, the results that we obtain using the classical threshold-based rank estimator are extremely similar to the ones with the entropic estimator. The exact values do differ, but the behaviors stay the same. One of the main differences is illustrated in fig. S13 , where we can see that the target rank is almost identical to the source one when we previously saw a drop of around 50%. This can be explained by the fact that some features may be less present in the target dataset, reducing the associated singular values, and thus the entropic rank. All of this shows that using one or the other will lead to similar results in practical scenarios.

I DETAILED TRAINING AND EVALUATION PROCEDURES

I.1 PRETRAINING All pretrainings were done with ResNet-50 backbones. The projector used is a MLP with intermediate dimensions 8192, 8192, 2048. They were trained with the LARS optimizer using a momentum of 0.9, weight decay 10 -6 and varying learning rates depending on the method. VICReg used 0.3 base learning rate, SimCLR 0.5 or 0.6 depending on the experiment, VICReg-exp 0.6 and VICReg-ctr 0.6. The learning rate is then computed as lr = base_lr * batch_size/256. We do a 10-epochs linear warmup and then use cosine annealing. We used batch sizes of 2048 for SimCLR and 1024 for other methods. SimCLR and VICReg-ctr also use a default temperature of 0.15, and 0.1 for VICReg-exp. We used the image augmentation strategy from Grill et al. (2020) illustrated in table S1 . For all datasets except StanfordCars, we use the standard protocol in VISSL. On StanfordCars we mostly tuned the learning rate. The parameters that we used are described in table S2 . For data augmentation, we used random resized crops and random horizontal flips during training, and center crop for evaluation. For VOC07, we follow the common protocol using SVMs, as used in Bardes et al. (2021) . We use the default VISSL settings for this evaluation. 



Figure 2: Validation of the hypotheses motivating RankMe.(Top) Embeddings' rank transfers from source to target datasets. The estimates use 25600 images from the respective datasets.(Bottom Left) Train and test accuracy are highly correlated across datasets.(Bottom Right) An increase in performance on embeddings leads to an increase in performance on representations.

Figure 4: Impact of rank on performance on other architectures and evaluation protocols. (Left) Using a 3 layer MLP as classification head does not alter the performance before or after the projector, showing that RankMe can go beyond linear evaluation.(Right) ConvNexts are also sensitive to dimensional collapse, showing that rank-deficiency is not an artifact of ResNets.

Figure S1: RankMe applied to iNaturalist18 pretrainings (Left). RankMe is able to select hyperparameters when pretraining on iNaturalist (Right).

Figure S2: DINO's projection head can be split in two parts, a classical projector and a clustering layer (Left). Collapse happens before the clustering layer and not on the clustering prototypes (Right).

FigureS3: RankMe is able to measure DINO's performance on its source dataset (Left). DINO's hyperparameters can be selected by using RankMe (Right).

Figure S5: Link between embedding rank and downstream performance on the representations.

Figure S6: Link between α-ReQ measured on the representations and performance on the representations.

Figure S7: Link between α-ReQ measured on the embeddings and performance on the representations.

Figure S8: Validation of the power-law prior on un-collapsed representations.(Left) Overall visualization. (Right) Zoom on the high singular values.

Figure S9: The power-law prior does not hold on collapsed representations.(Left) Overall visualization. (Right) Zoom on the high singular values.

Figure S10: Relationship between the two rank estimators, Pearson correlation coefficient of 0.99. Outliers indicate embeddings with singular values to the threshold, showing how the entropic rank takes into account this information.

Figure S11: Convergence of the rank estimators on ImageNet as a function of the number of samples for 2048 dimensional outputs, as indicated by the vertical line.

Figure S12: Reproduction of the top of fig. 2 with the classical rank estimator. Embeddings' rank transfers from source to target datasets. The estimates used 25600 images from the respective datasets.

Figure S13: Reproduction of fig. 3 with the classical rank estimator.(Left) Validation of RankMe on embeddings, a higher ImageNet rank leads to improved performance across methods and datasets.(Right) Validation of RankMe on representations, where the link is even clearer, reinforcing RankMe's practical use.

µ : 1, ν : 16, τ : 0.1

r best ← r i 7: return f best

Top-1 accuracies obtained by doing hyperparameter selection using ImageNet validation performance, α-ReQ or RankMe. OOD indicates the average performance over all the considered datasets other than ImageNet. The performance is computed on the embeddings.

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database. In NeurIPS, 2014. Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. In ICLR, 2022a. Pan Zhou, Yichen Zhou, Chenyang Si, Weihao Yu, Teck Khim Ng, and Shuicheng Yan. Mugs: A multi-granular self-supervised learning framework. 2022b. Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learning of visual embeddings. In ICCV, 2019.

Image augmentation parameters, taken fromGrill et al. (2020).

Optimization parameters used to evaluate on downstream datasets

Hyperparameters for all runs.

Hyperparameters for all runs, continued.

Rank after projector in all settings, continued.

