THE GEOMETRY OF SELF-SUPERVISED LEARNING MODELS AND ITS IMPACT ON TRANSFER LEARNING Anonymous authors Paper under double-blind review

Abstract

Self-supervised learning (SSL) has emerged as a desirable paradigm in computer vision due to the inability of supervised models to learn representations that can generalize in domains with limited labels. The recent popularity of SSL has led to the development of several models that make use of diverse training strategies, architectures, and data augmentation policies with no existing unified framework to study or assess their effectiveness in transfer learning. We propose a data-driven geometric strategy to analyze different SSL models using local neighborhoods in the feature space induced by each. Unlike existing approaches that consider mathematical approximations of the parameters, individual components, or optimization landscape, our work aims to explore the geometric properties of the representation manifolds learned by SSL models. Our proposed manifold graph metrics (MGMs) provide insights into the geometric similarities and differences between available SSL models, their invariances with respect to specific augmentations, and their performances on transfer learning tasks. Our key findings are two fold: (i) contrary to popular belief, the geometry of SSL models is not tied to its training paradigm (contrastive, non-contrastive, and cluster-based); (ii) we can predict the transfer learning capability for a specific model based on the geometric properties of its semantic and augmentation manifolds.

1. INTRODUCTION

Self-supervised learning (SSL) for computer vision applications has empowered deep neural networks (DNNs) to learn meaningful representations of images from unlabeled data (1; 2; 3; 4; 5; 6; 7; 8; 5; 9; 10; 11) . These methods learn a feature space embedding that is invariant to data augmentations (e.g., cropping, translation, color jitter) by maximizing the agreement between representations from different augmentations of the same image. The resulting models are then used as general-purpose feature extractors and have been shown to achieve better performance for transfer learning as compared to features obtained from a supervised model (12). Broadly, SSL models can be categorized into contrastive (13; 2), non-contrastive (14; 15), prototype/clustering (16; 17). There exist multiple differences in the resulting models, even among models belonging to the same category. For example, popular SSL models available as pre-trained networks (18) can differ in terms of training parameters (loss function, optimizer, learning rate), architecture (DNN backbone, projection head, momentum encoder), model initialization (weights, batch-normalization parameters, learning rate schedule), etc. Recently, researchers have focused on developing an understanding of specific components of SSL models by studying the loss function used to train the models and its impact on the learned representations. For instance, (19) analyzes contrastive loss functions and the dimensional collapse problem. (20) also analyzes contrastive losses and describes the effect of augmentation strength as well as the importance of non-linear projection head. (21) quantifies the importance of data augmentations in a contrastive SSL model via a distance-based approach. (22) explores a graph formulation of contrastive loss functions with generalization guarantees on the representations learned. In ( 23), the importance of the temperature parameter used in the SSL loss function and its impact on learning is examined. ( 24) performs a spectral analysis of DNN's mapping induced by non-contrastive loss and the momentum encoder approach. However, these studies are unable to provide a unified analysis of the myriad of existing SSL models. Besides, these theoretical approaches only provide insights into the embedding obtained after the projection head, while in practice it is the mapping provided by the encoder that is actually used for transfer learning. Closer to our approach, the general transfer performance of an SSL model has been predicted based on the performance it can achieve on the ImageNet dataset (25). This idea was shown to be effective for transfer datasets that are similar to ImageNet, but it cannot be generalized to all transfer learning problems (12). In fact, if ImageNet performance were highly correlated to general transfer learning performance, then it is unclear why SSL models would be needed, as one could simply use supervised training with ImageNet to obtain image representations for transfer learning. Furthermore, existing empirical evaluations such as (25) only provide a somewhat coarse and partial understanding of SSL models. For example, they do not provide insights into how the level of invariance to specific augmentation in an SSL model relates to its performance on a given downstream task. Since it has been observed that invariance to some augmentations can be beneficial in some cases and harmful in others (26), our goal is to develop a more direct and quantitative understanding of augmentation invariance and how this invariance determines performance. To achieve this goal, we propose a geometric perspective to understand SSL models and their transfer capabilities. Our approach analyzes the manifold properties of SSL models by using a data-driven graph-based method to characterize the geometry of data and their augmentations, as illustrated on the left of Figure 1 . Specifically, we develop a set of manifold graph metrics (MGMs) to quantify the geometric properties of existing SSL models. This allows us to provide insights about the similarities and differences between models (Figure 1 , right) and link their ability to transfer to specific characteristics of the target task. Because our approach can be applied directly to sample data points and their augmentations, it has several important advantages. First, it is agnostic to specific training procedures, architecture, and loss functions; Second, it can be applied to the data embeddings obtained at any layer of the SSL model, thus alleviating the challenge induced by the presence of projection heads; Third, it enables us to compare different feature representations of the same data point, even if these representations have different dimensions. We are interested in using our approach to answer the following questions about SSL models: (i) What are the geometric differences between the feature spaces of various SSL models? (ii) What geometric properties allow for better transfer learning capability for a specific task? For our transfer study, we perform our MGM evaluation on 14 SSL models under 5 augmentation setting and show their impact in 8 downstream tasks comprising of 18 datasets in total. Our contributions are summarized as follows: • We develop quantitative tools (i.e., MGMs) capable of capturing important geometrical aspects of SSL representations, such as the degree of equivariance-invariance, the curvature, and the intrinsic dimensionality, Sec. 3. • We leverage the proposed MGMs to explore the geometric differences and similarities between SSL models. As illustrated in the right part of Figure 1 , we show that SSL models can be clustered using these geometric properties into three main groups that are not entirely aligned with the paradigm upon which they were trained, Sec. 4. • We analyze the geometric differences between a Vision Transformer (ViT) and a convolutional network (ResNet). We show that while ResNet is biased towards a collapsed representation at initialization, ViTs are not. This initialization bias leads to different geometrical behavior (attraction/repulsion of representations) between the two architectures when training under an SSL regime as detailed in Sec. 4. • We demonstrate that observed MGMs are a strong indicator of the transfer learning capabilities of SSL models for most downstream tasks. Therefore showing that specific geometrical properties are crucial for a given transfer learning task, Sec. 5.

2. BACKGROUND

Self-supervised Learning: SSL models are generated by producing multiple versions of the same image, via data augmentation, and training a DNN such that their embeddings coincide. Many SSL models are obtained with architectures that cascade two networks, the backbone encoder, from which the representation to be used for a downstream task is extracted, and the projection head, from which the output is fed into the SSL loss function. The main risk in such an approach is the so-called the feature collapse phenomenon (19; 24; 27) , where the learned representations are invariant to input samples that belong to different manifolds. To reduce the risk of feature collapse, multiple SSL

