THE GEOMETRY OF SELF-SUPERVISED LEARNING MODELS AND ITS IMPACT ON TRANSFER LEARNING Anonymous authors Paper under double-blind review

Abstract

Self-supervised learning (SSL) has emerged as a desirable paradigm in computer vision due to the inability of supervised models to learn representations that can generalize in domains with limited labels. The recent popularity of SSL has led to the development of several models that make use of diverse training strategies, architectures, and data augmentation policies with no existing unified framework to study or assess their effectiveness in transfer learning. We propose a data-driven geometric strategy to analyze different SSL models using local neighborhoods in the feature space induced by each. Unlike existing approaches that consider mathematical approximations of the parameters, individual components, or optimization landscape, our work aims to explore the geometric properties of the representation manifolds learned by SSL models. Our proposed manifold graph metrics (MGMs) provide insights into the geometric similarities and differences between available SSL models, their invariances with respect to specific augmentations, and their performances on transfer learning tasks. Our key findings are two fold: (i) contrary to popular belief, the geometry of SSL models is not tied to its training paradigm (contrastive, non-contrastive, and cluster-based); (ii) we can predict the transfer learning capability for a specific model based on the geometric properties of its semantic and augmentation manifolds.

1. INTRODUCTION

Self-supervised learning (SSL) for computer vision applications has empowered deep neural networks (DNNs) to learn meaningful representations of images from unlabeled data (1; 2; 3; 4; 5; 6; 7; 8; 5; 9; 10; 11) . These methods learn a feature space embedding that is invariant to data augmentations (e.g., cropping, translation, color jitter) by maximizing the agreement between representations from different augmentations of the same image. The resulting models are then used as general-purpose feature extractors and have been shown to achieve better performance for transfer learning as compared to features obtained from a supervised model ( 12). Broadly, SSL models can be categorized into contrastive (13; 2), non-contrastive (14; 15), prototype/clustering (16; 17). There exist multiple differences in the resulting models, even among models belonging to the same category. For example, popular SSL models available as pre-trained networks (18) can differ in terms of training parameters (loss function, optimizer, learning rate), architecture (DNN backbone, projection head, momentum encoder), model initialization (weights, batch-normalization parameters, learning rate schedule), etc. Recently, researchers have focused on developing an understanding of specific components of SSL models by studying the loss function used to train the models and its impact on the learned representations. For instance, (19) analyzes contrastive loss functions and the dimensional collapse problem. (20) also analyzes contrastive losses and describes the effect of augmentation strength as well as the importance of non-linear projection head. ( 21) quantifies the importance of data augmentations in a contrastive SSL model via a distance-based approach. (22) explores a graph formulation of contrastive loss functions with generalization guarantees on the representations learned. In ( 23), the importance of the temperature parameter used in the SSL loss function and its impact on learning is examined. ( 24) performs a spectral analysis of DNN's mapping induced by non-contrastive loss and the momentum encoder approach. However, these studies are unable to provide a unified analysis of the myriad of existing SSL models. Besides, these theoretical approaches only provide insights into the embedding obtained after the projection head, while in practice it is the mapping provided by the encoder that is actually used for transfer learning.

