AMORTISED INVARIANCE LEARNING FOR CONTRASTIVE SELF-SUPERVISION

Abstract

Contrastive self-supervised learning methods famously produce high quality transferable representations by learning invariances to different data augmentations. Invariances established during pre-training can be interpreted as strong inductive biases. However these may or may not be helpful, depending on if they match the invariance requirements of downstream tasks or not. This has led to several attempts to learn task-specific invariances during pre-training, however, these methods are highly compute intensive and tedious to train. We introduce the notion of amortised invariance learning for contrastive self supervision. In the pre-training stage, we parameterize the feature extractor by differentiable invariance hyper-parameters that control the invariances encoded by the representation. Then, for any downstream task, both linear readout and task-specific invariance requirements can be efficiently and effectively learned by gradient-descent. We evaluate the notion of amortised invariances for contrastive learning over two different modalities: vision and audio, on two widely-used contrastive learning methods in vision: SimCLR and MoCo-v2 with popular architectures like ResNets and Vision Transformers, and SimCLR with ResNet-18 for audio. We show that our amortised features provide a reliable way to learn diverse downstream tasks with different invariance requirements, while using a single feature and avoiding taskspecific pre-training. This provides an exciting perspective that opens up new horizons in the field of general purpose representation learning.

1. INTRODUCTION

Self-supervised learning has emerged as a driving force in representation learning, as it eliminates the dependency on data annotation and enables scaling up to larger datasets that tend to produce better representations (Ericsson et al., 2022b) . Among the flavours of self-supervision, contrastive learning has been particularly successful in important application disciplines such as computer vision (Chen et al., 2020c; Caron et al., 2020; Zbontar et al., 2021) , medical AI (Azizi et al., 2021; Krishnan et al., 2022) , and audio processing (Al-Tahan & Mohsenzadeh, 2021) . The key common element of various contrastive learning methods is training representations that are invariant to particular semantics-preserving input transformations (e.g., image blur, audio frequency masking) that are applied synthetically during training. Such invariances provide a strong inductive bias that can improve downstream learning speed, generalisation, and robustness (Geirhos et al., 2020) . A major vision motivating self-supervision research has been producing a general purpose representation can be learned once, albeit at substantial cost, and then cost-effectively re-used for different tasks of interest. Rapidly advancing research (Chen et al., 2020c; Caron et al., 2020; Zbontar et al., 2021) , as summarized by various evaluation studies (Azizi et al., 2021; Ericsson et al., 2021) , shows progress towards this goal. If successful this could displace the 'end-to-end supervised learning for each task' principle that has dominated deep learning and alleviate its data annotation cost. However, this vision is not straightforward to achieve. In reality, different tasks often require mutually incompatible invariances (inductive biases). For example, object recognition may benefit from rotation and blur invariance; but pose-estimation and blur-estimation tasks obviously prefer rotation and blur equivariance respectively. Training a feature extractor with any given invariance will likely harm some task of interest, as quantified recently Ericsson et al. (2022a) . This has led to work on learning task-specific invariances/augmentations for self-supervision with meta-gradient (Raghu et al., 2021) or BayesOpt (Wagner et al., 2022) , which is extremely expensive and cumbersome; and training feature ensembles using multiple backbones with different invariances (Xiao et al., 2021; Ericsson et al., 2022a) , which is also expensive and not scalable. In this paper we therefore raise the question: How can we learn a single general-purpose representation that efficiently supports a set of downstream tasks with conflicting, and a-priori unknown invariance requirements? To address these issues we explore the notion of amortized invariance learning in contrastive selfsupervision. We parameterise the contrastive learner's neural architecture by a set of differentiable invariance hyper-parameters, such that the feature extraction process is conditioned on a particular set of invariance requirements. During contrastive pre-training, sampled augmentations correspond to observed invariance hyper-parameters. By learning this architecture on a range of augmentations, we essentially learn a low-dimensional manifold of feature extractors that is paramaterised by desired invariances. During downstream task learning, we freeze the feature extractor and learn a new readout head as well as the unknown invariance hyperparameters. Thus the invariance requirements of each downstream task are automatically detected in a way that is efficient and parameter light. Our framework provides an interesting new approach to general purpose representation learning by supporting a range of invariances within a single feature extractor. We demonstrate this concept empirically for two different modalities of vision and audio, using SimCLR Chen et al. 2017) approach for ResNet CNNs, and a prompt learning approach for ViTs Dosovitskiy et al. ( 2021). We evaluate both classification and regression tasks in both many-shot and few-shot regime. Finally, we provide theoretical insights about why our amortised learning framework provides strong generalisation performance.

2. RELATED WORK

Invariance Learning Invariances have been learned by MAP (Benton et al., 2020) , marginal likelihood (Immer et al., 2022 ), BayesOpt (Wagner et al., 2022) , and meta learning (Raghu et al., 2021) --where gradients from the validation set are backpropagated to update the invariances or augmentation choice. All these approaches are highly data and compute intensive due to the substantial effort required to train an invariance at each iteration of invariance learning. Our framework amortises the cost of invariance learning so that it is quick and easy to learn task-specific invariances downstream. Invariances in Self-Supervision Self-supervised methods (Ericsson et al., 2022b) often rely on contrastive augmentations (Chen et al., 2020c) . Their success has been attributed to engendering invariances (Ericsson et al., 2021; Wang & Isola, 2020; Purushwalkam & Gupta, 2020) through these augmentations, which in turn provide good inductive bias for downstream tasks. Self-supervision sometimes aspires to providing a single general purpose feature suited for all tasks in the guise of foundation models (Bommasani et al., 2021) . However, studies have shown that different augmentations (invariances) are suited for different downstream tasks, with no single feature being optimal for all tasks (Ericsson et al., 2022a) and performance suffering if inappropriate invariances are provided. This leads to the tedious need to produce and combine an ensemble of features (Xiao et al., 2021; Ericsson et al., 2022a) , to disentangle invariance and transformation prediction (Lee et al., 2021) , or to costly task-specific self-supervised pre-training (Raghu et al., 2021; Wagner et al., 2022) . Our framework breaths new life into the notion of self-supervised learning of general purpose representations by learning a parametric feature extractor that spans an easily accessible range of invariances, and provides easy support for explicit task-specific invariance estimation of downstream tasks.

Self-Supervision in Audio and Beyond

The design of typical augmentations in computer vision benefits from a large collective body of wisdom (Chen et al., 2020b) about suitable augmentations/invariances for common tasks of interest. Besides the task-dependence (e.g., recognition vs poseestimation) of invariance already discussed, bringing self-supervision to new domains with less prior knowledge -such as audio -often requires expensive grid search to find a good augmentation suite to use (Al-Tahan & Mohsenzadeh, 2021; Wagner et al., 2022) , where each step consists of selfsupervised pre-training followed by downstream task evaluation. Our framework also benefits these situations: we can simply pre-train once with a fairly unconstrained suite of augmentations, and then quickly search for those augmentations beneficial to downstream tasks in this modality.



(2020a) and MoCo Chen et al. (2020c) as representative contrastive learners; and provide two instantiations of the amortized learning framework: a hypernetwork-based Ha et al. (

