STEERABLE EQUIVARIANT REPRESENTATION LEARN-ING

Abstract

Pre-trained deep image representations are useful for post-training tasks such as classification through transfer learning, image retrieval, and object detection. Data augmentations are a crucial aspect of pre-training robust representations in both supervised and self-supervised settings. Data augmentations explicitly or implicitly promote invariance in the embedding space to the input image transformations. This invariance reduces generalization to those downstream tasks which rely on sensitivity to these particular data augmentations. In this paper, we propose a method of learning representations that are instead equivariant to data augmentations. We achieve this equivariance through the use of steerable representations. Our representations can be manipulated directly in embedding space via learned linear maps. We demonstrate that our resulting steerable and equivariant representations lead to better performance on transfer learning and robustness: e.g. we improve linear probe top-1 accuracy by between 1% to 3% for transfer; and ImageNet-C accuracy by upto 3.4%. We further show that the steerability of our representations provides significant speedup (nearly 50×) for test-time augmentations; by applying a large number of augmentations for out-of-distribution detection, we significantly improve OOD AUC on the ImageNet-C dataset over an invariant representation.

1. INTRODUCTION

Embeddings of pre-trained deep image models are extremely useful in a variety of downstream tasks such as zero-shot retrieval (Radford et al., 2021) , few-shot transfer learning (Tian et al., 2020) , perceptual quality metrics (Zhang et al., 2018) and the evaluation of generative models (Heusel et al., 2017; Salimans et al., 2016) . The pre-training is done with various supervised or self-supervised losses (Khosla et al., 2020; Radford et al., 2021; Chen et al., 2020) and a variety of architectures (He et al., 2016; Dosovitskiy et al., 2020; Tolstikhin et al., 2021) . The properties of pre-trained embeddings, such as generalization (Zhai et al., 2019) and robustness (Naseer et al., 2021) , are therefore of significant interest. Most current pre-training methods impose invariance to input data augmentations either via losses (Tsuzuku et al., 2018; Chen et al., 2020; Caron et al., 2021) or architectural components such as pooling (Fan et al., 2011) . For invariant embeddings, the (output) embedding stays nearly constant for all transformations of a sample (e.g. geometric or photometric transformations of the input). Invariance is desirable for tasks where the transformations is a nuisance variable (Lyle et al., 2020) . However, prior work shows that it can lead to poor performance on tasks where sensitivity to transformations is desirable (Dangovski et al., 2022; Xiao et al., 2021) . Equivariance is a more general property: an equivariant embedding changes (smoothly) with respect to changes at the input to the encoder (Dangovski et al., 2022) . If the change is zero (or very small), we get invariance as a special case. In prior work, equivariant embeddings have been shown to have numerous benefits: reduced sample complexity for training, improved generalization and transfer learning performance (Cohen & Welling, 2016b; Simeonov et al., 2021; Lenssen et al., 2018; Xiao et al., 2021) . Equivariance has been achieved mostly by the use of architectural modifications (Finzi et al., 2021; Cohen & Welling, 2016b) and are mostly restricted to symmetries represented as matrix groups. However, this does not cover important transformations such as photometric changes or others that cannot be represented explicitly as matrix transformations. Xiao et al. (2021) and Dangovski et al. (2022) propose more flexible approaches to allow arbitrary input transformations to be represented at the embedding, for the self-supervised setting. However, a key distinction between these works and ours is that we parameterize the transformations in latent space, allowing for steering.

