STEERABLE EQUIVARIANT REPRESENTATION LEARN-ING

Abstract

Pre-trained deep image representations are useful for post-training tasks such as classification through transfer learning, image retrieval, and object detection. Data augmentations are a crucial aspect of pre-training robust representations in both supervised and self-supervised settings. Data augmentations explicitly or implicitly promote invariance in the embedding space to the input image transformations. This invariance reduces generalization to those downstream tasks which rely on sensitivity to these particular data augmentations. In this paper, we propose a method of learning representations that are instead equivariant to data augmentations. We achieve this equivariance through the use of steerable representations. Our representations can be manipulated directly in embedding space via learned linear maps. We demonstrate that our resulting steerable and equivariant representations lead to better performance on transfer learning and robustness: e.g. we improve linear probe top-1 accuracy by between 1% to 3% for transfer; and ImageNet-C accuracy by upto 3.4%. We further show that the steerability of our representations provides significant speedup (nearly 50×) for test-time augmentations; by applying a large number of augmentations for out-of-distribution detection, we significantly improve OOD AUC on the ImageNet-C dataset over an invariant representation.

1. INTRODUCTION

Embeddings of pre-trained deep image models are extremely useful in a variety of downstream tasks such as zero-shot retrieval (Radford et al., 2021) , few-shot transfer learning (Tian et al., 2020) , perceptual quality metrics (Zhang et al., 2018) and the evaluation of generative models (Heusel et al., 2017; Salimans et al., 2016) . The pre-training is done with various supervised or self-supervised losses (Khosla et al., 2020; Radford et al., 2021; Chen et al., 2020) and a variety of architectures (He et al., 2016; Dosovitskiy et al., 2020; Tolstikhin et al., 2021) . The properties of pre-trained embeddings, such as generalization (Zhai et al., 2019) and robustness (Naseer et al., 2021) , are therefore of significant interest. Most current pre-training methods impose invariance to input data augmentations either via losses (Tsuzuku et al., 2018; Chen et al., 2020; Caron et al., 2021) or architectural components such as pooling (Fan et al., 2011) . For invariant embeddings, the (output) embedding stays nearly constant for all transformations of a sample (e.g. geometric or photometric transformations of the input). Invariance is desirable for tasks where the transformations is a nuisance variable (Lyle et al., 2020) . However, prior work shows that it can lead to poor performance on tasks where sensitivity to transformations is desirable (Dangovski et al., 2022; Xiao et al., 2021) . Equivariance is a more general property: an equivariant embedding changes (smoothly) with respect to changes at the input to the encoder (Dangovski et al., 2022) . If the change is zero (or very small), we get invariance as a special case. In prior work, equivariant embeddings have been shown to have numerous benefits: reduced sample complexity for training, improved generalization and transfer learning performance (Cohen & Welling, 2016b; Simeonov et al., 2021; Lenssen et al., 2018; Xiao et al., 2021) . Equivariance has been achieved mostly by the use of architectural modifications (Finzi et al., 2021; Cohen & Welling, 2016b) and are mostly restricted to symmetries represented as matrix groups. However, this does not cover important transformations such as photometric changes or others that cannot be represented explicitly as matrix transformations. Xiao et al. (2021) and Dangovski et al. (2022) propose more flexible approaches to allow arbitrary input transformations to be represented at the embedding, for the self-supervised setting. However, a key distinction between these works and ours is that we parameterize the transformations in latent space, allowing for steering. We introduce some notation. x refers to an input sample (image). e(x; w) represents the encoder network that maps input x to the embedding e, where w are the parameters of the network. We use e(x) and e(x; w) interchangeably for ease of notation. The data augmentation of a sample x is represented as g(x; θ), often shortened to g(x) for brevity. θ refers to the parameters of the augmentation, e.g. for photometric transformations it is a 3-dimensional vector of red, green and blue shifts applied to the image. We denote latent space transformations as M (e, θ), taking as input the embedding e and transformation parameter θ. M may be linear (a matrix) or a nonlinear function (deep network), the output of which is another vector of the same dimensions as e(x). Thus, M is a mapping from the joint embedding and parameter space to embedding space. Given this notation, if e(g(x; θ)) = e(x), i.e. the embedding does not change due to the input transformation g(θ), it is said to be invariant to g. Equivariance is defined as e(g(x; θ)) = M (e(x), θ). If M is the identity function, then we recover invariance. The map M in latent space encourages the embedding to change smoothly with respect to the g and θ, the parameters of the transformation. These maps M allow us to directly manipulate the embeddings e, leading us to the concept of steerability (e.g. (Freeman et al., 1991) ). It has been shown that pre-trained embeddings often accommodate linear vector operations to enable e.g. nearest neighbor retrieval using Euclidean distance (Radford et al., 2021) ; this is a coarse form of steerability. However, without more explicit control on the embeddings, it is difficult to perform fine-grained operations on this vector space, for example, re-ordering retrieved results by color attributes. It is not very useful in practice to steer an invariant model: the embeddings may change very little in response to steering. However, enabling steerability for an equivariant embedding opens up a number of applications for control in embedding space; we show the benefits in our experiments. We introduce a simple and general regularizer to encourage embedding equivariance to input data augmentations. The same mechanism (mapping functions) used for the regularization enables a simple steerable mechanism to control embeddings post-training. Our regularizer achieves significantly more equivariance (as measured by the metric in (Jayaraman & Grauman, 2015)) than pre-training without the regularizer. Prior work (Cohen & Welling, 2016a; Deng et al., 2021; Zhang, 2019) has introduced



Figure 1: Two examples of image retrieval, comparing our steerable equivariant model to the baseline invariant (standard) model. The top example (flowers) is for color-based steering; the bottom example (buildings) is shown for brightness-based retrieval. For each example, we show three query images in the left column, along with nearest neighbors in the next 8 columns (4 each for the steerable and standard models). Please see text for definitions of e(x), M (e(x)) and ∆M (e(x)). Query image shown is simply for illustration; we do not use that image for the retrieval. The steerable model retrieves images where the color or brightness change overrides semantics. For example, the second query on the flowers example retrieves yellow/pink neighbors and the third query retrieves purple/blue colored flowers; similarly for a darker image in the second example, darker images are retrieved and for brighter examples, brighter images are retrieved. The invariant model retrievals are fairly static between different color or brightness changes.

