STEERABLE EQUIVARIANT REPRESENTATION LEARN-ING

Abstract

Pre-trained deep image representations are useful for post-training tasks such as classification through transfer learning, image retrieval, and object detection. Data augmentations are a crucial aspect of pre-training robust representations in both supervised and self-supervised settings. Data augmentations explicitly or implicitly promote invariance in the embedding space to the input image transformations. This invariance reduces generalization to those downstream tasks which rely on sensitivity to these particular data augmentations. In this paper, we propose a method of learning representations that are instead equivariant to data augmentations. We achieve this equivariance through the use of steerable representations. Our representations can be manipulated directly in embedding space via learned linear maps. We demonstrate that our resulting steerable and equivariant representations lead to better performance on transfer learning and robustness: e.g. we improve linear probe top-1 accuracy by between 1% to 3% for transfer; and ImageNet-C accuracy by upto 3.4%. We further show that the steerability of our representations provides significant speedup (nearly 50×) for test-time augmentations; by applying a large number of augmentations for out-of-distribution detection, we significantly improve OOD AUC on the ImageNet-C dataset over an invariant representation.

1. INTRODUCTION

Embeddings of pre-trained deep image models are extremely useful in a variety of downstream tasks such as zero-shot retrieval (Radford et al., 2021) , few-shot transfer learning (Tian et al., 2020) , perceptual quality metrics (Zhang et al., 2018) and the evaluation of generative models (Heusel et al., 2017; Salimans et al., 2016) . The pre-training is done with various supervised or self-supervised losses (Khosla et al., 2020; Radford et al., 2021; Chen et al., 2020) and a variety of architectures (He et al., 2016; Dosovitskiy et al., 2020; Tolstikhin et al., 2021) . The properties of pre-trained embeddings, such as generalization (Zhai et al., 2019) and robustness (Naseer et al., 2021) , are therefore of significant interest. Most current pre-training methods impose invariance to input data augmentations either via losses (Tsuzuku et al., 2018; Chen et al., 2020; Caron et al., 2021) or architectural components such as pooling (Fan et al., 2011) . For invariant embeddings, the (output) embedding stays nearly constant for all transformations of a sample (e.g. geometric or photometric transformations of the input). Invariance is desirable for tasks where the transformations is a nuisance variable (Lyle et al., 2020) . However, prior work shows that it can lead to poor performance on tasks where sensitivity to transformations is desirable (Dangovski et al., 2022; Xiao et al., 2021) . Equivariance is a more general property: an equivariant embedding changes (smoothly) with respect to changes at the input to the encoder (Dangovski et al., 2022) . If the change is zero (or very small), we get invariance as a special case. In prior work, equivariant embeddings have been shown to have numerous benefits: reduced sample complexity for training, improved generalization and transfer learning performance (Cohen & Welling, 2016b; Simeonov et al., 2021; Lenssen et al., 2018; Xiao et al., 2021) . Equivariance has been achieved mostly by the use of architectural modifications (Finzi et al., 2021; Cohen & Welling, 2016b) and are mostly restricted to symmetries represented as matrix groups. However, this does not cover important transformations such as photometric changes or others that cannot be represented explicitly as matrix transformations. Xiao et al. (2021) and Dangovski et al. (2022) propose more flexible approaches to allow arbitrary input transformations to be represented at the embedding, for the self-supervised setting. However, a key distinction between these works and ours is that we parameterize the transformations in latent space, allowing for steering. We introduce some notation. x refers to an input sample (image). e(x; w) represents the encoder network that maps input x to the embedding e, where w are the parameters of the network. We use e(x) and e(x; w) interchangeably for ease of notation. The data augmentation of a sample x is represented as g(x; θ), often shortened to g(x) for brevity. θ refers to the parameters of the augmentation, e.g. for photometric transformations it is a 3-dimensional vector of red, green and blue shifts applied to the image. We denote latent space transformations as M (e, θ), taking as input the embedding e and transformation parameter θ. M may be linear (a matrix) or a nonlinear function (deep network), the output of which is another vector of the same dimensions as e(x). Thus, M is a mapping from the joint embedding and parameter space to embedding space. Given this notation, if e(g(x; θ)) = e(x), i.e. the embedding does not change due to the input transformation g(θ), it is said to be invariant to g. Equivariance is defined as e(g(x; θ)) = M (e(x), θ). If M is the identity function, then we recover invariance. The map M in latent space encourages the embedding to change smoothly with respect to the g and θ, the parameters of the transformation. These maps M allow us to directly manipulate the embeddings e, leading us to the concept of steerability (e.g. (Freeman et al., 1991) ). It has been shown that pre-trained embeddings often accommodate linear vector operations to enable e.g. nearest neighbor retrieval using Euclidean distance (Radford et al., 2021) ; this is a coarse form of steerability. However, without more explicit control on the embeddings, it is difficult to perform fine-grained operations on this vector space, for example, re-ordering retrieved results by color attributes. It is not very useful in practice to steer an invariant model: the embeddings may change very little in response to steering. However, enabling steerability for an equivariant embedding opens up a number of applications for control in embedding space; we show the benefits in our experiments. We introduce a simple and general regularizer to encourage embedding equivariance to input data augmentations. The same mechanism (mapping functions) used for the regularization enables a simple steerable mechanism to control embeddings post-training. Our regularizer achieves significantly more equivariance (as measured by the metric in (Jayaraman & Grauman, 2015) ) than pre-training without the regularizer. Prior work (Cohen & Welling, 2016a; Deng et al., 2021; Zhang, 2019) has introduced specialized architectures to make deep networks equivariant to specific transformations such as rotation or translation. Our approach is complementary to these works: as long as a transformation is parameterized in input space, we can train a mapping function in embedding space to mimic that transformation. It is agnostic to architecture. We show the benefits of our approach with applications in nearest neighbor retrieval when different augmentations are applied in embedding space, showing the benefits of steerable control of embeddings (see Fig. 1 as an example). We also test our approach for out-of-distribution detection, transfer, and robustness: our steerable equivariant embeddings significantly outperform the invariant model. We will release our code at www.xxx.yyy. et al., 2020; Hendrycks et al., 2019; Yun et al., 2019; Hendrycks et al., 2019; Chen et al., 2020; Gidaris et al., 2018) . Most data augmentation pipelines involve randomly picking the parameters of each transformation, and defining a deterministic (Cubuk et al., 2020) or learned order of the transformations (Cubuk et al., 2018) . Adversarial training is another class of augmentations, (Xie et al., 2020; Herrmann et al., 2021) which provide model-adaptive regularization. Data augmentation expands the training distribution to reduce overfitting in heavily over-parametrized deep networks. This provides a strong regularization effect, and improves generalization consistently, across architectures, modalities and loss functions (Hernández-García & König, 2018; Steiner et al., 2021; Hou et al., 2018; Shen et al., 2020; Chen et al., 2020; Caron et al., 2020; He et al., 2021) . Data augmentations are crucial in the training of self-supervised contrastive losses (Chen et al., 2020; Grill et al., 2020) .

2. RELATED WORK

Most losses for deep networks implicitly or explicitly impose invariance to input data augmentations (Tsuzuku et al., 2018) . For all transformations of a sample (e.g. different color variations), the output embedding stays nearly constant. When this property is useful, e.g. classification under perturbations, invariance is desirable (Lyle et al., 2020) . Many papers have studied invariance properties of convolutional networks to specific augmentations such as translation (Azulay & Weiss, 2018; Zhang, 2019; Bruna & Mallat, 2013) , rotation (Sifre & Mallat, 2013) and scaling (Xu et al., 2014) . These architectural constructs have been made somewhat redundant in newer Transformerbased deep networks (Vaswani et al., 2017; Dosovitskiy et al., 2020; Tolstikhin et al., 2021) which use a mix of patch tokens and multilayer perceptrons. However, invariance is not universally desirable. Equivariance is a more general property from which invariance can be extracted by using aggregation operators such as pooling (Laptev et al., 2016; Fan et al., 2011) . Equivariant architectures have benefits such as reduced sample complexity in training (Esteves, 2020) and capturing symmetries in the data (Smidt, 2021) . Rotational equivariance has been extensively studied for CNN's (Cohen & Welling, 2016a; Simeonov et al., 2021; Deng et al., 2021) . Convolutional networks (without pooling) are constructed to have translational equivariance, although papers such as (Azulay & Weiss, 2018) try to understand when this property does not hold. A number of works have suggested specific architectures to enable equivariance e.g. (Cohen & Welling, 2016b; Dieleman et al., 2016; Lenssen et al., 2018; Finzi et al., 2020; Sosnovik et al., 2019; Romero et al., 2020; Romero & Cordonnier, 2020; Bevilacqua et al., 2021; Smets et al., 2020) . However, these architectures have not been widely adopted in spite of their useful properties, possibly due to the extra effort required to setup and train such specialized models. For many applications, it is also useful to be able to steer equivariant embeddings in a particular direction, to provide fine-grained control. While it is a well-understood concept in signal and image processing (Freeman et al., 1991) , it is less widely applied in neural networks. Our work is inspired by that of (Jayaraman & Grauman, 2015) , who introduce an unsupervised learning objective to tie together ego-motion and feature learning. The resulting embedding space captures equivariance of complex transformations in 3-D space, which are hard to encode in architectures directly: they use standard convolutional networks and an appropriate loss to encourage equivariance. They show significant benefits of their approach for downstream recognition tasks over invariant baselines. The works of (Xiao et al., 2021; Dangovski et al., 2022) are also closely related. They build explicit equivariant spaces for specific data augmentations (in their case, color jitter, rotation and texture randomization). However, they do so indirectly by increasing 'sensitivity' to transformations, and do not build any M g equivalent maps. They hence do not provide steerability. Additionally, their work is restricted to the contrastive setting, whereas our regularizer can be added to any training paradigm.

3. MODEL

In this paper, we work in the context of supervised training on ImageNet classification models. However, note that our approach is general and easily extends to self-supervised settings. e.g. (Chen et al., 2020; He et al., 2021; Chen et al., 2021) . Our standard (invariant) model is trained with a cross-entropy loss, along with weight decay regularization with hyperparameter λ: L CE (x) = c -log p c (x; w)y c (x) + λ w 2 2 (1) Here, the embedding e(x; w) is projected to a normalized probability space p(x; w) (the "logits" layer of the network). y(x) refers to the target label vector for the sample x, used for supervised learning. Vector components in both are indexed by c, which can refer to, say, the classes for supervised training. The entries of p(x) and y(x) are between 0 and 1, and they sum to 1 to form the parameters of a categorical distribution. The usual manner of training cross-entropy loss is to first apply a sequence of data augmentations to x e.g. (Cubuk et al., 2020) and then pass the transformed version of x into the network. Since all transformations of x are encouraged to map to the same distribution y(x), this loss promotes invariance in p(x) and therefore also in the embedding e(x).

3.1. MEASURING EQUIVARIANCE FOR AN AUGMENTATION

In works such as Cohen & Welling (2016a) ; Lenssen et al. ( 2018), the architecture guarantees equivariance. Our approach is agnostic to architecture, so we desire a quantitative way to measure the equivariance with respect to a particular augmentation. We adapt the measure in (Jayaraman & Grauman, 2015) . Denoting a given augmentation by a, we use the formula: ρ a (x) = M a (e(x), θ a ) -e(g a (x; θ a )) 2 e(g a (x, θ a )) -e(x) 2 The denominator here measures invariance: lower values mean more invariant embeddings e w.r.t. the augmentation a applied to input x. The numerator measures equivariance: we want the embedding of a transformation g a (x) to be represented as a transformation in embedding space, represented by M a (e(x)). The lower the value of ρ a (x), the more equivariant the embedding e(x). Note that we need the ratio in Eqn. 2, rather than just the numerator, to exclude trivial solutions where g a (x; θ a ) is mapped to the same point for all θ a (for a given x), and similarly for M a (e(x)). These would indeed make the numerator small, but they are representative of invariance rather than equivariance. Dividing by the denominator ensures that e(g a (x; θ a )) be distinct from e(x) for different values of θ a . Note that M a and g a share the same transformation parameter θ a .

3.2. EQUIVARIANCE-PROMOTING REGULARIZER

We use the numerator of Eqn. 2 to define a regularizer to promote equivariance in the embeddings: L a E (x) = M a (e(x), θ a ) -e(g a (x; θ a )) 2 2 (3) where we have a separate term for each augmentation a that is applied at the input. As before, the transformation-specific parameters θ a are shared between the embedding map M a and the input transformation g a . Thus, the embedding map learns to apply a transformation in embedding space that mimics the effect of applying the transformation to the input. Hence, these maps M allow us to directly manipulate the embeddings e and provide fine-grained control w.r.t transformations: a concept we call steerability. Note that it is possible to make representations equivariant but not steerable, as in Xiao et al. (2021) ; Dangovski et al. (2022) , by not recovering the M a 's. We use the following structure for M a . A vector of continuous valued parameters θ a is first mapped to a 128-dimensional intermediate parameter representation using a single dense layer. This vector is then concatenated with the given embedding e(x) and passed through another dense layer, to give the final output of the map which is the same dimensionality (2048) as e(x). This structure adds around 1% extra parameters to a ResNet-50 model.

3.3. UNIFORMITY REGULARIZER

We observe that embeddings learned using L CE (x) lead to well-formed class clusters, but within a cluster they collapse onto each other, thereby increasing invariance (and Equation (3) cannot prevent this). To overcome this, we enforce a uniformity loss (Wang & Isola, 2020) which is given by: L U (x) = log ij exp -e(xi)-e(xj ) 2 2 /τ (4) τ is a temperature parameter often set to a small value such as 0.1. This loss encourages the embeddings of sample x i to separate from embeddings of other samples x j . We find that it also has the effect of spreading out augmentations of the same sample, increasing the equivariance of the embeddings and reducing the likelihood of a trivial solution for the mapping M a . This could alternatively have been achieved by maximizing the denominator of Equation ( 2), but we find empirically that strongly pushed e(x) and e(g a (x)) apart and destabilized training.

3.4. LOSS

Putting the above together, our final loss to train an equivariant/steerable model is given by: L CEU (x) = L CE (x) + α a L a E (x) + β(L U (x) + a L U (g a (x; θ a ))) The sums are computed over different augmentations a; and a separate embedding map is learned for each augmentation for which we desire equivariance in the embedding. α and β are weighting hyperparameters. Note that the L E and L U terms are applied to the embedding e and the cross-entropy is applied to the softmax output p.

4. EXPERIMENTS

Our models are trained on the ImageNet dataset (Deng et al., 2009) , on the ResNet-50 architecture (He et al., 2016) . To train our steerable linear maps, we use the following augmentations: • Geometric: a crop transformation parameterized by a 4-dimensional parameter vector θ geo , representing the position of the top left corner (x and y coordinates) of the crop, crop height and crop width. All values are normalized by the image size (224 in our case). When θ geo = [0, 0, 1, 1], this corresponds to no augmentation. We denote the corresponding steerable map as M geo . This augmentation encompasses random crop, zoom, resize (Chen et al., 2020) , and translation (by only varying the top left corner). • Photometric: a color jitter tranformation parameterized by a 3-dimensional parameter vector θ photo , which represents the respective relative change in the values of the RGB channels. The values of θ photo are in the range of [-1, 1]. When θ photo = [0, 0, 0], this corresponds to no augmentation. We denote the corresponding steerable map as M photo . This augmentation encompasses global contrast, brightness, and hue/color transformations. Both the standard (invariant) and our steerable equivariant models are trained with the same data augmentation. The invariant model is a trained with the standard cross-entropy loss in Eqn. 1 for 250 epochs, with a batch size of 4096 (other training details are in the Appendix A.2). This model achieves a top-1 accuracy of 75.17% on the ImageNet evaluation set. The equivariant/steerable model is trained with the loss in Eqn. 5 with the same learning rate schedule, number of epochs and batch size as the invariant model, with hyperparameters α=0.1 and β=0.1. As earlier, the data augmentations from SimCLR (Chen et al., 2020) are applied to a sample x and used in L CEU . In addition to this, the parameters θ photo and θ geo are sampled uniformly at random within their pre-defined ranges. These are applied independently to generate two more of views of x, for use in L a E (x) and LU (x). This model achieves a top-1 accuracy of 74.96% on the ImageNet evaluation set. To facilitate comparison with the equivariant model, we endow the invariant model with similar maps M geo and M photo for the photometric and geometric augmentations. We train them using Eqn. 3 but with the encoder parameters frozen.

4.1. EQUIVARIANCE MEASUREMENT

We use the trained maps (two maps for each model) to measure the equivariance of both the models using the measure of Eqn. 2, and report them in Table 1 . We see that the equivariant/steerable model has significantly lower ρ values than the baseline invariant model for both data augmentations, showing that pre-training with an equivariance promoting regularizer is crucial for learning equivariant and steerable representations. 

4.2. NEAREST NEIGHBOR RETRIEVAL

A common use-case for pre-trained embeddings is their use in image retrieval (Xiao et al., 2021; Zheng et al., 2017) . In the simplest case, given a query embedding e q , the nearest neighbors of this embedding are retrieved using Euclidean distance in embedding space over a database of stored embeddings. We test qualitative retrieval performance of the invariant and equivariant models. To mimic a practical setting, we populate a database with the embeddings e(x i ) of samples x i from the ImageNet validation set. All query and key embeddings are normalized to have an l 2 norm of 1, before being used in retrieval. Given a query sample x, we consider the following query embeddings: • e(x): Embedding of the sample x with no augmentation applied. • e(g(x)): Embedding of x after we apply a transformation g in input space. • M (e(x))): Embedding map applied to embedding e(x) to steer towards e(g(x)). • ∆M (e(x)): Compute the difference between e(x) and M (e(x)) and add it back to M (e(x)) with a weight w m ; i.e. ∆M (e(x)) = M (e(x)) + w m (M (e(x)) -e(x)). This enables ∆M (e(x)) to be 'pushed' further in the direction of the transformation. w m is empirically chosen and set to 5 and 1 for the equivariant and invariant models respectively for all retrieval experiments. In addition to qualitative results, we compute the mean reciprocal rank (MRR): M RR = 1 n n i=1 ri , where r i is the rank of the desired result within a list of retrieved responses to a single query, and n is the number of queries. MRR lies in the range [0, 1], with 1 signifying perfect retrieval. We calculate MRR for both models, for color, crop, and brightness augmentations. Table 2 shows that the equivariant model achieves better ranks across all augmentations. 

4.2.1. PHOTOMETRIC AUGMENTATIONS

Results for color augmentation comparing the invariant and our equivariant/steerable models, are displayed in Figure 2 . We observe that retrieved results for both e(g(x)) and M (e(x)) change more in response to a change in query color for our steerable equivariant model than the invariant model. The color of the retrieved results for all queries for the standard model do not change appreciably, confirming invariance. This effect is even more pronounced for ∆M (e(x)). We were unable to find any value of the parameter w m for the invariant model that gave results qualitatively similar to the equivariant/steerable model. In Figures 1(top 3 rows) and 3, we show more examples across different classes and colors. Figure 1 (bottom 3 rows) shows retrieval in the setting of brightness changes. We populate the database with darkened and lightened versions using θ photo = [δ, δ, δ], where δ > 0 to mimic "daytime" and δ < 0 to mimic "nighttime" versions of the images. We query for either using ∆M (e(x)). Our steerable model retrieves other images in similar lighting settings as the query, whereas the invariant model retrieves the exact same nearest neighbors for the dark and light queries. The results demonstrate the benefit of both equivariance and steerability: applying the map to the embeddings gives both better and faster results than applying transformations to input images. Qualitatively, they also display the range of transformations and their parameters that our steerable equivariant respresentations generalize to. More results are shown in the Appendix (Figure 14 ).

4.2.2. IMAGE CROPPING/ZOOMING

In this experiment, we show that equivariant/steerable model preserves visual order for zooming data augmentation. Figure 4 shows the original image and a steered version (in embedding space). Each key image in the dataset consists of multiple zoomed versions of images from different classes. The equivariant model result maintains a sensible global ordering (retrieving samples from the same class) as well as local ordering (ordering the nearest neighbors according to the level of zooming). The invariant model does not preserve local ordering. For example, the equivariant model retrievals are correctly ordered by zoom level; whereas the invariant model retrievals orders them unpredictably.

4.2.3. COMPOSED AUGMENTATIONS

More complex sequences of augmentations are easily formed by applying the map functions sequentially. In Figure 5 , we apply both photometric (color) and geometric (crop) augmentations in the database, and query using composed maps. The returned results respect both augmentations in a sensible manner (although there is no unique ordering). Note that the retrieved results respect M(e(x)) Figure 5 : Image retrieval on composition of augmentations: color and crop transformations. We see that along each dimension (color or crop) ordering is preserved correctly. high-level semantics (nearest neighbors belong to the same class) in addition to low-level attributes. We calculate MRR for this experiment as well, and report it in Table 2 (last column).

4.3. TRANSFER LEARNING

While invariance to a particular transformation is useful for a particular dataset/task, it may hinder performance on another. Thus, we expect equivariant representations to perform better at transferring to downstream datasets than invariant representations. We test this by comparing the linear probe accuracy of both models on Oxford Flowers-102 (Nilsback & Zisserman, 2008 ), Caltech-101 (Fei-Fei et al., 2004 ), Oxford-IIIT Pets (Parkhi et al., 2012 ), and DTD (Fei-Fei et al., 2004 ) (see Table 1 ). We see that equivariant representations consistently achieve a higher accuracy.

4.4. ROBUSTNESS AND OUT-OF-DISTRIBUTION DETECTION

Figure 6 : OOD for ImageNet (in-distribution) against 4 ImageNet-C corruptions (out-of-distribution). We use upto 60 crop augmentations. Equivariant AUC (latent) monotonically increases whereas invariant AUC (latent) stays nearly flat. Equivariant AUC's are 5%-15% better than that of invariant. Invariance is commonly encouraged in model pre-training to improve robustness (Zheng et al., 2016; Geirhos et al., 2019; Rebuffi et al., 2021; Hendrycks et al., 2020) . We test whether equivariance can Table 3 : Accuracy of models on all the corruptions from the ImageNet-C (averaged across severities). then hurt in this setting vs invariance. We measure and compare the accuracy of the representations on various corruptions in the ImageNet-C (Hendrycks & Dietterich, 2019) dataset in Table 3 , and find that the equivariant model is in fact suprisingly more robust on all the ImageNet-C corruptions. We also measure the mCE (lower is better) for both models and find that our model has an mCE of 0.81 as compared to the invariant model's 0.845. Despite better robustness, there is a significant accuracy loss. In this case, we want our model to detect a sample with corruptions as out-of-distribution (OOD). Test-time data augmentation has enabled better performance on tasks such as detection of out-of-distribution, adversarial or misclassified samples and uncertainty estimation (Ayhan & Berens, 2018; Bahat & Shakhnarovich, 2020; Wang et al., 2019) . These approaches are based on the hypothesis that in-distribution images tend to exhibit stable embeddings under certain image transformations. In contrast, OOD samples have larger variations. This difference in stability can be exploited to detect out-of-distribution samples. In existing work e.g. (Wang et al., 2019) , multiple augmentations are applied to the input samples which are then forward propagated through the encoder. This leads to significant computational load since we typically need a large number of augmentations. This becomes increasingly impractical as the number of augmentations is increased. With our steerable model we can apply these augmentations directly in embedding space, leading to significant speedups. Applying 60 augmentations at input and then forward propagating them in a mini-batch takes 14.98 seconds. Conversely, forward propagating a single sample and applying 60 mappings in embedding space takes only 0.28 and 0.02 seconds per mini-batch respectively: a nearly 50× speedup. Applying 60 augmentations at input and then forward propagating them in a mini-batch takes 0.2263 seconds. Conversely, forward propagating a single sample and applying 60 mappings in embedding space takes only 0.0091 and 0.0030 seconds per mini-batch respectively: a nearly 50× speedup. We perform OOD detection using ImageNet validation set as the in-distribution dataset and ImageNet-C as the OOD dataset. We use M geo to generate multiple augmentations in latent space for a given image, and compute AUC curves across augmentations (see Appendix A.6 for details). Results are shown in Figure 6 for 4 corruptions from ImageNet-C (remainder are presented in the Appendix).We see the clear benefit of applying latent augmentations for almost all corruptions and severity levels. We further see from Figure 6 that latent augmentations have an insignificant effect on the invariant model AUCs. Thus, these results demonstrate the benefit equivariant representations provide over invariant in test-time augmentations, and how steerability can be used to amplify these and obtain great computational speedups.

5. CONCLUSION

We have presented a method to steer equivariant representations in the direction of chosen data augmentations. To the best of our knowledge, ours is the first work to show a practical approach for general deep network architectures and training paradigms. We show the benefits of steerable equivariant embeddings in retrieval, robustness, transfer learning and OOD detection, with significant performance and computational improvements over the standard (invariant) model. Our method is simple to implement and adds negligible computational overhead at inference time. A limitation of our approach is that it requires the learning of new maps for every new data augmentation that we would like to steer. We plot the values of the equivariance metric, ρ, over the course of training in Figure 8 . In Table 4 , we conduct ablations on hyperparameters α and β. We can see that both hyperparameters have sweet spots, above and below which the model either does not gain much equivariance, or it does but at the cost of reduced accuracy. For the main paper, we empirically selected a model with hyperparameter values such that the cross-entropy accuracy is not adversely reduced w.r.t to the invariance los,; and the ρ equivariance metric is reduced (lower is better) for all augmentations. 6 , we show the accuracy of both the models on ImageNet-C dataset (Hendrycks & Dietterich, 2019) , on all 15 corruptions and 3 different severity levels. We can see that the equivariant model outperforms the invariant across the board, with better accuracy on all but 5 out of 45 data points.

A.6 OUT-OF-DISTRIBUTION DETECTION

Test-time Augmentation Details: Here we give details of how we perform test time augmentation. We use M geo / M photo to generate multiple augmentations in latent space for a given input image. We compute the geometric mean across the set of logits generated in this manner (for a given number of augmentations), and then use this average logit to compute softmax probabilities. The maximum softmax value is the confidence for this sample. We use these confidences across a set of ImageNet and ImageNet-C samples and probability threshold values to compute a PR curve, and measure the AUC of this PR curve. We repeat this for upto 60 augmentations, and plot the AUC values across the number of augmentations. In Figure 10 , we provide more examples of OOD detection for all 15 corruptions from ImageNet-C (Hendrycks & Dietterich, 2019) , with severity level 3. In 12 of the 15 cases, the equivariant latent outperforms invariant latent space in AUC on both photometric and geometric augmentations. We display similar plots on different severity levels with the geomentric augmentation in Figures 11, 12 and 13. Adding more number of augmentations may help to further improve performance on the equivariant model. We also repeat the experiment above but by applying augmentations directly on the input images. We can only apply upto 8 input augmentations, as they use significantly more memory than latent augmentations. The results are plotted in Figure 9 . We see that (1) in general input augmentations do better than latent space augmentations but at a significantly higher speed, compute and memory cost; and (2) equivariant input augmentations always do better than for the invariant model. This shows the benefit of our equivariance promoting regularizer. 



Figure 1: Two examples of image retrieval, comparing our steerable equivariant model to the baseline invariant (standard) model. The top example (flowers) is for color-based steering; the bottom example (buildings) is shown for brightness-based retrieval. For each example, we show three query images in the left column, along with nearest neighbors in the next 8 columns (4 each for the steerable and standard models). Please see text for definitions of e(x), M (e(x)) and ∆M (e(x)). Query image shown is simply for illustration; we do not use that image for the retrieval. The steerable model retrieves images where the color or brightness change overrides semantics. For example, the second query on the flowers example retrieves yellow/pink neighbors and the third query retrieves purple/blue colored flowers; similarly for a darker image in the second example, darker images are retrieved and for brighter examples, brighter images are retrieved. The invariant model retrievals are fairly static between different color or brightness changes.

Figure2: For both invariant and equivariant/steerable models, we show performance with 4 types of embeddings. e(g(x)) and M (e(x)) tend to be similar to each other (since Eqn. 3 encourages this). For the invariant model, semantic retrieval (flowers) override visual (pink color). The equivariant model can perform better visual retrieval. By steering using ∆M (e(x)), we can further enhance the color component of the embedding to control visual vs semantic retrieval.

Figure 4: Examples of retrieval with crop/zoom data augmentation. See text for details. Equivariant model retrieves the same sample, ordered correctly by zoom level (e.g. see how the dog's head progressively gets exposed). Invariant model does not preserve the zoom ordering or retrieves other samples. See Appendix for other examples.Query Equivariant

Figure7: The concepts of invariance, equivariance and steerability of embeddings. Blue boxes represent the (shared) encoder that takes the input x to the embedding e(x). g(x) represents a transformation of x in input space; and M (e(x)) is a mapping in embedding space. Equivariance is a necessary but not sufficient condition for steerability.

Figure 8: Values of ρ photo components over the course of training. Left: Numerator. Center: denominator. Right: ρ photo

Figure10: OOD Detection for ImageNet-C when both photometric and geometric augmentations are applied. We see that both augmentations lead to improved OOD performance.

Figure 11: OOD Detection for ImageNet-C with geometric augmentations and multiple severity levels.

Figure 12: (contd.) OOD Detection for ImageNet-C with geometric augmentations and multiple severity levels.

Figure 13: (contd.) OOD Detection for ImageNet-C with geometric augmentations and multiple severity levels.

Modelρgeo ↓ ρ photo ↓ Flowers-102 ↑ DTD ↑ Pets ↑ Caltech-101 ↑ Left: Equivariance measure (Eqn. 2) for the two sets of augmentations. Right: Linear probe accuracy on 4 datasets: our equivariant model consistently outperforms the invariant model.

Mean Reciprocal Rank for single (left 3) and composed (right) augmentations.

Accuracy ρ geo ↓ ρ photo ↓

Left: Ablation on α, with β=0.1. Right: Ablation on β, with α=1.0A.4 ROTATIONWe add rotation to the list of augmentations and measure model accuracy and ρs in Table5¿. We see that the ρ r ot (equivariance measure) is lower for our model than a standard (invariant) ResNet-50 in this case as well, and that the existing model accuracy and ρ values for other augmentations are minimally affected. This confirms that our proposed method generalizes to rotations as well.

Accuracy and Equivariance measure (Eqn. 2) with rotation augmentation

Accuracy of models on all the corruptions from the ImageNet-C with multiple severities.the range of transformations and their parameters that our steerable equivariance respresentations generalize to.

Ablation with α = 0, β > 0

Query

Equivariant Invariant 

