WHAT SHOULD NOT BE CONTRASTIVE IN CONTRASTIVE LEARNING

Abstract

Recent self-supervised contrastive methods have been able to produce impressive transferable visual representations by learning to be invariant to different data augmentations. However, these methods implicitly assume a particular set of representational invariances (e.g., invariance to color), and can perform poorly when a downstream task violates this assumption (e.g., distinguishing red vs. yellow cars). We introduce a contrastive learning framework which does not require prior knowledge of specific, task-dependent invariances. Our model learns to capture varying and invariant factors for visual representations by constructing separate embedding spaces, each of which is invariant to all but one augmentation. We use a multi-head network with a shared backbone which captures information across each augmentation and alone outperforms all baselines on downstream tasks. We further find that the concatenation of the invariant and varying spaces performs best across all tasks we investigate, including coarse-grained, fine-grained, and few-shot downstream classification tasks, and various data corruptions.

1. INTRODUCTION

Self-supervised learning, which uses raw image data and/or available pretext tasks as its own supervision, has become increasingly popular as the inability of supervised models to generalize beyond their training data has become apparent. Different pretext tasks have been proposed with different transformations, such as spatial patch prediction (Doersch et al., 2015; Noroozi & Favaro, 2016) , colorization (Zhang et al., 2016; Larsson et al., 2016; Zhang et al., 2017) , rotation (Gidaris et al., 2018) . Whereas pretext tasks aim to recover the transformations between different "views" of the same data, more recent contrastive learning methods (Wu et al., 2018; Tian et al., 2019; He et al., 2020; Chen et al., 2020a) instead try to learn to be invariant to these transformations, while remaining discriminative with respect to other data points. Here, the transformations are generated using classic data augmentation techniques which correspond to common pretext tasks, e.g., randomizing color, texture, orientation and cropping. Yet, the inductive bias introduced through such augmentations is a double-edged sword, as each augmentation encourages invariance to a transformation which can be beneficial in some cases and harmful in others: e.g., adding rotation may help with view-independent aerial image recognition, but significantly downgrade the capacity of a network to solve tasks such as detecting which way is up in a photograph for a display application. Current self-supervised contrastive learning methods assume implicit knowledge of downstream task invariances. In this work, we propose to learn visual representations which capture individual factors of variation in a contrastive learning framework without presuming prior knowledge of downstream invariances. Instead of mapping an image into a single embedding space which is invariant to all the handcrafted augmentations, our model learns to construct separate embedding sub-spaces, each of which is sensitive to a specific augmentation while invariant to other augmentations. We achieve this by optimizing multiple augmentation-sensitive contrastive objectives using a multi-head architecture with a shared backbone. Our model aims to preserve information with regard to each augmentation in a unified representation, as well as learn invariances to them. The general representation trained with these augmentations can then be applied to different downstream tasks, where each task is free to selectively utilize different factors of variation in our representation. We consider transfer of either the shared backbone representation, or the concatenation of all the task-specific heads; both outperform all baselines; the former uses same embedding dimensions as typical baselines, while the latter provides greatest overall performance in our experiments. In this paper, we experiment with three types of augmentations: rotation, color jittering, and texture randomization, as visualized in Figure 1 . We evaluate our approach across a variety of diverse tasks including large-scale classification (Deng et al., 2009) , fine-grained classification (Wah et al., 2011; Van Horn et al., 2018) , few-shot classification (Nilsback & Zisserman, 2008) , and classification on corrupted data (Barbu et al., 2019; Hendrycks & Dietterich, 2019) . Our representation shows consistent performance gains with increasing number of augmentations. Our method does not require hand-selection of data augmentation strategies, and achieves better performance against state-of-the-art MoCo baseline (He et al., 2020; Chen et al., 2020b) , and demonstrates superior transferability, generalizability and robustness across tasks and categories. Specifically, we obtain around 10% improvement over MoCo in classification when applied on the iNaturalist (Van Horn et al., 2018) dataset.

2. BACKGROUND: CONTRASTIVE LEARNING FRAMEWORK

Contrastive learning learns a representation by maximizing similarity and dissimilarity over data samples which are organized into similar and dissimilar pairs, respectively. It can be formulated as a dictionary look-up problem (He et al., 2020) , where a given reference image I is augmented into two views, query and key, and the query token q should match its designated key k + over a set of sampled negative keys {k -} from other images. In general, the framework can be summarized as the following components: (i) A data augmentation module T constituting n atomic augmentation operators, such as random cropping, color jittering, and random flipping. We denote a pre-defined atomic augmentation as random variable X i . Each time the atomic augmentation is executed by sampling a specific augmentation parameter from the random variable, i.e., x i ∼X i . One sampled data augmentation module transforms image I into a random view I, denoted as I = T [x 1 , x 2 , . . . , x n ] (I). Positive pair (q, k + ) is generated by applying two randomly sampled data augmentation on the same reference image. (ii) An encoder network f which extracts the feature v of an image I by mapping it into a d-dimensional space R d . (iii) A projection head h which further maps extracted representations into a hyper-spherical (normalized) embedding space. This space is subsequently used for a specific pretext task, i.e., contrastive loss objective for a batch of positive/negative pairs. A common choice is InfoNCE (Oord et al., 2018) : L q = -log exp (q•k + /τ ) exp (q•k + /τ ) + k -exp (q•k -/τ ) , ( ) where τ is a temperature hyper-parameter scaling the distribution of distances. As a key towards learning a good feature representation (Chen et al., 2020a), a strong augmentation policy prevents the network from exploiting naïve cues to match the given instances. However, in-



Figure 1: Self-supervised contrastive learning relies on data augmentations as depicted in (a) to learn visual representations. However, current methods introduce inductive bias by encouraging neural networks to be less sensitive to information w.r.t. augmentation, which may help or may hurt. As illustrated in (b), rotation invariant embeddings can help on certain flower categories, but may hurt animal recognition performance; conversely color invariance generally seems to help coarse grained animal classification, but can hurt many flower categories and bird categories. Our method, shown in the following figure, overcomes this limitation.

