IMPROVING TRANSFORMATION INVARIANCE IN CONTRASTIVE REPRESENTATION LEARNING

Abstract

We propose methods to strengthen the invariance properties of representations obtained by contrastive learning. While existing approaches implicitly induce a degree of invariance as representations are learned, we look to more directly enforce invariance in the encoding process. To this end, we first introduce a training objective for contrastive learning that uses a novel regularizer to control how the representation changes under transformation. We show that representations trained with this objective perform better on downstream tasks and are more robust to the introduction of nuisance transformations at test time. Second, we propose a change to how test time representations are generated by introducing a feature averaging approach that combines encodings from multiple transformations of the original input, finding that this leads to across the board performance gains. Finally, we introduce the novel Spirograph dataset to explore our ideas in the context of a differentiable generative process with multiple downstream tasks, showing that our techniques for learning invariance are highly beneficial.

1. INTRODUCTION

Learning meaningful representations of data is a central endeavour in artificial intelligence. Such representations should retain important information about the original input whilst using fewer bits to store it (van der Maaten et al., 2009; Gregor et al., 2016) . Semantically meaningful representations may discard a great deal of information about the input, whilst capturing what is relevant. Knowing what to discard, as well as what to keep, is key to obtaining powerful representations. By defining transformations that are believed a priori to distort the original without altering semantic features of interest, we can learn representations that are (approximately) invariant to these transformations (Hadsell et al., 2006) . Such representations may be more efficient and more generalizable than lossless encodings. Whilst less effective for reconstruction, these representations are useful in many downstream tasks that relate only to the semantic features of the input. Representation invariance is also a critically important task in of itself: it can lead to improved robustness and remove noise (Du et al., 2020) , afford fairness in downstream predictions (Jaiswal et al., 2020) , and enhance interpretability (Xu et al., 2018) . Contrastive learning is a recent and highly successful self-supervized approach to representation learning that has achieved state-of-the-art performance in tasks that rely on semantic features, rather than exact reconstruction (van den Oord et al., 2018; Hjelm et al., 2018; Bachman et al., 2019; He et al., 2019) . These methods learn to match two different transformations of the same object in representation space, distinguishing them from contrasts that are representations of other objects. The objective functions used for contrastive learning encourage representations to remain similar under transformation, whilst simultaneously requiring different inputs to be well spread out in representation space (Wang & Isola, 2020) . As such, the choice of transformations is key to their success (Chen et al., 2020a) . Typical choices include random cropping and colour distortion. However, representations are compared using a similarity function that can be maximized even for representations that are far apart, meaning that the invariance learned is relatively weak. Unfor- * Equal contribution

