FEATURE DROPOUT: REVISITING THE ROLE OF AUGMENTATIONS IN CONTRASTIVE LEARNING

Abstract

What role do augmentations play in contrastive learning? Recent work suggests that good augmentations are label-preserving with respect to a specific downstream task. We complicate this picture by showing that label-destroying augmentations can be useful in the foundation model setting, where the goal is to learn diverse, general-purpose representations for multiple downstream tasks. We perform contrastive learning experiments on a range of image and audio datasets with multiple downstream tasks (e.g. for digits superimposed on photographs, predicting the class of one vs. the other). We find that Viewmaker Networks, a recently proposed model for learning augmentations for contrastive learning, produce label-destroying augmentations that stochastically destroy features needed for different downstream tasks. These augmentations are interpretable (e.g. altering shapes, digits, or letters added to images) and surprisingly often result in better performance compared to expert-designed augmentations, despite not preserving label information. To support our empirical results, we theoretically analyze a simple contrastive learning setting with a linear model. In this setting, label-destroying augmentations are crucial for preventing one set of features from suppressing the learning of features useful for another downstream task. Our results highlight the need for analyzing the interaction between multiple downstream tasks when trying to explain the success of foundation models.

1. INTRODUCTION

In recent years, foundation models (Bommasani et al., 2021) have exhibited remarkable progress on a range of AI tasks (Devlin et al., 2019; Liu et al., 2019; Ramesh et al., 2021; Radford et al., 2021; Brown et al., 2020; Chowdhery et al., 2022; Hoffmann et al., 2022; Alayrac et al., 2022; Reed et al., 2022) . A crucial characteristic of foundation models is that they can be adapted for a range of downstream tasks. For example, a foundation model trained on ImageNet should ideally not only perform well at object classification, but should also have learned general features useful for localization, segmentation, and other visual tasks. Indeed, this is borne out by recent work showing the high accuracy of foundation models on a range of downstream tasks (Chen et al., 2020b) , as well as a range of analysis work showing models learn high-level semantic features including texture, color, pose, and style (Goh et al., 2021) . One popular strategy for training foundation models involves training models to match transformed versions (known as views or augmentations) of the same input. For example, image views might include common data augmentations such as cropping or color jitter (Chen et al., 2020b), while views for speech might include pitch modulation or spectrogram masking (Kharitonov et al., 2021; Park et al., 2019) . This family of objectives includes contrastive approaches such as SimCLR and MoCo, as well as non-contrastive approaches such as BYOL and SwAV (Chen et al., 2020b; He et al., 2020; Grill et al., 2020; Caron et al., 2020) . Given the central importance of these views for defining the self-supervised task, much work has focused on the question of what views lead to high-quality representations. The prevailing consensus, exemplified by (Tian et al., 2020) , holds that views should be label-preserving with respect to a downstream task. In other words, because the contrastive loss will produce representations which are invariant to features that vary across views, any information we wish to preserve in the representations should not be altered by such views. As Tian et al. (2020) write: "A good set of views are those that share the minimal information necessary to perform well at the downstream task." Here, we question whether this assumption-in particular, with its focus on a single task-is enough to explain why contrastive foundation models succeed on a range of downstream tasks. In Section 2, we observe that the actual choice and application of views in practice does not align with this prevailing consensus. For example, complete invariance to several common data augmentations (e.g. shifts in brightness or cropping) is undesirable since augmentations of inputs from different classes can collide. Furthermore, in many cases there are explicit ways to specify invariances (e.g. converting images to grayscale) that researchers avoid in favor of specifying them indirectly via augmentations (e.g. hue shifts). These observations suggest that specifying invariances is not the sole role of these views. Instead, we suspect that augmentations serve as a form of feature dropout-preventing any one feature from becoming a shortcut feature and suppressing the learning of other features. We study this idea empirically in Viewmaker Networks, a recently proposed method that appears to learn to drop out different features in the input via adversarial training. We apply viewmaker and expert views to datasets with two associated downstream tasks, one involving classifying the main input (e.g., an image or audio recording) and one involving a simple overlaid element (e.g., a digit, shape, letter, or speech snippet). We observe that the viewmaker augmentations selectively obscure these overlaid features. Despite this, the viewmaker representations still learn both downstream tasks well, while expert views often struggle on one or the other. This further suggests that being label-preserving is not a necessary property of good views, as long as the label information is still sometimes accessible. Finally, we formalize the intuition that feature dropout can aid learning with a theoretical analysis of a simple linear contrastive setting. In this setting, we characterize how the noisiness of each feature directly determines how quickly features are learned, and uncover an interaction between features governing how fast they are learned. In particular, we show how learning one feature quickly can suppress the learning of other features, and show that adding noise to the "easiest" feature can increase the rate at which other features are learned. This further indicates that label-destroying augmentations may have a direct role in ensuring that contrastive models learn a broad range of features for downstream tasks. Overall, these findings suggest the need to revisit common assumptions about the role of augmentations for contrastive learning in the foundation model setting, and move towards a better understanding of how to train generalist models that learn diverse features from unlabeled data.

2. COMMON PRACTICES ARE AT ODDS WITH THE "INVARIANCE" EXPLANATION

We begin by briefly exploring several common augmentations used in contrastive learning for natural images, and explore how they come into conflict with the common assumption described above. First, we observe that many common augmentations can affect the label of the input, depending on the downstream task. For example, many downstream image recognition tasks require color information (e.g. identifying bird species) or brightness (e.g. scene or time-of-day classification), implying that invariance to these characteristics would be undesirable. Yet hue shifts, greyscaling, and brightness shifts are common augmentations used in contrastive learning Chen et al. (2020b); He et al. (2020) Second, repeated application of some augmentations causes challenges for all downstream tasks. For example, applying brightness shifts repeatedly results in any image turning completely black or completely white. Thus the class label cannot be truly invariant to this augmentation, since inputs from different classes can

