i-MIX: A DOMAIN-AGNOSTIC STRATEGY FOR CONTRASTIVE REPRESENTATION LEARNING

Abstract

Contrastive representation learning has shown to be effective to learn representations from unlabeled data. However, much progress has been made in vision domains relying on data augmentations carefully designed using domain knowledge. In this work, we propose i-Mix, a simple yet effective domain-agnostic regularization strategy for improving contrastive representation learning. We cast contrastive learning as training a non-parametric classifier by assigning a unique virtual class to each data in a batch. Then, data instances are mixed in both the input and virtual label spaces, providing more augmented data during training. In experiments, we demonstrate that i-Mix consistently improves the quality of learned representations across domains, including image, speech, and tabular data. Furthermore, we confirm its regularization effect via extensive ablation studies across model and dataset sizes.

1. INTRODUCTION

Representation learning (Bengio et al., 2013 ) is a fundamental task in machine learning since the success of machine learning relies on the quality of representation. Self-supervised representation learning (SSL) has been successfully applied in several domains, including image recognition (He et al., 2020; Chen et al., 2020a) , natural language processing (Mikolov et al., 2013; Devlin et al., 2018) , robotics (Sermanet et al., 2018; Lee et al., 2019) , speech recognition (Ravanelli et al., 2020) , and video understanding (Korbar et al., 2018; Owens & Efros, 2018) . Since no label is available in the unsupervised setting, pretext tasks are proposed to provide self-supervision: for example, context prediction (Doersch et al., 2015) , inpainting (Pathak et al., 2016) , and contrastive learning (Wu et al., 2018b; Hjelm et al., 2019; He et al., 2020; Chen et al., 2020a) . SSL has also been used as an auxiliary task to improve the performance on the main task, such as generative model learning (Chen et al., 2019) , semi-supervised learning (Zhai et al., 2019) , and improving robustness and uncertainty (Hendrycks et al., 2019) . Recently, contrastive representation learning has gained increasing attention by showing state-ofthe-art performance in SSL for large-scale image recognition (He et al., 2020; Chen et al., 2020a) , which outperforms its supervised pre-training counterpart (He et al., 2016) on downstream tasks. However, while the concept of contrastive learning is applicable to any domains, the quality of learned representations rely on the domain-specific inductive bias: as anchors and positive samples are obtained from the same data instance, data augmentation introduces semantically meaningful variance for better generalization. To achieve a strong, yet semantically meaningful data augmentation, domain knowledge is required, e.g., color jittering in 2D images or structural information in video understanding. Hence, contrastive representation learning in different domains requires an effort to develop effective data augmentations. Furthermore, while recent works have focused on largescale settings where millions of unlabeled data is available, it would not be practical in real-world applications. For example, in lithography, acquiring data is very expensive in terms of both time and cost due to the complexity of manufacturing process (Lin et al., 2018; Sim et al., 2019) . Meanwhile, MixUp (Zhang et al., 2018) has shown to be a successful data augmentation for supervised learning in various domains and tasks, including image classification (Zhang et al., 2018) , generative model learning (Lucas et al., 2018) , and natural language processing (Guo et al., 2019; Guo, 2020) . In this paper, we explore the following natural, yet important question: is the idea of MixUp useful for unsupervised, self-supervised, or contrastive representation learning across different domains? To this end, we propose instance Mix (i-Mix), a domain-agnostic regularization strategy for contrastive representation learning. The key idea of i-Mix is to introduce virtual labels in a batch and mix data instances and their corresponding virtual labels in the input and label spaces, respectively. We first introduce the general formulation of i-Mix, and then we show the applicability of i-Mix to state-ofthe-art contrastive representation learning methods, SimCLR (Chen et al., 2020a) and MoCo (He et al., 2020) , and a self-supervised learning method without negative pairs, BYOL (Grill et al., 2020) . Through the experiments, we demonstrate the efficacy of i-Mix in a variety of settings. First, we show the effectiveness of i-Mix by evaluating the discriminative performance of learned representations in multiple domains. Specifically, we adapt i-Mix to the contrastive representation learning methods, advancing state-of-the-art performance across different domains, including image (Krizhevsky & Hinton, 2009; Deng et al., 2009) , speech (Warden, 2018), and tabular (Asuncion & Newman, 2007) datasets. Then, we study i-Mix in various conditions, including when 1) the model and training dataset is small or large, 2) domain knowledge is limited, and 3) transfer learning. Contribution. In summary, our contribution is three-fold: • We propose i-Mix, a method for regularizing contrastive representation learning, motivated by MixUp (Zhang et al., 2018) . We show how to apply i-Mix to state-of-the-art contrastive representation learning methods (Chen et al., 2020a; He et al., 2020; Grill et al., 2020) . • We show that i-Mix consistently improves contrastive representation learning in both vision and non-vision domains. In particular, the discriminative performance of representations learned with i-Mix is on par with fully supervised learning on CIFAR-10/100 (Krizhevsky & Hinton, 2009) and Speech Commands (Warden, 2018). • We verify the regularization effect of i-Mix in a variety of settings. We empirically observed that i-Mix significantly improves contrastive representation learning when 1) the training dataset size is small, or 2) the domain knowledge for data augmentations is not enough.

2. RELATED WORK

Self-supervised representation learning (SSL) aims at learning representations from unlabeled data by solving a pretext task that is derived from self-supervision. Early works on SSL proposed pretext tasks based on data reconstruction by autoencoding (Bengio et al., 2007) , such as context prediction (Doersch et al., 2015) and inpainting (Pathak et al., 2016) . Decoder-free SSL has made a huge progress in recent years. Exemplar CNN (Dosovitskiy et al., 2014) learns by classifying individual instances with data augmentations. SSL of visual representation, including colorization (Zhang et al., 2016) , solving jigsaw puzzles (Noroozi & Favaro, 2016) , counting the number of objects (Noroozi et al., 2017 ), rotation prediction (Gidaris et al., 2018) , next pixel prediction (Oord et al., 2018; Hénaff et al., 2019) , and combinations of them (Doersch & Zisserman, 2017; Kim et al., 2018; Noroozi et al., 2018) often leverages image-specific properties to design pretext tasks. Meanwhile, alhough deep clustering (Caron et al., 2018; 2019; Asano et al., 2020) is often distinguished from SSL, it also leverages unsupervised clustering assignments as self-supervision for representation learning. Contrastive representation learning has gained lots of attention for SSL (He et al., 2020; Chen et al., 2020a) . As opposed to early works on exemplar CNN (Dosovitskiy et al., 2014; 2015) , contrastive learning maximizes similarities of positive pairs while minimizes similarities of negative pairs instead of training an instance classifier. As the choice of negative pairs is crucial for the quality of learned representations, recent works have carefully designed them. Memory-based approaches (Wu et al., 2018b; Hjelm et al., 2019; Bachman et al., 2019; Misra & van der Maaten, 2020; Tian et al., 2020a ) maintain a memory bank of embedding vectors of instances to keep negative samples, where the memory is updated with embedding vectors extracted from previous batches. In addition, MoCo (He et al., 2020) showed that differentiating the model for anchors and positive/negative samples is effective, where the model for positive/negative samples is updated by the exponential moving average of the model for anchors. On the other hand, recent works (Ye et al., 2019; Misra & van der Maaten, 2020; Chen et al., 2020a; Tian et al., 2020a) showed that learning invariance to different views is important in contrastive representation learning. The views can be generated through data augmentations carefully designed using domain knowledge (Chen et al., 2020a) , splitting

