i-MIX: A DOMAIN-AGNOSTIC STRATEGY FOR CONTRASTIVE REPRESENTATION LEARNING

Abstract

Contrastive representation learning has shown to be effective to learn representations from unlabeled data. However, much progress has been made in vision domains relying on data augmentations carefully designed using domain knowledge. In this work, we propose i-Mix, a simple yet effective domain-agnostic regularization strategy for improving contrastive representation learning. We cast contrastive learning as training a non-parametric classifier by assigning a unique virtual class to each data in a batch. Then, data instances are mixed in both the input and virtual label spaces, providing more augmented data during training. In experiments, we demonstrate that i-Mix consistently improves the quality of learned representations across domains, including image, speech, and tabular data. Furthermore, we confirm its regularization effect via extensive ablation studies across model and dataset sizes.

1. INTRODUCTION

Representation learning (Bengio et al., 2013 ) is a fundamental task in machine learning since the success of machine learning relies on the quality of representation. Self-supervised representation learning (SSL) has been successfully applied in several domains, including image recognition (He et al., 2020; Chen et al., 2020a) , natural language processing (Mikolov et al., 2013; Devlin et al., 2018 ), robotics (Sermanet et al., 2018; Lee et al., 2019 ), speech recognition (Ravanelli et al., 2020 ), and video understanding (Korbar et al., 2018; Owens & Efros, 2018) . Since no label is available in the unsupervised setting, pretext tasks are proposed to provide self-supervision: for example, context prediction (Doersch et al., 2015) , inpainting (Pathak et al., 2016) , and contrastive learning (Wu et al., 2018b; Hjelm et al., 2019; He et al., 2020; Chen et al., 2020a) . SSL has also been used as an auxiliary task to improve the performance on the main task, such as generative model learning (Chen et al., 2019), semi-supervised learning (Zhai et al., 2019) , and improving robustness and uncertainty (Hendrycks et al., 2019) . Recently, contrastive representation learning has gained increasing attention by showing state-ofthe-art performance in SSL for large-scale image recognition (He et al., 2020; Chen et al., 2020a) , which outperforms its supervised pre-training counterpart (He et al., 2016) on downstream tasks. However, while the concept of contrastive learning is applicable to any domains, the quality of learned representations rely on the domain-specific inductive bias: as anchors and positive samples are obtained from the same data instance, data augmentation introduces semantically meaningful variance for better generalization. To achieve a strong, yet semantically meaningful data augmentation, domain knowledge is required, e.g., color jittering in 2D images or structural information in video understanding. Hence, contrastive representation learning in different domains requires an effort to develop effective data augmentations. Furthermore, while recent works have focused on largescale settings where millions of unlabeled data is available, it would not be practical in real-world applications. For example, in lithography, acquiring data is very expensive in terms of both time and cost due to the complexity of manufacturing process (Lin et al., 2018; Sim et al., 2019) . Meanwhile, MixUp (Zhang et al., 2018) has shown to be a successful data augmentation for supervised learning in various domains and tasks, including image classification (Zhang et al., 2018) , generative model learning (Lucas et al., 2018) , and natural language processing (Guo et al., 2019; Guo, 2020) .

