i-MIX: A DOMAIN-AGNOSTIC STRATEGY FOR CONTRASTIVE REPRESENTATION LEARNING

Abstract

Contrastive representation learning has shown to be effective to learn representations from unlabeled data. However, much progress has been made in vision domains relying on data augmentations carefully designed using domain knowledge. In this work, we propose i-Mix, a simple yet effective domain-agnostic regularization strategy for improving contrastive representation learning. We cast contrastive learning as training a non-parametric classifier by assigning a unique virtual class to each data in a batch. Then, data instances are mixed in both the input and virtual label spaces, providing more augmented data during training. In experiments, we demonstrate that i-Mix consistently improves the quality of learned representations across domains, including image, speech, and tabular data. Furthermore, we confirm its regularization effect via extensive ablation studies across model and dataset sizes.

1. INTRODUCTION

Representation learning (Bengio et al., 2013 ) is a fundamental task in machine learning since the success of machine learning relies on the quality of representation. Self-supervised representation learning (SSL) has been successfully applied in several domains, including image recognition (He et al., 2020; Chen et al., 2020a) , natural language processing (Mikolov et al., 2013; Devlin et al., 2018) , robotics (Sermanet et al., 2018; Lee et al., 2019) , speech recognition (Ravanelli et al., 2020) , and video understanding (Korbar et al., 2018; Owens & Efros, 2018) . Since no label is available in the unsupervised setting, pretext tasks are proposed to provide self-supervision: for example, context prediction (Doersch et al., 2015) , inpainting (Pathak et al., 2016) , and contrastive learning (Wu et al., 2018b; Hjelm et al., 2019; He et al., 2020; Chen et al., 2020a) . SSL has also been used as an auxiliary task to improve the performance on the main task, such as generative model learning (Chen et al., 2019) , semi-supervised learning (Zhai et al., 2019) , and improving robustness and uncertainty (Hendrycks et al., 2019) . Recently, contrastive representation learning has gained increasing attention by showing state-ofthe-art performance in SSL for large-scale image recognition (He et al., 2020; Chen et al., 2020a) , which outperforms its supervised pre-training counterpart (He et al., 2016) on downstream tasks. However, while the concept of contrastive learning is applicable to any domains, the quality of learned representations rely on the domain-specific inductive bias: as anchors and positive samples are obtained from the same data instance, data augmentation introduces semantically meaningful variance for better generalization. To achieve a strong, yet semantically meaningful data augmentation, domain knowledge is required, e.g., color jittering in 2D images or structural information in video understanding. Hence, contrastive representation learning in different domains requires an effort to develop effective data augmentations. Furthermore, while recent works have focused on largescale settings where millions of unlabeled data is available, it would not be practical in real-world applications. For example, in lithography, acquiring data is very expensive in terms of both time and cost due to the complexity of manufacturing process (Lin et al., 2018; Sim et al., 2019) . Meanwhile, MixUp (Zhang et al., 2018) has shown to be a successful data augmentation for supervised learning in various domains and tasks, including image classification (Zhang et al., 2018) , generative model learning (Lucas et al., 2018) , and natural language processing (Guo et al., 2019; Guo, 2020) . In this paper, we explore the following natural, yet important question: is the idea of MixUp useful for unsupervised, self-supervised, or contrastive representation learning across different domains? To this end, we propose instance Mix (i-Mix), a domain-agnostic regularization strategy for contrastive representation learning. The key idea of i-Mix is to introduce virtual labels in a batch and mix data instances and their corresponding virtual labels in the input and label spaces, respectively. We first introduce the general formulation of i-Mix, and then we show the applicability of i-Mix to state-ofthe-art contrastive representation learning methods, SimCLR (Chen et al., 2020a) and MoCo (He et al., 2020) , and a self-supervised learning method without negative pairs, BYOL (Grill et al., 2020) . Through the experiments, we demonstrate the efficacy of i-Mix in a variety of settings. First, we show the effectiveness of i-Mix by evaluating the discriminative performance of learned representations in multiple domains. Specifically, we adapt i-Mix to the contrastive representation learning methods, advancing state-of-the-art performance across different domains, including image (Krizhevsky & Hinton, 2009; Deng et al., 2009) , speech (Warden, 2018) , and tabular (Asuncion & Newman, 2007) datasets. Then, we study i-Mix in various conditions, including when 1) the model and training dataset is small or large, 2) domain knowledge is limited, and 3) transfer learning. Contribution. In summary, our contribution is three-fold: • We propose i-Mix, a method for regularizing contrastive representation learning, motivated by MixUp (Zhang et al., 2018) . We show how to apply i-Mix to state-of-the-art contrastive representation learning methods (Chen et al., 2020a; He et al., 2020; Grill et al., 2020) . • We show that i-Mix consistently improves contrastive representation learning in both vision and non-vision domains. In particular, the discriminative performance of representations learned with i-Mix is on par with fully supervised learning on CIFAR-10/100 (Krizhevsky & Hinton, 2009) and Speech Commands (Warden, 2018) . • We verify the regularization effect of i-Mix in a variety of settings. We empirically observed that i-Mix significantly improves contrastive representation learning when 1) the training dataset size is small, or 2) the domain knowledge for data augmentations is not enough.

2. RELATED WORK

Self-supervised representation learning (SSL) aims at learning representations from unlabeled data by solving a pretext task that is derived from self-supervision. Early works on SSL proposed pretext tasks based on data reconstruction by autoencoding (Bengio et al., 2007) , such as context prediction (Doersch et al., 2015) and inpainting (Pathak et al., 2016) . Decoder-free SSL has made a huge progress in recent years. Exemplar CNN (Dosovitskiy et al., 2014) learns by classifying individual instances with data augmentations. SSL of visual representation, including colorization (Zhang et al., 2016) , solving jigsaw puzzles (Noroozi & Favaro, 2016) , counting the number of objects (Noroozi et al., 2017) , rotation prediction (Gidaris et al., 2018) , next pixel prediction (Oord et al., 2018; Hénaff et al., 2019) , and combinations of them (Doersch & Zisserman, 2017; Kim et al., 2018; Noroozi et al., 2018) often leverages image-specific properties to design pretext tasks. Meanwhile, alhough deep clustering (Caron et al., 2018; 2019; Asano et al., 2020) is often distinguished from SSL, it also leverages unsupervised clustering assignments as self-supervision for representation learning. Contrastive representation learning has gained lots of attention for SSL (He et al., 2020; Chen et al., 2020a) . As opposed to early works on exemplar CNN (Dosovitskiy et al., 2014; 2015) , contrastive learning maximizes similarities of positive pairs while minimizes similarities of negative pairs instead of training an instance classifier. As the choice of negative pairs is crucial for the quality of learned representations, recent works have carefully designed them. Memory-based approaches (Wu et al., 2018b; Hjelm et al., 2019; Bachman et al., 2019; Misra & van der Maaten, 2020; Tian et al., 2020a ) maintain a memory bank of embedding vectors of instances to keep negative samples, where the memory is updated with embedding vectors extracted from previous batches. In addition, MoCo (He et al., 2020) showed that differentiating the model for anchors and positive/negative samples is effective, where the model for positive/negative samples is updated by the exponential moving average of the model for anchors. On the other hand, recent works (Ye et al., 2019; Misra & van der Maaten, 2020; Chen et al., 2020a; Tian et al., 2020a) showed that learning invariance to different views is important in contrastive representation learning. The views can be generated through data augmentations carefully designed using domain knowledge (Chen et al., 2020a) , splitting input channels (Tian et al., 2020a) , or borrowing the idea of other pretext tasks, such as creating jigsaw puzzles or rotating inputs (Misra & van der Maaten, 2020) . In particular, SimCLR (Chen et al., 2020a) showed that a simple memory-free approach with a large batch size and strong data augmentations has a comparable performance to memory-based approaches. InfoMin (Tian et al., 2020b) further studied a way to generate good views for contrastive representation learning and achieved state-of-the-art performance by combining prior works. Different from other contrastive representation learning methods, BYOL (Grill et al., 2020) does not require negative pairs, where the proposed pretext task aims at predicting latent representations of one view from another. While prior works have focused on SSL on large-scale visual recognition tasks, our work focuses on contrastive representation learning in both small-and large-scale settings in different domains. Data augmentation is a technique to increase the diversity of data, especially when training data are not enough for generalization. Since the augmented data must be understood as the original data, data augmentations are carefully designed using the domain knowledge about images (DeVries & Taylor, 2017b; Cubuk et al., 2019a; b; Zhong et al., 2020) , speech (Amodei et al., 2016; Park et al., 2019) , or natural languages (Zhang et al., 2015; Wei & Zou, 2019) . Some works have studied data augmentation with less domain knowledge: DeVries & Taylor (2017a) proposed a domain-agnostic augmentation strategy by first encoding the dataset and then applying augmentations in the feature space. MixUp (Zhang et al., 2018) is an effective data augmentation strategy in supervised learning, which performs vicinal risk minimization instead of empirical risk minimization, by linearly interpolating input data and their labels on the data and label spaces, respectively. On the other hand, MixUp has also shown its effectiveness in other tasks and non-vision domains, including generative adversarial networks (Lucas et al., 2018) , improved robustness and uncertainty (Hendrycks et al., 2020) , and sentence classification in natural language processing (Guo, 2020; Guo et al., 2019) . Other variations have also been investigated by interpolating in the feature space (Verma et al., 2019) or leveraging domain knowledge (Yun et al., 2019) . MixUp would not be directly applicable to some domains, such as point clouds, but its adaptation can be effective (Harris et al., 2020) . i-Mix is a kind of data augmentation for better generalization in contrastive representation learning, resulting in better performances on downstream tasks. Concurrent works have leveraged the idea of MixUp for contrastive representation learning. As discussed in Section 3.3, only input data can be mixed for improving contrastive representation learning (Shen et al., 2020; Verma et al., 2020; Zhou et al., 2020) , which can be considered as injecting data-driven noises. Kalantidis et al. (2020) mixed hard negative samples on the embedding space. Kim et al. (2020) reported similar observations to ours but focused on small image datasets.

3. APPROACH

In this section, we review MixUp (Zhang et al., 2018) in supervised learning and present i-Mix in contrastive learning (He et al., 2020; Chen et al., 2020a; Grill et al., 2020) . Throughout this section, let X be a data space, R D be a D-dimensional embedding space, and a model f : X → R D be a mapping between them. For conciseness, f i = f (x i ) and fi = f (x i ) for x i , xi ∈ X , and model parameters are omitted in loss functions.

3.1. MIXUP IN SUPERVISED LEARNING

Suppose an one-hot label y i ∈ {0, 1} C is assigned to a data x i , where C is the number of classes. Let a linear classifier predicting the labels consists of weight vectors {w 1 , . . . , w C }, where w c ∈ R D .foot_0 Then, the cross-entropy loss for supervised learning is defined as: Sup (x i , y i ) = - C c=1 y i,c log exp(w c f i ) C k=1 exp(w k f i ) . (1) While the cross-entropy loss is widely used for supervised training of deep neural networks, there are several challenges of training with the cross-entropy loss, such as preventing overfitting or networks being overconfident. Several regularization techniques have been proposed to alleviate these issues, including label smoothing (Szegedy et al., 2016) , adversarial training (Miyato et al., 2018) , and confidence calibration (Lee et al., 2018) . MixUp (Zhang et al., 2018) is an effective regularization with negligible computational overhead. It conducts a linear interpolation of two data instances in both input and label spaces and trains a model by minimizing the cross-entropy loss defined on the interpolated data and labels. Specifically, for two labeled data (x i , y i ), (x j , y j ), the MixUp loss is defined as follows: MixUp Sup (x i , y i ), (x j , y j ); λ = Sup (λx i + (1 -λ)x j , λy i + (1 -λ)y j ), where λ ∼ Beta(α, α) is a mixing coefficient sampled from the beta distribution. MixUp is a vicinal risk minimization method (Chapelle et al., 2001) that augments data and their labels in a data-driven manner. Not only improving the generalization on the supervised task, it also improves adversarial robustness (Pang et al., 2019) and confidence calibration (Thulasidasan et al., 2019) .

3.2. i-MIX IN CONTRASTIVE LEARNING

We introduce instance mix (i-Mix), a data-driven augmentation strategy for contrastive representation learning to improve the generalization of learned representations. Intuitively, instead of mixing class labels, i-Mix interpolates their virtual labels, which indicates their identity in a batch. Let B = {(x i , xi )} N i=1 be a batch of data pairs, where N is the batch size, x i , xi ∈ X are two views of the same data, which are usually generated by different augmentations. For each anchor x i , we call xi and xj =i positive and negative samples, respectively.foot_1 Then, the model f learns to maximize similarities of positive pairs (instances from the same data) while minimize similarities of negative pairs (instances from different data) in the embedding space. The output of f is L2-normalized, which has shown to be effective (Wu et al., 2018a; He et al., 2020; Chen et al., 2020a) . Let v i ∈ {0, 1} N be the virtual label of x i and xi in a batch B, where v i,i = 1 and v i,j =i = 0. For a general sample-wise contrastive loss with virtual labels (x i , v i ), the i-Mix loss is defined as follows: i-Mix (x i , v i ), (x j , v j ); B, λ = (Mix(x i , x j ; λ), λv i + (1 -λ)v j ; B), where λ ∼ Beta(α, α) is a mixing coefficient and Mix is a mixing operator, which can be adapted depending on target domains: for example, MixUp(x i , x j ; λ) = λx i + (1-λ)x j (Zhang et al., 2018) when data values are continuous, and CutMix(x i , x j ; λ) = M λ x i + (1-M λ ) x j (Yun et al., 2019) when data values have a spatial correlation with neighbors, where M λ is a binary mask filtering out a region whose relative area is (1-λ), and is an element-wise multiplication. Note that some mixing operators might not work well for some domains: for example, CutMix would not be valid when data values and their spatial neighbors have no correlation. However, the MixUp operator generally works well across domains including image, speech, and tabular; we use it for i-Mix formulations and experiments, unless otherwise specified. In the following, we show how to apply i-Mix to contrastive representation learning methods. SimCLR (Chen et al., 2020a ) is a simple contrastive representation learning method without a memory bank, where each anchor has one positive sample and (2N -2) negative samples. Let x N +i = xi for conciseness. Then, the (2N -1)-way discrimination loss is written as follows: SimCLR (x i ; B) = -log exp s(f i , f (N +i) mod 2N )/τ 2N k=1,k =i exp s(f i , f k )/τ , ( ) where τ is a temperature scaling parameter and s(f, f ) = (f f )/ f f is the inner product of two L2-normalized vectors. In this formulation, i-Mix is not directly applicable because virtual labels are defined differently for each anchor. 3 To resolve this issue, we simplify the formulation of SimCLR by excluding anchors from negative samples. Then, with virtual labels, the N -way discrimination loss is written as follows: N-pair (x i , v i ; B) = - N n=1 v i,n log exp s(f i , fn )/τ N k=1 exp s(f i , fk )/τ , where we call it the N-pair contrastive loss, as the formulation is similar to the N-pair loss in the context of metric learning (Sohn, 2016) . 4 For two data instances (x i , v i ), (x j , v j ) and a batch of data  pairs B = {(x i , xi )} N i=1 , the i-Mix loss is defined as follows: i-Mix N-pair (x i , v i ), (x j , v j ); B, λ = N-pair (λx i + (1 -λ)x j , λv i + (1 -λ)v j ; B). Algorithm 1 provides the pseudocode of i-Mix on N-pair contrastive learning for one iteration.foot_4  Pair relations in contrastive loss. To use contrastive loss for representation learning, one needs to properly define a pair relation {(x i , xi )} N i=1 . For contrastive representation learning, where semantic class labels are not provided, the pair relation would be defined in that 1) a positive pair, x i and xi , are different views of the same data and 2) a negative pair, x i and xj =i , are different data instances. For supervised representation learning, x i and xi are two data instances from the same class, while x i and xj =i are from different classes. Note that two augmented versions of the same data also belong to the same class, so they can also be considered as a positive pair. i-Mix is not limited to self-supervised contrastive representation learning, but it can also be used as a regularization method for supervised contrastive representation learning (Khosla et al., 2020) or deep metric learning (Sohn, 2016; Movshovitz-Attias et al., 2017) . MoCo (He et al., 2020) . In contrastive representation learning, the number of negative samples affects the quality of learned representations (Arora et al., 2019) . Because SimCLR mines negative samples in the current batch, having a large batch size is crucial, which often requires a lot of computational resources (Chen et al., 2020a) . For efficient training, recent works have maintained a memory bank M = {µ k } K k=1 , which is a queue of previously extracted embedding vectors, where K is the size of the memory bank (Wu et al., 2018b; He et al., 2020; Tian et al., 2020a; b) . In addition, MoCo introduces an exponential moving average (EMA) model to extract positive and negative embedding vectors, whose parameters are updated as θ f EMA ← mθ f EMA + (1 -m)θ f , where m ∈ [0, 1) is a momentum coefficient and θ is model parameters. The loss is written as follows: MoCo (x i ; B, M) = -log exp s(f i , f EMA i )/τ exp s(f i , f EMA i )/τ + K k=1 exp s(f i , µ k )/τ . ( ) The memory bank M is then updated with { f EMA i } in the first-in first-out order. In this (K+1)-way discrimination loss, data pairs are independent to each other, such that i-Mix is not directly applicable because virtual labels are defined differently for each anchor. To overcome this issue, we include the positive samples of other anchors as negative samples, similar to the N-pair contrastive loss in Eq. ( 5). Let ṽi ∈ {0, 1} N +K be a virtual label indicating the positive sample of each anchor, where ṽi,i = 1 and ṽi,j =i = 0. Then, the (N +K)-way discrimination loss is written as follows: MoCo (x i , ṽi ; B, M) = - N n=1 ṽi,n log exp s(f i , f EMA n )/τ N k=1 exp s(f i , f EMA k )/τ + K k=1 exp s(f i , µ k )/τ . ( ) As virtual labels are bounded in the same set in this formulation, i-Mix is directly applicable: for two data instances (x i , ṽi ), (x j , ṽj ), a batch of data pairs B = {(x i , xi )} N i=1 , and the memory bank M, the i-Mix loss is defined as follows: i-Mix MoCo (x i , ṽi ), (x j , ṽj ); B, M, λ = MoCo (λx i + (1 -λ)x j , λṽ i + (1 -λ)ṽ j ; B, M). ( 9) BYOL (Grill et al., 2020) . Different from other contrastive representation learning methods, BYOL is a self-supervised representation learning method without contrasting negative pairs. For two views of the same data x i , xi ∈ X , the model f learns to predict a view embedded with the EMA model f EMA i from its embedding f i . Specifically, an additional prediction layer g is introduced, such that the difference between g(f i ) and f EMA i is learned to be minimized. The BYOL loss is written as follows: BYOL (x i , xi ) = g(f i )/ g(f i ) -fi / fi 2 = 2 -2 • s(g(f i ), fi ). ( ) This formulation can be represented in the form of the general contrastive loss in Eq. ( 3), as the second view xi can be accessed from the batch B with its virtual label v i . To derive i-Mix in BYOL, let F = [ f1 / f1 , . . ., fN / fN ] ∈ R D×N be the collection of L2-normalized embedding vectors of the second views, such that fi / fi = F v i . Then, the BYOL loss is written as follows: BYOL (x i , v i ; B) = g(f i )/ g(f i ) -F v i 2 = 2 -2 • s(g(f i ), F v i ). For two data instances (x i , v i ), (x j , v j ) and a batch of data pairs B = {(x i , xi )} N i=1 , the i-Mix loss is defined as follows: i-Mix BYOL (x i , v i ), (x j , v j ); B, λ = BYOL (λx i + (1 -λ)x j , λv i + (1 -λ)v j ; B). (12)

3.3. INPUTMIX

The contribution of data augmentations to the quality of learned representations is crucial in contrastive representation learning. For the case when the domain knowledge about efficient data augmentations is limited, we propose to apply InputMix together with i-Mix, which mixes input data but not their labels. This method can be viewed as introducing structured noises driven by auxiliary data to the principal data with the largest mixing coefficient λ, and the label of the principal data is assigned to the mixed data (Shen et al., 2020; Verma et al., 2020; Zhou et al., 2020) . We applied InputMix and i-Mix together on image datasets in Table 3 .

4. EXPERIMENTS

In this section, we demonstrate the effectiveness of i-Mix. In all experiments, we conduct contrastive representation learning on a pretext dataset and evaluate the quality of representations via supervised classification on a downstream dataset. We report the accuracy averaged over up to five runs. In the first stage, a convolutional neural network (CNN) or multilayer perceptron (MLP) followed by the two-layer MLP projection head is trained on an unlabeled dataset. Then, we replace the projection head with a linear classifier and train only the linear classifier on a labeled dataset for downstream task. Except for transfer learning, datasets for the pretext and downstream tasks are the same. For i-Mix, we sample a mixing coefficient λ ∼ Beta(α, α) for each data, where α = 1 unless otherwise stated. 6Additional details for the experimental settings and more experiments can be found in Appendix C.

4.1. EXPERIMENTAL SETUP

Baselines and datasets. We consider 1) N-pair contrastive learning as a memory-free contrastive learning method,foot_6 2) MoCo v2 (He et al., 2020; Chen et al., 2020b) foot_7 as a memory-based contrastive learning method, and 3) BYOL (Grill et al., 2020) , which is a self-supervised learning method without negative pairs. We apply i-Mix to these methods and compare their performances. To show the effectiveness of i-Mix across domains, we evaluate the methods on datasets from multiple domains, including image, speech, and tabular datasets. CIFAR-10/100 (Krizhevsky & Hinton, 2009) random resized cropping, horizontal flipping, color jittering, gray scaling, and Gaussian blurring for ImageNet, which has shown to be effective (Chen et al., 2020a; b) . We use ResNet-50 (He et al., 2016) as a backbone network. Models are trained with a batch size of 256 (i.e., 512 including augmented data) for up to 4000 epochs on CIFAR-10 and 100, and with a batch size of 512 for 800 epochs on ImageNet. For ImageNet experiments, we use the CutMix (Yun et al., 2019) version of i-Mix. The Speech Commands dataset (Warden, 2018) For Higgs, we use a subset of 100k and 1M training data to experiment at a different scale. Since the domain knowledge for data augmentations on tabular data is limited, only a masking noise with the probability 0.2 is considered as a data augmentation. We use a 5-layer MLP with batch normalization (Ioffe & Szegedy, 2015) as a backbone network. Models are trained with a batch size of 512 for 500 epochs. We use α = 2 for CovType and Higgs100k, as it is slightly better than α = 1.

4.2. MAIN RESULTS

Table 1 shows the wide applicability of i-Mix to state-of-the-art contrastive representation learning methods in multiple domains. i-Mix results in consistent improvements on the classification accuracy, e.g., up to 6.5% when i-Mix is applied to MoCo v2 on CIFAR-100. Interestingly, we observe that linear classifiers on top of representations learned with i-Mix without fine-tuning the pre-trained part often yield a classification accuracy on par with simple end-to-end supervised learning from random initialization, e.g., i-Mix vs. end-to-end supervised learning performance is 96.3% vs. 95.5% on CIFAR-10, 78.6% vs. 78.9% on CIFAR-100, and 98.2% vs. 98.0% on Speech Commands.foot_8 

4.3. REGULARIZATION EFFECT OF i-MIX

A better regularization method often benefits from longer training of deeper models, which is more critical when training on a small dataset. To investigate the regularization effect of i-Mix, we first 1 shows the performance of MoCo v2 (solid box) and i-Mix (dashed box). The improvement by applying i-Mix to MoCo v2 is consistent over the different architecture size and the number of training epochs. Deeper models benefit from i-Mix, achieving 96.7% on CIFAR-10 and 79.1% on CIFAR-100 when the backbone network is ResNet-152. On the other hand, models trained without i-Mix start to show decrease in performance, possibly due to overfitting to the pretext task when trained longer. The trend clearly shows that i-Mix results in better representations via improved regularization. Next, we study the effect of i-Mix with varying dataset sizes for the pretext tasks. Table 2 shows the effect of i-Mix on large-scale datasetsfoot_9 from image and tabular domains. We observe that i-Mix is particularly effective when the amount of training data is reduced, e.g., ImageNet-100 consists of images from 100 classes, thus has only 10% of training data compared to ImageNet-1k. However, the performance gain is reduced when the amount of training data is large. we further study representations learned with different pretext dataset sizes from 1% to 100% of the ImageNet training data in Figure 2 . Here, different from ImageNet-100, we reduce the amount of data for each class, but maintain the number of classes the same. We observe that the performance gain by i-Mix is more significant when the size of the pretext dataset is small. Our study suggests that i-Mix is effective for regularizing self-supervised representation learning when training from a limited amount of data. We believe that this is aligned with findings in Zhang et al. (2018) for MixUp in supervised learning. Finally, when a large-scale unlabeled dataset is available, we expect i-Mix would still be useful in obtaining better representations when trained longer with deeper and larger models.

4.4. CONTRASTIVE LEARNING WITHOUT DOMAIN-SPECIFIC DATA AUGMENTATION

Data augmentations play a key role in contrastive representation learning, and therefore it raises a question when applying them to domains with a limited or no knowledge of such augmentations. In this section, we study the effectiveness of i-Mix as a domain-agnostic strategy for contrastive representation learning, which can be adapted to different domains. Table 3 shows the performance of MoCo v2 and i-Mix with and without data augmentations. We observe significant performance gains with i-Mix when other data augmentations are not applied. For example, compared to the accuracy of 93.5% on CIFAR-10 when other data augmentations are applied, contrastive learning achieves 47.7% when trained without any data augmentations. This suggests that data augmentation is an essential part for the success of contrastive representation learning (Chen et al., 2020a) . However, i-Mix is able to learn meaningful representations without other data augmentations and achieves the accuracy of 83.4% on CIFAR-10. In Table 3 , InputMix is applied together with i-Mix to further improve the performance on image datasets. For each principal data, we mix two auxiliary data, with mixing coefficients (0.5λ 1 + 0.5, 0.5λ 2 , 0.5λ 3 ), where λ 1 , λ 2 , λ 3 ∼ Dirichlet(1, 1, 1).foot_10 In the above example, while i-Mix is better than baselines, adding InputMix further improves the performance of i-Mix, i.e., from 75.1% to 83.4% on CIFAR-10, and from 50.7% to 54.0% on CIFAR-100. This confirms that InputMix can further improve the performance when domain-specific data augmentations are not available, as discussed in Section 3.3. Moreover, we verify its effectiveness on other domains beyond the image domain. For example, the performance improves from 76.9% to 92.8% on the Speech Commands dataset when we assume no other data augmentations are available. We also observe consistent improvements in accuracy for tabular datasets, even when the training dataset size is large. Although the domain knowledge for data augmentations is important to achieve state-of-the-art results, our demonstration shows the potential of i-Mix to be used for a wide range of application domains where domain knowledge is particularly limited.

4.5. TRANSFERABILITY OF i-MIX

In this section, we show the improved transferability of the representations learned with i-Mix. The results are provided in Table 4 . First, we train linear classifiers with downstream datasets different from the pretext dataset used to train backbone networks and evaluate their performance, e.g., CIFAR-10 as pretext and CIFAR-100 as downstream datasets or vice versa. We observe consistent performance gains when learned representations from one dataset are evaluated on classification tasks of another dataset. Next, we transfer representations trained on ImageNet to the PASCAL VOC object detection task (Everingham et al., 2010) . We follow the settings in prior works (He et al., 2020; Chen et al., 2020b) : the parameters of the pre-trained ResNet-50 are transferred to a Faster R-CNN detector with the ResNet50-C4 backbone (Ren et al., 2015) , and fine-tuned end-to-end on the VOC 07+12 trainval dataset and evaluated on the VOC 07 test dataset. We report the average precision (AP) averaged over IoU thresholds between 50% to 95% at a step of 5%, and AP 50 and AP 75 , which are AP values when IoU threshold is 50% and 75%, respectively. Similar to Table 2 , we observe small but consistent performance gains in all metrics. Those results confirm that i-Mix improves the quality of learned representations, such that performances on downstream tasks are improved.

5. CONCLUSION

We propose i-Mix, a domain-agnostic regularization strategy applicable to a class of self-supervised learning. The key idea of i-Mix is to introduce a virtual label to each data instance, and mix both inputs and the corresponding virtual labels. We show that i-Mix is applicable to state-of-the-art self-supervised representation learning methods including SimCLR, MoCo, and BYOL, which consistently improves the performance in a variety of settings and domains. Our experimental results indicate that i-Mix is particularly effective when the training dataset size is small or data augmentation is not available, each of which are prevalent in practice. Then, the head of the CNN is replaced with a linear classifier, and only the linear classifier is trained with the labeled downstream dataset. For the second stage, we use a batch size of 256 with the SGD optimizer with a momentum of 0.9 and an initial learning rate chosen among {1, 3, 5, 10, 30, 50, 70} over 100 epochs, where the learning rate is decayed by 0.2 after 80, 90, 95 epochs. No weight decay is used at the second stage. The quality of representation is evaluated by the top-1 accuracy on the downstream task. We sample a single mixing coefficient λ ∼ Beta(1, 1) for each training batch. The temperature is set to τ = 0.2. Note that the optimal distribution of λ and the optimal value of τ varies over different architectures, methods, and datasets, but the choices above result in a reasonably good performance. The memory bank size of MoCo is 65536 for ImageNet and 4096 for other datasets, and the momentum for the exponential moving average (EMA) update is 0.999 for MoCo and BYOL. We do not symmetrize the BYOL loss, as it does not significantly improve the performance while increasing computational complexity. For data augmentation, we follow Chen et al. (2020a): We apply a set of data augmentations randomly in sequence including resized cropping (Szegedy et al., 2015) , horizontal flipping with a probability of 0.5, color jittering,foot_14 and gray scaling with a probability of 0.2. A Gaussian blurring with σ ∈ [0.1, 2] and kernel size of 10% of the image height/width is applied for ImageNet. For evaluation on downstream tasks, we apply padded cropping with the pad size of 4 and horizontal flipping for CIFAR-10 and 100, and resized cropping and horizontal flipping for ImageNet. Speech. In the experiments on Speech Commands (Warden, 2018) , the network is the same with the image domain experiments, except that the number of input channels is one instead of three. The temperature is set to τ = 0.5 for the standard setting and τ = 0.2 for the no augmentation setting. 10% of silence data (all zero) are added when training. At the first stage, the model is trained with the SGD optimizer with a momentum of 0.9 and an initial learning rate of 0.125 over 500 epochs, where the learning rate decays by 0.1 after 300 and 400 epochs and the weight decay is 0.0001. The other settings are the same with the experiments on CIFAR. For data augmentation, 16 we apply a set of data augmentations randomly in sequence including changing amplitude, speed, and pitch in time domain, stretching, time shifting, and adding background noise in frequency domain. Each data augmentation is applied with a probability of 0.5. Augmented data are then transformed to the mel spectogram in the size of 32 × 32. Tabular. In the experiments on CovType and Higgs (Asuncion & Newman, 2007) , we take a fivelayer MLP with batch normalization as a backbone network. The output dimensions of layers are (2048-2048-4096-4096-8192) , where all layers have batch normalization followed by ReLU except for the last layer. The last layer activation is maxout (Goodfellow et al., 2013) with 4 sets, such that the output dimension is 2048. On top of this five-layer MLP, we attach two-layer MLP (2048-128) as a projection head. We sample a single mixing coefficient λ ∼ Beta(α, α) for each training batch, where α = 2 for CovType and Higgs100k, and α = 1 for Higgs1M. The temperature is set to τ = 0.1. The other settings are the same with the experiments on CIFAR, except that the batch size is 512 and the number of training epochs is 500. At the second stage, the MLP head is replaced with a linear classifier. For Higgs, the classifier is computed by linear regression from the feature matrix obtained without data augmentation to the label matrix using the pseudoinverse. Since the prior knowledge on tabular data is very limited, only the masking noise with a probability of 0.2 is considered as a data augmentation.

C.2 VARIATIONS OF i-MIX

We compare the MixUp (Zhang et al., 2018) and CutMix (Yun et al., 2019) variation of i-Mix on N-pair contrastive learning and SimCLR. To distinguish them, we call them i-MixUp and i-CutMix, respectively. To be fair with the memory usage in the pretext task stage, we reduce the batch size of i-MixUp and i-CutMix by half (256 to 128) for SimCLR. Following the learning rate adjustment strategy in Goyal et al. (2017) , we also decrease the learning rate by half (0.125 to 0.0625) when the batch size is reduced. We note that i-MixUp and i-CutMix on SimCLR take approximately 2.5 times more training time to achieve the same number of training epochs. The results are provided in Table C .1. We first verify that the N-pair formulation results in no worse performance than that of SimCLR. This justifies to conduct experiments using the N-pair formulation instead of that of



We omit bias terms for presentation clarity. Some literature(He et al., 2020; Chen et al., 2020b) refers to them as query and positive/negative keys. We present the application of i-Mix to the original SimCLR formulation in Appendix A. InfoNCE (Oord et al., 2018) is a similar loss inspired by the idea of noise-contrastive estimation(Gutmann & Hyvärinen, 2010). For losses linear with respect to labels (e.g., the cross-entropy loss), they are equivalent to λ (λxi + (1λ)xj, vi)+(1-λ) (λxi +(1-λ)xj, vj), i.e., optimization to the mixed label is equivalent to joint optimization to original labels. The proof for losses in contrastive learning methods is provided in Appendix B. Beta(α, α) is the uniform distribution when α = 1, bell-shaped when α > 1, and bimodal when α < 1. We use the N-pair formulation in Eq. (5) instead of SimCLR as it is simpler and more efficient to integrate i-Mix. As shown in Appendix C.2, the N-pair formulation results in no worse performance than SimCLR. MoCo v2 improves the performance of MoCo by cosine learning schedule and more data augmentations. Supervised learning with improved methods, e.g., MixUp, outperforms i-Mix. However, linear evaluation on top of self-supervised representation learning is a proxy to measure the quality of representations learned without labels, such that it is not supposed to be compared with the performance of supervised learning. Here, "scale" corresponds to the amount of data rather than image resolution. This guarantees that the mixing coefficient for the principal data is larger than 0.5 to prevent from training with noisy labels. Note thatBeckham et al. (2019) also sampled mixing coefficients from the Dirichlet distribution for mixing more than two data. The j-th data can be excluded from the negative samples, but it does not result in a significant difference. https://github.com/HobbitLong/SupContrast For small resolution data from CIFAR and Speech Commands, we replaced the kernal, stride, and padding size from (7,2,3) to (3,1,1) in the first convolutional layer, and removed the first max pooling layer, followingChen et al. (2020a). Specifically, brightness, contrast, and saturation are scaled by a factor uniformly sampled from [0.6, 1.4] at random, and hue is rotated in the HSV space by a factor uniformly sampled from [-0.1, 0.1] at random. https://github.com/tugstugi/pytorch-speech-commands



Loss computation for i-Mix on N-pair contrastive learning in PyTorch-like style. a, b = aug(x), aug(x) # two different views of input x lam = Beta(alpha, alpha).sample() # mixing coefficient randidx = randperm(len(x)) a = lam * a + (1-lam) * a[randidx] logits = matmul(normalize(model(a)), normalize(model(b)).T) / t loss = lam * CrossEntropyLoss(logits, arange(len(x))) + \(1-lam) * CrossEntropyLoss(logits, randidx)

Figure 1: Comparison of performance gains by applying i-Mix to MoCo v2 with different model sizes and number of epochs on CIFAR-10 and 100.

consist of 50k training and 10k test images, and ImageNet(Deng et al., 2009) has 1.3M training and 50k validation images, where we use them for evaluation. For ImageNet, we also use a subset of randomly chosen 100 classes out of 1k classes to experiment at a different scale. We apply a set of data augmentations randomly in sequence including ± 0.1 95.6 ± 0.2 93.5 ± 0.2 96.1 ± 0.1 94.2 ± 0.2 96.3 ± 0.2 CIFAR-100 70.8 ± 0.4 75.8 ± 0.3 71.6 ± 0.1 78.1 ± 0.3 72.7 ± 0.4 78.6 ± 0.2 Speech Commands 94.9 ± 0.1 98.3 ± 0.1 96.3 ± 0.1 98.4 ± 0.0 94.8 ± 0.2 98.3 ± 0.0 Comparison of contrastive representation learning methods and i-Mix in different domains.

Comparison of MoCo v2 and i-Mix on large-scale datasets. ± 0.2 96.1 ± 0.1 71.6 ± 0.1 78.1 ± 0.3 96.3 ± 0.1 98.4 ± 0.0 70.5 ± 0.2 73.1 ± 0.1

Comparison of MoCo v2 and i-Mix with and without data augmentations. a comparison between MoCo v2 and i-Mix by training with different model sizes and number of training epochs on the pretext task. We train ResNet-18, 50, 101, and 152 models with varying number of training epochs from 200 to 2000.Figure

± 0.2 96.1 ± 0.1 85.9 ± 0.3 90.0 ± 0.4 CIFAR-100 64.1 ± 0.4 70.8 ± 0.4 71.6 ± 0.1 78.1 ± 0.3

Comparison of MoCo v2 and i-Mix in transfer learning.

A MORE APPLICATIONS OF i-MIX

In this section, we introduce more variations of i-Mix. For conciseness, we use v i to denote virtual labels for different methods. We make the definition of v i for each application clear.A.1 i-MIX FOR SIMCLR For each anchor, SimCLR takes other anchors as negative samples such that the virtual labels must be extended. Let x N +i = xi for conciseness, and v i ∈ {0, 1} 2N be the virtual label indicating the positive sample of each anchor, where v i,N +i = 1 and v i,j =N +i = 0. Note that v i,i = 0 because the anchor itself is not counted as a positive sample. Then, Eq. ( 4) can be represented in the form of the cross-entropy loss:The application of i-Mix to SimCLR is straightforward: for two data instances (x i , v i ), (x j , v j ) and a batch of data B = {x i } 2N i=1 , the i-Mix loss is defined as follows: 12i-MixSimCLR (x i , v i ), (x j , v j ); B, λ = SimCLR (λx i + (1 -λ)x j , λv i + (1 -λ)v j ; B). (A.2)Note that only the input data of Eq. (A.2) is mixed, such that f i in Eq. (A.1) is an embedding vector of the mixed data while the other f n 's are the ones of clean data. Because both clean and mixed data need to be fed to the network f , i-Mix for SimCLR requires twice more memory and training time compared to SimCLR when the same batch size is used.

A.2 i-MIX FOR SUPERVISED CONTRASTIVE LEARNING

Supervised contrastive learning has recently shown to be effective for supervised representation learning and it often outperforms the standard end-to-end supervised classifier learning (Khosla et al., 2020) . Suppose an one-hot label y i ∈ {0, 1} C is assigned to a data x i , where C is the number of classes. Let x N +i = xi and y N +i = y i for conciseness. For a batch of data pairs and their labels B = {(x i , y i )} 2N i=1 , let v i ∈ {0, 1} 2N be the virtual label indicating the positive samples of each anchor, where v i,j = 1 if y i = y j =i , and otherwise v i,j = 0. Intuitively, 2N j=1 v i,j = 2N yi -1 where N yi is the number of data with the label y i . Then, the supervised learning version of the SimCLR (SupCLR) loss function is written as follows:The application of i-Mix to SupCLR is straightforward: for two data instances (x i , v i ), (x j , v j ) and a batch of data B = {x i } 2N i=1 , the i-Mix loss is defined as follows:Note that i-Mix in Eq. (A.4) is not as efficient as SupCLR in Eq. (A.3) due to the same reason in the case of SimCLR. To overcome this, we reformulate SupCLR in the form of the N-pair loss (Sohn, 2016) . Suppose an one-hot label y i ∈ {0, 1} C is assigned to a data x i , where C is the number of classes. For a batch of data pairs and their labels B = {(x i , xi , y i )} N i=1 , let v i ∈ {0, 1} N be the virtual label indicating the positive samples of each anchor, where v i,j = 1 if y i = y j =i , and otherwise v i,j = 0. Then, the supervised version of the N-pair (Sup-N-pair) contrastive loss function is written as follows:Then, the i-Mix loss for Sup-N-pair is defined as follows:

B PROOF OF THE LINEARITY OF LOSSES WITH RESPECT TO VIRTUAL LABELS

Cross-entropy loss. The loss used in contrastive representation learning works, which is often referred to as InfoNCE (Oord et al., 2018) , can be represented in the form of the cross-entropy loss as we showed for N-pair contrastive learning, SimCLR (Chen et al., 2020a) , and MoCo (He et al., 2020) . Here we provide an example in the case of N-pair contrastive learning. LetL2 loss between L2-normalized feature vectors. The BYOL (Grill et al., 2020) loss is in this type.Because F is not backpropagated, it can be considered as a constant.

C MORE ON EXPERIMENTS

We describe details of the experimental settings and more experimental results. For additional experiments below, we adapted the code for supervised contrastive learning (Khosla et al., 2020) . 13 C.1 SETUPIn this section, we describe details of the experimental settings. Note that the learning rate is scaled by the batch size (Goyal et al., 2017) : ScaledLearningRate = LearningRate × BatchSize/256.Image. The experiments on CIFAR-10 and 100 (Krizhevsky & Hinton, 2009) and ImageNet (Deng et al., 2009) are conducted in two stages: following Chen et al. (2020a), the convolutional neural network (CNN) part of ResNet-50 (He et al., 2016) 14 followed by the two-layer multilayer perceptron (MLP) projection head (output dimensions are 2048 and 128, respectively) is trained on the unlabeled pretext dataset with a batch size of 256 (i.e., 512 augmented data) with the stochastic gradient descent (SGD) optimizer with a momentum of 0.9 over up to 4000 epochs. BYOL has an additional prediction head (output dimensions are the same with the projection head), which follows the projection head, only for the model updated by gradient. 10 epochs of warmup with a linear schedule to an initial learning rate of 0.125, followed by the cosine learning rate schedule (Loshchilov & Hutter, 2017) is used. We use the weight decay of 0.0001 for the first stage. For ImageNet, we use the same hyperparameters except that the batch size is 512 and the initial learning rate is 0.03. Table C .1: Comparison of N-pair contrastive learning and SimCLR with i-MixUp and i-CutMix on them with ResNet-50 on CIFAR-10 and 100. We run all experiments for 1000 epochs. i-MixUp improves the accuracy on the downstream task regardless of the data distribution shift between the pretext and downstream tasks. i-CutMix shows a comparable performance with i-MixUp when the pretext and downstream datasets are the same, but it does not when the data distribution shift occurs. C .2: Comparison of the N-pair self-supervised and supervised contrastive learning methods and i-Mix on them with ResNet-50 on CIFAR-10 and 100. We also provide the performance of formulations proposed in prior works: SimCLR (Chen et al., 2020a) and its supervised version (Khosla et al., 2020) . We run all experiments for 1000 epochs. i-Mix improves the accuracy on the downstream task regardless of the data distribution shift between the pretext and downstream tasks, except the case that the pretest task has smaller number of classes than that of the downstream task. The quality of representation depends on the pretext task in terms of the performance of transfer learning: self-supervised learning is better on CIFAR-10, while supervised learning is better on CIFAR-100.SimCLR, which is simpler and more efficient, especially when applying i-Mix, while not losing the performance. When pretext and downstream tasks share the training dataset, i-CutMix often outperforms i-MixUp, though the margin is small. However, i-CutMix shows a worse performance in transfer learning.Table C .2 compares the performance of SimCLR, N-pair contrastive learning, and i-Mix on N-pair contrastive learning when the pretext task is self-supervised and supervised contrastive learning. We confirm that the N-pair formulation results in no worse performance than that of SimCLR in supervised contrastive learning as well. i-Mix improves the performance of supervised contrastive learning from 95.7% to 97.0% on CIFAR-10, similarly to improvement achieved by MixUp for supervised learning where it improves the performance of supervised classifier learning from 95.5% to 96.6%. On the other hand, when the pretext dataset is CIFAR-100, the performance of supervised contrastive learning is not better than that of supervised learning: MixUp improves the performance of supervised classifier learning from 78.9% to 82.2%, and i-Mix improves the performance of supervised contrastive learning from 74.6% to 78.4%. While supervised i-Mix improves the classification accuracy on CIFAR-10 when trained on CIFAR-10, the representation does not transfer well to CIFAR-100, possibly due to overfitting to 10 class classification. When pretext dataset is CIFAR-100, supervised contrastive learning shows a better performance than self-supervised contrastive learning regardless of the distribution shift, as it learns sufficiently general representation for linear classifier to work well on CIFAR-10 as well. (Fréchet, 1957; Vaserstein, 1969) between the set of training and test embedding vectors under the Gaussian distribution assumption. For conciseness, let fi = f (x i )/ f (x i ) be an 2 normalized embedding vector; we normalize embedding vectors as we do when we measure the cosine similarity. Then, with the estimated mean m = 1 C .3, i-Mix improves FED over contrastive learning, regardless of the distribution shift. Note that the distance is large when the training dataset of the downstream task is the same with that of the pretext task. This is because the model is overfit to the training dataset, such that the distance from the test dataset, which is unseen during training, has to be large.

C.3 QUALITATIVE EMBEDDING ANALYSIS

On the other hand, Table C .3 shows that i-Mix reduces the gap between the training and test accuracy. This implies that i-Mix is an effective regularization method for pretext tasks, such that the learned representation is more generalizable on downstream tasks.

