MOSAIC REPRESENTATION LEARNING FOR SELF-SUPERVISED VISUAL PRE-TRAINING

Abstract

Self-supervised learning has achieved significant success in learning visual representations without the need of manual annotation. To obtain generalizable representations, a meticulously designed data augmentation strategy is one of the most crucial parts. Recently, multi-crop strategies utilizing a set of small crops as positive samples have been shown to learn spatially structured features. However, it overlooks the diversity of contextual backgrounds, which reduces the variance of the input views and degenerates the performance. To address this problem, we propose a mosaic representation learning framework (MosRep), consisting of a new data augmentation strategy that enriches the backgrounds of each small crop and improves the quality of visual representations. Specifically, we randomly sample numbers of small crops from different input images and compose them into a mosaic view, which is equivalent to introducing different background information for each small crop. Additionally, we further jitter the mosaic view to prevent memorizing the spatial locations of each crop. Along with optimization, our MosRep gradually extracts more discriminative features. Extensive experimental results demonstrate that our method improves the performance far greater than the multi-crop strategy on a series of downstream tasks, e.g., +7.4% and +4.9% than the multi-crop strategy on ImageNet-1K with 1% label and 10% label, respectively. Code is available at https://github.com/DerrickWang005/MosRep.git.

1. INTRODUCTION

High-quality representation learning (Bengio et al., 2013 ) is a fundamental task in machine learning. Tremendous number of visual recognition models have achieved promising performance by learning from large-scale annotated datasets, e.g., ImageNet (Deng et al., 2009) and OpenImage (Kuznetsova et al., 2020) . However, a great deal of challenges exist in collecting large-scale datasets with annotations, e.g., label noise (Liu & Tao, 2015; Natarajan et al., 2013; Xia et al., 2019) , high cost (Zhu et al., 2019) and privacy concerns (Liang et al., 2020) . To address these issues, self-supervised learning (SSL) is proposed to learn generic representations without manual annotation. Recent progress in visual self-supervised learning (Caron et al., 2020; He et al., 2020; Grill et al., 2020; Chen & He, 2021; Bai et al.) shows remarkable potential and achieves comparable results with supervised learning. Among these SSL methods, a common underlying idea is to extract invariant feature representations from different augmented views of the same input image. Contrastive learning (Dosovitskiy et al., 2015; Wu et al., 2018; Chen et al., 2020a; He et al., 2020; Wang et al., 2022) is one of the most commonly used methods. They define 'positive' and 'negative' pairs and apply the contrastive loss (i.e., InfoNCE (Hénaff et al., 2019) ) for optimization, where the 'positive' pairs are pulled close and the 'negative' pairs are pushed away. Another trend of work, such as BYOL (Grill et al., 2020) and Simsiam (Chen & He, 2021) , introduces the concept of asymmetry, which is free from designing negatives. They add an extra 'predictor' behind the model and update the parameters using one The standard view is generated from the input image by applying the strategy used in (He et al., 2020) . (b) The multi-crop view is generated from the input image by the multicrop strategy (Caron et al., 2020) . (c) The mosaic view is generated from the input image by our designed augmentation strategy. Although the multi-crop view encourages the "local-to-global" correspondences, it obviously overlooks the diverse background information. In contrast, the mosaic view effectively enriches the background of each crops in the mosaic view, which more facilitates the extraction of discriminative features than the multi-crop strategy. (i), (ii) and (iii) denote the activation maps of MoCo-v2, MoCo-v2 with the multi-crop strategy and MoCo-v2 with our MosRep. Qualitatively, MosRep performs better localization than the other two methods, demonstrating that our method effectively extracts discriminative features and learns high-quality representations. augmented view, while the feature of another augmented view is used as fixed supervision. Besides, clustering methods (Caron et al., 2018; 2020; Asano et al., 2019; Li et al., 2020) adopt two augmented views of the same image as the prediction and the pseudo cluster label and enforce the consistency between the two views. We present more related works in the appendix. It is worth noting that a carefully-designed data augmentation strategy is an essential part of the above self-supervised learning frameworks. SimCLR (Tian et al., 2020) and InfoMin (Chen et al., 2020a) empirically investigate the impact of different data augmentations and observe that SSL benefits more from strong data augmentations than supervised learning. After that, SwAV (Caron et al., 2020) proposes the multi-crop strategy, which achieves significant performances on downstream tasks. As shown in Figure 1 (a) and (b), they use two standard resolution crops and sample several small crops that cover the local regions of the input image in order to encourage the "local-to-global" correspondences. However, small crops overlook the diverse backgrounds and decrease the variance, where such views with too many similarities are trivial for learning discriminative features. Intuitively, if we can take into account both the "local-to-global" correspondences and the diverse contextual backgrounds, the quality of learned representations can be further improved. In this paper, we propose a mosaic representation learning framework (MosRep) consisting of a new data augmentation strategy, which can enrich the contextual background of each small crop and encourage the "local-to-global" correspondences. Specifically, we first sample M (e.g., M = 4) small crops of each input image in the current batch. Then, these crops are randomly shuffled and divided into multiple groups. Each group contains M crops and we also ensure that small crops in each group are from different input images. Subsequently, as illustrated in Figure 1 (c), we combine the small crops of the same group into a single view, which terms the mosaic view. Finally, we further jitter the mosaic view in order to prevent the model from memorizing the spatial position of each small crop. In the forward process, the mosaic view is fed into the model for feature extraction, and we adopt the RoI Align operator to extract the feature of each crop from the mosaic view and project this feature into an embedding space. To minimize the loss function (e.g., contrastive loss), the model gradually learns to capture more discriminative features (i.e., foreground objects) from the complex backgrounds, improving the quality of visual representations, which is shown in Figure 1 (d). In summary, the main contributions of this paper are as follows: 1. We design a mosaic augmentation strategy that takes into account the diversity of backgrounds and the "local-to-global" correspondences. 2. Based on the proposed mosaic augmentation, we propose a practical and effective SSL framework, MosRep, which benefits the extraction of discriminative features and improves the quality of visual representations. 3. We build our proposed method upon two different SSL frameworks and validate the effectiveness. Experimental results show that our method achieves superior gains compared to the multi-crop strategy on various downstream tasks. 

2. METHODS

We propose MosRep, a framework to adequately facilitate the learning of discriminative features from large-scale unlabeled datasets. In this section, we first revisit the preliminaries on contrastive learning and the multi-crop strategy. Then we present our mosaic representation learning framework in detail.

2.1. PRELIMINARY

Contrastive learning. Contrastive learning (Dosovitskiy et al., 2015; Chen et al., 2020a; He et al., 2020) is one of the most popular self-supervised learning frameworks and has achieved great success in recent years. Given a set of images X , the goal of contrastive learning is to learn an embedding space such that the embedding of each image x i can be distinguished from a set of negatives N . Firstly, two separate augmentations t q and t k are sampled from a set of pre-defined augmentations T and applied to obtain two different views of each image, i.e., x q i = t q (x i ), x k i = t k (x i ). Then, these views are fed into an encoder F (•) to extract features h and a projector g(•) to map features h into a embedding space, i.e., z q i = g(F (x q i )), z k i = g(F (x k i )). Finally, a contrastive objective (e.g., InfoNCE) is formulated as: L contrast = - 1 N N i=1 log exp(sim(z q i , z k i )/τ ) exp(sim(z q i , z k i )/τ ) + n -∈N exp(sim(z q i , n -)/τ ) , where n -, τ , N and sim(•) denote a negative sample, a temperature parameter, the number of input images and cosine similarity, respectively. Besides, without the need of negatives, BYOL (Grill et al., 2020) and Simsiam (Chen & He, 2021) adopt an extra predictor to map the embedding z to the prediction p and minimize their negative cosine similarity of them, i.e., L cos = 1 N N i=1 -sim(p q i , sg(z k i )) -sim(p k i , sg(z q i )), where sg(•) denotes the stop-gradient trick that is crucial to avoid model collapsing. Multi-crop strategy. SwAV (Caron et al., 2020) proposes a multi-crop strategy using two highresolution views that cover large parts of the image and several low-resolution views that cover small parts of the image. In doing so, this strategy maximizes the agreement between a high-resolution (global) view and a low-resolution (local) view of the same image, encouraging the model to learn spatially structured representations.

2.2. MOSAIC REPRESENTATION LEARNING

In the multi-crop strategy, the features of low-resolution and high-resolution crops are enforced to remain constant, which encourages the "local-to-global" correspondences. However, the lowresolution crops only cover small parts of the input images, which overlooks the diverse background information and reduces the variance of these crops. Because of this issue, the model is not able to learn sufficient discriminative features, which can cause the degeneration of performance on various downstream tasks. Motivated by this, we propose a mosaic representation learning framework (MosRep), which consists of a new mosaic augmentation strategy that enriches the backgrounds of each small crop and improves the quality of visual representations. Concretely, given an input image x i , we generate two standard views and M small crops by three separate augmentation operators t q , t k and t s , i.e., x q i = t q (x i ), x k i = t k (x i ), X s i = t s (x i ), t q , t k , t s ∼ T, where x q i and x k i denote two standard view and X s i denotes a set of small crops. As shown in "Small Crops" of Figure 2 , we randomly shuffle all small crops from images in a batch and divide them into groups. We set up M crops in each group and ensure that the crops in each group come from different input images. Then, we compose the crops of each group into a single view, termed as the mosaic view x v i , and record the coordinates (t, l, b, r) of each small crop relative to x v i , where t, l, b, r indicates the top, left, bottom and right position. This process is formulated as, x v i = Compose(M i ), M i = {x s ij | i ∈ N} M j=1 , ) where M i denotes a group of shuffled small crops, N denotes all indices of input images and j denotes the index of small crops in a group. That is, x s ij is the j-th crop in the set X s i . Since the spatial position of each crop is fixed in the mosaic view x v i , the model can easily memorize the position, resulting in over-fitting. In order to tackle this dilemma, we conduct the view jitter operation on the mosaic view. We first sample offsets of the mosaic view from a beta distribution β(α, α) with two identical parameters α, i.e., ∆x = θ • u, ∆y = θ • v, u, v ∼ β(α, α), where θ, ∆x and ∆y denote the upper bound of jitter range and the offsets of the mosaic view, respectively. We set α < 1, which indicates a U-shaped distribution. In this way the mosaic view is more likely to be jittered in a relatively wider range with larger offsets. Meanwhile, we update the coordinates of each small crop with these offsets, which is calculated as, (t ′ , l ′ , b ′ , r ′ ) = (t + ∆y, l + ∆x, b + ∆y, r + ∆x). (6) In doing so, the mosaic view effectively enriches the background information of each small crop and facilitates the extraction of discriminative features, improving the quality of learned representations. Subsequently, we present the forward propagation of our proposed framework in the lower part of Figure 2 . Given two standard views x q i , x k i and a mosaic view x v i , an encoder F (•) is used to extract the feature h of them, i.e., h q i = F (x q i ), h k i = F (x k i ), h v i = F (x v i ), where h q i , h k i and h v i denote the feature of two standard views and a mosaic view, respectively. It is worth noting that, by resorting to the above-mentioned coordinates, we adopt a RoI Align operator to extract the feature of each small crop in the mosaic view. According to the index i of the input images, we rearrange the features of these small crops to correspond to their positive samples, and use H s i to denote a set of features of small crops in i-th mosaic view. After that, all features are mapped into an embedding space by a projector g(•), i.e., z q i = g(h q i ), z k i = g(h k i ), Z s i = g(H s i ), where z q i , z k i are the embedding of two standard views, and Z s i denotes a set of embeddings of small crops. Finally, we define the positive and negative pairs of all embeddings based on the index i of the input images and calculate the contrastive loss, i.e., L contrast = - 1 N N i=1 log exp(sim(z q i , z k i )/τ ) exp(sim(z q i , z k i )/τ ) + nj ∈N exp(sim(z q i , n j )/τ ) - 1 N N i=1 1 M M j=1 log exp(sim(z s ij , z k i )/τ ) n -∈Z k N exp(sim(z s ij , n -)/τ ) , where z s ij denotes the j-th embedding in the set Z s i , Z k denotes a set of embeddings of the key view and n -denotes negative samples in the set Z k and N . The first component in Eq. 9 is the standard contrastive loss used in (Chen et al., 2020b) . And we design the second component in Eq. 9 to explicitly pull the small crops and its corresponding key view closer and push other irrelevance away. With optimization, the model gradually learns to extract more discriminative features, thus improving the quality of learned representations. Besides, we also build our proposed method upon the BYOL (Grill et al., 2020) framework by minimizing the negative cosine similarity between small crops and the key view: Loss = - 1 B j∈B k∈B ′ exp(p i2t c jk /τ T ) k∈B ′ exp(p i2t c jk ) • log exp(p i2t s jk /τ S ) k∈B ′ exp(p i2t s jk ) (10) - 1 B j∈B k∈B ′ exp(p t2i c jk /τ T ) k∈B ′ exp(p t2i c jk ) • log exp(p i2i s jk /τ S ) k∈B ′ exp(p i2i s jk ) , where B and B ′ denote the batch of each rank and all ranks, p i2t c and p t2i c denote the image-to-text similarity and the text-to-image similarity of the CLIP model, p i2i s and p i2t s denote the image-to-image similarity and the image-to-text similarity between the small model and CLIP model, and τ T and τ S denote the temperature of the CLIP model and small model.

3.1. PRE-TRAINING SETTINGS

Datasets We perform self-supervised pre-training on two datasets, one middle-scale and another large-scale: 1) 100-category ImageNet (IN-100) (Tian et al., 2019) , a subset of IN-1K dataset containing ∼125k images; and 2) 1000-category ImageNet (IN-1K) (Deng et al., 2009) , the standard ImageNet training set containing ∼1.25M images. Architectures Following the commonly-used setting in recent unsupervised methods (He et al., 2020; Chen et al., 2020a; Tian et al., 2019) , we adopt the ResNet-50 (He et al., 2016) model as our encoder. To study the flexibility of our proposed method, we build on two different frameworks, MoCo-v2 (Chen et al., 2020b) and BYOL (Grill et al., 2020) . Data Augmentation During pre-training, we adopt the data augmentation used in MoCo-v2 (He et al., 2020) and BYOL (Grill et al., 2020) for the standard view. As for the mosaic view, we generate M = 4 small crops with 112 × 112 input size, and other data augmentations are the same with MoCo-v2 and BYOL. Optimization Our hyperparameters closely follow these of the adopted self-supervised learning methods, MoCo-v2 (He et al., 2020) and BYOL (Grill et al., 2020) . As for the MoCo version, we pre-train the network on IN-100 and IN-1K for 400 and 200 epochs, respectively. SGD (Loshchilov & Hutter, 2016) optimizer with a cosine learning rate scheduler and lr base = 0.3 is adopted, with a mini-batch size of 256 on 8 NVIDIA V100 GPU. We utilize a negative queue of 16,384 for IN-100, and 65,536 for IN-1K. The weight decay is 0.0001 and SGD momentum is 0.9. As for the BYOL (Grill et al., 2020 ) version, we pre-train the network on IN-100 for 400 epochs, and on IN-1K for 200 epochs. We adopt the SGD optimizer with lr base = 0.7, where the learning rate is linearly scaled with the batch size as lr = lr base × batch/256. The batch size is set as 512 for IN-1OO and 2048 for IN-1K. The weight decay is 0.000001 and the SGD momentum is 0.9. For the momentum encoder, the momentum value starts from 0.99 and is increased to 1, following (Grill et al., 2020) . We use batch normalization (Ioffe & Szegedy, 2015) synchronized across devices.

3.2. EVALUATION ON IMAGENET

To evaluate the effectiveness of our proposed method, we pre-train the model on IN-100 and IN-1K, respectively, and conduct a series of downstream tasks, including linear probing, nearest neighbor, semi-supervised learning, and transfer learning. Table 1 : Evaluation on ImageNet linear probing. We adopt a ResNet-50 (He et al., 2016) as our backbone, and pre-train it on middle-scale and large-scale datasets, i.e., ImageNet-100 and ImageNet-1K (Deng et al., 2009) for 400 and 200 epochs, respectively. We build our method on two widely-used frameworks, MoCo-v2 (Chen et al., 2020b) and BYOL (Grill et al., 2020) to evaluate the universality. To thoroughly verify the effectiveness, we set the standard two-crop version as a weak baseline and a multi-crop version as a strong baseline. We report the top 1 accuracy of linear probing and k-NN on ImageNet-100 and ImageNet-1K, respectively. * denotes the strong baseline.

Method

ImageNet-100 ImageNet-1K (Chen & He, 2021) . We freeze the deep features and train a supervised classifier for 90 epochs with a batch size of 1024. We adopt the SGD optimizer with lr base = 0.1, where the learning rate is linearly scaled with the batch size as lr = lr base × batch/256. We use standard supervised ImageNet augmentations (He et al., 2020) during training. For a thorough comparison, we set two baselines: 1) a weak baseline with two standard crops. 2) a strong baseline with the multi-crop strategy. Firstly, we build our MosRep upon the MoCo-v2 baseline. The key difference between the two frameworks is the use of mosaic representation learning. As illustrated in Table 1 , MosRep surpasses the weak baseline by an obvious margin of 4.8% at linear probing on IN-100. Similarly, MosRep achieves a significant gain of 4.6% over the weak baseline at linear probing on IN-1K. More importantly, MosRep shows a considerable improvement over the strong baseline on both IN-100 and IN-1K datasets, which demonstrates that the performance gain is not simply from more small crops. For each small crop, the mosaic view effectively introduce diverse background information. We argue that, along with the optimization, the model needs to extract more discriminative features from the diverse background to maximize the similarity of positive pairs and minimize the similarity of negative pairs, which can improve the quality of learned representations. In the BYOL experiments, we observe similar phenomenon as the MoCo-v2 ones. Nearest Neighbor We further evaluate the representations of the pre-trained models by the nearest neighbor classifier. Following (Koohpayegani et al., 2021) , we adopt the center crop operation with 256 × 256 on the training and test sets, and calculate l2 normalized embeddings by forwarding through the backbone. We report Top-1 accuracy on 1-NN and 5-NN classifiers. For a thorough comparison, we set two baselines: 1) a weak baseline with two standard crops. 2) a strong baseline with the multi-crop strategy. Table 1 shows the results of nearest neighbor on the IN-100 and IN-1K datasets, respectively. MosRep achieves more consistent improvements over the weak and strong baselines on the 1-NN and 5-NN classifiers. Significant gains of 2.5% and 5.7% are achieved on the IN-100 and IN-1K datasets when we build our method on the MoCo-v2 framework. Besides, we also build on BYOL and obtain similar improvements of 2.0% and 0.8% on the IN-100 and IN-1K, respectively. Semi-supervised Learning Following the semi-supervised settings (Chen et al., 2020a; Grill et al., 2020; Hénaff et al., 2019) , we evaluate the pre-trained models on the task of classification with Table 2 : Evaluation on ImageNet-1K semi-supervised training. We pre-train a ResNet-50 (He et al., 2016) on ImageNet-1K for 200 epochs, and conduct semi-supervised training on 1% and 10% labelled sets. We build our method on two widely-used frameworks, MoCo-v2 (Chen et al., 2020b) and BYOL (Grill et al., 2020) to evaluate the universality. To fully evaluate the effectiveness, we set the standard two-crop version as a weak baseline and the multi-crop version as a strong baseline. Top-1 and Top-5 accuracy are used as benchmark metrics. * denotes the strong baseline. ImageNet-1K (1%) ImageNet-1K (10%) Top- Table 3 : Evaluation on Linear probing transfer learning evaluation. We pre-train a ResNet-50 (He et al., 2016) on ImageNet-1K for 200 epochs, and conduct linear probing on 8 classification datasets. We build our method on two widely-used frameworks, MoCo-v2 (Chen et al., 2020b) and BYOL (Grill et al., 2020) to evaluate the universality. * denotes the strong baseline. et al., 2009) training dataset, respectively. We freeze the backbone and train a single linear classifier. We utilize the features from the top of the backbone, which are normalized to have unit l2 norm and then scaled and shifted to have zero mean and unit variance for each dimension. The linear classifier is trained with SGD optimizer (lr=0.01, epochs=40, batch size=256, weight decay=0.0005, and momentum=0.9). The learning rate is multiplied by 0.1 at 15 and 30 epochs. We use the standard setting of ImageNet supervised learningfoot_0 during training. Table 2 illustrates the comparisons of our approach against two widely-used frameworks (He et al., 2020; Grill et al., 2020) and their strong variants. Impressively, when we build our proposed MosRep on the MoCo-v2, we achieves considerable improvements over the strong baseline with both 1% and 10% labels, e.g., 7.4% Top-1 accuracy on 1% subset and 4.9% Top-1 accuracy on 10% subset. After that, we build on the BYOL and still significantly outperforms the strong baseline by 1.4% and 0.6% at Top-1 accuracy with 1% and 10% labels, respectively.

3.3. EVALUATION ON TRANSFER LEARNING

To further evaluate the quality of learned representations, we conduct a series of transfer learning, including linear probing on various classification datasets, COCO object detection (Lin et al., 2014) , COCO instance segmentation (Lin et al., 2014) and cityscapes instance segmentation (Cordts et al., 2016) . Table 4 : Evaluation on COCO object detection and instance segmentation. We adopt a ResNet-50 (He et al., 2016) as our backbone, and pre-train it on ImageNet-100 and ImageNet-1K (Deng et al., 2009) for 400 and 200 epochs, respectively. Each pretrained model is transferred to the Mask R-CNN R50-FPN model, subsequently finetuned on COCO 2017 (Lin et al., 2014) Linear Probing Following the procedure adopted in (Chen et al., 2020a; Grill et al., 2020) , we perform the linear probing task on the following datasets: CIFAR-10 ( Krizhevsky et al., 2009) , CIFAR-100 (Krizhevsky et al., 2009) , STL-10 (Coates et al., 2011) , Food-101 (Bossard et al., 2014) , Flower-102 (Nilsback & Zisserman, 2008) , DTD (Cimpoi et al., 2014 ), Pets (Parkhi et al., 2012) and Cars (Krause et al., 2013) . First, we build on the MoCo-v2 framework and compare it with the weak and strong baselines. As shown in the Table 3 , it is obvious that our proposed MosRep outperforms both baselines on all eight datasets, and achieves better transfer performances than the supervised pre-trained model on CIFAR-10, CIFAR-100, STL-10, Food-101 and DTD, which demonstrates that our method can effectively improve the generalization. Besides, we also conduct experiments on the BYOL framework. Although BYOL is a state-of-the-art self-supervised learning framework, which achieves excellent transfer performances on eight datasets, our approach can improve the generalization performance on most datasets, even outperforming the IN-1K supervised model across the board. COCO Object Detection & Instance Segmentation Following the procedure outlined in (Chen et al., 2020b; He et al., 2020; Chen & He, 2021) , we fine-tune a Mask-RCNN (R50-FPN) (He et al., 2017; Lin et al., 2017; Wu et al., 2019) under 1× scheduler and add new Batch Normalization layers before the FPN parameters. We report Average Precision on bounding-boxes (AP bb ) and Average Precision on masks (AP mk ). Table 4 shows the comparison of our proposed methods and both baselines. Our MosRep can consistently increase the ability of detection and segmentation on the MoCo-v2 and BYOL frameworks. Table 5 illustrates the comparisons of MosRep versus the supervised pre-training counterparts and the weak and strong baselines.

3.4. ABLATIONS

In this section, we study the effect of some design choices in the MosRep, including the jittering operation, α in the beta distribution and the upper bound θ of jitter range. We built our method upon the MoCo-v2 and pre-train it on IN-100 for 400 epochs. We report the Top-1 accuracy on IN-100 linear probing. Effect of α in the Beta distribution We use α values from the set {0.3, 0.5, 1.0, 2.0} and set α = 0.5 for the above experiments. A smaller α means a greater likelihood of having a large range of jittering and a larger α means the opposite. As illustrated in Figure 3 , MosRep with the jittering operation achieves better transfer performance than ones without the jittering operation, which demonstrates that this operation can increase the variance of the mosaic view to prevent the model from memorizing the spatial location of each small crop. The Jitter Range We further study the effect of different jitter ranges. As shown in the right part of Figure 3 , with an increasing jittering range, MosRep achieves better Top-1 accuracy on the linear probing. However, when the jittering range is too large, more false positive samples with no overlapping area between positive pairs are likely to be observed, which results in the degradation of the quality of visual representations.

4. CONCLUSIONS

In this paper, we design a mosaic data augmentation strategy that focuses on small parts of input images and adequately enriches the background information. Through this strategy, we develop a simple and effective mosaic representation learning framework for self-supervised visual pre-training. Our method can be flexibly combined with other self-supervised learning frameworks, e.g., MoCo-v2 and BYOL. Extensive experimental results demonstrate that our method consistently boosts the performance on a series of downstream tasks, including linear probing, nearest neighbor, semisupervised learning, transfer learning, and etc. We hope this work could inspire future research on view designing, considering its significant role in self-supervised learning. The main limitation is that although we achieved promising results on various image-level downstream tasks, we do not observe noticeable gains on pixel-level downstream tasks, especially when pretrained on large-scale datasets. This phenomenon implies that image-level self-supervised learning is sub-optimal for pixel-level downstream tasks. We will study this problem in our future research. Yi Zhu, Karan Sapra, Fitsum A Reda, Kevin J Shih, Shawn Newsam, Andrew Tao, and Bryan Catanzaro. Improving semantic segmentation via video propagation and label relaxation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8856-8865, 2019. background noise can enrich the background information, it cannot effectively reduce the redundant mutual information and is easy to optimize. Different input sizes of small crops. We vary the input size of small crops from 64 × 64 to 112 × 112 and pre-train a ResNet-50 on ImageNet-100 for 400 epochs. As illustrated in Table 1 , experimental results show that decreasing the input size of small crops causes performance degradation on linear probing and 1-NN. Considering the limited resource, we do not further increase the input size of small crops.

D PSEUDO-CODES OF MOSAIC AUGMENTATION STRATEGY

Algorithm 1: Mosaic augmentation strategy



https://github.com/pytorch/examples/blob/master/imagenet/main.py



Figure 1: (a)  The standard view is generated from the input image by applying the strategy used in(He et al., 2020). (b) The multi-crop view is generated from the input image by the multicrop strategy(Caron et al., 2020). (c) The mosaic view is generated from the input image by our designed augmentation strategy. Although the multi-crop view encourages the "local-to-global" correspondences, it obviously overlooks the diverse background information. In contrast, the mosaic view effectively enriches the background of each crops in the mosaic view, which more facilitates the extraction of discriminative features than the multi-crop strategy. (i), (ii) and (iii) denote the activation maps of MoCo-v2, MoCo-v2 with the multi-crop strategy and MoCo-v2 with our MosRep. Qualitatively, MosRep performs better localization than the other two methods, demonstrating that our method effectively extracts discriminative features and learns high-quality representations.

Figure 2: The framework of MosRep. The upper figure illustrates the pipeline of the mosaic augmentation strategy. The lower figure illustrates the architecture of the MosRep that is built on the MoCo-v2 (Chen et al., 2020b). Following their training setting, the blue encoder and projector are updated by the SGD optimizer, and the green encoder is updated by the exponential moving average strategy. Best viewed in color.

Following the setting utilized in ReSim Xiao et al. (2021), we fine-tune a Mask-RCNN (R50-FPN) He et al. (2017); Lin et al. (2017); Wu et al. (2019) and add Batch Normalization layers before the FPN, and synchronize all Batch Normalization during training.

Figure 3: Built on the MoCo-v2, we train the MosRep on the IN-100 for 400 epochs, and conduct the linear prob on the IN-100. The orange dashed line denotes the MosRep without the jitter operation.jittering operation achieves better transfer performance than ones without the jittering operation, which demonstrates that this operation can increase the variance of the mosaic view to prevent the model from memorizing the spatial location of each small crop.

train set and evaluated on COCO 2017 val set. Averaging precisions on bounding-boxes (AP bb ) and masks (AP mk ) are used as benchmark metrics. * denotes the strong baseline.

Evaluation on Cityscapes instance segmentation. We adopt a ResNet-50(He et al., 2016) as our backbone, and pre-train it on ImageNet-100 and ImageNet-1K(Deng et al., 2009) for 400 and 200 epochs, respectively. To evaluate the effectiveness, we set the standard two-crop version as a weak baseline and the multi-crop version as a strong baseline. Each pretrained model is transferred to Mask R-CNN R50-FPN model, subsequently finetuned on Cityscapes(Cordts et al., 2016) train set and evaluated on Cityscapes val set. Averaging precision on masks (AP mk ) is used as benchmark metric. All results are the average of three trials. * denotes the strong baseline.

5. ACKNOWLEDGE

The authors would like to thank the anonymous reviewers and the meta-reviewer for their constructive feedback and encouraging comments on this work. Zhaoqing Wang was supported by OPPO Research Institute. Jun Yu is sponsored by Natural Science Foundation of China ( 62276242), CAAI-Huawei MindSpore Open Fund (CAAIXSJLJJ-2021-016B, CAAIXSJLJJ-2022-001A), Anhui Province Key Research and Development Program (202104a05020007), USTC-IAT Application Sci. & Tech. Achievement Cultivation Program (JL06521001Y). Mingming Gong was supported by ARC DE210101624. Tongliang Liu was partially supported by Australian Research Council Projects IC-190100031, LP-220100527, DP-220102121, and FT-220100318. 

A RELATED WORK

Self-supervised learning methods have been extensively studied to close the gap with supervised learning. These methods can be mainly categorized into several aspects, including contrastive-based, consistency-based, clustering-based and generative-based methods.A.1 CONTRASTIVE LEARNING contrastive self-supervised learning has emerged as a promising approach to unsupervised visual representation learning. Each image is treated as an individual class in an instance discrimination setting, whose core idea is to pull the positive pairs together and push the negative pairs away in the embedding space. As studied in many previous works (Chen et al., 2020b; He et al., 2020; Chen et al., 2020a; Hu et al., 2021; Li et al., 2022) , it is extremely essential to construct high-quality positive and negative pairs to achieve higher transfer performance. Specifically, InstDisc (Wu et al., 2018) increases the size of negative pairs by built a memory bank that stores pre-computed embeddings, thus improving the performance. Following this line, MoCo (He et al., 2020) utilizes a momentum update mechanism to update a large queue of negatives for contrastive learning. This momentum encoder greatly improves the quality and consistency of negative pairs and achieves remarkable performance compared to previous works. SimCLR (Chen et al., 2020a) further improves in a straightforward way that directly adopts negative samples in the current batch with much bigger batch size. Besides, researchers carefully construct a rich family of data augmentations on cropped images, significantly boosting classification accuracy. MoCo-v2 Chen et al. (2020b) also improve the performance than MoCo (He et al., 2020 ) by using the same data augmentation and MLP layer design.

A.2 CONSISTENCY REGULARIZATION

Different from other contrastive-based methods, BYOL (Grill et al., 2020) trains an online network to predict the output of the target network, with the latter slowly updated with momentum. The authors assume that the additional predictor to the online network and the momentum encoder are essential to avoid collapsed solutions without negatives. SimSiam (Chen & He, 2021) further explores simple Siamese networks that can learn transferable representations without negative samples, large batches and momentum encoders. The role of the stop gradient is emphasized in preventing collapse. BarlowTwins (Zbontar et al., 2021) avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors.

A.3 CLUSTERING

These methods, such as DeepCluster (Caron et al., 2018) , SWAV (Caron et al., 2020) , SELA Ma et al. (2019) , perform contrastive-like comparisons without the requirement to compute all pairwise distances. Specifically, these methods cluster the data simultaneously while enforcing consistency between cluster assignments produced for different distortions of the same image, instead of directly comparing features in contrastive learning. Clustering methods are also prone to collapse, e.g., empty clusters in k-means and avoiding them relies on careful implementation details. Online clustering methods like SWAV can be trained with large and small batches but require storing features when the number of clusters is much larger than the batch size.

A.4 GENERATIVE & MASK IMAGE MODELING

The generative methods typically adopt auto-encoders (Kingma et al., 2019) , and adversarial learning to train an unsupervised representation. Usually, they focused on the pixel-wise information of images to distinguish images from different classes. For instance, BigBiGAN (Donahue & Simonyan, 2019) adopted BiGAN to capture the relationship between latent semantic representations and the input images. Motivated by the great success of BERT, masked image modeling (MIM) (Xie et al., 2022; Bao et al., 2021; He et al., 2022) becomes a new trend in self-supervised visual pre-training, which randomly masks parts of images and reconstructs them based on the corrupted image. ViT (Dosovitskiy et al., 2021) attempts to adopt masked patch prediction for self-supervised learning.Published as a conference paper at ICLR 2023 Table 6 : Evaluation on ImageNet linear probing. By plugging in DINO Caron et al. (2021) , we adopt a ViT-S/16 (Dosovitskiy et al., 2021) as our backbone, and pre-train it on middle-scale and large-scale datasets, i.e., ImageNet-100 and ImageNet-1K (Deng et al., 2009) for 200 epochs. To thoroughly verify the effectiveness, we set the standard two-crop version as a weak baseline and a multi-crop version as a strong baseline. We report the top 1 accuracy of linear probing and 1-NN on ImageNet-100 and ImageNet-1K, respectively. * denotes the strong baseline.

Method

ImageNet-100 ImageNet-1K BEiT (Bao et al., 2021) predicts the discrete tokens of masked token resorting to an off-the-shelf discrete VAE. Instead of discretizing the visual information, MAE (He et al., 2022) and SimMIM (Xie et al., 2022) propose to directly predict the pixel-level value as the reconstruction target. 

B ADDITIONAL EVALUATION ON IMAGENET

We build our method on DINO Caron et al. (2021) . For a fair comparison, we set a standard two-crop version as a weak baseline and a multi-crop version as a strong baseline. We perform self-supervised pre-training and linear probing on both ImageNet-100 and ImageNet-1K. Following Caron et al. (2021) , we adopt ViT-S/16 as the encoder and pre-train for 200 epochs. The batch size is 1024. We report the top1 accuracy for linear probing and 1-NN. As illustrated in Table 8 , MosRep surpasses both baselines by clear margins on linear probing and 1-NN.

C ADDITIONAL ABLATION STUDY

Simple background noise. We conduct an ablation study that adds simple background noise (e.g., gaussian) in the small crops. Experimental results in the following table 1 show that adding simple background noise cannot improve the performance of linear probing. We argue that although simple

