MOSAIC REPRESENTATION LEARNING FOR SELF-SUPERVISED VISUAL PRE-TRAINING

Abstract

Self-supervised learning has achieved significant success in learning visual representations without the need of manual annotation. To obtain generalizable representations, a meticulously designed data augmentation strategy is one of the most crucial parts. Recently, multi-crop strategies utilizing a set of small crops as positive samples have been shown to learn spatially structured features. However, it overlooks the diversity of contextual backgrounds, which reduces the variance of the input views and degenerates the performance. To address this problem, we propose a mosaic representation learning framework (MosRep), consisting of a new data augmentation strategy that enriches the backgrounds of each small crop and improves the quality of visual representations. Specifically, we randomly sample numbers of small crops from different input images and compose them into a mosaic view, which is equivalent to introducing different background information for each small crop. Additionally, we further jitter the mosaic view to prevent memorizing the spatial locations of each crop. Along with optimization, our MosRep gradually extracts more discriminative features. Extensive experimental results demonstrate that our method improves the performance far greater than the multi-crop strategy on a series of downstream tasks, e.g., +7.4% and +4.9% than the multi-crop strategy on ImageNet-1K with 1% label and 10% label, respectively. Code is available at https://github.com/DerrickWang005/MosRep.git.

1. INTRODUCTION

High-quality representation learning (Bengio et al., 2013 ) is a fundamental task in machine learning. Tremendous number of visual recognition models have achieved promising performance by learning from large-scale annotated datasets, e.g., ImageNet (Deng et al., 2009) and OpenImage (Kuznetsova et al., 2020) . However, a great deal of challenges exist in collecting large-scale datasets with annotations, e.g., label noise (Liu & Tao, 2015; Natarajan et al., 2013; Xia et al., 2019) , high cost (Zhu et al., 2019) and privacy concerns (Liang et al., 2020) . To address these issues, self-supervised learning (SSL) is proposed to learn generic representations without manual annotation. Recent progress in visual self-supervised learning (Caron et al., 2020; He et al., 2020; Grill et al., 2020; Chen & He, 2021; Bai et al.) shows remarkable potential and achieves comparable results with supervised learning. Among these SSL methods, a common underlying idea is to extract invariant feature representations from different augmented views of the same input image. Contrastive learning (Dosovitskiy et al., 2015; Wu et al., 2018; Chen et al., 2020a; He et al., 2020; Wang et al., 2022) is one of the most commonly used methods. They define 'positive' and 'negative' pairs and apply the contrastive loss (i.e., InfoNCE (Hénaff et al., 2019) ) for optimization, where the 'positive' pairs are pulled close and the 'negative' pairs are pushed away. Another trend of work, such as BYOL (Grill et al., 2020) and Simsiam (Chen & He, 2021), introduces the concept of asymmetry, which is free from designing negatives. They add an extra 'predictor' behind the model and update the parameters using one augmented view, while the feature of another augmented view is used as fixed supervision. Besides, clustering methods (Caron et al., 2018; 2020; Asano et al., 2019; Li et al., 2020) adopt two augmented views of the same image as the prediction and the pseudo cluster label and enforce the consistency between the two views. We present more related works in the appendix. It is worth noting that a carefully-designed data augmentation strategy is an essential part of the above self-supervised learning frameworks. SimCLR (Tian et al., 2020) and InfoMin (Chen et al., 2020a) empirically investigate the impact of different data augmentations and observe that SSL benefits more from strong data augmentations than supervised learning. After that, SwAV (Caron et al., 2020) proposes the multi-crop strategy, which achieves significant performances on downstream tasks. As shown in Figure 1 (a) and (b), they use two standard resolution crops and sample several small crops that cover the local regions of the input image in order to encourage the "local-to-global" correspondences. However, small crops overlook the diverse backgrounds and decrease the variance, where such views with too many similarities are trivial for learning discriminative features. Intuitively, if we can take into account both the "local-to-global" correspondences and the diverse contextual backgrounds, the quality of learned representations can be further improved. In this paper, we propose a mosaic representation learning framework (MosRep) consisting of a new data augmentation strategy, which can enrich the contextual background of each small crop and encourage the "local-to-global" correspondences. Specifically, we first sample M (e.g., M = 4) small crops of each input image in the current batch. Then, these crops are randomly shuffled and divided into multiple groups. Each group contains M crops and we also ensure that small crops in each group are from different input images. Subsequently, as illustrated in Figure 1 (c), we combine the small crops of the same group into a single view, which terms the mosaic view. Finally, we further jitter the mosaic view in order to prevent the model from memorizing the spatial position of each small crop. In the forward process, the mosaic view is fed into the model for feature extraction, and we adopt the RoI Align operator to extract the feature of each crop from the mosaic view and project this feature into an embedding space. To minimize the loss function (e.g., contrastive loss), the model gradually learns to capture more discriminative features (i.e., foreground objects) from the complex backgrounds, improving the quality of visual representations, which is shown in Figure 1 (d). In summary, the main contributions of this paper are as follows: 1. We design a mosaic augmentation strategy that takes into account the diversity of backgrounds and the "local-to-global" correspondences. 2. Based on the proposed mosaic augmentation, we propose a practical and effective SSL framework, MosRep, which benefits the extraction of discriminative features and improves the quality of visual representations. 3. We build our proposed method upon two different SSL frameworks and validate the effectiveness. Experimental results show that our method achieves superior gains compared to the multi-crop strategy on various downstream tasks.



Figure 1: (a) The standard view is generated from the input image by applying the strategy used in (He et al., 2020). (b) The multi-crop view is generated from the input image by the multicrop strategy (Caron et al., 2020). (c) The mosaic view is generated from the input image by our designed augmentation strategy. Although the multi-crop view encourages the "local-to-global" correspondences, it obviously overlooks the diverse background information. In contrast, the mosaic view effectively enriches the background of each crops in the mosaic view, which more facilitates the extraction of discriminative features than the multi-crop strategy. (i), (ii) and (iii) denote the activation maps of MoCo-v2, MoCo-v2 with the multi-crop strategy and MoCo-v2 with our MosRep. Qualitatively, MosRep performs better localization than the other two methods, demonstrating that our method effectively extracts discriminative features and learns high-quality representations.

