MOSAIC REPRESENTATION LEARNING FOR SELF-SUPERVISED VISUAL PRE-TRAINING

Abstract

Self-supervised learning has achieved significant success in learning visual representations without the need of manual annotation. To obtain generalizable representations, a meticulously designed data augmentation strategy is one of the most crucial parts. Recently, multi-crop strategies utilizing a set of small crops as positive samples have been shown to learn spatially structured features. However, it overlooks the diversity of contextual backgrounds, which reduces the variance of the input views and degenerates the performance. To address this problem, we propose a mosaic representation learning framework (MosRep), consisting of a new data augmentation strategy that enriches the backgrounds of each small crop and improves the quality of visual representations. Specifically, we randomly sample numbers of small crops from different input images and compose them into a mosaic view, which is equivalent to introducing different background information for each small crop. Additionally, we further jitter the mosaic view to prevent memorizing the spatial locations of each crop. Along with optimization, our MosRep gradually extracts more discriminative features. Extensive experimental results demonstrate that our method improves the performance far greater than the multi-crop strategy on a series of downstream tasks, e.g., +7.4% and +4.9% than the multi-crop strategy on ImageNet-1K with 1% label and 10% label, respectively. Code is available at https://github.com/DerrickWang005/MosRep.git.

1. INTRODUCTION

High-quality representation learning (Bengio et al., 2013 ) is a fundamental task in machine learning. Tremendous number of visual recognition models have achieved promising performance by learning from large-scale annotated datasets, e.g., ImageNet (Deng et al., 2009) and OpenImage (Kuznetsova et al., 2020) . However, a great deal of challenges exist in collecting large-scale datasets with annotations, e.g., label noise (Liu & Tao, 2015; Natarajan et al., 2013; Xia et al., 2019) , high cost (Zhu et al., 2019) and privacy concerns (Liang et al., 2020) . To address these issues, self-supervised learning (SSL) is proposed to learn generic representations without manual annotation. Recent progress in visual self-supervised learning (Caron et al., 2020; He et al., 2020; Grill et al., 2020; Chen & He, 2021; Bai et al.) shows remarkable potential and achieves comparable results with supervised learning. Among these SSL methods, a common underlying idea is to extract invariant feature representations from different augmented views of the same input image. Contrastive learning (Dosovitskiy et al., 2015; Wu et al., 2018; Chen et al., 2020a; He et al., 2020; Wang et al., 2022) is one of the most commonly used methods. They define 'positive' and 'negative' pairs and apply the contrastive loss (i.e., InfoNCE (Hénaff et al., 2019) ) for optimization, where the 'positive' pairs are pulled close and the 'negative' pairs are pushed away. Another trend of work, such as BYOL (Grill et al., 2020) and Simsiam (Chen & He, 2021), introduces the concept of asymmetry, which is free from designing negatives. They add an extra 'predictor' behind the model and update the parameters using one * Equal contribution 1

