CONTEXTUAL IMAGE MASKING MODELING VIA SYN-ERGIZED CONTRASTING WITHOUT VIEW AUGMENTA-TION FOR FASTER AND BETTER VISUAL PRETRAINING

Abstract

We propose a new contextual masking image modeling (MIM) approach called contrasting-aided contextual MIM (ccMIM), under the MIM paradigm for visual pretraining. Specifically, we adopt importance sampling to select the masked patches with richer semantic information for reconstruction, instead of random sampling as done in previous MIM works. As such, the resulting patch reconstruction task from the remaining less semantic patches could be more difficult and helps to learn. To speed up the possibly slowed convergence due to our more difficult reconstruction task, we further propose a new contrastive loss that aligns the tokens of the vision transformer extracted from the selected masked patches and the remaining ones, respectively. The hope is that it serves as a regularizer for patch feature learning such that the image-level global information could be captured in both masked and unmasked patches, and notably such a single-view contrasting avoids the tedious image augmentation step required in recent efforts of introducing contrastive learning to MIM (to speedup convergence and discriminative ability). Meanwhile, the attention score from the contrastive global feature can also carry effective semantic clues to in turn guide our above masking patch selection scheme. In consequence, our contextual MIM and contrastive learning are synergetically performed in a loop (semantic patch selection-token alignment contrasting) to boost the best of the two worlds: fast convergence and strong performance on downstream tasks without ad-hoc augmentations, which are verified by empirical results on ImageNet-1K for both classification and dense vision tasks.

1. INTRODUCTION

Self-supervised learning (SSL) (Zbontar et al., 2021; Jin et al., 2022; Chen et al., 2020b) has been attracting increasing attention recently in deep learning, due to its label-free property and capability of learning rich holistic representations. Recent SSL methods mainly fall into two classes. Contrastive methods (He et al., 2020; Chen et al., 2020b; Zhang et al., 2021) construct multiple views of the given image to increase the variance and align the representations of these views in latent space. The pair-wise learning paradigm endows strong linear evaluation accuracy of contrastive methods. However, these methods tend to extract global information but ignore local information, which limits their performance on downstream dense vision tasks, e.g. detection, segmentation, etc. Besides, the two-view learning paradigm is usually sensitive to the choice of augmentation function (Chen et al., 2020c ), batch size (Chen et al., 2020b; Zhang et al., 2021) and output dimension (Zbontar et al., 2021; Zhang et al., 2022) etc. With the success of recent ViT-based vision backbones, which divide images into several patches, Mask-Image-Modeling (MIM) methods (He et al., 2021; Fang et al., 2022; Chen et al., 2022) randomly mask some patches and use the self-attention mechanism to

funding

This work was in part supported by NSFC (62222607), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102) and SenseTime Collaborative Research Grant.

