CONTEXTUAL IMAGE MASKING MODELING VIA SYN-ERGIZED CONTRASTING WITHOUT VIEW AUGMENTA-TION FOR FASTER AND BETTER VISUAL PRETRAINING

Abstract

We propose a new contextual masking image modeling (MIM) approach called contrasting-aided contextual MIM (ccMIM), under the MIM paradigm for visual pretraining. Specifically, we adopt importance sampling to select the masked patches with richer semantic information for reconstruction, instead of random sampling as done in previous MIM works. As such, the resulting patch reconstruction task from the remaining less semantic patches could be more difficult and helps to learn. To speed up the possibly slowed convergence due to our more difficult reconstruction task, we further propose a new contrastive loss that aligns the tokens of the vision transformer extracted from the selected masked patches and the remaining ones, respectively. The hope is that it serves as a regularizer for patch feature learning such that the image-level global information could be captured in both masked and unmasked patches, and notably such a single-view contrasting avoids the tedious image augmentation step required in recent efforts of introducing contrastive learning to MIM (to speedup convergence and discriminative ability). Meanwhile, the attention score from the contrastive global feature can also carry effective semantic clues to in turn guide our above masking patch selection scheme. In consequence, our contextual MIM and contrastive learning are synergetically performed in a loop (semantic patch selection-token alignment contrasting) to boost the best of the two worlds: fast convergence and strong performance on downstream tasks without ad-hoc augmentations, which are verified by empirical results on ImageNet-1K for both classification and dense vision tasks.

1. INTRODUCTION

Self-supervised learning (SSL) (Zbontar et al., 2021; Jin et al., 2022; Chen et al., 2020b) has been attracting increasing attention recently in deep learning, due to its label-free property and capability of learning rich holistic representations. Recent SSL methods mainly fall into two classes. Contrastive methods (He et al., 2020; Chen et al., 2020b; Zhang et al., 2021) construct multiple views of the given image to increase the variance and align the representations of these views in latent space. The pair-wise learning paradigm endows strong linear evaluation accuracy of contrastive methods. However, these methods tend to extract global information but ignore local information, which limits their performance on downstream dense vision tasks, e.g. detection, segmentation, etc. Besides, the two-view learning paradigm is usually sensitive to the choice of augmentation function (Chen et al., 2020c ), batch size (Chen et al., 2020b; Zhang et al., 2021) and output dimension (Zbontar et al., 2021; Zhang et al., 2022) etc. With the success of recent ViT-based vision backbones, which divide images into several patches, Mask-Image-Modeling (MIM) methods (He et al., 2021; Fang et al., 2022; Chen et al., 2022) randomly mask some patches and use the self-attention mechanism to recover pixel-wise information from the remaining un-masked patches. These MIM-based methods are shown can obtain a better pre-trained encoder even than contrastive approaches, and especially show strong performance on transfer learning and finetuning tasks thanks to the ability to capture local information. However, these methods can suffer from the limited discriminability for global representation with less competitive linear accuracy (Gao et al., 2022) , slow convergence speed (1,600 epochs for MAE), and GPU-consuming (e.g. with batch size 4096). Therefore, recent attempts (Wang et al., 2022; Chen et al., 2022; Yi et al., 2022) are devoted to combining contrastive learning and MIM. The hope is that both (global) discriminative information (by contrasting) and spatial-sensitive information (by MIM) can be captured. Although these approaches show improved convergence speed (in the sense of fewer epochs needed for convergence) compared with using MIM alone, they can hardly further boost the performance of MIM on downstream tasks given even more (e.g. 800, 1600) pretraining epochs. Meanwhile, such direct combing contrasting with MIM can also bring about side effects e.g. the increased sensitivity to data augmentation and more overhead at each training epoch. In this paper, we aim to improve both convergence speed and accuracy by devising a new contextual MIM scheme, which is synergistically aided by contrastive learning. The highlights of our proposed ccMIM are: 1) Novel framework for synergizing MIM and contrastive learning in a close-loop: We propose a novel contextual MIM framework by actively selecting the rich semantic patches as masking patches for reconstruction to improve the MIM learning difficulty, whereby the semantic clue is learned and measured by our devised vision transformer token alignment contrasting loss. As such, the synergizing of the two components can be fulfilled in the first place. In contrast, existing efforts on the combination of MIM and contrasting often perform these two components independently until their weighted loss is finally computed and show limited accuracy improvement (perhaps also partly due to their random patch selection strategy as observed in our ablation study in Sec. 4.3). 2) Cost-effective technical design for contextual MIM: Under the above framework, we propose to use importance sampling to contextually select the patches with richer semantic information for masking and reconstruction, to increase the learning difficulty and effectiveness. The selection is guided by the attentional semantic score derived from the [CLS] token alignment as contrastive learning between the selected and un-selected patches. Compared to the widely adopted two-view augmentation for contrasting in recent MIM works, our contrasting is efficiently performed within a single view (for speedup MIM training). Moreover, our way of using contrasting also directly guides the MIM learning instead of being two independent components as done in the existing literature. 3) Improvement over MAE: Using ViT-B (Dosovitskiy et al., 2020) as a standard protocol in MIM, ccMIM achieves 83.6%, 84.2% top-1 accuracy with 300 and 800 epochs pre-training, outperforming MAE (He et al., 2021) 0.8% (82.8% for 300 epochs) and 0.6% (83.6% for 1600 epochs) accuracy on ImageNet-1K, respectively. Source code will be made publicly available.

2. RELATED WORK

In our work, the proposed ccMIM divides image patches into two sets by information density by learning two objective functions. One is aligning the global representations of two sets and the other is reconstructing raw images from the visible set, so we briefly review contrastive learning (align representations of multiple augmented views in latent space) and the MIM-based methods (reconstruction). Contrastive learning aims to learn instance discriminative representations to distinguish an image from the others (Hjelm et al., 2018) . This is achieved by pulling together the representations of different views of the same image and pushing away the other images. Thus, most contrastive methods adopt siamese network (He et al., 2020; Grill et al., 2020; Chen & He, 2021) . To create different views for the same image, well-designed data augmentation functions have been deployed (e.g., those investigated in SimCLR (Chen et al., 2020b)). To increase the number of negative pairs, Moco (He et al., 2020; Chen et al., 2020c ) constructs a large queue to store negative representation in memory. Besides, Moco and BYOL use momentum and stop-gradient mechanisms are adopted to prevent degenerate solutions (Tian et al., 2021; Wang & Isola, 2020) . To simplify BYOL, SimSiam (Chen & He, 2021) discards the momentum updating and finds stop-gradient and predictor head are the keys to preventing collapse. MoCo-v3 (Chen et al., 2021) and DINO (Caron et al., 2021) are based on the siamese network and extend MoCo (He et al., 2020) and BYOL (Grill et al., 2020) with Vision

funding

This work was in part supported by NSFC (62222607), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102) and SenseTime Collaborative Research Grant.

