MASKED IMAGE MODELING WITH DENOISING CONTRAST

Abstract

Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up. MIM recently dominates this line of research with state-of-theart performance on vision Transformers (ViTs), where the core is to enhance the patch-level visual context capturing of the network via denoising auto-encoding mechanism. Rather than tailoring image tokenizers with extra training stages as in previous works, we unleash the great potential of contrastive learning on denoising auto-encoding and introduce a pure MIM method, ConMIM, to produce simple intra-image inter-patch contrastive constraints as the sole learning objectives for masked patch prediction. We further strengthen the denoising mechanism with asymmetric designs, including image perturbations and model progress rates, to improve the network pre-training. ConMIM-pretrained models with various scales achieve competitive results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks, e.g., on ImageNet-1K classification, we achieve 83.9% top-1 accuracy with ViT-Small and 85.3% with ViT-Base without extra data for pre-training.

1. INTRODUCTION

The great success of self-supervised learning in natural language processing (NLP) tasks, e.g., BERT (Devlin et al., 2019) and GPT (Radford et al., 2018; 2019) , has sparked several revolutions in visual representation learning, during which the development of vision dictionary look-up is the most critical. In the age of convolutional neural networks (CNNs) (He et al., 2016; Krizhevsky et al., 2012) , prominent works (He et al., 2020; Chen et al., 2020) perform self-supervised learning with a pretext task of instance-level dictionary look-up via contrastive learning as demonstrated in Figure 1(a) . With the advent of vision Transformers (ViTs) (Dosovitskiy et al., 2021) , the gap between vision and NLP tasks has been further narrowed since the introduction of patch-level dictionary look-up via masked image modeling in a pioneer work BEiT (Bao et al., 2022)  (see Figure 1(b)). The introduction of masked image modeling (Bao et al., 2022) , inspired by masked language modeling (Devlin et al., 2019) in NLP tasks, ushers in a new fad for self-supervised learning using vision Transformers (Dosovitskiy et al., 2021) , i.e., a portion of vision tokens are randomly masked and then recovered by the Transformer network being trained. Concurrent works (Dong et al., 2021; Li et al., 2022; Wei et al., 2022 ) make efforts to design patch-level dictionaries, image tokenizers in other words, to build proper learning objectives (i.e., vision token ids) for masked image modeling. Though advanced results can be achieved, the off-the-shelf image tokenizers, e.g., discrete VAE (Ramesh et al., 2021) used in BEiT (Bao et al., 2022) , depend on extra training stages and data knowledge, rendering an inflexible two-stage pre-training paradigm. We would like to call for a revisit of the superiority of masked image modeling over contrastive learning on self-supervised learning with vision Transformers. Since they are essentially both designed towards vision dictionary look-up, the key difference lies in the patch-level denoising auto-encoding mechanism in masked image modeling, which encourages the network's capability to capture finegrained visual context and semantics. As for the auto-encoding objective, we do not have to intentionally discretize the continuous visual signals as words in NLP tasks to cast the masked prediction as a classification task. Instead, we can give full play to the wisdom of contrastive learning, which has good capability to structure the visual space with semantically meaningful representations. To this end, we introduce a new pre-training method for masked image modeling, namely, ConMIM, to get rid of extra tokenizing networks by revitalizing contrastive learning, as shown in Figure 1 (c). Our ConMIM casts masked patch prediction in self-supervised image pre-training as denoising contrastive learning. The corrupted input with a large proportion of patches masked is fed into the encoder, a plain vision Transformer in general. The encoder learns to recover the representations of the masked patches, which are predicted by feeding the full input into the encoder. The training objective is formed by an intra-image inter-patch contrastive loss. To be specific, patch representations of a full input image build a dynamic dictionary, and patches from the same positions as the masked ones of the corrupted input serve as their positive keys, respectively. The remaining ones from different positions but in a same image are the negative keys. To further improve the network via a stronger denoising auto-encoding mechanism, we introduce asymmetric designs in ConMIM training, including asymmetric image perturbations and asymmetric model progress rates. We adopt a strong augmentation for the full input while a weak augmentation for the corrupted input. For the image encoder, the slowly progressing momentum encoder (He et al., 2020) is employed for the full input to embed more challenging but semantically consistent learning targets. We perform self-supervised learning with ConMIM on ImageNet (Deng et al., 2009) , and then fine-tune the pre-trained vision Transformers with various scales on image classification, semantic segmentation, object detection and instance segmentation. Unlike those employ large models with super-scale extra data knowledge, ConMIM excels especially at small-scale architectures, which render a more challenging task for effective pre-training as well as a more practical task in realworld applications. With a vanilla ViT-Small model, we achieve 83.9% top-1 accuracy using only ImageNet-1K, suggesting that useful knowledge is exploited from data. This significantly outperforms the baseline BEiT (Bao et al., 2022) and the comparable MIM methods without tokenizers (e.g., MAE (He et al., 2022 ), iBOT (Zhou et al., 2022) ) due to the stronger semantic structured regularization in ConMIM. Other than the promising results, we would like to draw public attention to unleash the great potential of "outdated" contrastive learning in visual representation learning.

2. RELATED WORK

Self-supervised learning via vision dictionary look-up. The pretext task of contrastive learning (Chen et al., 2020; He et al., 2020; Caron et al., 2020) dominates self-supervised visual pre-training in the era of CNNs. Contrastive learning methods generally perform instance-level dictionary lookup. The anchors are pulled closer to their positive keys at the same time pushing away from the negative keys. The establishment of vision dictionaries is critical for the contrast regularization. For example, the seminal work MoCo (He et al., 2020) builds the vision dictionary with a firstin-first-out queue, driven by a momentum encoder. The concurrent work SimCLR (Chen et al.,



Figure 1: Conventional contrastive learning methods(e.g., MoCo (He et al., 2020), SimCLR (Chen  et al., 2020)) and masked image modeling methods (e.g.,BEiT (Bao et al., 2022), PeCo (Dong et al.,  2021)) both perform the pretext task of vision dictionary look-up, where the superiority of the latter ones lie in the patch-level denoising auto-encoding mechanism to enable fine-grained visual context understanding of vision Transformers(Dosovitskiy et al., 2021). We introduce to cast masked image modeling as denoising contrastive learning to avoid the extra training stages of image tokenizer, rendering a flexible, simple and effective pre-training paradigm.

availability

Code will be available at https://github.com/TencentARC

