CORRUPTED IMAGE MODELING FOR SELF-SUPERVISED VISUAL PRE-TRAINING

Abstract

We introduce Corrupted Image Modeling (CIM) for self-supervised visual pretraining. CIM uses an auxiliary generator with a small trainable BEiT (Bao et al., 2021) to corrupt the input image instead of using artificial [MASK] tokens, where some patches are randomly selected and replaced with plausible alternatives sampled from the BEiT output distribution. Given this corrupted image, an enhancer network learns to either recover all the original image pixels, or predict whether each visual token is replaced by a generator sample or not. The generator and the enhancer are simultaneously trained and synergistically updated. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks. CIM is a general and flexible visual pre-training framework that is suitable for various network architectures. For the first time, CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework. Experimental results show that our approach achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation.

1. INTRODUCTION

Vision Transformers (ViTs) (Dosovitskiy et al., 2020) are transferring the landscape of computer vision, not only in terms of the network architecture design, but also the self-supervised pre-training recipe. Masked image modeling (MIM) (Bao et al., 2021) , which randomly masks out some input tokens and then recovers the masked content by conditioning on the visible context, is able to learn rich visual representations and shows promising performance on various vision benchmarks (Zhou et al., 2021; He et al., 2021; Xie et al., 2021; Dong et al., 2021; Wei et al., 2021) . Originated in masked language modeling (Devlin et al., 2019) , MIM (Figure 1a ) is tailor-made for specific architectures (Vaswani et al., 2017) , which is generally capable of receiving and processing tokenized inputs such as the artificial [MASK] tokens. Meanwhile, the more common and natural input signal in computer vision is the image in RGB domain with 2D regular grid structures. In order to apply MIM pre-training for images, ViT has to "patchify" the input image into a 1D sequence of non-overlapping patch embeddings, and then use [MASK] tokens to perturb them. MIM is tightly coupled with the Transformer family, and the usage of [MASK] tokens limits its scope of application to some extent. More importantly, MIM is not directly suitable for convolutional neural networks (CNNs) (LeCun et al., 1989) , the dominant architecture for computer vision in the last decade. Introducing [MASK] tokens in any intermediate stage of CNN is infeasible, as convolution's intrinsic dense-sliding-window paradigm causes information leakage between visual features in previous layers and therefore impedes the MIM. Therefore the large CNN family cannot directly benefit from the upsurge of this new pre-training scheme. Moreover, the usage of [MASK] tokens causes a discrepancy between pre-training and fine-tuning (Devlin et al., 2019; Clark et al., 2020) , as the artificial [MASK] tokens never appear in the fine-tuning stage. In this paper, we present a new visual pre-training framework, called Corrupted Image Modeling (CIM, Figure 1b ), which avoids directly manipulating [MASK] tokens on pre-trained models and generalizes quite well to both ViT and CNN architectures. Rather than directly using artificial [MASK] tokens to corrupt a portion of non-overlapping patch embeddings as in MIM, CIM uses 1a ) requires the pre-trained architecture to receive and process the artificial [MASK] tokens, while CIM (Figure 1b ) relaxes these restrictions by using a trainable generator to sample corrupted images serving as the input for the enhancer. Similar to BEiT, the small generator learns to predict the golden visual token produced by the pre-trained frozen image tokenizer encoder (not shown in the figure) based on partial observations of the input. The enhancer can be various architectures including CNN and learns either a generative or a discriminative visual pre-training objective. After pre-training, we throw out the generator and fine-tune the enhancer on downstream tasks. The dice icon in Figure 1b refers to the visual tokens' stochastic sampling process, and the lock icon means the pre-trained image tokenizer decoder is frozen. a small trainable BEiT (Bao et al., 2021) as an auxiliary generator to corrupt the input image. Specifically, the BEiT generator learns to predict visual tokens at the masked positions, where we utilize the predicted distribution to sample visual tokens' replacements. The replaced visual tokens together with the golden tokens that directly produced by a pre-trained frozen image tokenizer encoder (e.g., the DALL-E (Ramesh et al., 2021) dVAE encoder) given the same input as the small trainable BEiT are then mapped back to the image RGB domain by a pre-trained frozen tokenizer decoder (e.g., the DALL-E dVAE decoder). The resulting corrupted image serves as the input of the enhancer, which is the model to be pre-trained and transferred. For the enhancer, the choice of pre-training objectives is quite flexible. We study two representatives: a generative objective that regresses all the original image pixels given the corrupted image (Dosovitskiy et al., 2020; Chen et al., 2020a) , dubbed as Pixel Residual learning (RESPIX), and a discriminative objective that predicts whether each visual token is replaced by the small generator or not (Clark et al., 2020) , dubbed as Replaced Visual token Detection (REVDET). After pre-training, the enhancer can be used as a strong feature extractor for visual downstream tasks. Overall, CIM is a general and flexible pre-training framework suited for different kinds of visual encoders. For the first time, we demonstrate that both ViT and CNN can learn rich visual representations using a unified non-Siamese structure. Experimental results show that our approach achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation. We hope CIM can serve as a promising starting point for exploring flexible & unified visual representation learning of various architectures.

2. CORRUPTED IMAGE MODELING (CIM)

Figure 1b shows the overview of CIM. Our approach simultaneously learns two neural networks: an auxiliary generator and an enhancer. The generator is used to corrupt the input image, while the enhancer receives the corrupted image (Figure 2 ) and learns either a generative or a discriminative visual pretext task. After pre-training, we throw out the generator and fine-tune the enhancer on downstream tasks.

2.1. GENERATOR

Rather than using artificial [MASK] tokens to corrupt the input image, we learn a trainable auxiliary generator to relax the architectural constraints of MIM. Moreover, the generator enriches the diversity



Figure 1: Overview of our Corrupted Image Modeling (CIM) and comparisons with Masked Image Modeling (MIM). MIM (Figure1a) requires the pre-trained architecture to receive and process the artificial [MASK] tokens, while CIM (Figure1b) relaxes these restrictions by using a trainable generator to sample corrupted images serving as the input for the enhancer. Similar to BEiT, the small generator learns to predict the golden visual token produced by the pre-trained frozen image tokenizer encoder (not shown in the figure) based on partial observations of the input. The enhancer can be various architectures including CNN and learns either a generative or a discriminative visual pre-training objective. After pre-training, we throw out the generator and fine-tune the enhancer on downstream tasks. The dice icon in Figure1brefers to the visual tokens' stochastic sampling process, and the lock icon means the pre-trained image tokenizer decoder is frozen.

