CONTEXT AUTOENCODER FOR SELF-SUPERVISED REPRESENTATION LEARNING

Abstract

We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised representation pretraining. The goal is to pretrain an encoder by solving the pretext task: estimate the masked patches from the visible patches in an image. Our approach first feeds the visible patches into the encoder, extracting the representations. Then, we make predictions from visible patches to masked patches in the encoded representation space. We introduce an alignment constraint, encouraging that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder. In other words, the predicted representations are expected to lie in the encoded representation space, which empirically shows the benefit to representation learning. Last, the predicted masked patch representations are mapped to the targets of the pretext task through a decoder. One additional characteristic is that our approach encourages the separation of the representation learning part (encoder), and the pretext task completion part that will be replaced by the downstream task part. In contrast, previous MIM methods (e.g., BEiT and MAE) couple the two parts, potentially limiting the representation learning quality. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks: semantic segmentation, and object detection and instance segmentation.

1. INTRODUCTION

We study the masked image modeling task for self-supervised representation learning. Masked image modeling (MIM) is a pretext task of masking some patches of the input image and estimate the masked patches from the visible patches. It is expected that the resulting encoder pretrained through solving the MIM task is able to extract the patch representations taking on semantics that are transferred to solving downstream tasks. BEiT (Bao et al., 2021) and the method studied in the ViT paper (Dosovitskiy et al., 2021) , two MIM methods, learn a ViT to estimate the patch tokens and the pixels, respectively, and use the resulting ViT as the pretrained encoder. They take the visible patches and mask tokens representing the masked patches as input, and make estimations for both the visible and masked patches, where the estimations only for masked patches are evaluated during training. The two methods use the single ViT structure simultaneously for both representation learning and task completion. Thus, only the partial capacity of the ViT is explored for representation learning, limiting the representation quality. Masked autoencoder (MAE) (He et al., 2022) prepends an extra ViT structure that only receives visible patches as the so-called encoder followed by a lightweight decoder taking all the patches as input. Unfortunately, the decoder might play a partial role in representation learning, thus distracting the responsibility of representation learning. We present a context autoencoder (CAE) approach, illustrated in Figure 1 , for improving the encoding quality. We randomly partition the image into two sets of patches: visible patches and masked patches. The architecture contains an encoder, a latent contextual regressor with an alignment constraint, and a decoder, The encoder takes only the visible patches as input and learns the representations only for the visible patches. The latent contextual regressor predicts the masked patch representations according to the visible patch representations, where the predicted masked patch representations are

