CONTEXT AUTOENCODER FOR SELF-SUPERVISED REPRESENTATION LEARNING

Abstract

We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised representation pretraining. The goal is to pretrain an encoder by solving the pretext task: estimate the masked patches from the visible patches in an image. Our approach first feeds the visible patches into the encoder, extracting the representations. Then, we make predictions from visible patches to masked patches in the encoded representation space. We introduce an alignment constraint, encouraging that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder. In other words, the predicted representations are expected to lie in the encoded representation space, which empirically shows the benefit to representation learning. Last, the predicted masked patch representations are mapped to the targets of the pretext task through a decoder. One additional characteristic is that our approach encourages the separation of the representation learning part (encoder), and the pretext task completion part that will be replaced by the downstream task part. In contrast, previous MIM methods (e.g., BEiT and MAE) couple the two parts, potentially limiting the representation learning quality. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks: semantic segmentation, and object detection and instance segmentation.

1. INTRODUCTION

We study the masked image modeling task for self-supervised representation learning. Masked image modeling (MIM) is a pretext task of masking some patches of the input image and estimate the masked patches from the visible patches. It is expected that the resulting encoder pretrained through solving the MIM task is able to extract the patch representations taking on semantics that are transferred to solving downstream tasks. BEiT (Bao et al., 2021) and the method studied in the ViT paper (Dosovitskiy et al., 2021) , two MIM methods, learn a ViT to estimate the patch tokens and the pixels, respectively, and use the resulting ViT as the pretrained encoder. They take the visible patches and mask tokens representing the masked patches as input, and make estimations for both the visible and masked patches, where the estimations only for masked patches are evaluated during training. The two methods use the single ViT structure simultaneously for both representation learning and task completion. Thus, only the partial capacity of the ViT is explored for representation learning, limiting the representation quality. Masked autoencoder (MAE) (He et al., 2022) prepends an extra ViT structure that only receives visible patches as the so-called encoder followed by a lightweight decoder taking all the patches as input. Unfortunately, the decoder might play a partial role in representation learning, thus distracting the responsibility of representation learning. We present a context autoencoder (CAE) approach, illustrated in Figure 1 , for improving the encoding quality. We randomly partition the image into two sets of patches: visible patches and masked patches. The architecture contains an encoder, a latent contextual regressor with an alignment constraint, and a decoder, The encoder takes only the visible patches as input and learns the representations only for the visible patches. The latent contextual regressor predicts the masked patch representations according to the visible patch representations, where the predicted masked patch representations are constrained to align with the masked patch representations computed from the encoder. The decoder maps the predicted masked patch representations to the targets for masked patches. The prediction from the visible patches to the masked patches, i.e., generating a plausible semantic guess for the masked patches, is performed on the encoded representation space using latent contextual regressor. The predicted representations for the masked patches are constrained to match with the representations computed from the encoder, rendering that the predicted representations also lie in the encoded representation space. Making predictions in the encoded representation space encourages that the encoded representations take on a larger extent of semantics, empirically validated by the experiments. In addition, the encoder in the top stream in Figure 1 operates on visible patches, only focusing on learning semantic representations. The CAE design also expects that the responsibility of representation learning is taken by the encoder through two things: The latent representations of visible patches are not updated in the other parts; and the alignment constraint expects that the predicted representations through latent contextual regressor also lie in the encoded representation space. In comparison to BEiT, MAE, and the approach in the ViT paper, our CAE encoder exploits the greater capability for learning the representation, thus improving the representation quality. We present the empirical performance of our approach on downstream tasks, semantic segmentation, and object detection and instance segmentation. The results show that our approach outperforms supervised pretraining, contrastive pretraining, and other MIM methods.

2. RELATED WORK

Self-supervised representation learning has been widely studied in computer vision , including: context prediction (Doersch et al., 2015; Tian et al., 2021 ), clustering-based methods (Xie et al., 2016; Yang et al., 2016; Caron et al., 2018; Asano et al., 2019; Zhuang et al., 2019; Huang et al., 2019; Caron et al., 2019; Goyal et al., 2021 ), contrastive learning (Li et al., 2020; Oord et al., 2018; Henaff, 2020; Wang et al., 2022) , instance discrimination (Dosovitskiy et al., 2014; 2015) , image discretization (Gidaris et al., 2020a; b) , masked image modeling (Li et al., 2021; Fang et al., 2022; Tian et al., 2022) , and information maximization (Ermolov et al., 2021; Zbontar et al., 2021; Bardes et al., 2021) . The following mainly reviews closely-related methods. Autoencoding. Traditionally, autoencoders were used for dimensionality reduction or feature learning (LeCun, 1987; Gallinari et al., 1987; Hinton & Zemel, 1994; Hinton & Salakhutdinov, 2006; Ranzato et al., 2007; Vincent et al., 2008; Kingma & Welling, 2013) . The denoising autoencoder (DAE) is an autoencoder that receives a corrupted data point as input and is trained to estimate the original, uncorrupted data point as its output. The variants or modifications of DAE were adopted for self-supervised representation learning, e.g., corruption by masking pixels (Vincent et al., 2010;  



Figure 1: The pipeline of context autoencoder. Our approach feeds visible patches into the encoder and extracts their representations Z v and then completes the pretext task by predicting the representations Z m of the masked patches from the visible patches in the encoded representation space with latent contextual regressor and alignment constraint, and mapping predicted representations Z m of masked patches to the targets. The pretrained encoder in (a) is applied to downstream tasks by simply replacing the pretext task part (b) with the downstream task part.

