A ROBUSTLY AND EFFECTIVELY OPTIMIZED PRE-TRAINING APPROACH FOR MASKED AUTOENCODER

Abstract

Recently, Masked Image Modeling (MIM) has increasingly reshaped the status quo of self-supervised visual pre-training. This paper does not describe a novel MIM framework, but to unravel several fundamental ingredients to robustly and effectively pre-train a Masked AutoEncoder (MAE) with improved downstream performance as a byproduct. We highlight the great significance for the whole autoencoder to encourage high-variance interactions across different tokens, while simultaneously for the reconstructed target to smooth the inter-patch variances. First, at the decoding phase, we apply the standard dropout upon the attention probabilities as noise to randomly mask out the edge connection across different tokens. Otherwise, their shortcut interactions might hinder the emergence of meaningful contextual representation. Second, we point out that the per-patch normalization will fail unless the patch pixels rely on some population statistics to reduce inter-patch variance and then smooth the reconstruction. Third, we show that autoencoders with different capacities encounter the issue to varying degrees and the learnable masked tokens can be employed to manipulate the variance dependent on its inserted position and ratio in the model. The proposed techniques here are simple and effective to benefit the pre-training of a masked autoencoder stably and obtain superior performance across different downstream tasks.

1. INTRODUCTION

Contrastive learning (He et al., 2020; Chen et al., 2020; Grill et al., 2020) and masked image modeling (Bao et al., 2021; He et al., 2022; Xie et al., 2022b) have become two dominant paradigms for self-supervised visual pre-training. This paper elaborates on the latter one, which exhibits more intriguing progress recently. Generally speaking, the key philosophy of masked image modeling is to mask out a portion of input image and then learn latent representation that can predict the removed data. Such mechanism has first manifested its efficacy in natural language processing (Kenton & Toutanova, 2019; Brown et al., 2020) to learn contextual representation that universally benefits various downstream tasks. Unfortunately, the vision community struggles to embark a similar trajectory for a while. Thanks to the development of Vision Transformer (Dosovitskiy et al., 2020) (ViT), masked image modeling eventually opens the new chapter for self-supervised visual pre-training. In particular, the pioneering BEiT (Bao et al., 2021) applies ViT as a bidirectional encoder to predict visual tokens from a pre-trained codebook (Ramesh et al., 2021) , given some patches of the input image are masked and replaced with a learnable embedding. BEiT first demonstrates the superiority of masked image modeling by outperforming the supervised version that pre-train using the class label of an image. Instead, He et al. (2022) presents a self-contained solution to directly reconstruct pixels at the decoding phase. Their proposed masked autoencoder employs an asymmetric architecture, where the encoder only computes on the low-portion visible tokens while the lightweight decoder is used to reconstruct the other high-portion masked tokens. Some similar techniques are also exploited in Xie et al. (2022b) , such as random masking, raw pixel prediction, lightweight decoding, etc. In addition, Wei et al. (2022) reveals that using HoG (Dalal & Triggs, 2005) as the reconstructed target yields competitive representation. In light of these breakthroughs, there arises several lines of improvements: incorporating with siamese-network-based contrastive learning (Huang et al., 2022; Tao et al., 2022) ; enhancing the pre-trained image tokenizer (Zhou et al., 2021; Peng et al., 2022) ; etc. Instead, this paper does not fall within these categories, but to dive into a purely self-contained Masked AuoEncoder (MAE). Without loss of generality, we consider the architecture proposed in He et al. ( 2022) as our baseline. We attempt to reveal several fundamental ingredients that contribute to the success of masked autoencoding and then propose techniques to robustly and effectively improve it. The key message we deliver here is that it is of significance for the entire model to circumvent the oversmoothing token interactions at both encoding and decoding phase. Conversely, we show that some population statistics are in demand to smooth the inter-patch variance in the pixel space. Otherwise, the perpatch normalized pixels can not serve as a well-performing reconstructed target. We detail our key message as follows. Particularly, when we talk about oversmoothing here, we mean that the pre-trained autoencoder might learn shortcut interactions across tokens to trivially fulfill the pretext task (e.g., pixel prediction), which hinders the emergence of contextual representation in masked image modeling. On the encoding side, both high-portion masking and random masking can alleviate this issue, which have been proposed in He et al. (2022) . First, if we regard the self-attention matrix as a normalized adjacent matrix across the patches, then a complete connected graph will be created (Shi et al., 2022) . To this end, the high-portion masking enables to dynamically sample a combinatorial number of subgraphs, substantially reducing the the risk of oversmoothing. Second, given a fixed masking ratio, the random masking will have larger bipartite entanglement between the visible and masked tokens by increasing their boundary perimeter. That is, to preserve a fixed global semantics, the random masking might be able to have higher masking ratio to spatially remove more redundant patches. While there have been effective practices for the encoding side, however, they are not applicable to the decoder. At the decoding phase, the removed tokens are inserted back as a shared learnable embedding and then all the tokens are visible to the decoder in a fully-connected graph. To circumvent the trivial dependencies among tokens, we are inspired by the drop edge technique (Rong et al., 2019) in graph neural network to randomly remove edges of the graph. In our specific case, we apply the standard dropout upon the self-attention probabilities to randomly mask out some portion of interactions across different tokens, dynamically resulting in a partial connected graph to be visible by the decoder at every iteration. In addition to the architecture, what to predict (i.e., the reconstructed target) is also a concern. In He et al. (2022); Feichtenhofer et al. (2022) , the per-patch normalization of the raw pixel is empirically demonstrated the optimal variant, which suggests that predicting the local high-frequency components benefit the representation learning. However, we argue that this operation becomes meaningless if some population statistics are missed to transform the pixel space. We conduct an ablation study to illustrate this point in Table 1 , where the ViT-Base and ViT-Large models are pretrained using different reconstructed targets. As the None variant shown in the table, if we directly reconstruct on the purely raw pixels with per-patch normalization, then the masked pre-training will not bring positive gains compared with training the model from scratch. Indeed, only if we first perform some specific inter-patch normalization, then followed by the intra-patch normalization, predicting the high-frequency components will be plausible. Specifically, the original MAE actually transforms the pixel space by normalizing along the RGB channels, where the mean and standard deviation are calculated image-wisely on the whole dataset. We also testify that the similar advantages can be observed by normalizing over the 1-D patches along the dimension of size equal to the patch length, which is shown in the last two columns in the table. Note that our conclusion is indeed aligned with Chen & He (2021); Wang et al. ( 2022), once we regard the predictor as an autoencoder and the original image as the naturally stop-gradient target view. To this end, performing batch normalization on the target will facilitate smoothing the target branch and thus stablize the pre-training. While the previous discussions are delivered in an architecture-invariant manner, however, models with different capacities are subject to the mentioned issue by varying degrees. For instance, the extremely high masking ratio (e.g., 75%) might fit the ViT-Large model better but not the optimal one for the ViT-Base variant. In order to manipulate token interactions for different models, we show that the involvement of masked tokens could be more flexible in terms of their inserted position and ratio. For a smaller architecture, we can design an extra low-portion of mid-level masked tokens and include them earlier in the encoder. The inserted position is preferred at the higher layer of the encoder, as not to sacrifice the efficiency. Unlike the argument in Chen et al. ( 2022) that the representation learning and pretext task completion should not be coupled, we empirically demonstrate

