DEEP DYNAMIC AUTOENCODER FOR VISION BERT PRETRAINING

Abstract

Recently, masked image modeling (MIM) has demonstrated promising prospects in self-supervised representation learning. However, existing MIM frameworks recover all masked patches equivalently, ignoring that the reconstruction difficulty of different patches can vary sharply due to their diverse distance from visible patches. In this paper, we propose Deep Dynamic AutoEncoder (DDAE), a novel MIM framework that dynamically focuses on patch reconstructions with different degrees of difficulty at different pretraining phases and depths of the model. In addition to raw pixel regression, DDAE performs dynamic feature self-distillation for intermediate layers to learn semantic information. Our methodology provides more locality inductive bias for ViTs, especially in deep layers, which inherently makes up for the absence of local prior for self-attention mechanism. Moreover, our core design deep dynamic supervision can be migrated into existing MIM methods (e.g., MAE, BEiT-v2) seamlessly. The experimental results demonstrate the effectiveness of our approach. As a tokenizer-free framework, the base-size DDAE can achieve 83.5% top-1 accuracy with only 100 epochs pretraining, surpassing MAE and BEiT pretrained for 800 epochs. For a longer pretraining schedule, DDAE achieves 84.3% top-1 accuracy on Imagenet-1K, and 49.3% mIoU on ADE20K for semantic segmentation.

1. INTRODUCTION

Aided by the rapid gains in hardware, deep learning has ushered in the era of big models and big data. Along with the ever-growing model capacity, the demand for data can easily reach hundreds of millions (Dosovitskiy et al., 2020) , which is not publicly accessible for labeled data. Self-Supervised Learning(SSL) frameworks, such as DINO (Caron et al., 2021) , MOCO (Chen et al., 2021) , BeiT (Bao et al., 2021) , etc., have grown in concern in vision model pretraining without the need for labels. In particular, the recently proposed Masked Image Modeling(MIM) methods (He et al., 2022; Xie et al., 2022b; Dong et al., 2021; Wei et al., 2022b; Chen et al., 2022b; Dong et al., 2022; Chen et al., 2022c) have shown remarkably impressive performance in a variety of vision tasks, demonstrating the promise to unify computer vision and natural language processing(NLP) pretraining (Peng et al., 2022a) . Inspired by BERT (Devlin et al., 2018) in NLP, MIM pretrains the encoder by reconstructing the masked image patches from visible patches. Existing MIM methods can be divided into two categories according to the need for an additional tokenizer. Two-stage methods: Represented by the pioneer work BEiT (Bao et al., 2021) , two-stage methods firstly transform image patches into semantic visual tokens through a pre-trained tokenizer, then reconstruct the corresponding tokens of masked image patches to pretrain the encoder. Tokenizer needs to be offline pretrained with fixed model architectures and extra data (Zhang et al., 2019b; Ramesh et al., 2021; Radford et al., 2021) , some methods further require an off-the-shelf DNN as teacher to distill tokenizer pre-training (Peng et al., 2022a) . One-stage methods: Taking the recent work MAE (He et al., 2022) as a representative, MAE constructs an asymmetric encoder-decoder structure, and directly performs raw pixel regression of masked image patches. One-stage methods are tokenizer-free, the reconstruction target includes raw pixels, features (e.g., HOG) and self-distillation targets (He et al., 2022; Xie et al., 2022b; Wei et al., 2022a; Chen et al., 2022c) . MIM methods enable Vision Transformers(ViTs)to learn rich visual representations and show great potential in various downstream tasks. Naturally, we ask why MIM works. Facilitated by self-attention mechanism, ViTs excel at modeling long-range dependencies, but different from CNN, ViTs' lack of local prior(i.e., a pixel is more related to its neighbors than the distant pixels) largely leads to its need for large dataset pretraining (e.g., JFT-300M) or special training techniques (e.g., a DeiT-style distillation method (Touvron et al., 2021b) ). Therefore, we conjecture that the success of MIM comes from the fine-grained reconstruction of masked image patches.The generation task built by MIM enables ViTs to pay more attention to vicinity besides using global semantic reasoning. Thus, it inherently makes up for the absence of local prior in the structure of ViTs, which is essentially different from the discriminative task in supervised pretraining or contrastive learning. However, success comes with remaining obstacles. At present, all MIM frameworks recover all patches equivalently, ignoring the fact that the reconstruction difficulty of different patches can vary sharply, and the semantic reasoning and local perception ability required for patches are not the same. Generally, recovering patches with more visible patches around will be simpler, as long as the model has sufficient local perception ability. In contrast, reconstruction with few visible patches around requires the model to have strong semantic reasoning ability, given that there is little access to neighboring information. Treating all pixels equally will neglect this demand for different properties of the model, inevitably limiting its representation ability. On the other hand, layers at different training phases naturally learn features at different levels. Therefore, we ask if there is a way to focus on objectives with diverse characteristics as training progresses so that better representations can be learned overall. Since the pixel representations of perceptually similar images might be very different, only pixellevel reconstruction can be vulnerable to the perceptual difference between images (Dong et al., 2021) . Besides pixel regression, we further apply feature self-distillation to provide semantic information. We directly align the encoder's intermediate features with the corresponding features of the momentum encoder in deep dynamic supervision, where the momentum encoder is updated by Exponential Moving Average (EMA) (Grill et al., 2020; He et al., 2020) . It is worth noting that the feature distillation target here is the corresponding feature of the momentum encoder, while the regression targets for intermediate layers are all raw pixels. Dynamic loss is applied for both feature self-distillation and pixel regression. Furthermore, we directly migrate the core design deep dynamic supervision to the representative methods in one-stage and two-stage, MAE (He et al., 2022) and BEiT-v2 (Peng et al., 2022a) respectively, surpassing original methods with nontrivial margins. Since deep dynamic supervision does not introduce any additional structure, it can be used as a general MIM pretraining plug-andplay module. The contributions are summarized as follows: • We propose DDAE, a one-stage tokenizer-free framework for self-supervised representation learning. Our core design deep dynamic supervision can be migrated to existing MIM approaches seamlessly, providing more local priors for ViTs to make up for their inherent structural defects. • Although deep supervision does not perform well in modern supervised learning, DDAE combines dynamic loss and deep supervision in a novel and effective way, making a step towards unleashing the potential of deep supervision in MIM.



Motivated by this observation and answering the question above, we propose a simple yet effective SSL framework named Deep Dynamic AutoEncoder (DDAE). With deep dynamic supervision, the model dynamically focuses on different patches at different training phases. First, it attaches importance to patches close to visible ones, which only demands fair local perception to recover. Then compared with the beginning of training, the model further pays attention to the distant and difficult patches aided by the simpler patches, which have been attached more importance in the earlier pretraining phase. Specifically, we first define the reconstruction difficulty according to the distance between masked patches and visible ones, generating a distance map. Then, we exert different supervision signals (which are controlled by learnable parameters β ) for intermediate layers. As the training progresses, the model dynamically focuses on different regions with the update of β.

