TOWARDS UNDERSTANDING WHY MASK RECON-STRUCTION PRETRAINING HELPS IN DOWNSTREAM TASKS

Abstract

For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE (He et al., 2021) and data2vec (Baevski et al., 2022), randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional "supervised learning" (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic feature learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative features of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most features in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much features as it can in downstream datasets, and would not lost these features with theoretical guarantees. In contrast, SL only randomly captures some features due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications.

1. INTRODUCTION

Self-supervised learning (SSL) has emerged as a popular and effective method to learn unsupervised representations, with great success witnessed by many downstream tasks, e.g. image classification (He et al., 2016a) , object detection (Girshick et al., 2015; Tan et al., 2020) and segmentation (Ronneberger et al., 2015; He et al., 2017) . In SSL, one often needs to first create an artificial supervised learning problem, a.k.a. a pretext task, that can obtain pseudo data labels via well designing the task itself, and then train a network for learning how to capture useful data features from this artificial supervised task. For example, one representative SSL, contrastive learning (He et al., 2020a; Chen et al., 2020b) , constructs a supervised problem on an unlabeled dataset via regarding random augmentations of an image as a separate class, and then performs supervised instance discrimination. Owing to the unnecessity of manual annotations and its great success, SSL has already paved a new way to solve unsupervised learning problems, and also has attracted increasing research interests. In this work, we are particularly interested in the recently proposed mask-reconstruction pretraining (MRP) of SSL families (Xie et al., 2021; Dong et al., 2021) , e.g. MAE (He et al., 2021) and data2vec (Baevski et al., 2022) . The core idea of this MRP family is to randomly mask the patches of the input image and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. After pretraining on a large-scale unsupervised dataset, MRP fine-tunes the encoder on a specific downstream task to learn more task-specific representations. This pretraining mechanism generally enjoys remarkable test performance improvement on the same downstream task and also a much superior generalization ability on out-of-distribution data than the standard end-to-end "supervised learning". Actually, it also reveals better fine-tuning performance than other state-of-the-art SSL approaches, including contrastive learning (He et al., 2020a; Chen et al., 2020b) and clustering learning (Caron et al., 2018; Wu et al., 2018) . Because of its simplicity and strong compatibility, MRP has attracted wide interests and is seeing increasingly more applications. However, theoretical analyses and understandings on MRP still largely lag their practical applications. To be specific, it is not clear how MRP performs feature learning via the mask reconstruction task, though heavily desired. Moreover, the theoretical reasons for the superiority in test performance of MRP over end-to-end supervised learning are rarely investigated. Most existing theoretical works (Wen & Li, 2021; Arora et al., 2019; Tosh et al., 2021a; b) focus on analyzing contrastive learning, and few works study MRP which differs much from contrastive learning. Cao et al. ( 2022) analyzed the patch-based attention in MAE via an integral kernel but did not study the core questions in this work, i.e.1) what features does MRP learn and 2) why does MRP beat conventional supervised learning. Contributions. In this work, we provide a theoretical viewpoint to understand the semantic (feature) learning process of MRP. Moreover, we analyze test performance of MRP to show its superiority over supervised learning on the downstream classification tasks. Our contributions are highlighted below. Firstly, based on the multi-view data assumption from (Allen-Zhu & Li, 2020) where multi/single discriminative features exist in multi-view/single-view data, we prove that on an auto-encoder with a two/one-layered convolution encoder/decoder, the pretrained encoder in MRP can capture all the discriminative features of each semantic class in the pretraining dataset. Moreover, a convolution kernel in the encoder captures at most a feature. These properties benefit the downstream tasks. As the pretraining dataset is often much larger than downstream dataset, pretraining dataset (approximately) covers all the features in the downstream dataset. So the kernels of the pretrained encoder also well grab the features in downstream datasets. Besides, as a kernel is associated with at most a feature, then the semantic features would not be fused together, allowing a network to easily establish the relation among kernels and semantic class labels in downstream classification task. Secondly, we theoretically show that after fine-tuning on the downstream dataset, MRP enjoys superior test performance to that of end-to-end supervised learning on the downstream tasks by using classification as an example. Assuming pretraining and downstream datasets share the same distribution, we prove that after fine-tuning, MRP can classify the new samples correctly with high probability for both multi-view and single-view test data. This result is superior to (Allen-Zhu & Li, 2020) , which shows the conventional SL only has a half test accuracy on single-view test data.

2. RELATED WORKS

SSL approaches. According to the pretext tasks, the current SSL approaches can be grouped into contrastive learning, e.g. (Hjelm et al., 2018; Oord et al., 2018) , clustering learning, e.g. (Caron et al., 2018; Wu et al., 2018) and mask-reconstruction pretraining (MRP) (He et al., 2021; Baevski et al., 2022) . Given random augmentations of an image, contrastive learning, e.g., MoCo (He et al., 2020a) and SimCLR (Chen et al., 2020a) , brings the different crops of the same image together, and pushes the crops of different images far away from each other in the feature space. For clustering learning, it aims to cluster similar samples into the same group. However, both contrastive learning and clustering learning heavily depend on the multi-crop augmentations. The recently proposed MRP is a simpler SSL. This MRP family, e.g. MAE (He et al., 2021) and SimMIM (Xie et al., 2021) , randomly masks image patches and then reconstructs the masked patches via an auto-encoder. Later, both MaskFeat (Wei et al., 2021) and data2vec (Baevski et al., 2022) empirically find better performance by reconstructing semantic feature. Now MRP has surpassed the end-to-end supervised learning on many downstream tasks, e.g. image classification (Dong et al., 2021) and object detection (He et al., 2021) , and is seeing more applications because of its effectiveness and strong compatibility. SSL analysis. Despite its remarkable success in practice, the theoretical understanding of SSL is still largely absent. Arora et al. (2019) provided generalization guarantees for contrastive learning on linear classification models with the assumption that different positives belong to the same latent class. Wang & Isola (2020) showed that contrastive learning can trade-off the alignment and uniformity of features on a hypersphere. HaoChen et al. (2021) proposed and analyzed a spectral version of contrastive loss with provable accuracy guarantees under linear probing evaluation. Tian et al. (2020) proved that SimCLR only captures feature variability across data points. However, these theoretical works mainly study contrastive learning which essentially differs from MRP. The most closely relevant work to ours is (Lee et al., 2021; Cao et al., 2022) . Cao et al. (2022) analyzed the patch-based attention in MAE via an integral kernel by showing the benefits of patchifying, the equivalence between the attention mechanism in MAE and a learnable integral kernel transform, etc. However, they did not reveal any feature properties of MRP and the superiority reasons of MRP over conventional supervised learning. Lee et al. (2021) showed the benefits of reconstruction partial

