A CLOSER LOOK AT SELF-SUPERVISED LIGHTWEIGHT VISION TRANSFORMERS

Abstract

Self-supervised learning on large-scale Vision Transformers (ViTs) as pre-training methods has achieved promising downstream performance. Yet, how much these pre-training paradigms promote lightweight ViTs' performance is considerably less studied. In this work, we mainly develop and benchmark self-supervised pre-training methods, e.g., contrastive-learning-based MoCo-v3, masked-imagemodeling-based MAE on image classification tasks, and some downstream dense prediction tasks. We surprisingly find that if proper pre-training is adopted, even vanilla lightweight ViTs show comparable performance on ImageNet to previous SOTA networks with delicate architecture design. We also point out some defects of such pre-training, e.g., failing to benefit from large-scale pre-training data and showing inferior performance on data-insufficient downstream tasks. Furthermore, we analyze and clearly show the effect of such pre-training by analyzing the properties of the layer representation and attention maps for related models. Finally, based on the above analyses, a distillation strategy during pre-training is developed, which leads to further downstream performance improvement for MAE-based pre-training.

1. INTRODUCTION

Self-supervised learning (SSL) has shown great progress in representation learning without heavy reliance on expensive labeled data. SSL focuses on various pretext tasks for pre-training. Among them, several works (He et al., 2020; Chen et al., 2020; Grill et al., 2020; Caron et al., 2020; Chen et al., 2021a; Caron et al., 2021) based on contrastive learning (CL) have achieved comparable or even better accuracy than supervised pre-training when transferring the learned representations to downstream tasks. Recently, another trend focuses on masked image modeling (MIM) (Bao et al., 2021; He et al., 2021; Zhou et al., 2022) , which perfectly fits Vision Transformers (ViTs) (Dosovitskiy et al., 2020) for vision tasks, and achieves improved generalization performance. Most of these works, however, involve large networks with little attention paid to smaller ones. Some works (Fang et al., 2020; Abbasi Koohpayegani et al., 2020; Choi et al., 2021) focus on contrastive self-supervised learning on small convolutional networks (ConvNets) and improve the performance by distillation. However, the pre-training of lightweight ViTs is considerably less studied. Efficient neural networks are essential for modern on-device computer vision. Recent study on achieving top-performing lightweight models mainly focuses on designing network architectures (Sandler et al., 2018; Howard et al., 2019; Graham et al., 2021; Ali et al., 2021; Heo et al., 2021; Touvron et al., 2021b; Mehta & Rastegari, 2022; Chen et al., 2021b; Pan et al., 2022) , while with little attention on how to optimize the training strategies for these models. We believe the latter is also of vital importance, and the utilization of pre-training is one of the most hopeful approaches along this way, since it has achieved great progress on large models. To this end, we develop and benchmark recently popular self-supervised pre-training methods, e.g., CL-based MoCo-v3 (Chen et al., 2021a) and MIM-based MAE (He et al., 2021) , along with fully-supervised pre-training for lightweight ViTs as the baseline on both ImageNet and some other classification tasks as well as some dense prediction tasks, e.g., object detection and segmentation. We surprisingly find that if proper pre-training is adopted, even vanilla lightweight ViTs show comparable performance to previous SOTA networks with delicate design on ImageNet, which achieves 78.5% top-1 accuracy on ImageNet with vanilla ViT-Tiny (5.7M). We also observe some intriguing defects of such pre-training, e.g., failing to benefit from large-scale pre-training data and showing inferior performance on data-insufficient downstream tasks. These findings motivate us to dive deep into the working mechanism of these pre-training methods for lightweight ViTs. More specifically, we introduce a variety of model analysis methods to study the pattern of layer behaviors during pre-training and fine-tuning, and investigate what really matters for downstream performance. First, we find that lower layers of the pre-trained models matter more than higher ones if sufficient downstream data is provided, while higher layers matter in data-insufficient downstream tasks. Second, we observe that the pre-training alters the attention behaviors of the final recognition model little, without introducing locality inductive bias, which is, however, the commonly adopted rule for recent network architecture design (Mehta & Rastegari, 2022; Heo et al., 2021; Touvron et al., 2021b; Liu et al., 2021) . Based on the above analyses, we also develop a distillation strategy for MAE-based pre-training, which improves the pre-training of lightweight ViTs. Better downstream performance is achieved especially on data-insufficient classification tasks and detection tasks.

2. PRELIMINARIES AND EXPERIMENTAL SETUP

ViTs. We use ViT-Tiny (Touvron et al., 2021a) as the base model in our study to examine its downstream performance with pre-training, which contains 5.7M parameters. We adopt the vanilla architecture, consisting of 12 layers with the embedding dimension of 192, except that the number of heads is increased to 12 as we find it can improve the model's expressive power. We use this improved version by default. ViT-Tiny is chosen for study because it is an ideal experimental object, on which almost all existing pre-training methods can directly apply, and has a rather naive structure, which can eliminate the influence of the model architecture on our analysis to a great extent. Evaluation Metrics. Linear probing has been a popular protocol to evaluate the quality of the pre-trained weights (He et al., 2020; Chen et al., 2020; Grill et al., 2020; Caron et al., 2020) , in which only the prediction head is tuned based on the downstream training set while the pre-trained representations are kept frozen. However, prior works point out that linear evaluation does not always correlate with utility (He et al., 2021; Newell & Deng, 2020) . Fine-tuning is another evaluation protocol, in which all the layers are tuned by first initializing them with the pre-trained models. We adopt this by default. Besides, layer-wise lr decay (Bao et al., 2021) is also taken into consideration. By default, we do the evaluation on ImageNet (Deng et al., 2009) by fine-tuning on the train split and evaluating on the validation split. Several other downstream classification datasets (Nilsback & Zisserman, 2008; Parkhi et al., 2012; Maji et al., 2013; Krause et al., 2013; Krizhevsky et al., 2009; Van Horn et al., 2018) and object detection and segmentation tasks on COCO (Lin et al., 2014) are also exploited for comparison in our study. Compared Methods. Baseline: We largely follow the recipe in DeiT (Touvron et al., 2021a) except for some hyper-parameters of augmentations (see Appendix A.1 for our improved recipe) and fully-supervised train a ViT-Tiny from scratch for 300 epochs on the training set of ImageNet-1k. It achieves 74.5% top-1 accuracy on the validation set of ImageNet-1k, surpassing that in the original architecture (72.2%) through modifying the number of heads to 12 from 3, and further reaches 75.8% by adopting the improved training recipe, which finally serves as our strong baseline to examine the pre-training. We denote this supervised trained model by DeiT-Tiny. 



MAE: MAE (He et al., 2021) is selected as a representative for MIM-based pre-training methods, which has a simple framework with low training cost. We largely follow the design of MAE except that the encoder is altered to ViT-Tiny. Several basic factors and components are adjusted to fit the smaller encoder (see Appendix A.2). By default, we do pre-training on the train split of ImageNet-1k (Deng et al., 2009) (dubbed IN1K) for 400 epochs, and denote the pre-trained model as MAE-Tiny. MoCov3: We also implement a contrastive SSL pre-training counterpart to achieve a more thorough study. MoCo-v3 (Chen et al., 2021a) is selected for its simplicity. We use MoCov3-Tiny to denote this pre-trained model with 400 epochs. Details are provided in Appendix A.3. Some other methods, e.g., MIM-based SimMIM Xie et al. (2022) and CL-based DINO Caron et al. (2021) are also involved, but are moved to Appendix B.5 due to the space limitation.

