HIVIT: A SIMPLER AND MORE EFFICIENT DESIGN OF HIERARCHICAL VISION TRANSFORMER

Abstract

There has been a debate on the choice of plain vs. hierarchical vision transformers, where researchers often believe that the former (e.g., ViT) has a simpler design but the latter (e.g., Swin) enjoys higher recognition accuracy. Recently, the emerge of masked image modeling (MIM), a self-supervised pre-training method, raised a new challenge to vision transformers in terms of flexibility, i.e., part of image patches or tokens are to be discarded, which seems to claim the advantages of plain vision transformers. In this paper, we delve deep into the comparison between ViT and Swin, revealing that (i) the performance gain of Swin is mainly brought by a deepened backbone and relative positional encoding, (ii) the hierarchical design of Swin can be simplified into hierarchical patch embedding (proposed in this work), and (iii) other designs such as shifted-window attentions can be removed. By removing the unnecessary operations, we come up with a new architecture named HiViT (short for hierarchical ViT), which is simpler and more efficient than Swin yet further improves its performance on fully-supervised and self-supervised visual representation learning. In particular, after pre-trained using masked autoencoder (MAE) on ImageNet-1K, HiViT-B reports a 84.6% accuracy on ImageNet-1K classification, a 53.3% box AP on COCO detection, and a 52.8% mIoU on ADE20K segmentation, significantly surpassing the baseline. Code is available at https://github.com/zhangxiaosong18/hivit.

1. INTRODUCTION

Deep neural networks (LeCun et al., 2015) have advanced the research fields of computer vision, natural language processing, etc., in the past decade. Since 2020, the computer vision community has adapted the transformer module from natural language processing (Vaswani et al., 2017; Devlin et al., 2019) to visual recognition, leading to a large family of vision transformers (Dosovitskiy et al., 2021; Liu et al., 2021; Wang et al., 2021; Zhou et al., 2021; Dai et al., 2021; Li et al., 2021a ) that replaced the dominance of convolutional neural networks (Krizhevsky et al., 2012; He et al., 2016; Tan & Le, 2019) . They have the ability of formulating long-range feature dependencies, which naturally benefits visual recognition especially when long-range relationship is important. There are mainly two families of vision transformers, namely, the plain vision transformers (Dosovitskiy et al., 2021; Touvron et al., 2021) and the hierarchical vision transformers (Liu et al., 2021; Wang et al., 2021; Dong et al., 2021a; Chen et al., 2021a) , differing from each other in whether multi-resolution features are used. Intuitively, visual recognition requires hierarchical information and the hierarchical vision transformers indeed show superior performance. However, the hierarchical vision transformers have introduced complicated and asymmetric operations, e.g., Swin (Liu et al., 2021) used regional self-attentions with shifted windows, hence, they encounter difficulties when the tokens need to be flexibly manipulated. A typical example lies in masked image modeling (MIM), a recent methodology of visual pre-raining (Bao et al., 2021; He et al., 2021; Xie et al., 2021b) , in which a random subset of image patches are masked from input and the model learns by reconstructing the masked contents. In such a circumstance, the plain transformers (e.g., ViT) can directly discard the masked tokens, while the hierarchical vision transformers (e.g., Swin) must feed the entire image (with the masked patches filled with dummy contents) into the encoder (Xie et al., 2021b) , slowing down the training procedure and contaminating the original data distribution. This paper tries to answer the following question: is it possible to design an alternative vision transformer that enjoys both the flexibility of plain models and the representation ability of hierarchical models? We start with ViT and Swin, the most popular plain and hierarchical models. We design a path that connects them, with each step only changing a single design factor. The modifications include (a) increasing network depth, (b) adding relative positional encoding, (c) adding hierarchical patch embedding, (c') adding shifted-window attentions (an alternative to (c)), and (d) adding the stage 4foot_1 . We find that (a)(b)(c) are the main factors that contribute to visual recognition, while (c') shall be replaced by (c) and (d) can be discarded. In particular, the window attentions were designed to reduce the computation of self-attentions in the high-resolution (i.e., low-level) feature maps, but we find that, under a sufficient network depth (e.g., Swin-B used 24 transformer blocks), the low-level self-attentions only have marginal contribution and can be removed. Based on the analysis, we present a hierarchical version of ViT named HiViT. Following (a)(b)(c) discussed above, the modification beyond the original ViT is minimal. At the base level, the architecture has 24 transformer blocks (the number of channels is reduced) where the first 4 appear as hierarchical patch embedding that replaces the plain counterpart and the others are equipped with relative positional encoding -one needs only a few lines of code to replace ViT with HiViT. Overall, the core contribution of this paper is HiViT, a hierarchical vision transformer architecture that is off-the-shelf for a wide range of vision tasks. In particular, with MIM being a generalized paradigm for self-supervised visual representation learning, HiViT has the potential of being directly plugged into many existing algorithms to improve their effectiveness and efficiency.



* Equal Contribution. † Corresponding Author. Stage refers to components processing the same resolution in hierarchical models. In this paper, stage 1, 2, 3, and 4 respectively refer to components processing 56 2 , 28 2 , 14 2 , and 7 2 resolutions in image classification.



Compared to ViT and Swin, HiViT is faster in pretraining, needs fewer parameters, and achieves higher accuracy. All numbers in % are reported by pre-training the model using MIM (ViT-B and HiViT-B by MAE and Swin-B by SimMIM) and fine-tuning it to the downstream data. Please refer to experiments for detailed descriptions. Under the MAE framework(He et al., 2021), with 1600 epochs of pre-training and 100 epochs of fine-tuning, HiViT-B reports a 84.6% top-1 accuracy on ImageNet-1K, which is +1.0% over ViT-B (trained with MAE) and +0.6% over Swin-B (trained with Sim-MIM (Xie et al., 2021b)). More importantly, HiViT enjoys the efficient implementation that discards all masked patches (or tokens) at the input stage, and hence the training procedure is as simple and efficient as applying MAE on ViT. The pre-trained models also show advantages on downstream tasks, including linear probing (a 71.3% top-1 accuracy on ImageNet-1K), semantic segmentation (52.8% mIoU on ADE20K (Zhou et al., 2017)), and object detection and instance segmentation (a 53.3% box AP and a 47.0% mask AP) onCOCO (Lin et al., 2014a)  under the 3× training schedule).

