HIVIT: A SIMPLER AND MORE EFFICIENT DESIGN OF HIERARCHICAL VISION TRANSFORMER

Abstract

There has been a debate on the choice of plain vs. hierarchical vision transformers, where researchers often believe that the former (e.g., ViT) has a simpler design but the latter (e.g., Swin) enjoys higher recognition accuracy. Recently, the emerge of masked image modeling (MIM), a self-supervised pre-training method, raised a new challenge to vision transformers in terms of flexibility, i.e., part of image patches or tokens are to be discarded, which seems to claim the advantages of plain vision transformers. In this paper, we delve deep into the comparison between ViT and Swin, revealing that (i) the performance gain of Swin is mainly brought by a deepened backbone and relative positional encoding, (ii) the hierarchical design of Swin can be simplified into hierarchical patch embedding (proposed in this work), and (iii) other designs such as shifted-window attentions can be removed. By removing the unnecessary operations, we come up with a new architecture named HiViT (short for hierarchical ViT), which is simpler and more efficient than Swin yet further improves its performance on fully-supervised and self-supervised visual representation learning. In particular, after pre-trained using masked autoencoder (MAE) on ImageNet-1K, HiViT-B reports a 84.6% accuracy on ImageNet-1K classification, a 53.3% box AP on COCO detection, and a 52.8% mIoU on ADE20K segmentation, significantly surpassing the baseline. Code is available at https://github.com/zhangxiaosong18/hivit.

1. INTRODUCTION

Deep neural networks (LeCun et al., 2015) have advanced the research fields of computer vision, natural language processing, etc., in the past decade. Since 2020, the computer vision community has adapted the transformer module from natural language processing (Vaswani et al., 2017; Devlin et al., 2019) to visual recognition, leading to a large family of vision transformers (Dosovitskiy et al., 2021; Liu et al., 2021; Wang et al., 2021; Zhou et al., 2021; Dai et al., 2021; Li et al., 2021a ) that replaced the dominance of convolutional neural networks (Krizhevsky et al., 2012; He et al., 2016; Tan & Le, 2019) . They have the ability of formulating long-range feature dependencies, which naturally benefits visual recognition especially when long-range relationship is important. There are mainly two families of vision transformers, namely, the plain vision transformers (Dosovitskiy et al., 2021; Touvron et al., 2021) and the hierarchical vision transformers (Liu et al., 2021; Wang et al., 2021; Dong et al., 2021a; Chen et al., 2021a) , differing from each other in whether multi-resolution features are used. Intuitively, visual recognition requires hierarchical information and the hierarchical vision transformers indeed show superior performance. However, the hierarchical vision transformers have introduced complicated and asymmetric operations, e.g., Swin (Liu et al., 2021) used regional self-attentions with shifted windows, hence, they encounter difficulties



* Equal Contribution. † Corresponding Author. 1

