VISION TRANSFORMER ADAPTER FOR DENSE PREDICTIONS

Abstract

This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate visionspecific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pretraining-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT-Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO testdev. We hope that the ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. Code and models will be released at https://github.com/czczup/ViT-Adapter.



Figure 1 : Previous paradigm vs. our paradigm. (a) Previous paradigm designs vision-specific models and pre-trains on large-scale image datasets via supervised or self-supervised learning and then fine-tunes them on downstream tasks. (b) We propose a pre-training-free adapter to close the performance gap between plain ViT (Dosovitskiy et al., 2020) and vision-specific transformers (e.g., Swin (Liu et al., 2021b )) for dense prediction tasks. Compared to the previous paradigm, our method preserves the flexibility of ViT and thus could benefit from advanced multi-modal pre-training.

1. INTRODUCTION

Recently, transformers have witnessed remarkable success in a broad range of computer vision fields. Benefiting from the dynamic modeling capability and the long-range dependence of the attention mechanism, various vision transformers (Dosovitskiy et al., 2020; Chen et al., 2021; Han et al., 2021; Li et al., 2021c; Wu et al., 2022b) soon rose in many computer vision tasks such as object detection and semantic segmentation, surpassing CNN models and reaching state-of-the-art performance. These models are mainly divided into two families, i.e. the plain ViT (Dosovitskiy et al., 2020; Touvron et al., 2021) , and its hierarchical variants (Dong et al., 2021; Liu et al., 2021b; Wang et al., 2021; 2022a) . In general, the latter can produce better results and is believed to introduce vision-specific inductive biases into their architectures by using local spatial operations. Figure 2 : Object detection performance on COCO val2017 using Mask R-CNN. We see that the proposed ViT-Adapter brings significant improvements to plain ViTs. ⋆ indecates using multi-modal pre-trained ViT from (Zhu et al., 2021) . Backbones pre-trained on ImageNet-22K are marked with † , otherwise ImageNet-1K. Nonetheless, the plain ViT (i.e., vanilla transformer) still has some nonnegligible advantages. A typical example lies in multi-modal pre-training (Zhu et al., 2021; 2022; Wang et al., 2022b) . Stemming from the natural language processing (NLP) field, transformer has no assumption of input data. Equipping with different tokenizers, e.g. patch embedding (Dosovitskiy et al., 2020) , 3D patch embedding (Liu et al., 2021c) , and token embedding (Vaswani et al., 2017) , vanilla transformers such as plain ViT can use massive multi-modal data for pre-training, including image, video, and text, which encourages the model to learn semantic-rich representations. However, the plain ViT has conclusive defects in dense predictions compared to visionspecific transformers. Lacking image-related prior knowledge results in slower convergence and lower performance, and thus plain ViTs are hard to compete with vision-specific transformers (Huang et al., 2021b Murray, 2019) in the NLP field, this work aims to develop an adapter to close the performance gap between the plain ViT and vision-specific backbones for dense prediction tasks. To this end, we propose the Vision Transformer Adapter (ViT-Adapter), which is a pre-training-free additional network that can efficiently adapt the plain ViT to downstream dense prediction tasks without modifying its original architecture. Specifically, to introduce the vision-specific inductive biases into the plain ViT, we design three tailored modules for ViT-Adapter, including (1) a spatial prior module for capturing the local semantics (spatial prior) from input images, (2) a spatial feature injector for incorporating spatial prior into the ViT, and (3) a multi-scale feature extractor to reconstruct the multi-scale features required by dense prediction tasks. As shown in Figure 1 , compared to the previous paradigm that pre-trains on large-scale image datasets (e.g., ImageNet (Deng et al., 2009) ) then fine-tunes on other tasks, our paradigm is more flexible. In our framework, the backbone network is a general-propose model (e.g., plain ViT) that can be pre-trained with not only images but also multi-modal data. For the transfer learning of dense prediction tasks, we use a randomly initialized adapter to introduce the image-related prior knowledge (inductive biases) into the pre-trained backbone, making the model suitable for these tasks. In this way, using ViT as the backbone, our framework achieves comparable or even better performance than vision-specific transformers such as Swin (Liu et al., 2021b) . Our main contributions are as follows: • We explore a new paradigm to introduce vision-specific inductive biases into the plain ViT. It helps ViT achieve comparable performance to recent transformer variants (Liu et al., 2021b; Wang et al., 2022a) with regular ImageNet pre-training and further benefits from multi-modal pre-training. • We design a spatial prior module and two feature interaction operations, to inject the image prior without redesigning the architecture of ViT. They can supplement the missing local information and reorganize fine-grained multi-scale features for dense prediction tasks. • We evaluate the ViT-Adapter on multiple challenging benchmarks, including COCO (Lin et al., 2014) and ADE20K (Zhou et al., 2017) . As shown in Figure 2 , our models consistently achieve improved performance compared to the prior arts under the fair pre-training strategy. For instance, when using only ImageNet-1K pre-training, ViT-Adapter-B reports 49. 6 

3.1. OVERALL ARCHITECTURE

As illustrated in Figure 4 , our model can be divided into two parts. The first part is the plain ViT (Dosovitskiy et al., 2020 ) that consists of a patch embedding followed by L transformer encoder layers (see Figure 4 (a)). The second part is the proposed ViT-Adapter as shown in Figure 4 (b), which contains (1) a spatial prior module to capture spatial features from the input image, (2) a spatial feature injector to inject spatial priors into the ViT, and (3) a multi-scale feature extractor to extract hierarchical features from the single-scale features of ViT. For the ViT, the input image is first fed into the patch embedding, where the image is divided into 16 × 16 non-overlapping patches. After that, these patches are flattened and projected to Ddimensional tokens, and the feature resolution is reduced to 1/16 of the original image. Then, these tokens added with the position embedding, are passed through L encoder layers. For the ViT-Adapter, we first feed the input image into the spatial prior module. D-dimensional spatial features of three target resolutions (i.e., 1/8, 1/16, and 1/32) will be collected. Then, these feature maps are flattened and concatenated as the input for feature interaction. Specifically, given the number of interactions N (usually N = 4), we evenly split the transformer encoders of ViT into N blocks, each containing L/N encoder layers. For the i-th block, we first inject spatial priors F i sp into the block via a spatial feature injector, and then extract hierarchical features from the output of the block by a multi-scale feature extractor. After N feature interactions, we obtain high-quality multi-scale features, and then we split and reshape the features into three target resolutions 1/8, 1/16, and 1/32. Finally, we build the 1/4-scale feature map by upsampling the 1/8-scale feature map using a 2 × 2 transposed convolution. In this way, we obtain a feature pyramid of similar resolutions to ResNet (He et al., 2016) , which can be used in various dense prediction tasks. As shown in Figure 4 (c), a standard convolutional stem borrowed from ResNet (He et al., 2016 ) is employed, which consists of three convolutions and a max-pooling layer. Then, we use a stack of stride-2 3×3 convolutions to double the number of channels and reduce the size of feature maps. Finally, several 1×1 convolutions are applied at the end to project the feature maps to D dimensions. In this way, we obtain a feature pyramid {F 1 , F 2 , F 3 }, which contains D-dimensional feature maps with resolutions of 1/8, 1/16, and 1/32. Then, we flatten and concatenate these feature maps into feature tokens F 1 sp ∈ R ( HW 8 2 + HW 16 2 + HW 32 2 )×D , as the input for feature interaction.

3.3. FEATURE INTERACTION

Due to weak prior assumptions, the plain ViT suffers sub-optimal performance on dense prediction tasks compared to vision-specific transformers (Chu et )×D as the key and value. We use crossattention to inject spatial feature F i sp into the input feature F i vit , which can be written as Eqn. 1. Fi vit = F i vit + γ i Attention(norm(F i vit ), norm(F i sp )), where the norm(•) is LayerNorm (Ba et al., 2016), and the attention layer Attention(•) suggests using sparse attention. In addition, we apply a learnable vector γ i ∈ R D to balance the attention layer's output and the input feature F i vit , which is initialized with 0. This initialization strategy ensures that the feature distribution of F i vit will not be modified drastically due to the injection of spatial priors, thus making better use of the pre-trained weights of ViT. Multi-Scale Feature Extractor. After injecting the spatial priors into the ViT, we obtain the output feature F i+1 vit by passing Fi vit through the encoder layers of the i-th block. Then, we apply a module consisting of a cross-attention layer and a feed-forward network (FFN), to extract multi-scale features, as shown in Figure 4 (e). This process can be formulated as: F i+1 sp = Fi sp + FFN(norm( Fi sp )), Fi sp = F i sp + Attention(norm(F i sp ), norm(F i+1 vit )), in which we use the spatial feature F i sp ∈ R ( HW 8 2 + HW 16 2 + HW 32 2 )×D as the query, and the output feature F i+1 vit ∈ R HW 16 2 ×D as the key and value for cross-attention. As same as the spatial feature injector, we adopt sparse attention here to reduce computational cost. The generated spatial feature F i+1 sp will be used as the input of the next spatial feature injector.

3.4. ARCHITECTURE CONFIGURATIONS

We build our ViT-Adapter for 4 different sizes of ViT, including ViT-T, ViT-S, ViT-B, and ViT-L. For these models, the parameter numbers of our adapters are 2.5M, 5.8M, 14.0M, and 23.7M, respectively. We employ deformable attention (Zhu et al., 2020) as the default sparse attention in our method, where the number of sampling points is fixed to 4, and the number of attention heads is set to 6, 6, 12, and 16. The number of interactions N is 4, and in the last feature interaction, we stack three multi-scale feature extractors. Besides, we set the FFN ratio in our adapter to 0.25 to save computational overhead, i.e. the hidden sizes of FFN are 48, 96, 192, and 256 for 4 different adapters. More details of each configuration are shown in Table 10 in Appendix B.

4. EXPERIMENTS

Previous work (Wang et al., 2021) has shown that the pyramid prior is beneficial to dense prediction, but brings little gains to image classification. Therefore, in this study, we focus on how to better adapt readily available pre-trained ViTs to dense prediction tasks. We hope this method will also help decouple the model design of upstream pre-training and downstream fine-tuning. Results with Multi-Modal Pre-training. In this experiment, we study the effect of multimodal pre-training. Specifically, we fine-tune the ViT-Adapter-B with Mask R-CNN for the 3×+MS schedule using different pre-trained weights. As shown in Table 4 , simply replacing the ImageNet-22K pre-training (Steiner et al., 2021) with the multi-modal pre-training (Zhu et al., 2021) gives us a significant gain of 0.7 AP b and AP m . These results indicate that our method can easily derive considerable benefits from advanced multi-modal pre-training, which is difficult for vision-specific models like Swin.

4.2. SEMANTIC SEGMENTATION

Settings. We evaluate our ViT-Adapter on semantic segmentation with the ADE20K (Zhou et al., 2017) dataset and MMSegmentation (Contributors, 2020) codebase. Both Semantic FPN (Kirillov et al., 2019) and UperNet (Xiao et al., 2018) are employed as the basic frameworks. For Semantic FPN, we apply the settings of PVT (Wang et al., 2021) and train the models for 80k iterations. For UperNet, we follow the settings of Swin (Liu et al., 2021b) to train it for 160k iterations. Results with ImageNet-1K Pre-training. In Table 3 , we report the semantic segmentation results in terms of single-scale and multi-scale (MS) mIoU. As same as Section 4.1, we initialize all ViT-T/S/B models with the DeiT (Touvron et al., 2021) released ImageNet-1K weights. It shows that, under comparable model sizes, our method surpasses the ViT (Li et al., 2021b ) and many representative vision-specific transformers (Wang et al., 2021; 2022a; Liu et al., 2021b; Chu et al., 2021a) . For instance, our ViT-Adapter-S achieves 47.1 MS mIoU with UperNet, outperforming many strong counterparts such as Swin-T. Similarly, ViT-Adapter-B reports a competitive performance of 49.7 MS mIoU, which is 2.6 points higher than ViT-B and on par with Swin-B and Twins-SVT-L. These fair comparisons using only regular ImageNet-1K pre-training (Touvron et al., 2021) demonstrate the effectiveness and universality of our ViT-Adapter. Results with ImageNet-22K Pre-training. When using the ImageNet-22K pre-trained weights (Steiner et al., 2021) , our ViT-Adapter-B † attains 51.9 mIoU and 52.5 MS mIoU with UperNet, exceeding Swin-B † by at least 0.8 mIoU. Similarly, ViT-Adapter-L † yields the results of 53.4 mIoU and 54.4 MS mIoU, which is outstanding from the counterparts like Swin-L † . These significant and consistent improvements over different model sizes suggest that our method can cover the shortage of plain ViT, making it more suitable for semantic segmentation. Results with Multi-Modal Pre-training. Here, we apply the multi-modal pre-trained weights from Uni-Perceiver (Zhu et al., 2021) for semantic segmentation. As shown in Table 3 , for Semantic FPN and UperNet, replacing the ImageNet-22K pre-training with multi-modal pre-training benefits our ViT-Adapter-L ⋆ with impressive gains of 1.3 mIoU and 1.6 mIoU, respectively. Results. As shown in Table 5 , our method reaches state-of-the-art performance. While these results may be partly due to the effectiveness of advanced pre-training, our study demonstrates that plain backbone detectors/segmenters can challenge the entrenched position of hierarchical backbones.

4.4. ABLATION STUDY

ViT vs. ViT-Adapter Feature. Recent works (Park & Kim, 2022; Si et al., 2022) show that ViT presents the characteristics of learning low-frequency global signals, while CNN tends to extract high-frequency information (e.g., local edges and textures). To show the difference between the features of ViT and ViT-Adapter, we first use Fourier analysis as a toolkit for visualization. As shown in Figure 5 (a), the Fourier spectrum and relative log amplitudes of the Fourier transformed feature maps (average over 100 images) indicate that ViT-Adapter captures more high-frequency signals than the ViT (Li et al., 2021b) baseline. In addition, we also visualize the stride-8 feature map in Figure 5 (b)(c), which shows that the features of ViT are blurry and coarse. In contrast, our features are more fine-grained and have more local edges and textures. This observation demonstrates that our method grafts the merit of CNN for capturing high-frequency information to ViT. Ablation for Components. To investigate the contribution of each key design, we gradually extend the ViT-S baseline (Li et al., 2021b) to our ViT-Adapter-S. All models are trained with Mask R-CNN for 1× schedule. As shown in the left side of Table 6 , by directly resizing and adding the spatial features from SPM, our variant 1 improves 1.4 AP b and 0.9 AP m over the baseline, showing that local spatial information is essential for dense prediction. From variant 2, we find that the spatial feature injector further boosts the performance by 1.0 AP b and 0.8 AP m . This observation illustrates that cross-attention is a more flexible way to inject spatial features. Moreover, we employ the multiscale feature extractor to reconstruct hierarchical features, which brings 2.1 AP b and 1.1 AP m gains, alleviating ViT's drawback of single-scale features. In summary, our proposed components are each necessary and collectively create 4.5 AP b and 2.8 AP m improvements. 

Number of Interactions.

In the right side of Table 6 , we study the effect of the number of interactions. Specifically, we build several ViT-Adapter-S variants with different numbers of interactions. We observe that the model accuracy saturates when N goes larger, and applying more interactions cannot monotonically promote the performance. Therefore, we empirically set N to 4 by default. Attention Type. Our method is a general framework in which the attention mechanism is replaceable. To verify this, we adopt ViT-Adapter-S as the basic model and study 4 different attention mechanisms. As shown in Table 7 , sparse attention with linear complexity is more suitable for our adapter than global attention with quadratic complexity. We ended up using deformable attention (Zhu et al., 2020) as the default configuration. Notably, it can be replaced by other more advanced attention mechanisms in the future to further boost performance.

5. CONCLUSION

This work explores a new paradigm, namely ViT-Adapter, to bridge the performance gap between the plain ViT and vision-specific transformers on dense prediction tasks. Without modifying the inherent architecture, we flexibly inject image-related inductive biases into the ViT and reconstruct fine-grained multi-scale features required by dense predictions. Extensive experiments on object detection, instance segmentation, and semantic segmentation show that our method can achieve comparable or even better performance than well-designed vision-specific transformers, and further derive considerable benefits from advanced multi-modal pre-training.



In ViTDet, using regular ImageNet-22K pre-training instead of MAE(He et al., 2021) drops 4.0 box AP. & Query ℱ '() & Cross-Attention (e) Multi-Scale Feature Extractor 𝑖 & ℱ # '() &



To date, adapters have been widely used in the NLP field. PALs (Stickland & Murray, 2019) and Adapters (Houlsby et al., 2019) introduce new modules in transformer encoders for taskspecific fine-tuning, making the pre-trained model quickly adapt to downstream NLP tasks. In the field of computer vision, some adapters have been proposed for incremental learning (Rosenfeld & Tsotsos, 2018) and domain adaptation (Rebuffi et al., 2017; 2018). With the advent of CLIP (Radford et al., 2021), many CLIP-based adapters (Gao et al., 2021; Sung et al., 2021; Zhang et al., 2021) were presented to transfer pre-trained knowledge to zero-shot or few-shot downstream tasks. Recently, Li et al. (2021b) and ViTDet (Li et al., 2022b) employed some upsampling and downsampling modules to adapt the plain ViT for object detection, as shown in Figure 3(a). However, under regular training settings (i.e., apply ImageNet supervised pre-training and fine-tune for 36 epochs), their detection performance is still inferior 1 to recent models (Chu et al., 2021b; Dong et al., 2021; Wang et al., 2022a; Wu et al., 2022b) that well combine image prior. Therefore, it is still challenging to design a powerful dense prediction task adapter for ViT.

Figure 4: Overall architecture of ViT-Adapter. (a) The ViT, whose encoder layers are divided into N (usually N = 4) equal blocks for feature interaction. (b) Our ViT-Adapter, which contains three key designs, including (c) a spatial prior module for modeling local spatial contexts from the input image, (d) a spatial feature injector for introducing spatial priors into ViT, and (e) a multi-scale feature extractor for reorganizing multi-scale features from the single-scale features of ViT.

SPATIAL PRIOR MODULERecent studies(Wang et al., 2022a;Wu et al., 2021;Fang et al., 2022; Park & Kim, 2022)  show convolutions can help transformers better capture the local spatial information. Inspired by this, we introduce the Spatial Prior Module (SPM). It is designed to model the local spatial contexts of images parallel with the patch embedding layer, so as not to alter the original architecture of ViT.

ViT-S and ViTDet-S are 3.8 AP b and 3.3 AP b lower than PVTv2-B2(Wang et al., 2022a) respectively. Differently, our ViT-Adapter-S outperforms these two approaches by clear margins and even 0.4 AP b higher than PVTv2-B2. This observation can also be seen in the experiments of three other detectors, including Cascade Mask R-CNN, ATSS, and GFL. These results indicate that, with only the regular ImageNet-1K pre-training, ViT-Adapter can promote the plain ViT to attain similar or even superior performance than these vision-specific transformers.Results withImageNet-22K Pre-training. In Table 1, we employ the ImageNet-22K pre-trained weights from AugReg (Steiner et al., 2021) to initialize all ViT-L models, including ViT (Li et al., 2021b), ViTDet (Li et al., 2022b), and our ViT-Adapter. It can be seen that, when training Mask R-CNN with 3×+MS schedule, our ViT-Adapter-L † brings 3.8 AP b and 3.0 AP b improvements over ViT-L † (Li et al., 2021b) and ViTDet-L † (Li et al., 2022b), respectively. Method Pre-train AP b AP m Comparison of different pre-trained weights. Our method retains the flexibility of ViT and thus could benefit from advanced multimodal pre-training (Zhu et al., 2021).

COMPARISONS WITH STATE-OF-THE-ARTS Settings. We conduct experiments to combine our ViT-Adapter with state-of-the-art detection/segmentation frameworks, including HTC++ (Liu et al., 2021b) (without extra detection dataset) and Mask2Former (Cheng et al., 2021), and recent multi-modal pre-training BEiTv2 (Peng et al., 2022). The experimental settings are listed in Appendix A.1 and A.2. COCO test-dev AP b AP m BEiT3(w/ ViT-Adapter) 62.8

Figure 5: ViT vs. ViT-Adapter Feature. (a) Relative log amplitudes of Fourier transformed feature maps. (b) Detection results. (c) Stride-8 feature map. Compared to the ViT baseline (Li et al., 2021b), our ViT-Adapter captures more high-frequency signals, and produces more fine-grained features with rich edges and textures, which is of great help for dense prediction.

; Xie et al., 2021; Wang et al., 2022a) on dense prediction tasks. Inspired by the adapters (Houlsby et al., 2019; Stickland &

box AP on COCO val, outperforming Swin-B by 1.0 points. Benefiting from multi-modal pre-training (Peng et al., 2022), our ViT-Adapter-L yields 60.9 box AP, which is the best record on COCO test-dev without training on extra detection data such as Objects365 (Shao et al., 2019).

al., 2021a;Dong et al., 2021; Liu et al.,  2021b; Wang et al., 2022a). To alleviate this issue, we propose two feature interaction modules to bridge the feature maps of our SPM and the ViT. To be specific, the two modules are mainly based on cross-attention, namely Spatial Feature Injector and Multi-Scale Feature Extractor. Spatial Feature Injector. As shown in Figure4(d), this module is used to inject the spatial priors into ViT. Specifically, for the i-th block of the ViT, we take the input feature F i vit ∈ R

Our detection experiments are based on MMDetection (Chen et al., 2019b) and the COCO (Lin et al., 2014) dataset. We use 4 mainstream detectors to evaluate our ViT-Adapter, including Object

Object detection with different frameworks on COCO val2017. For fair comparison, we initialize all ViT-S/B models with the regular ImageNet-1K pre-training (Touvron et al., 2021). "#P" denotes the number of parameters. "MS" means multi-scale training.Results with ImageNet-1K Pre-training. In Table1and Table2, we apply the DeiT (Touvron et al.

Semantic segmentation on the ADE20K val. Semantic FPN (Kirillov et al., 2019) and UperNet (Xiao et al., 2018) are used as segmentation frameworks. "IN-1K/22K" and "MM" represent ImageNet-1K/22K and multi-modal pre-training, respectively. "MS" means multi-scale testing.

Comparison with previous SOTA.

Attention 44.7 39.9 47.8M N AP b AP m #Param 0 40.2 37.1 43.8M 1 43.2 38.9 45.5M 2 43.9 39.4 46.2M 4 44.7 39.9 47.8M 6 44.7 39.8 49.4M

Ablation studies of ViT-Adapter. (Left) ablation of key components. Our proposed components collectively bring 4.5 AP b and 2.8 AP m gains. (Right) ablation of the number of interactions N . The model gives the best performance when N = 4. SPM is short for the spatial prior module.

Ablation of using different attention mechanisms in our adapter. The per-iteration training time and GPU training memory are measured by A100 GPUs with per-GPU batch size 2 and FP16 training. "*" indicates using activation checkpointing to save training memory.

ACKNOWLEDGEMENT

This work is partly supported by the National Natural Science Foundation of China (Grant No. 61672273, 61832008), and the Shanghai Committee of Science and Technology (Grant No. 21DZ1100100).

annex

As reported in Table 8 , our ViT-Adapter-L (w/ BEiT) creates 60.4 AP b and 52.5 AP m on the COCO test-dev, and ViT-Adapter-L (w/ BEiTv2) further sets this record to 60.9 AP b and 53.0 AP m . Notably, although it's not a perfectly controlled comparison, our method attains similar performance taking fewer training epochs (36 vs. 100) than ViTDet (Li et al., 2022b) . We argue that a longer training schedule such as 100 epochs may bring an added bonus, but it is expensive to afford due to limited computing resources. In summary, from a system-level perspective, our ViT-Adapter can enjoy the dividends of various advanced pre-training techniques and help plain ViT achieve leading performance on the object detection and instance segmentation tasks.

A.2 SEMANTIC SEGMENTATION

Settings. For semantic segmentation, we employ the AdamW optimizer with an initial learning rate of 2×10 -5 , a batch size of 16, and a weight decay of 0.05. Layer-wise learning rate decay of 0.9 and drop path rate of 0.4 are used to train the models. Other training settings, such as pre-training techniques, crop size, and the number of iterations, are listed in Table 9 . Results with More Advanced Pre-training. It can be seen from Table 9 , when training with Uper-Net for 160k iterations, our ViT-Adapter-L (w/ BEiT) yields 58.4 MS mIoU, outperforming BEiT-L by 1.4 points with only 10M additional parameters. It shows that our adapter can deliver significant benefits even for a powerful self-supervised pre-trained ViT. Furthermore, we compare the performance of our method with vision-specific models that also use additional datasets. For example, SwinV2-G (Liu et Figure 6 : TIDE error type analysis (the lower the better). We use the models listed in Table 1 for analysis. As defined in (Bolya et al., 2020), we plot the AP b metric at an IoU threshold of 0.5. These bars show the effect of each error type on overall detection performance). The error types include: cls: localized correctly but classified incorrectly; loc: classified correctly but localized incorrectly; both: classified incorrectly and localized incorrectly; dupe: detection would be correct if not for a higher scoring detection; bkg: detected background as foreground; miss: all undetected groundtruth not covered by other error types; FN: false negatives; FP: false positives. We observe that our ViT-Adapter makes fewer localization and miss errors than the ViT baseline (Li et al., 2021b) , and occurs fewer false positive and negative errors.

Results with

learning strategy for semantic segmentation. Specifically, we use the COCO-Stuff (Caesar et al., 2018) dataset for 80k iterations of pre-training, and then ADE20K for 80k iterations of fine-tuning. The total number of iterations is still 160k, and no additional training overhead is added.Under this setting, our ViT-Adapter-L (w/ BEiT) produces an exciting score of 60.5 MS mIoU. Further, ViT-Adapter-L (w/ BEiTv2) creates a new record of 61.5 MS mIoU, which is slightly better than FD-SwinV2-G (Wei et al., 2022), while the parameter number is much smaller (571M vs. 3.0B). It's worth noting that, our ViT-Adapter is also adopted by the recently proposed BEiT-3 (Wang et al., 2022b) , which is a ViT-style foundation model that can be pre-trained with multi-modal data.As described in their paper, using ViT-Adapter for the transfer learning of semantic segmentation, BEiT-3 establishes a new state-of-the-art of 62.8 MS mIoU on ADE20K val, which is a convincing verification of the paradigm we present in Figure 1 .

B ADDITIONAL ABLATION AND DISCUSSION

Architecture Configurations. The more detailed configurations are listed in Table 10 .TIDE Error Type Analysis. TIDE (Bolya et al., 2020) is a toolbox for analyzing the sources of error in object detection algorithms. Following (Li et al., 2021b) , we show the error type analysis in Figure 6 . For fair comparison, the models listed in Table 1 are adopted for analysis. These results reveal where our ViT-Adapter improves overall AP b relative to the ViT baseline (Li et al., 2021b) .For instance, we observe that our adapter helps reduce missed and localization errors, and has a substantial effect on fixing false negative and positive errors.Feature Visualization. We plot more visualization of feature maps produced by ViT-B (Li et al., 2021b) and our ViT-Adapter-B in Figure 7 and Figure 8 , which are trained based on Mask R-CNN for detection and UperNet for segmentation, respectively. As can be seen, the features of ViT-B are blurry and coarse, while our features are more refined and have more local edges and textures. This observation also accords with the Fourier analysis in Section 4.4, which demonstrates that ViT has the characteristics of capturing low-frequency information, and our ViT-Adapter can supplement the missing high-frequency signals.Comparison with SETR. Like ViTDet (Li et al., 2022b) , SETR (Zheng et al., 2021 ) also changes the shape of features of ViT according to the task prior (see Figure 3 (a)), thus allowing ViT to achieve better segmentation performance. Although this paradigm shares some similarities with our approach, e.g. combining ViT and convolutions, they have three main differences: (1) In addition to the task prior, our method also takes the information of the input image (the input prior) into consideration when adapting ViT to dense prediction tasks; (2) The input prior will constantly interact with ViT's features, making the output features more suitable for dense prediction tasks; (3) Our method is an adapter that is general in both detection and segmentation tasks, and moreover achieves better results than segmentation-specific head SETR (Zheng et al., 2021) .Comparison with other Adapters. We would like to clarify the differences between ViT-Adapter and other adapters (Jia et ) aim to explore parameter-efficient transfer learning, while the goal of our ViT-Adapter is to push the performance boundaries of plain ViT downstream applications, make ViT more general for downstream tasks, and efficiently utilize large-scale ViT weights pre-trained in different ways. We argue that these two technical lines are orthogonal, as shown in the last column in Table 11 . Combining ViT-Adapter with these adapters to achieve efficient and accurate transfer learning of dense prediction is a research topic worth exploring.ViTDet's Performance. The higher performance of the original ViTDet (Li et al., 2022b) comes from stronger training settings. Specifically, ViTDet adopts a more expensive training strategy than ours, i.e., loading the MAE (He et al., 2021) pre-trained weights, and using the Large Scale Jitter (Ghiasi et al., 2021) augmentation to train the model for 100 epochs. This setting leads to almost 3 times the training cost compared to the commonly used 36 epochs (i.e., 3×+MS schedule). And to some extent, it reveals the lack of image-related inductive biases in ViT will lead to slow convergence on dense prediction tasks.For fair comparisons, we benchmark all plain ViT detectors, including ViTDet (Li et al., 2022b) and our ViT-Adapter under the commonly used 3×+MS training schedule, and use the same ImageNet-1K pre-trained weights (i.e., DeiT) as initialization. It makes sense that our ViT-Adapter achieves better performance than ViTDet under this setting, because our adapter injects image-related prior into the plain ViT, which can speed up convergence and improve performance. 

