VISION TRANSFORMER ADAPTER FOR DENSE PREDICTIONS

Abstract

This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate visionspecific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pretraining-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT-Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO testdev. We hope that the ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. Code and models will be released at

1. INTRODUCTION

Recently, transformers have witnessed remarkable success in a broad range of computer vision fields. Benefiting from the dynamic modeling capability and the long-range dependence of the attention mechanism, various vision transformers (Dosovitskiy et al., 2020; Chen et al., 2021; Han et al., 2021; Li et al., 2021c; Wu et al., 2022b) soon rose in many computer vision tasks such as object detection and semantic segmentation, surpassing CNN models and reaching state-of-the-art performance. These models are mainly divided into two families, i.e. the plain ViT (Dosovitskiy et al., 2020; Touvron et al., 2021) , and its hierarchical variants (Dong et al., 2021; Liu et al., 2021b; Wang et al., 2021; 2022a) . In general, the latter can produce better results and is believed to introduce vision-specific inductive biases into their architectures by using local spatial operations.



Figure 1: Previous paradigm vs. our paradigm. (a) Previous paradigm designs vision-specific models and pre-trains on large-scale image datasets via supervised or self-supervised learning and then fine-tunes them on downstream tasks. (b) We propose a pre-training-free adapter to close the performance gap between plain ViT (Dosovitskiy et al., 2020) and vision-specific transformers (e.g., Swin (Liu et al., 2021b)) for dense prediction tasks. Compared to the previous paradigm, our method preserves the flexibility of ViT and thus could benefit from advanced multi-modal pre-training.

availability

https://github.com/czczup/ViT-Adapter.

annex

Figure 2 : Object detection performance on COCO val2017 using Mask R-CNN. We see that the proposed ViT-Adapter brings significant improvements to plain ViTs. ⋆ indecates using multi-modal pre-trained ViT from (Zhu et al., 2021) . Backbones pre-trained on ImageNet-22K are marked with † , otherwise ImageNet-1K.Nonetheless, the plain ViT (i.e., vanilla transformer) still has some nonnegligible advantages. A typical example lies in multi-modal pre-training (Zhu et al., 2021; 2022; Wang et al., 2022b) . Stemming from the natural language processing (NLP) field, transformer has no assumption of input data. Equipping with different tokenizers, e.g. patch embedding (Dosovitskiy et al., 2020) , 3D patch embedding (Liu et al., 2021c) , and token embedding (Vaswani et al., 2017) , vanilla transformers such as plain ViT can use massive multi-modal data for pre-training, including image, video, and text, which encourages the model to learn semantic-rich representations. However, the plain ViT has conclusive defects in dense predictions compared to visionspecific transformers. Lacking image-related prior knowledge results in slower convergence and lower performance, and thus plain ViTs are hard to compete with vision-specific transformers (Huang et al., 2021b; Xie et al., 2021; Wang et al., 2022a) on dense prediction tasks. Inspired by the adapters (Houlsby et al., 2019; Stickland & Murray, 2019) in the NLP field, this work aims to develop an adapter to close the performance gap between the plain ViT and vision-specific backbones for dense prediction tasks.To this end, we propose the Vision Transformer Adapter (ViT-Adapter), which is a pre-training-free additional network that can efficiently adapt the plain ViT to downstream dense prediction tasks without modifying its original architecture. Specifically, to introduce the vision-specific inductive biases into the plain ViT, we design three tailored modules for ViT-Adapter, including (1) a spatial prior module for capturing the local semantics (spatial prior) from input images, (2) a spatial feature injector for incorporating spatial prior into the ViT, and (3) a multi-scale feature extractor to reconstruct the multi-scale features required by dense prediction tasks.As shown in Figure 1 , compared to the previous paradigm that pre-trains on large-scale image datasets (e.g., ImageNet (Deng et al., 2009) ) then fine-tunes on other tasks, our paradigm is more flexible. In our framework, the backbone network is a general-propose model (e.g., plain ViT) that can be pre-trained with not only images but also multi-modal data. For the transfer learning of dense prediction tasks, we use a randomly initialized adapter to introduce the image-related prior knowledge (inductive biases) into the pre-trained backbone, making the model suitable for these tasks. In this way, using ViT as the backbone, our framework achieves comparable or even better performance than vision-specific transformers such as Swin (Liu et al., 2021b) .Our main contributions are as follows:• We explore a new paradigm to introduce vision-specific inductive biases into the plain ViT. It helps ViT achieve comparable performance to recent transformer variants (Liu et al., 2021b; Wang et al., 2022a) with regular ImageNet pre-training and further benefits from multi-modal pre-training.• We design a spatial prior module and two feature interaction operations, to inject the image prior without redesigning the architecture of ViT. They can supplement the missing local information and reorganize fine-grained multi-scale features for dense prediction tasks.• We evaluate the ViT-Adapter on multiple challenging benchmarks, including COCO (Lin et al., 2014) and ADE20K (Zhou et al., 2017) . As shown in Figure 2 , our models consistently achieve improved performance compared to the prior arts under the fair pre-training strategy. For instance, when using only ImageNet-1K pre-training, ViT-Adapter-B reports 49.6 box AP on COCO val, outperforming Swin-B by 1.0 points. Benefiting from multi-modal pre-training (Peng et al., 2022) , our ViT-Adapter-L yields 60.9 box AP, which is the best record on COCO test-dev without training on extra detection data such as Objects365 (Shao et al., 2019) .

