CONSOLIDATOR: MERGEABLE ADAPTER WITH GROUPED CONNECTIONS FOR VISUAL ADAPTATION

Abstract

Recently, transformers have shown strong ability as visual feature extractors, surpassing traditional convolution-based models in various scenarios. However, the success of vision transformers largely owes to their capacity to accommodate numerous parameters. As a result, new challenges for adapting a well-trained transformer to downstream tasks arise. On the one hand, classic fine-tuning tunes all parameters in a huge model for every downstream task and thus easily falls into an overfitting situation, leading to inferior performance. On the other hand, on resource-limited devices, fine-tuning stores a full copy of all parameters and thus is usually impracticable for the shortage of storage space. However, few works have focused on how to efficiently and effectively transfer knowledge in a vision transformer. Existing methods did not dive into the properties of visual features, leading to inferior performance. Moreover, some of them bring heavy inference cost though benefiting storage. To tackle these problems, we propose consolidator to achieve efficient transfer learning for large vision models. Our consolidator modifies the pre-trained model with the addition of a small set of tunable parameters to temporarily store the task-specific knowledge while freezing the backbone model during adaptation. Motivated by the success of group-wise convolution, we adopt grouped connections across the features extracted by fully connected layers to construct tunable parts in a consolidator. To further enhance the model's capacity to transfer knowledge under a constrained storage budget and keep inference efficient, we consolidate the parameters in two stages: 1. between adaptation and storage, and 2. between loading and inference. On a series of downstream visual tasks, our consolidator can reach up to 7.56 better accuracy than full finetuning with merely 0.35% parameters, and outperform state-of-the-art parameterefficient tuning methods by a clear margin. Code is available at github.

1. INTRODUCTION

Recently, transformer architectures originated from natural language processing (NLP) (Vaswani et al., 2017) demonstrate considerable capacity in computer vision (Dosovitskiy et al., 2020; Touvron et al., 2021; Liu et al., 2021b) . Vision transformers, along with traditional convolutional neural networks (CNNs) (Krizhevsky et al., 2012; He et al., 2016; Simonyan & Zisserman, 2014) , are widely used as feature extractors to generate strong and general visual representations via deriving knowledge from massive images. Thanks to the abundant information in such representations, we can adapt the pre-trained models to downstream tasks by a simple fine-tuning strategy. However, fine-tuning is not a good solution for adaptation. As is well known, the scale of vision models grows faster and faster in recent years. On the one hand, fine-tuning which tunes all parameters in such a huge model easily falls into an overfitting situation, leading to inferior performance. On the other hand, fine-tuning inflicts heavy storage burdens. Since fine-tuning intensively tunes all parameters, it maintains a full copy of the model's parameters for each task. Therefore, fine-tuning can cause a huge storage burden when there are many tasks to be adapted, resulting in impracticality in real-world scenarios, especially in resource-constrained situations, e.g., embedded systems. † Corresponding authors. 1

