CONSOLIDATOR: MERGEABLE ADAPTER WITH GROUPED CONNECTIONS FOR VISUAL ADAPTATION

Abstract

Recently, transformers have shown strong ability as visual feature extractors, surpassing traditional convolution-based models in various scenarios. However, the success of vision transformers largely owes to their capacity to accommodate numerous parameters. As a result, new challenges for adapting a well-trained transformer to downstream tasks arise. On the one hand, classic fine-tuning tunes all parameters in a huge model for every downstream task and thus easily falls into an overfitting situation, leading to inferior performance. On the other hand, on resource-limited devices, fine-tuning stores a full copy of all parameters and thus is usually impracticable for the shortage of storage space. However, few works have focused on how to efficiently and effectively transfer knowledge in a vision transformer. Existing methods did not dive into the properties of visual features, leading to inferior performance. Moreover, some of them bring heavy inference cost though benefiting storage. To tackle these problems, we propose consolidator to achieve efficient transfer learning for large vision models. Our consolidator modifies the pre-trained model with the addition of a small set of tunable parameters to temporarily store the task-specific knowledge while freezing the backbone model during adaptation. Motivated by the success of group-wise convolution, we adopt grouped connections across the features extracted by fully connected layers to construct tunable parts in a consolidator. To further enhance the model's capacity to transfer knowledge under a constrained storage budget and keep inference efficient, we consolidate the parameters in two stages: 1. between adaptation and storage, and 2. between loading and inference. On a series of downstream visual tasks, our consolidator can reach up to 7.56 better accuracy than full finetuning with merely 0.35% parameters, and outperform state-of-the-art parameterefficient tuning methods by a clear margin. Code is available at github.

1. INTRODUCTION

Recently, transformer architectures originated from natural language processing (NLP) (Vaswani et al., 2017) demonstrate considerable capacity in computer vision (Dosovitskiy et al., 2020; Touvron et al., 2021; Liu et al., 2021b) . Vision transformers, along with traditional convolutional neural networks (CNNs) (Krizhevsky et al., 2012; He et al., 2016; Simonyan & Zisserman, 2014) , are widely used as feature extractors to generate strong and general visual representations via deriving knowledge from massive images. Thanks to the abundant information in such representations, we can adapt the pre-trained models to downstream tasks by a simple fine-tuning strategy. However, fine-tuning is not a good solution for adaptation. As is well known, the scale of vision models grows faster and faster in recent years. On the one hand, fine-tuning which tunes all parameters in such a huge model easily falls into an overfitting situation, leading to inferior performance. On the other hand, fine-tuning inflicts heavy storage burdens. Since fine-tuning intensively tunes all parameters, it maintains a full copy of the model's parameters for each task. Therefore, fine-tuning can cause a huge storage burden when there are many tasks to be adapted, resulting in impracticality in real-world scenarios, especially in resource-constrained situations, e.g., embedded systems. Efforts have been made to improve the performance as well as reduce the storage overhead of finetuning. For example, adapter (Houlsby et al., 2019; Karimi Mahabadi et al., 2021) , prompt tuning (Li & Liang, 2021; Lester et al., 2021; Zhou et al., 2021) and LoRA (Hu et al., 2021) inject tunable parameters and freeze the backbone during adaptation. In the vision field, VPT (Jia et al., 2022) directly leverage learnable prompts, AdaptFormer (Chen et al., 2022) adopts parallel adapter, NOAH (Zhang et al., 2022) searches for the optimal combinations of the three representative modules, i.e., adapter, LoRA, and VPT, and SSF (Lian et al., 2022b) use additional scaling and shifting parameters for adaptation. Despite their acceptable performance, existing methods suffer from two common conflicts: 1. trade-off between the inference efficiency and the adaptation performance, and 2. trade-off between the adaptation performance and the number of stored parameters. Previous works (Houlsby et al., 2019) show that introducing more tunable parameters can achieve more fruitful results. However, extra parameters can bring significantly larger computation and storage cost, resulting in low inference efficiency and more storage space. Therefore, one essential question is raised: can we design a module that can share the same inference cost as an ordinary model while enjoying superior capacity against existing methods? In this paper, we propose a generic module, dubbed consolidator, to tackle the aforementioned issues. The proposed consolidator is designed as a mergeable adapter that accompanies the fully connected (FC) layer in the vision models. Specifically, to enrich the model capacity under a limited parameter budget, we take inspiration from the success of group-wise convolution (Howard et al., 2017; Ma et al., 2018; Liu et al., 2022) and build our consolidator as grouped connected (GC) layers. To enhance the flexibility, we further reorder channels for each group connection, followed by a droppath regularizer. Benefiting from the inference-time linearity of GC, channel reorder, and droppath operations, the proposed consolidator can be perfectly consolidated into the original FC layer of a vision model, leading to no extra inference cost. Our consolidator can be easily expanded as a multi-branch topology without breaking the linearity. Practically, we can simultaneously equip several GC layers with channel reordering for communications between different groups of feature channels. After adaptation, we can first consolidate the multi-branch GC layers into one single sparse parameter matrix and store the sparse matrix for each task. Such property can enhance the model's transferability and achieve a considerable storage reduction when the number of tasks scales up. During inference, such a sparse parameter matrix can be merged into the backbone model as well, resulting in no inference cost. Thanks to the twice consolidation, the proposed consolidator can greatly promote efficient and effective visual adaptation. To verify the superiority of consolidator, we conduct extensive experiments and analysis on a series of downstream recognition tasks. Experimental results show that our consolidator can surpass full fine-tuning by 7.56 top-1 accuracy with merely 0.35% parameters per task. Compared to state-ofthe-art methods, such as NOAH, AdaptFormer and SSF our method can consistently reach better performance while enjoying no inference cost. On other fundamental visual tasks, i.e., object detection and semantic segmentation, our consolidator shows great power as well. Overall, we summarize our contributions as follows. (i) We propose a basic module, dubbed consolidator, for effective and efficient visual transfer learning. To enhance the transferability under limited tunable parameters, our consolidator is designed as a mergeable grouped connected (GC) layer with a channel reorder layer and a droppath regularizer. We extend the single branch to a multi-branch topology for better flexibility and transferability. (ii) We design a two-stage consolidation scheme by merging corresponding parameters in the training-storage phase and loading-inference phase. In this way, we can maximally dig the adaptation capacity of the model under a constrained storage budget, with no extra inference cost. (iii) We conduct extensive experiments and analysis on various downstream tasks. Results show that the proposed consolidator method can consistently outperform state-of-the-art methods with fewer stored parameters but superior performance.

2. RELATED WORKS

Parameter-efficient transfer learning. In the language field, works (Houlsby et al., 2019; Pfeiffer et al., 2021; Li & Liang, 2021; Lester et al., 2021; Zaken et al., 2021; Hu et al., 2021; Karimi Mahabadi et al., 2021; Liu et al., 2021a; Ding et al., 2022a) have been done to efficiently transfer the knowledge of pre-trained transformers to downstream language tasks. In the field of visual adaptation, several explorations have also been made to adapt vision transformers efficiently. Jia et al. Consolidator adds tunable multi-branch grouped connected layers to the original fully connected layers. The tunable parameters are merged via addition into one single sparse matrix before storage to reduce the needed storage space. Between loading and inference, the parameters in the sparse matrix will be merged back into the original fully connected layer. Consolidator greatly enlarges the model's adaptation capacity under a constrained storage budget with no extra inference cost. Best viewed in color. (2022) and Bahng et al. (2022) directly apply prompt-tuning. Jie & Deng (2022) integrates additional tunable convolution layers. NOAH (Zhang et al., 2022) first trains a large supernet with three modules, VPT, LoRA, and adapter, and then searches for the optimal configurations of each module for every transformer block using evolution algorithm (Chen et al., 2021) . AdaptFormer (Chen et al., 2022) adds parallel adapters instead of serial ones. SSF (Lian et al., 2022b) tunes additional scaling and shifting parameters for adaptation. It is also shown that classic methods such as LoRA (Hu et al., 2021) and adapter (Houlsby et al., 2019) can lead to good performance for vision transformers. However, existing methods suffer from the two trade-offs as we discussed in Section 1, resulting in difficulties in fully digging the adaptation capacity of vision models efficiently. To solve the problems, we present a mergeable adapter, named consolidator, and introduce a two-stage consolidation design to perfectly balance the trade-offs, leading to efficient and effective visual adaptation. Inference-efficient structures. Many works (Ding et al., 2019; Guo et al., 2020; Ding et al., 2021b; c; a; 2022b) strive to design a generic convolution architecture to realize superior capacity while enjoying no inference cost. For example, RepVGG (Ding et al., 2021c) integrates an extra 1×1 convolution to strengthen the main 3×3 convolution. However, existing methods are typically designed for CNNs. As for the popular vision transformer architectures, rare works investigate how to effectively strengthen their capacity while introducing no extra inference cost. LoRA (Hu et al., 2021) and SSF (Lian et al., 2022b) offer possible solutions, but they do not explore the consolidation process between training and storage, leading to inferior performance under a given storage budget. In this paper, we adopt parallel GC layers to replace the functionality of the original FC layers in vision models, which shows strong abilities for visual adaptation. Furthermore, we expand the existing one-stage training-inference consolidation to a two-stage process: 1. training-storage consolidation, and 2. loading-inference consolidation. Such a two-stage design can maximally dig the adaptation capacity of the pre-trained model under a constrained storage budget, with no extra inference cost. Extensive experiments show that our consolidator can outperform state-of-the-art methods in both the number of tunable parameters and the adaptation performance.

3.1. PRELIMINARIES

In this paper, we mainly focus on the adaptation for vision transformers (Dosovitskiy et al., 2020; Liu et al., 2021b) . A typical vision transformer (Dosovitskiy et al., 2020) consists of L serial blocks. In each encoder, there are a multi-head self-attention module (MHSA) and a multi-layer perceptron (MLP). Formally, a batch of input images x input ∈ R B×3×H×W will be first reshaped into a sequence of flattened 2D patches x p ∈ R B×N ×(P 2 •C) , where C is the number of channels and (P, P ) is the resolution of each patch, and N = N W/P 2 is the number of patches. Then the patches are mapped to D channel dimensions with a linear projection. Next, a classification token is appended and we can get x 1 ∈ R B×(N +1)×D . Here we use x l ∈ R B×(N +1)×D to de- note the input of l-th (1 ≤ l ≤ L) block. Its output x l+1 = x ′ l + MLP(LayerNorm(x ′ l )) where x ′ l = x l + MHSA(LayerNorm(x l )). For MHSA, the input features are first processed by three FC layers to generate matrices Q, K, V , and the output is calculated by Softmax( QK T √ d )V and then projected by another FC layer. Therefore, the parametric components of MHSA are four FC layers. The parametric components of MLP are two FC layers as well. Therefore, we formulate our consolidator for the FC layers (see Fig. 1 ), covering all the parametric components in each MHSA and MLP. We will show that such a design can realize both efficiency and effectiveness in Section 4. Notably, our method is also applicable for MLP (Lian et al., 2022a) and CNN (Liu et al., 2022) and can reach good results as in Tab. 4.

3.2. CONSOLIDATOR

For efficient transfer learning, we merely tune and store the parameters in consolidators while freezing other parameters in the pre-trained model. In this subsection, we will introduce our design of consolidator, an efficient and effective module for adapting vision transformers, in detail. Grouped connections. Inspired by the success of group convolution in extracting visual features, we hence assume that the cross-channel information exchange is redundant in visual adaptation, and aim to design consolidator by reducing the cross-channel connections between sequential features to minimize the number of stored parameters for downstream tasks while keeping maximum capacity. Therefore, for each FC layer, we add a concurrent module consisting of a grouped connected layer. Formally, for an input x ∈ R D , the output x ′ ∈ R E of a GC layer with group g, weight W ∈ R g× E g × D g and bias b ∈ R E is formulated by x ′ = GC(x) = g j=1 Pad(W j x (j-1)D g : jD g , j) + b. Here Pad(z, j) prepends (j-1)D g zeros and appends (g-j)D g zeros for z ∈ R E g according to another input j. In this way, the output channels in the j-th group only interact with the input channels in the j-th group, and thus we reduce the cross-channel connections as expected. To flexibly reach different ratios of stored parameters, we adopt a multi-branch topology in our consolidator. There is a GC layer with weight W (i) ∈ R g (i) × E g (i) × D g (i) and bias b i ∈ R E for ith branch with group g (i) . During adaptation, consolidator and the original FC layer take the same input and their outputs are summed up to produce the new output y. Formally, for each FC layer with weight W ∈ R E×D and bias b ∈ R E , the output of the whole layer modified by m GC branches is y = Wx + b + m i=1 ( g (i) j=1 Pad(W (i) j x (j-1)D g (i) : jD g (i) , j) + b (i) ). Channel reorder. To flexibly tune the total number of parameters and enrich the exchange of information flow, we prepend a "ChannelReorder" operation to every branch in our consolidator by manually adjusting the permutation of input features along the channel dimension. In general, we adopt shuffle operation (Zhang et al., 2018; Ma et al., 2018) to accomplish such a purpose. Formally, given input x ∈ R * ×D where " * " means any number of dimensions including none, we shuffle it into g groups and perform recombination across the last dimension. Formally, we first reshape x into x ′ ∈ R * ×g× D g , and then transpose the last two dimension and get x ′′ ∈ R * × D g ×g , and then reshape x ′′ into x ′′′ ∈ R * ×D , which is the final output. A pythonic style formulation is ChannelReorder(g, x) = (x.reshape( * , g, D g )).transpose(-2, -1).reshape( * , D) We set shuffle groups g = g i in the i-th branch, where g i is the group of the corresponding GC layer. In this way, there are few overlaps between the weight matrices of distinct branches, and the model capacity is greatly expanded each time a new branch is contained. Stochastic depth of pre-trained weight. To further enlarge the model's adaptation capacity, we append a droppath (Huang et al., 2016) layer to each branch. For small downstream datasets, dropping the consolidator path with a higher ratio p can help reduce overfitting and catastrophic forgetting, which may be beneficial to the performance. Empirically shown in Section 4.8, droppath is more effective than standard dropout (Srivastava et al., 2014) in the current situation probably for the following reasons. The parameters of the whole layer degrade into the pre-trained weight parameters with probability p. The frozen pre-trained parameters contain domain-generic knowledge, which may help the adaptation. Overall, a model modified by consolidator will have a stochastic depth of pre-trained state of parameters during each forward pass, and different consolidators will be activated for different training samples. Two-stage consolidation. Now we have constructed all the elements of a consolidator. Formally, the output of the whole layer after being modified by consolidator is y = Wx + b + Droppath(p, m i=1 ( g (i) j=1 Pad(W (i) j ChannelReorder(g (i) , x) (j-1)D g (i) : jD g (i) , j) + b (i) )). Since all the operations in a consolidator are inference-time linear, we can easily consolidate the domain-specific knowledge into the domain-agnostic knowledge in the pre-trained backbone model, in both the training-storage phase and the loading-inference phase. 1. Training-storage consolidation. All we need to store in a consolidator are W (i) and b (i) . However, there are some parameters corresponding to the same entry, and thus they can be merged into a single one. As shown in Fig. 1 , we also tune the bias of the original FC layer in addition to the parameters in consolidator. It is easy to find that the duplicate biases in all branches and the original bias can be merged into a single one. And there are some duplicate entries in the weight matrix as well, so we can merge all weight matrices into one single sparse matrix. Consolidating such duplicate entries can largely benefit storage. Formally, we use W and b to denote the matrix that we need to store on the disk. Since channel reorder is a linear operation, we can apply a reverse operation to W (i) to simulate the effect of the reorder applied to x. And we have W = m i=1 ChannelReorder -1 (g (i) , Compact(W (i) )). Here Compact reshapes the input matrix for the preparation of reordering its channels. It is easy to verify that ChannelReorder -1 (g (i) , •) = ChannelReorder( D g (i) , •). 2. Loading-inference consolidation. After loading the merged sparse weight matrix and merged bias matrix to memory, we can directly add them back to the weight matrix and bias matrix of the original FC layer. Formally, we use Ŵ and b to denote the final weight and bias of the FC layer for inference. Then Ŵ = W + W, b = b. In this way, no additional inference cost is brought. Overall, our consolidator reduces the storage space by using grouped connected layers and consolidating some duplicate entries. The training time non-linearity, e.g., droppath, which turns out to be linear in inference time, effectively enriches model capacity under a constrained storage budget. Finally, we can consolidate the task-specific knowledge into the backbone model by merging the inference time linear components to enjoy free, efficient, and effective transfer learning. 

4.2. MAIN RESULTS

We first choose a ViT-B (Dosovitskiy et al., 2020) with 86M parameters as a base model.

VTAB-1k

Tab. 1 presents the full results on VTAB-1k benchmark. Overall, our consolidator is the best parameter-efficient method. On 12 of 19 datasets, consolidator achieves the best or second best top-1 accuracy. Notably, consolidator surpasses the state-of-the-art methods NOAH and SSF by a clear margin, with low storage overhead and no inference cost. Full data setting Tab. 2 presents the full results on full data setting. Overall, our consolidator still performs best. An interesting observation is that the rank of full fine-tuning rises as the training data increase. None of the parameter-efficient methods can reach comparable performance with full tine-tuning other than our consolidator within 0.5% parameter storage overhead. In contrast, the parameter-efficient methods can reach at least 5% higher accuracy on VTAB-1k than full fine-tuning under comparable or even lower storage budget (around 0.5%), as shown in Tab. 1. 

4.4. RESULTS FOR MORE PRE-TRAINED MODELS

To further verify the generalization ability of consolidator, we conduct extensive experiments in Tab. 4 based on full data setting. First, we apply consolidator to supervised learned models with larger (ViT-L) or smaller (ViT-S) size than standard ViT-B. Compared with full fine-tuning, we achieve a comparable performance for ViT-S with 5.06% parameters. For ViT-L, merely 0.33% parameters can lead to 0.45 higher than full fine-tuning. Then we experiment on Swin-B, a hierarchical architecture using shifted windows to introduce locality for better recognition. We observe a 0.28 improvement while storing only 0.77% parameters. Next, we further verify the effectiveness of consolidator on other vision architectures other than transformers, e.g. AS-MLP (Lian et al., 2022a) and ConvNeXt (Liu et al., 2022) . Finally, we experiment on ViT-B pre-trained by a generative SSL Clearly, consolidator consistently outperforms LoRA, adapter, and full fine-tuning by a significant margin across a wide range of parameter scales from 0.5% to 10%. And each of the three methods will reach higher accuracy if we increase its storage budget and tune more parameters. Right: adaptation results corresponding to varying sampling rates of data. Consolidator performs best across all sampling rates. As the sampling rate decreases, full fine-tuning slightly falls off its advantage over adapter and LoRA, which is consistent with our previous observations. method, MAE, as a comparison with the contrastive SSL method MoCo v3. Our method stores 1.78% parameters onto the disk while enjoying 0.55 higher accuracy. Generally, it is more difficult to perfom better than full fine-tuning when the model does not have enough parameters to fully leverage the information from massive training data.

4.5. ADAPTATION ACROSS VARYING SCALES OF STORED PARAMETERS

Next, we seek to find the principle between downstream accuracy and the number of stored parameters for all parameter-efficient methods with flexible parameter scales, based on full data setting. Results are shown in Fig. 2 left. LoRA, adapter, and consolidator reach better performance as the number of parameters increases. In various parameter scales (from 0.5% to 10%), our consolidator consistently outperforms all competitors by a clear margin.

4.6. ADAPTATION ACROSS VARYING DATA SAMPLING RATIOS

Fig. 2 right shows adaptation results corresponding to varying sampling rates of datasets, based on full data setting. For all sampling rates, consolidator keeps the best adaptation ability. In addition, as the sampling rate decreases, full fine-tuning slightly falls off its advantage over Adapter and LoRA, which is consistent with previous observations on VTAB-1k and full data setting.

4.7. RESULTS ON OBJECT DETECTION AND SEMANTIC SEGMENTATION.

We further verify our method on downstream object detection and semantic segmentation tasks. We adopt Swin-Base as the backbone which can provide hierarchical features. Experiments are done on PASCAL VOC 07+12 (Everingham et al., 2010) for detection and PASCAL VOC 12 (Everingham et al., 2010) for segmentation. We adopt Faster R-CNN (Ren et al., 2015) and UperNet (Xiao et al., 2018) separately for detection and segmentation framework. Seen from Tab. 5, our consolidator significantly outperforms full training and detection/segmentation head training with a small number of parameters stored, showing great potential for broader usage.

4.8. ABLATION STUDIES

We do controlled experiments to identify the effect of individual components in our module design. We report tuned parameters, stored parameters, and accuracy in Tab. 6. We experiment on 5 datasets with different domains: Caltech101, DTD, OxfordFlowers, StanfordDogs, and EuroSAT. Droppath v.s. Dropout. We first investigate our choice of Droppath. Droppath with a consolidator of (g (1) = 96, g (2) = 192) as the base model. As shown in Tab. 6, compared with dropout and no drop-type layer, droppath can obtain 0.44 and 0.48 performance improvement, respectively, well demonstrating the effectiveness of encouraging stochastic depth for consolidator. ChannelReorder. The effect of ChannelReorder operation mainly lies in separating the entries into different branches to reduce the repetitive ones. In a multi-branch case like (g (1) = 96, g (2) = 192), ChannelReorder brings 0.38 accuracy improvement. Furthermore, it is helpful even if there is only one branch like (g (1) = 384), where ChannelReorder still slightly raises the accuracy by 0.1. Duplication of bias and weight. Based on the consolidator with (g (1) = 96, g (2) = 192), we can see duplicating bias is relatively effective. Compared with tuning the original bias and only tuning two extra biases, tuning all three biases can lead to 0.44 and 0.1 performance improvement, respectively, with the same storage cost. Additionally, we also compare tuning original bias v.s. not tuning bias, and the former only has a slight 0.1 accuracy advantage, which further verifies the effectiveness owes mostly to the delicate consolidation design instead of simple bias tuning. Besides bias, we test on a consolidator with (g (1) = 96) to investigate the effect of integrating duplicate weights. However, this kind of consolidation does not bring notable improvement. Structured v.s. Unstructured. One potential limitation of consolidator is that g (i) can not be selected arbitrarily for it must be a factor of the channels. This may cause trouble when fairly few parameters other than head parameters, e.g. 0.0001%, are required to be tuned. A solution is to adopt unstructured sparsity instead of structured block-wise sparsity in consolidator branches to flexibly control the parameter number. We simulate the situation with (g (1) = 384) for comparison with a unstructured implementation. Seen from Tab. 6, the unstructured branch whose tunable parameter number equals that of a branch with g = 384 faces a slight accuracy drop, 0.14. In summary, the unstructured consolidator can be a sub-optimal choice when needed, with a slight performance drop and more training cost due to the unstructured sparse matrix being unfriendly to the hardware.

5. CONCLUSIONS

We propose consolidator, a novel method to achieve both parameter-and inference-efficient visual adaptation. Consolidator adds a few tunable, mergeable modules along each fully connected layer in the pre-trained model and keeps most of the original parameters frozen during adaptation. We design a two-stage consolidation to dramatically boost performance under a given storage budget. The duplicate entries in a consolidator will be merged into a single matrix and stored on disk. Finally, we consolidate the task-specific parameters in consolidator into the tasks-agnostic parameters in the pre-trained model, bringing no extra inference cost. On various tasks, consolidator outperforms all state-of-the-art competitors significantly, showing strong scalability and generalization ability. the NAS algorithm introduced by AutoFormer (Chen et al., 2021) . In the end, NOAH retrains the best subnet candidates to produce the final result. AdaptFormer (Chen et al., 2022) : AdaptFormer adopts parallel adapters (Houlsby et al., 2019) and a scale operation for each encoder block. SSF (Lian et al., 2022b) : SSF adds tunable scale and shift parameters for each operation of the backbone model. The added parameters can be merged into the original model and thus SSF brings no inference cost. Implementation details. On VTAB-1k, we follow the same implementations for LoRA and adapter with NOAH (Zhang et al., 2022) . On Full data setting, the detailed implementations are shown as follows. For LoRA (Hu et al., 2021) , we follow its original implementation to do low-rank re-parameterization and tune merely two weight matrices W q and W v , which generate the attention matrices Q and V, in every transformer encoder block. For bias, we tune all the biases (including the parameters outside encoder blocks) in the model. For adapter, we follow its original implementation to additionally tune the parameters of LayerNorm (Ba et al., 2016) and add the tunable adapter modules to the end of each MHSA and MLP before calculating the residual connections. For consolidator, we follow the practice of adapter to tune the LayerNorm as well and only add consolidator to the linear layers in MHSA and MLP of each transformer encoder block. We sweep the drop ratio in {0.0, 0.2, 0.5, 0.8}. We choose training hyperparameters according to the performance of full-finetuning for each visual representation. The hyperparameters are finally configured as in Tab. 7. When applying consolidator to Swin-B, we skip the first stage of transformer encoder blocks that contain a relatively small part of parameters (0.6% of total parameters) compared with others and merely insert consolidator for linear layers in MHSA and MLP of blocks of last 3 stages.

A.2 OBJECT DETECTION

On the downstream object detection task, we adopt Faster R-CNN (Ren et al., 2015) framework with FPN (Lin et al., 2017) to verify the effectiveness of consolidator on Pascal VOC 07+12 dataset (Everingham et al., 2010) . We use a Swin-B pre-trained on IN-21k as the backbone model. The consolidator setting is the same as the setting of Swin-B in Tab. 7. For hyperparameters, we adopt AdamW as the optimizer with a learning rate of 1e-4 and weight decay of 5e-2, and train for 8 epochs in total. The learning rate is decayed by a factor of 10 after 6 epochs. The first 300 iterations are trained with a warmup ratio of 1e-3 for the learning rate. The data augmentation is the same as the default strategy in mmdetection (Chen et al., 2019) .

A.3 SEMANTIC SEGMENTATION

On the downstream semantic segmentation task, we adopt UperNet (Xiao et al., 2018) to verify the effectiveness of consolidator on Pascal VOC 12 dataset (Everingham et al., 2010) . We use a Swin-B pre-trained on IN-21k as the backbone model. The consolidator setting is the same as the setting of Swin-B in Tab. 7. For hyperparameters, we adopt AdamW as the optimizer with a learning rate of 6e-5 and weight decay of 1e-2, and train for 20000 iterations in total. The learning rate is decayed by a polynomial scheduler with a power of 1.0. The first 200 iterations are trained with a warmup ratio of 1e-6 for the learning rate. The data augmentation is the same as the default strategy in mmsegmentation (Contributors, 2020).

B TRAINING AND INFERENCE COST

Many the classic parameter-efficient tuning methods, e.g. adapter (Houlsby et al., 2019) and Adaptformer (Chen et al., 2022) introduce non-negligible extra cost in inference period and thus slow down the processing speed. In contrast, consolidator tuning shares identical structure with the original model and bring no extra inference cost. We quantitatively show the training cost and inference cost across different parameter scales in Fig. 3 . Here we do not show the cost of SSF for it can not be adapted to different scales of parameter budget. In addition, VPT and NOAH both search for an optimal structure from a large searching space and thus it is hard to fairly measure their cost as well.



Figure1: Consolidator tuning versus full fine-tuning. Consolidator adds tunable multi-branch grouped connected layers to the original fully connected layers. The tunable parameters are merged via addition into one single sparse matrix before storage to reduce the needed storage space. Between loading and inference, the parameters in the sparse matrix will be merged back into the original fully connected layer. Consolidator greatly enlarges the model's adaptation capacity under a constrained storage budget with no extra inference cost. Best viewed in color.

Figure2: Left: the Adaptation results corresponding to varying ratios of stored parameters. Clearly, consolidator consistently outperforms LoRA, adapter, and full fine-tuning by a significant margin across a wide range of parameter scales from 0.5% to 10%. And each of the three methods will reach higher accuracy if we increase its storage budget and tune more parameters. Right: adaptation results corresponding to varying sampling rates of data. Consolidator performs best across all sampling rates. As the sampling rate decreases, full fine-tuning slightly falls off its advantage over adapter and LoRA, which is consistent with our previous observations.

Full results on the VTAB-1k(Zhai et al., 2019) benchmark. The bold font denotes the best accuracy and the underline font denotes the second best accuracy in each column. Consolidator gives the strongest results, surpasses full fine-tuning by 7.56 accuracy on average, and outperforms the state-of-the-art methods with low storage overhead and no inference cost.

Full results on data-sufficient scenarios. It is more challenging to take full advantage of a large amount of data within a fairly small number of stored parameters. Consolidator earns the best or second best in all 10 datasets.



Adaptation performance for more models in full data setting. Consolidator consistently reaches a better result than full fine-tuning within a very small set of parameters which is needed to be stored. Generally, it is more difficult to reach a better result than full fine-tuning when the model capacity is insufficient, i.e., the model does not accommodate enough parameters in total.Then we use a self-supervised trained transformer by MoCov3(He et al., 2020) as our target model. Seen from Tab. 3, on both VTAB-1k and full data setting, consolidator consistently reaches the highest average top-1 accuracy. AdaptFormer gives the second highest accuracy. SSF significantly falls behind others (51.41 on VTAB and 80.73 on full data setting) when dealing with MoCo v3 ViT-B, showing limited generalization ability for self-supervised visual models.

Performance on downstream object detection and semantic segmentation tasks. Compared with tuning detection/segmentation head only, consolidator reaches much better mAP/mIoU with negligible parameters increase and surpasses the performance of tuning all parameters in both backbone and task head as well.

Effect of individual designs for a consolidator module.

ACKNOWLEDGMENTS

This work was supported by National Key R&D Program of China (No. 2021ZD0114703), National Natural Science Foundation of China (Nos. 61925107, 62271281, U1936202, 61571269), Beijing Natural Science Foundation (No. L223023), China Postdoctoral Science Foundation (BX2021161) and Zhejiang Lab Open Research Project (NO. K2022KI0AB01).

annex

Some of the datasets do not contain a validation split, and we will manually select 10% random images from the training set as the validation set.To better utilize the massive data in various domains, we follow the practice in (Junguang Jiang & Long, 2020; Jiang et al., 2022) to make preparation and divide data splits. For data augmentation, we adopt a standard pipeline. In training, we do a random resize crop to 224×224, random horizontal flip, and normalization for each input image. In test, we do a resize to 256×256, center crop to 224×224, and normalization for each input image.

A.1.2 TRAINING HYPERPARAMETERS.

On VTAB-1k, we follow the hyperparameters in VPT (Jia et al., 2022) for full fine-tuning, Head, Bias, and VPT, and mainly follow the hyperparameters in NOAH (Zhang et al., 2022) and SSF (Lian et al., 2022b) for adapter, LoRA, NOAH, and consolidator. The detailed hyperparameters for each tuning method can be found in Tab. 7.On full data setting, we do a quick grid search to choose a proper set of training hyperparameters based on the performance of full fine-tuning for every well-trained visual representation. All training hyperparameters are shown in Tabs. 7 and 8. Following the practice in (Jiang et al., 2022; Junguang Jiang & Long, 2020) , we reduce the learning rate of backbone parameters to 0.1x that of head parameters. We then adopt the same hyperparameters for all parameter-efficient variants as well as our consolidator after the hyperparameters have been chosen according to full fine-tuning results.

A.1.3 METHODS.

Full: Ordinary fine-tuning. It tunes all parameters and stores all parameters to disk for every downstream task, leading to heavy storage cost.Head: Also known as linear probing. It only tunes the classification head and freezes other parameters.Bias (Zaken et al., 2021) : Bias only tunes the biases and freezes all the weights in a pre-trained model.Adapter (Houlsby et al., 2019) : Adapter inserts a sequence of tunable parameters, including a down projection, a non-linearity (here we use GELU (Hendrycks & Gimpel, 2016) ) and an up projection layer, into each encoder block. The adapters are serially connected with the backbone layers.LoRA (Hu et al., 2021) : LoRA adds a concurrent branch containing low-rank weight matrices for efficient parameter update. A LoRA module contains serial down projection and up projection layers without non-linearity, and thus it can be merged into the backbone parameters before inference.VPT (Li & Liang, 2021; Jia et al., 2022) : VPT appends tunable virtual tokens to the inputs of transformer blocks, which will participate in the calculation of the subsequent blocks along with the actual tokens.NOAH (Zhang et al., 2022) : NOAH first trains a large supernet with all three modules, VPT, LoRA, and adapter, and then searches for the optimal configurations of each module for each layer using Published as a conference paper at ICLR 2023 Full data rank=10 hidden=9 (g (1) =384)

Supervised

ViT-S Full data rank=58 hidden=56 (g (1) =48, g (2) =64, 

C SENSITIVITY OF GROUPS AND BRANCHES IN CONSOLIDATOR

Given a particular target storage budget, we may have several choices in selecting the branches and groups. As seen in Tab. 9, such choices make little influence on the final results. The performance of consolidator is relatively stable under a given storage budget. When the budget increases, the performance of consolidator increases as well. Such monotonical property is helpful for real-world applications, making it easy to tune hyperparameters and rapidly find an optimal candidate under a given storage budget.

