SENSITIVITY-AWARE VISUAL PARAMETER-EFFICIENT TUNING Anonymous

Abstract

Visual Parameter-efficient Tuning (VPT) has become a powerful alternative for full fine-tuning, which only updates a small number of parameters while freezing the remaining vast majority of parameters to significantly reduce the storage costs for adapting the pre-trained vision models to downstream tasks. Although the storage burden is largely alleviated, VPT approaches still face many challenges, e.g., lower inference speed and lacking effective configurations for trainable parameters tailored for each task. In this paper, we present a simple yet effective approach termed Sensitivity-aware visual Parameter-efficient Tuning (SPT) to tackle these challenges. Given a desired tunable parameter budget, SPT quickly identifies the important parameters to the given task in a data-dependent way before fine-tuning, without the complex selection schedule. Then, SPT adaptively determines the tuning granularity for each weight matrix. Accordingly, for the whole model, we structurally tune the entire sensitive weight matrices that contain a large proportion of sensitive parameters (structured tuning), and non-structurally tune the sensitive connections in the insensitive weight matrices (unstructured tuning), simultaneously. For structured tuning, SPT approximates the update with the low-rank reparameterization to preserve the parameter budget. Therefore, our SPT has high flexibility and representational capability while achieving favorable trade-off between parameter-efficiency and accuracy. Through extensive experiments on a wide range of downstream recognition tasks, our SPT achieves better overall transfer performance than the full fine-tuning and the other VPT approaches, with no additional computational or memory overhead during inference. For instance, SPT saves 99.35% of the trainable parameters than the full fine-tuning while achieving a 7.3% higher average top-1 accuracy on VTAB-1k benchmark with the supervised pre-trained ViT-B backbone. Notably, SPT is also the first work that bridges the gap between full fine-tuning and VPT approaches with backbones under self-supervised pre-training strategies MAE and MoCo v3 on the challenging VTAB-1k benchmark.

1. INTRODUCTION

The pre-training and fine-tuning paradigm has underpinned the most recent breakthroughs in vision, yielding stunning empirical performance on a series of tasks such as segmentation (Chen et al., 2017; Ronneberger et al., 2015) and detection (He et al., 2017; Carion et al., 2020) . Transformer (Vaswani et al., 2017) has been widely adopted as the standard architecture for pre-trained vision models, with representatives including CLIP (Radford et al., 2021) , MAE (He et al., 2022b) , BEiT (Bao et al., 2022) , etc. To effectively adapt the pre-trained representations to the downstream tasks, the de-facto choice is fine-tuning, which initializes the model with the pre-trained weights and tunes all the parameters. However, vanilla fine-tuning needs to store a separate instance of parameters for each task and each deployment scenario. It can be extremely storage-intensive as the storage cost grows linearly with the number of possible cases, considering there are vast varieties of downstream tasks and dynamic deployment environments, especially when deploying the large vision models (Dosovitskiy et al., 2021; Liu et al., 2021; Xu et al., 2021b) to mobile systems. For example, even storing a single large pre-trained ViT-H (He et al., 2022b) model on a local disk requires at least 2.3GB, while the top-10 U.S. apps required only collectively 2.2GB in May 2021. Notably, an emerging trend is to replace the full fine-tuning with Visual Parameter-efficient Tuning (VPT) (Jia et al., 2022; Chen et al., 2022; Zhang et al., 2022) , which only tunes a small number of trainable parameters (newly introduced or inherently in the model) to cooperate with a frozen backbone that is shared by multiple tasks. As VPT approaches exhibit less than 1% of the trainable parameters, the storage burden is largely alleviated. Another attractive property of VPT is that tuning fewer parameters eases the optimization difficulty and mitigates the overfitting issue for the large models, thereby achieving comparable or even better performance than fine-tuning (Jia et al., 2022) (see Figure 1 (c)). Although promising, the existing VPT approaches suffer from two major issues. First, they specify the positions to add the trainable parameters with different heuristics, and the importance of these positions has not been well studied. For instance, Prompt tuning (Jia et al., 2022) and Adapter (Houlsby et al., 2019) add trainable parameters to the input space and each Transformer (Vaswani et al., 2017) block, respectively. Moreover, these approaches keep the same configuration for the trainable parameters across different downstream tasks, neglecting their domain gaps and characteristics. Second, the additional parameters lead to a non-negligible sacrifice on the inference efficiency in terms of speed and memory consumption. Taking Prompt tuning (Jia et al., 2022) as an example, with the enlarged input space (200 prompts), it exhibits 2× slower inference speed and consumes 2× of the GPU memory than the full fine-tuning counterpart. To this end, in this work, we present a novel Sensitivity-aware visual Parameter-efficient Tuning (SPT) that identifies and tunes the parameters at task-specific important positions while being inference-efficient. Based on the assumption that not all pre-trained parameters contribute equally to the performance across different tasks, we first propose a new criterion to efficiently measure the sensitivity (importance) of the pre-trained backbone parameters to a specific task for our SPT. Inspired by model pruning methods (Srivastava et al., 2015; Molchanov et al., 2019) , we propose to use loss reduction for the sensitivity measurement, which can be efficiently approximated with a first-order Taylor expansion. The resulting parameter sensitivity is solely computed from the gradients, and therefore it can be quickly derived ahead of fine-tuning. We show an example of parameter sensitivities using a pre-trained ViT-B (Dosovitskiy et al., 2021) backbone in Figure 1 (a), where the sensitivities vary across different tasks. Next, an intuitive solution is to only tune the parameters with the highest sensitivity, which we name as unstructured tuning following (Han et al., 2015; 2016) . Despite its simplicity and flexibility, unstructured tuning still lacks representational capability as only a few parameters are tuned to capture the domain gap. To this end, our SPT further incorporates unstructured tuning with structured tuning (Figure 1 (b)). Specifically, after identifying the sensitive parameters of the pre-trained backbone, SPT adaptively determines the tuning granularity for each weight matrix. Accordingly, for the whole model, we structurally tune the entire sensitive weight matrices that contain a large proportion of sensitive parameters (structured tuning), and non-structurally tune the sensitive connections in the insensitive weight matrices (unstructured tuning), simultaneously. To preserve the parameter budget, for structured tuning, SPT follows the efficient reparameterization strategy of LoRA (Hu



https://sensortower.com/blog/ios-app-size-growth-2021



1

Figure 1: (a) The block-wise parameter sensitivity with supervised pre-trained ViT-B backbone (Dosovitskiy et al., 2021) for three sampled tasks fromVTAB-1k (Zhai et al., 2019a). "TPS" denotes our task-specific parameter sensitivity (importance). We show averaged scores over all 800 training samples. The sensitivity of each block varies markedly across different tasks. (b) Our proposed Sensitivity-aware visual Parameter-efficient Tuning (SPT) identifies the task-specific important positions and adaptively combines unstructured and structured tuning to enjoy both flexibility and high capacity. The blue and red lines represent the frozen and trainable parameters, respectively. (c) Accuracy vs. parameter efficiency with the supervised pre-trained ViT-B backbone. Our SPT has no extra computational overhead during inference, surpasses full fine-tuning by large margins, and performs favorably against other VPT approaches.

