SENSITIVITY-AWARE VISUAL PARAMETER-EFFICIENT TUNING Anonymous

Abstract

Visual Parameter-efficient Tuning (VPT) has become a powerful alternative for full fine-tuning, which only updates a small number of parameters while freezing the remaining vast majority of parameters to significantly reduce the storage costs for adapting the pre-trained vision models to downstream tasks. Although the storage burden is largely alleviated, VPT approaches still face many challenges, e.g., lower inference speed and lacking effective configurations for trainable parameters tailored for each task. In this paper, we present a simple yet effective approach termed Sensitivity-aware visual Parameter-efficient Tuning (SPT) to tackle these challenges. Given a desired tunable parameter budget, SPT quickly identifies the important parameters to the given task in a data-dependent way before fine-tuning, without the complex selection schedule. Then, SPT adaptively determines the tuning granularity for each weight matrix. Accordingly, for the whole model, we structurally tune the entire sensitive weight matrices that contain a large proportion of sensitive parameters (structured tuning), and non-structurally tune the sensitive connections in the insensitive weight matrices (unstructured tuning), simultaneously. For structured tuning, SPT approximates the update with the low-rank reparameterization to preserve the parameter budget. Therefore, our SPT has high flexibility and representational capability while achieving favorable trade-off between parameter-efficiency and accuracy. Through extensive experiments on a wide range of downstream recognition tasks, our SPT achieves better overall transfer performance than the full fine-tuning and the other VPT approaches, with no additional computational or memory overhead during inference. For instance, SPT saves 99.35% of the trainable parameters than the full fine-tuning while achieving a 7.3% higher average top-1 accuracy on VTAB-1k benchmark with the supervised pre-trained ViT-B backbone. Notably, SPT is also the first work that bridges the gap between full fine-tuning and VPT approaches with backbones under self-supervised pre-training strategies MAE and MoCo v3 on the challenging VTAB-1k benchmark.

1. INTRODUCTION

The pre-training and fine-tuning paradigm has underpinned the most recent breakthroughs in vision, yielding stunning empirical performance on a series of tasks such as segmentation (Chen et al., 2017; Ronneberger et al., 2015) and detection (He et al., 2017; Carion et al., 2020 ). Transformer (Vaswani et al., 2017) has been widely adopted as the standard architecture for pre-trained vision models, with representatives including CLIP (Radford et al., 2021) , MAE (He et al., 2022b) , BEiT (Bao et al., 2022) , etc. To effectively adapt the pre-trained representations to the downstream tasks, the de-facto choice is fine-tuning, which initializes the model with the pre-trained weights and tunes all the parameters. However, vanilla fine-tuning needs to store a separate instance of parameters for each task and each deployment scenario. It can be extremely storage-intensive as the storage cost grows linearly with the number of possible cases, considering there are vast varieties of downstream tasks and dynamic deployment environments, especially when deploying the large vision models (Dosovitskiy et al., 2021; Liu et al., 2021; Xu et al., 2021b) to mobile systems. For example, even storing a single large pre-trained ViT-H (He et al., 2022b) model on a local disk requires at least 2.3GB, while the top-10 U.S. apps required only collectively 2.2GB in May 2021. 1



https://sensortower.com/blog/ios-app-size-growth-2021 1

