TOWARDS A UNIFIED VIEW ON VISUAL PARAMETER-EFFICIENT TRANSFER LEARNING Anonymous authors Paper under double-blind review

Abstract

Since the release of various large-scale natural language processing (NLP) pretrained models, parameter efficient transfer learning (PETL) has become a popular paradigm capable of achieving impressive performance on various downstream tasks. PETL aims at making good use of the representation knowledge in the pretrained large models by fine-tuning a small number of parameters. Recently, it has also attracted increasing attention to developing various PETL techniques for vision tasks. Popular PETL techniques such as Prompt-tuning and Adapter have been proposed for high-level visual downstream tasks such as image classification and video recognition. However, Prefix-tuning remains under-explored for vision tasks. In this work, we intend to adapt large video-based models to downstream tasks with a good parameter-accuracy trade-off. Towards this goal, we propose a framework with a unified view of PETL called visual-PETL (V-PETL) to investigate the effects of different PETL techniques, data scales of downstream domains, positions of trainable parameters, and other aspects affecting the tradeoff. Specifically, we analyze the positional importance of trainable parameters and differences between NLP and vision tasks in terms of data structures and pretraining mechanisms while implementing various PETL techniques, especially for the under-explored prefix-tuning technique. Based on a comprehensive understanding of differences between NLP and video data, we propose a new variation of prefix-tuning module called parallel attention (PATT) for video-based downstream tasks. An extensive empirical analysis on two video datasets via different frozen backbones has been carried and the findings show that the proposed PATT can effectively contribute to other PETL techniques. An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin with slightly more parameters and outperforms full-tuning with far less parameters.

1. INTRODUCTION

Many vision tasks rely on fine-tuning pre-trained models to achieve good performance. One standard modus operandi of transfer learning consists of two steps: pre-train a model on a source domain and fine-tune the entire model on a target domain (Zhuang et al., 2020) . Despite that prior works have achieved promising performance, such vanilla practice of fine-tuning is faced with challenges for adopting large models to downstream tasks. This full-tuning strategy requires one to update and store separate model parameters for different downstream tasks, which can be expensive and infeasible for the era of increasingly large models from EfficientNet-based (Pham et al., 2021) (480M parameters) to Transformer-based (Yu et al., 2022) (2, 100M parameters) ones. For such large models, making good use of shared parameter weights deployed on the cloud can be beneficial for edge devices such as autonomous vehicles, drones who are intensive in computing and battery resources (Yuan et al., 2022) . Second, the full fine-tuning strategy relies on high-quality downstream data and can hardly adapt to unseen scenarios that have large distribution shift (Kumar et al., 2021) , which is unlike the learning process of humans who can learn from few samples and generalize well to new circumstances. This issue has been researched in directions such as zero-shot learning, few-shot learning, and continual learning (Li et al., 2021a) . Another popular strategy is fine-tuning the downstream task head, i.e., the last fully connected (FC) layer, to avoid tuning the whole backbone model, which usually leads to poor performance when the target domain is large in data scale (see Figure

