TOWARDS A UNIFIED VIEW ON VISUAL PARAMETER-EFFICIENT TRANSFER LEARNING Anonymous authors Paper under double-blind review

Abstract

Since the release of various large-scale natural language processing (NLP) pretrained models, parameter efficient transfer learning (PETL) has become a popular paradigm capable of achieving impressive performance on various downstream tasks. PETL aims at making good use of the representation knowledge in the pretrained large models by fine-tuning a small number of parameters. Recently, it has also attracted increasing attention to developing various PETL techniques for vision tasks. Popular PETL techniques such as Prompt-tuning and Adapter have been proposed for high-level visual downstream tasks such as image classification and video recognition. However, Prefix-tuning remains under-explored for vision tasks. In this work, we intend to adapt large video-based models to downstream tasks with a good parameter-accuracy trade-off. Towards this goal, we propose a framework with a unified view of PETL called visual-PETL (V-PETL) to investigate the effects of different PETL techniques, data scales of downstream domains, positions of trainable parameters, and other aspects affecting the tradeoff. Specifically, we analyze the positional importance of trainable parameters and differences between NLP and vision tasks in terms of data structures and pretraining mechanisms while implementing various PETL techniques, especially for the under-explored prefix-tuning technique. Based on a comprehensive understanding of differences between NLP and video data, we propose a new variation of prefix-tuning module called parallel attention (PATT) for video-based downstream tasks. An extensive empirical analysis on two video datasets via different frozen backbones has been carried and the findings show that the proposed PATT can effectively contribute to other PETL techniques. An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin with slightly more parameters and outperforms full-tuning with far less parameters.

1. INTRODUCTION

Many vision tasks rely on fine-tuning pre-trained models to achieve good performance. One standard modus operandi of transfer learning consists of two steps: pre-train a model on a source domain and fine-tune the entire model on a target domain (Zhuang et al., 2020) . Despite that prior works have achieved promising performance, such vanilla practice of fine-tuning is faced with challenges for adopting large models to downstream tasks. This full-tuning strategy requires one to update and store separate model parameters for different downstream tasks, which can be expensive and infeasible for the era of increasingly large models from EfficientNet-based (Pham et al., 2021 ) (480M parameters) to Transformer-based (Yu et al., 2022) (2, 100M parameters) ones. For such large models, making good use of shared parameter weights deployed on the cloud can be beneficial for edge devices such as autonomous vehicles, drones who are intensive in computing and battery resources (Yuan et al., 2022) . Second, the full fine-tuning strategy relies on high-quality downstream data and can hardly adapt to unseen scenarios that have large distribution shift (Kumar et al., 2021) , which is unlike the learning process of humans who can learn from few samples and generalize well to new circumstances. This issue has been researched in directions such as zero-shot learning, few-shot learning, and continual learning (Li et al., 2021a) . Another popular strategy is fine-tuning the downstream task head, i.e., the last fully connected (FC) layer, to avoid tuning the whole backbone model, which usually leads to poor performance when the target domain is large in data scale (see Figure 1 ). Given the paradigm of fine-tuning increasingly large models, how to transfer such large models with parameter-accuracy trade-off is a hot topic in various domains (Gusak et al., 2022; Sung et al., 2022; Lin et al., 2020; Houlsby et al., 2019) . Taking the video-based action recognition task as an example, it can be inconvenient for deploying such large models to edge devices such as an autonomous driving (Liu et al., 2019) and unmanned aerial vehicle (Li et al., 2021b) as they can heavily rely on the interaction with cloud services for adapting to new environments via active learning (Wang et al., 2021) or continual learning (Li et al., 2021a) . Re-training large models on the cloud are usually not cost-effective due to the expensive overheads of storage and computational resources. Furthermore, these resources are limited on edge devices such as autonomous vehicles and unmanned aerial vehicles, making the sense for developing effective fine-tuning methods with proper parameter-accuracy trade-off that can be fine-tuned on edge devices and interacting with the large models deployed on the cloud. The difference is that the former performs weighted addition while the latter ones is unweighted. Note that prefix-tuning has not ever been applied to visual tasks in the form of pure visual models due to the intrinsic differences regarding pre-training methods of NLP and vision models. Another obstacle of directly applying prefix-tuning to visual tasks is the structural difference between text and vision data (we further discuss this in Section 2.3). Considering the video-based action recognition task, we propose a new variation of the prefixtuning module called parallel attention (PATT) to adapt video-based pre-trained large models to downstream domains with varied data scales. The differences of our method comparing the original prefix-tuning in NLP are twofold: prefix calculation and the manner of insertion (see Figure 2 [b] and Figure 3 ). Regarding the backbone model, we focus on Video Swin Transformer (Liu et al., 2022) , one of the state-of-the-art vision models that bring competitive performance on large-scale action recognition datasets such as Kinetics 400 and 600 Kay et al. (2017) . Our main contributions can be threefold as follows: 1. We analyze different PETL techniques using the backbone model Swin Video Transformer for



Figure 1: Parameter-accuracy trade-off. Adapting backbone Swin-B (Liu et al., 2022) pre-trained on Kinetics 400 via different fine-tuning methods on the something-something v2 (Goyal et al., 2017) dataset. Our methods perform significantly better than the state-of-the-art AdaptFormer-Swin (Chen et al., 2022) (our implementation with batch size 16) with slightly more tunable parameters, and outperform full-tuning with increasing margins when using larger values of d bottle . There have been some pioneering works for the PETL of visual models such as AdaptFormer (Chen et al., 2022) and visual prompt tuning (VPT) (Jia et al., 2022). AdaptFormer is primarily proposed based on vision transformer (Zhai et al., 2022), representing one of the stateof-the-art large models for image-based tasks. The proposed adapter module directly brings from Houlsby et al. (2019) due to its convenience of being inserted to any models. Implementing with a large batch size of 1, 024 with 64 GPUs, Adaptformer shows promising parameter-accuracy trade-off on video data. However, such powerful computing resource is not realistic for the usage of edge devices. Meanwhile, whether the good trade-off can be maintained for small batch size remains under-explored. Inspired by the Prompting in NLP (Liu et al., 2021), VPT proposes visualprompt to fine-tune visual models for imagebased tasks. According to the empirical results in Chen et al. (2022), adapter modules achieves superior performance over VPT in the regimes of both self-supervised and supervised pre-training. Another concern of VPT is its modification to the original model parameters might affect the knowledge representation of backbone models. Hence, we do not continue to compare our method with VPT but comparing with the adapter on video-based downstream tasks. Taking the recent inspiration of the mix-and-match adapter (MAM adapter) (He et al., 2022a) in NLP, we aim to propose a unified model for the vision domain, especially for video-based downstream tasks. He et al. (2022a) analyzed the unified view among PETL techniques such as prefixtuning, low-rank (LoRA) adaptation, and adapter, pointing out the similarity between prefix-tuning and adapter in terms of calculating the attention. The difference is that the former performs weighted addition while the latter ones is unweighted. Note that prefix-tuning has not ever been applied to visual tasks in the form of pure visual models due to the intrinsic differences regarding pre-training methods of NLP and vision models. Another obstacle of directly applying prefix-tuning to visual tasks is the structural difference between text and vision data (we further discuss this in Section 2.3). Considering the video-based action recognition task, we propose a new variation of the prefixtuning module called parallel attention (PATT) to adapt video-based pre-trained large models to downstream domains with varied data scales. The differences of our method comparing the original prefix-tuning in NLP are twofold: prefix calculation and the manner of insertion (see Figure2[b] and Figure3). Regarding the backbone model, we focus on Video Swin Transformer(Liu et al.,  2022), one of the state-of-the-art vision models that bring competitive performance on large-scale action recognition datasets such as Kinetics 400 and 600Kay et al. (2017).

