TOWARDS A UNIFIED VIEW ON VISUAL PARAMETER-EFFICIENT TRANSFER LEARNING Anonymous authors Paper under double-blind review

Abstract

Since the release of various large-scale natural language processing (NLP) pretrained models, parameter efficient transfer learning (PETL) has become a popular paradigm capable of achieving impressive performance on various downstream tasks. PETL aims at making good use of the representation knowledge in the pretrained large models by fine-tuning a small number of parameters. Recently, it has also attracted increasing attention to developing various PETL techniques for vision tasks. Popular PETL techniques such as Prompt-tuning and Adapter have been proposed for high-level visual downstream tasks such as image classification and video recognition. However, Prefix-tuning remains under-explored for vision tasks. In this work, we intend to adapt large video-based models to downstream tasks with a good parameter-accuracy trade-off. Towards this goal, we propose a framework with a unified view of PETL called visual-PETL (V-PETL) to investigate the effects of different PETL techniques, data scales of downstream domains, positions of trainable parameters, and other aspects affecting the tradeoff. Specifically, we analyze the positional importance of trainable parameters and differences between NLP and vision tasks in terms of data structures and pretraining mechanisms while implementing various PETL techniques, especially for the under-explored prefix-tuning technique. Based on a comprehensive understanding of differences between NLP and video data, we propose a new variation of prefix-tuning module called parallel attention (PATT) for video-based downstream tasks. An extensive empirical analysis on two video datasets via different frozen backbones has been carried and the findings show that the proposed PATT can effectively contribute to other PETL techniques. An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin with slightly more parameters and outperforms full-tuning with far less parameters.

1. INTRODUCTION

Many vision tasks rely on fine-tuning pre-trained models to achieve good performance. One standard modus operandi of transfer learning consists of two steps: pre-train a model on a source domain and fine-tune the entire model on a target domain (Zhuang et al., 2020) . Despite that prior works have achieved promising performance, such vanilla practice of fine-tuning is faced with challenges for adopting large models to downstream tasks. This full-tuning strategy requires one to update and store separate model parameters for different downstream tasks, which can be expensive and infeasible for the era of increasingly large models from EfficientNet-based (Pham et al., 2021 ) (480M parameters) to Transformer-based (Yu et al., 2022) (2, 100M parameters) ones. For such large models, making good use of shared parameter weights deployed on the cloud can be beneficial for edge devices such as autonomous vehicles, drones who are intensive in computing and battery resources (Yuan et al., 2022) . Second, the full fine-tuning strategy relies on high-quality downstream data and can hardly adapt to unseen scenarios that have large distribution shift (Kumar et al., 2021) , which is unlike the learning process of humans who can learn from few samples and generalize well to new circumstances. This issue has been researched in directions such as zero-shot learning, few-shot learning, and continual learning (Li et al., 2021a) . Another popular strategy is fine-tuning the downstream task head, i.e., the last fully connected (FC) layer, to avoid tuning the whole backbone model, which usually leads to poor performance when the target domain is large in data scale (see Figure 1 ). Given the paradigm of fine-tuning increasingly large models, how to transfer such large models with parameter-accuracy trade-off is a hot topic in various domains (Gusak et al., 2022; Sung et al., 2022; Lin et al., 2020; Houlsby et al., 2019) . Taking the video-based action recognition task as an example, it can be inconvenient for deploying such large models to edge devices such as an autonomous driving (Liu et al., 2019) and unmanned aerial vehicle (Li et al., 2021b) as they can heavily rely on the interaction with cloud services for adapting to new environments via active learning (Wang et al., 2021) or continual learning (Li et al., 2021a) . Re-training large models on the cloud are usually not cost-effective due to the expensive overheads of storage and computational resources. Furthermore, these resources are limited on edge devices such as autonomous vehicles and unmanned aerial vehicles, making the sense for developing effective fine-tuning methods with proper parameter-accuracy trade-off that can be fine-tuned on edge devices and interacting with the large models deployed on the cloud. There have been some pioneering works for the PETL of visual models such as AdaptFormer (Chen et al., 2022) and visual prompt tuning (VPT) (Jia et al., 2022) . AdaptFormer is primarily proposed based on vision transformer (Zhai et al., 2022) , representing one of the stateof-the-art large models for image-based tasks. The proposed adapter module directly brings from Houlsby et al. (2019) due to its convenience of being inserted to any models. Implementing with a large batch size of 1, 024 with 64 GPUs, Adaptformer shows promising parameter-accuracy trade-off on video data. However, such powerful computing resource is not realistic for the usage of edge devices. Meanwhile, whether the good trade-off can be maintained for small batch size remains under-explored. Inspired by the Prompting in NLP (Liu et al., 2021) , VPT proposes visualprompt to fine-tune visual models for imagebased tasks. According to the empirical results in Chen et al. (2022) , adapter modules achieves superior performance over VPT in the regimes of both self-supervised and supervised pre-training. Another concern of VPT is its modification to the original model parameters might affect the knowledge representation of backbone models. Hence, we do not continue to compare our method with VPT but comparing with the adapter on video-based downstream tasks. Taking the recent inspiration of the mix-and-match adapter (MAM adapter) (He et al., 2022a) in NLP, we aim to propose a unified model for the vision domain, especially for video-based downstream tasks. He et al. (2022a) analyzed the unified view among PETL techniques such as prefixtuning, low-rank (LoRA) adaptation, and adapter, pointing out the similarity between prefix-tuning and adapter in terms of calculating the attention. The difference is that the former performs weighted addition while the latter ones is unweighted. Note that prefix-tuning has not ever been applied to visual tasks in the form of pure visual models due to the intrinsic differences regarding pre-training methods of NLP and vision models. Another obstacle of directly applying prefix-tuning to visual tasks is the structural difference between text and vision data (we further discuss this in Section 2.3). Considering the video-based action recognition task, we propose a new variation of the prefixtuning module called parallel attention (PATT) to adapt video-based pre-trained large models to downstream domains with varied data scales. The differences of our method comparing the original prefix-tuning in NLP are twofold: prefix calculation and the manner of insertion (see Figure 2 [b] and Figure 3 ). Regarding the backbone model, we focus on Video Swin Transformer (Liu et al., 2022) , one of the state-of-the-art vision models that bring competitive performance on large-scale action recognition datasets such as Kinetics 400 and 600 Kay et al. (2017) . Our main contributions can be threefold as follows: 1. We analyze different PETL techniques using the backbone model Swin Video  Ẑl = 3DSW-MSA(LN(Z l-1 )) + Z l-1 , Z l = FFN(LN( Ẑl )) + Ẑl , where Ẑl and Z l respectively indicate the output of 3DSW-MSA and FNN modules. Given a video input sized t×w×h×3, containing t video frames with their heights and widths being h and w, respectively. The 3D patch for video data sized 2 × 4 × 4 × 3 is treated as a token. Then we will have t 2 × w 4 × h 4 3D tokens after a 3D patch partitioning layer. Given the 3D tokens sized t 2 × w 4 × h 4 and a 3D window with the size of p × m × m, the self-attention module, using the regular window partition strategy, will partition the 3D tokens to t 2p × w 4m × h 4m non-overlapping windows. For shifted 3D window, the partition is shifted along the temporal, height, and width dimensions by p 2 × m 2 × m 2 . For example, if we have an input video sized 8 × 224 × 224 × 3 and a 8 × 7 × 7 3D window, after the patch embedding, we will have 4 × 56 × 56 3D tokens with each of them sized 2 × 4 × 4 × 3. Without shifting, the non-overlapping window size will be 1 × 8 × 8 = 64.Then through the 3D window shifted by (4, 3, 3) , the number of 3D windows becomes 1 × 9 × 9 = 81. The 3DSW-MSA module is formed with a 3D relative position bias B ∈ R p 2 ×m 2 ×m 2 , each of which can be represented as: Attention(Q, K, V) = Sof tM ax( QK T √ d + B)V, where Q, K, V ∈ R p×m×m×d are the query, key, and value matrices, p × m×m is the number of tokens and d is the dimension of the tokens. MSA simultaneously performs the attention mechanism for n head heads, where the ith head can be parameterized by W (i) q , W (i) k , W (i) v ∈ R d×3d , projecting the input Z l-1 to queries, keys, and values. Given a matrix C ∈ R m×d , m = p × m×m, for performing attention, the 3DSW-MSA can be calculated as: 3DSW-MSA(Z l-1 , C) = Concat(head 1 , ..., head n )W o , head i = Attention(Z l-1 W (i) q , CW (i) k , CW (i) v ), where W o is the parameters of a linear project layer. The FNN module is composed of two linear layers with a GELU activation function in between, which can be computed as: FFN( Ẑl ) = GELU(LN( Ẑl )W 1 + b 1 )W 2 + b 2 , where  W 1 ∈ R d hidden ×d , W 2 ∈ R d×d hidden , b 1 ∈ R d hidden , Z Video Swin-Transformer K V P v P k E SoftMax QKV Prompt MLP LayerNorm LayerNorm 3DSW-MSA L Trainable Frozen s l-1 Z l 3DSW-MSA L P v P k Figure 2: V-PETL: A unified view of visual PETL techniques. They bring trainable parameters to different positions of the backbone model with various manners. AdaptFormer and Prefix-tuning respectively perform at the MLP and 3DSW-MSA modules that can adjust the number of trainable parameters via the bottleneck size of down and up projections. While prompt-tuning performed at the layer-level can adjust the length of prompts to control the tuned parameters. Prefix-tuning (Li & Liang, 2021) : The prefix-tuning approach prepends learnable prefix tokens to the keys and values of the MSA module of the model (see Figure 2 [b]). Specifically, two prefix matrices P k , P v ∈ R d token ×d that are randomly initialized with d token tokens and transformed from two linear layers (with parameters W (i) pk ∈ R d×d middle and W (i) pv ∈ R d middle ×d ) and a Tanh layer in between are concatenated to the original key and value, leading the calculation of head i in Eq. 3 to: head i = Attention(Z l-1 W (i) q , concat(P (i) k , CW (i) k ), concat(P (i) v , CW (i) v )), where the concat is the concatenation performed along the token dimension to mimic the prefixtuning in NLP tasks. Here, a question regarding whether this direct implementation will work for the vision domain is raised (results are in Table 4 ). This direct implementation is empirically invalid and we make further modification on it in Section2.3. Adapter (Chen et al., 2022)  W up ∈ R d bottle ×d . Z l = ReLU(LN( Ẑl )W down )W up , then two positions implementing adapter (parallel and sequential) can be respectively computed as: Z l = FFN(LN( Ẑl )) + Ẑl + s Z l , and sZ l = ReLU(FFN(LN( Ẑl ))W down )W up + Ẑl , where s is a scalar, controlling the effect of the adapter (will be ablated in experiments). According to Chen et al. (2022) , the parallel implementation (see Figure 2 [a]) empirically performs better. Prompt-tuning (Jia et al., 2022) : Prompt-tuning (see Figure 2[c] ) is inspired by the success of prompt-tuning that adapts large scale models to varied downstream NLP tasks. The idea of VPT (Jia et al., 2022) is to fine-tune a learnable matrix P l-1 prompt ∈ R dprompt×d , d prompt < d token -1 for the lth Transformer layer or all Transformer layers, which are known as shallow prompt and deep prompt, respectively. Ẑl = 3DSW-MSA(LN([x l-1 , P l-1 prompt , Z l-1 ])) + Z l-1 , where x l-1 ∈ R d denotes the [CLS]'s embedding for the lth layer's input space, P l-1 prompt is implemented by overlapping the top d prompt tokens of Z l-1 (Jia et al., 2022) . While it has also been implemented in front of the x l-1 (Chen et al., 2022) . Others: Other PETL techniques include ST-Adapter Pan et al. (2022 ), LoRA (Hu et al., 2022 ), and BitFit (Zaken et al., 2022) . ST-Adapter mainly adapts image-text models pre-trained on large scale datasets such as 400M image-text pair proposed by CLIP (Radford et al., 2021) and the IG-3.6B used by SWAG (Singh et al., 2022) to video understanding downstream tasks, which matches and even outperforms full-tuning. LoRA approximates the optimization process by injecting learnable low-rank matrices into the attention module. This method does not show superior performance for NLP tasks in terms of parameter efficiency. Hence, we do not prioritize this direction in this work. BitFit only tunes the bias terms of the backbone models, making it very parameter-efficient.

2.3. REVISITING PREFIX-TUNING FOR VISUAL TASKS

The prefix implementation in NLP Li & Liang (2021) ; He et al. (2022a) can be regarded as prepending contextual information for downstream tasks, which is similar with the pre-training process aiming to predict masked words in the process of an inner loop (Brown et al., 2020) . Considering the pre-training process of pure vision models, such direct implementation might not make sense for visual tasks. Although such autoregressive pre-training has been conducted in visual domain (He et al., 2022b; Tong et al., 2022) , but adding prefix for a sentence input in NLP can be structurally different with the visual domain. Specifically, masked pixels in image or video data cannot be regarded as some word level semantic information (e.g., a subject or an action) as in the NLP. UP Tanh Down SoftMax K V Q QKV K P V P K P V P s s Z l-1 Bruce Figure 3: Structure of PATT. Red parts are trainable parameters calculated by the same input for preparing query, key, and value (i.e., the output of the previous layer passing through a layer normalization layer Z l-1 ). Recall that the embedding state of prefix-tuning is randomly initiated, which is known as learnable prefix but can bring random noise that later turns out affecting the convergence of the fine-tuning downstream tasks. Hence, inspired by the connection between adapter and prefix (He et al., 2022a) , we avoid such learnable prefix design with random initialization and propose a parallel attention (PATT) to the original attention module (see Figure 3 ). The adapter structure can effective control the number of trainable parameters via d bottle , which is similar with the effect of the middle dimension d middle of W (i) pk and W (i) pv for preparing the prefix. Specifically, for the lth layer, we use output of its previous layer Z l-1 and project it to a pair of matrices K p , V p ∈ R m×d via a similar mechanism of Eq. 6: K p , V p = Tanh(Z l-1 W down )W up , ( ) where Tanh is the activation function used for preparing the prefix, which can be replaced by other activation functions such as RELU and GELU. Here, we follow the original prefix implementation as its value ranges from -1 to 1. Given K p and V p , Eq. 5 can be rewritten as: head i = Attention(Z l-1 W (i) q , sK p + CW (i) k , sV p + CW (i) v ), ) where s is a scalar for adjusting the effect of PATT. Note that without considering the physical meaning of such design, for PETL purpose, one can perform similar practise for any combinations of Q, K, and V. This brings connection to the LoRA (Hu et al., 2022) method, which add parallel trainable parameters to Q and V. Empirically, where to perform the PATT makes little difference, but the amount of trainable parameters brings larger effect for large scale downstream domains. Given the PETL techniques at hand, there can be many potential combinations leading to good parameteraccuracy trade-off. However, it is unrealistic to exhaustively test all the methods for a specific downstream task.

2.4. V-PETL: UNIFIED VIEW ON VISUAL PETL

Other than probing such solution via evolutionary search as in Zhang et al. (2022) , we aim to propose more understandable models by empirically analyzing the effect of different designs independently. According to the preliminary results shwon in Figure 1 , we argue that the position and amount of parameters are important for PETL techniques, especially when the target domain is not small. To verify the importance of position and tuned parameter amount, we independently tune different modules of the backbone model. Table 1 shows the results. We can see that the attention module's QKV layer has 20.98M parameters while the MLP module has the most number of parameters of 55.90M. Tuning positions with more parameters, will lead to better performance for SSv2. Thanks to the bottleneck mechanism of adapter and prefix-tuning, one can effectively achieve a good parameter-accuracy trade-off. As such, we derive a model called Swin-B-adapter-PATT (Swin-BAPAT) from the V-PETL framework by using the parallel adapter and our PATT to leverage the adaption of pre-trained backbone model at the positions of attention and MLP modules, respectively. In addition to adapter and PATT, we also fine-tune the last fully connected layer as it has relatively smaller amount of tunable parameters (i.e, 0.18M) than adapter and PATT. Implementation details: It is worth noting that big batch size (i.e., 1, 024) and the number of input video frames (i.e., 32 frames) can greatly benefit good performance (Carreira & Zisserman, 2017; Liu et al., 2022; Chen et al., 2022) , which usually requires GPU clusters to enable the training. AdaptFormer (Chen et al., 2022) uses such powerful GPU cluster to achieve good performance. However, good performance might not hold when the batch size is small. Following the more common hardware device setup, we use 4 GeForce 3090 GPUs for all experiments, leading to a batch size of 64. All the experiments are fine-tuned for 70 epochs. We use the Swin-Bfoot_0 model pre-trained on Kinetics 400 and 600. For HMDB51, we report the results without tuning the FC layer due to the significant effect of the FC layer on relatively small scale dataset. Following Chen et al. ( 2022), we do not perform regularization strategies such as mixup, cutmix, color jittering, etc. Our PATT module is convenient to be applied to other Transformer-based models. Hence, we respectively adopt ViT-B models from MAE (He et al., 2022b) and VideoMAE (Tong et al., 2022) to conduct further comparison on video and image datasets, which follows the self-supervised pretraining settingfoot_1 in Chen et al. ( 2022) except that the batch size is set to 256 instead of 1, 024.

3. EXPERIMENTS

Baselines: We mainly compare our method Swin-BAPAT with three baselines as follows: (1) Full-tuning: set all the parameters learnable and tune the whole model initiated with the pretrained weights. (2) Tune FC layer: tune the last fully connected layer and freeze pre-trained parameters of the whole backbone model. (3) AdaptFormer-Swin: method introduced by Chen et al. (2022) that adds a parallel adapter to the MLP module in each block of the backbone model. (4) Prefix-tuning: the direct implementation of prefix-tuning used in NLP as defined in Eq. 5. (5) BitFit: by tuning the bias of the backbone model together with the FC layer.

3.2. THE EFFECT OF DIFFERENT PETL TECHNIQUES

Table 2 shows the results of different PETL techniques. From the results of four baseline methods, full-tuning performs the best for the large-scale dataset SSv2, whereas tuning the FC layer achieves superior performance over other PETL techniques on HMDB51. This is due to the fact that downstream tasks with relatively larger scale datasets are more parameter hungry for good convergence. On the contrary, small datasets can make good use of the knowledge from the source domain with slight effort of adaption via an FC layer. Here, a question regarding the effect of this FC layer when using it together with other PETL techniques has not been investigated. As this FC layer having small amount of tunable parameters can already make a big difference, performing better than fulltuning and other PETL techniques and rendering them not effective for small-scale datasets. As such, we further examine this question in Section A.1. We test different amount of parameters adjusted by s bottle , taking its values to 32, 64, 128 and 256. The second and third groups (without or with Adapter, respectively) of results in Table 2 shows that larger values of s bottle can benefit the fine-tuning with slightly more overhead of parameters on large-scale datasets such as SSv2. All results of our Swin-BAPAT outperform the state-ofthe-art AdaptFormer-Swin with a big margin (using the smallest value s bottle = 32 can improve AdaptFormer-Swin by almost 25%). While without using Adapter, our method still outperforms baselines AdaptFormer-Swin and BitFit with roughly similar amount of parameters. When s bottle is larger than 64, our Swin-BAPAT starts to perform better than full-tuning on both datasets with proper parameter-accuracy trade-off, validating the effectiveness of our Swin-BAPAT for PETL. 

3.3. THE EFFECT OF DIFFERENT PRE-TRAINING DOMAINS

The knowledge from the pre-trained model is learned from the source domain. We test two different models pre-trained on large-scale datasets: Kinetics 400, Kinetics 600, and ImageNet-22K. Findings show that both two models pre-trained on such large-scale datasets can benefit our proposed PETL strategy with the latter being slightly more significant (see the third group of comparison in Table 2 ). This is due to the fact that Kinectics 600 is larger than its 400 version and brings more knowledge to the pre-trained model, benefiting more downstream tasks. However, image-based pre-training cannot perform as good as video-based pre-training due to the larger domain gap.

3.4. THE EFFECT OF DIFFERENT VIDEO INPUT SIZE

We also test whether our method is robust to increased number of input video frames. It is worth noting that larger number of input video frames usually can bring more spatial temporal information, benefiting data-driven models to learn more distinguishable features while keeping the model size remaining the same. The last group of comparisons in Table 2 shows that using double-sized video input (i.e., 16 frames) can greatly improve the performance of action recognition on both small and large-scale datasets. The improvements (increased 9.78% from 53.36% to 63.14% on SSv2, and 3.74% from 71.93% to 75.67% on HMDB51) are more significant than other factors such as d bottle and pre-training domain (around 1% to 2%). The top line in Figure 4 visualizes the significant effect of increasing the number of input video frames. These results suggest that our Swin-BAPAT can be promising for increased frames of video input. Recall that the effect of our PATT on pretrained models can be adjusted by the variable s in Eq. 10. Table 3 shows that adopting the value of 0.8 can deliver consistent best performances on both datasets SSv2 and HMDB51 under our experimental setting. Smaller values of s will quantitatively reduce the effect of our PATT module on the knowledge transfer while large values will increase the effect of our PATT module. The good performance achieved via taking an effective scale of 0.8 indicates that our PATT module plays an important role in the knowledge transfer. However, even larger values over 0.8 can affect the importance of original knowledge thereof the pretrained model. Hence, proper valued scalar s is essential for balancing the role of PATT and pre-trained backbone model. Note this can be a learnable parameter upon specific implementation, here we empirically verified the effect of the scalar.

3.6. THE EFFECT OF DIFFERENT METHODS YIELD FROM V-PETL

We have argued that, especially for relative large downstream datasets, the position and the amount of trainable parameters are important for parameter-efficient transfer learning in Section 2.4. The proposed Swin-BAPAT is one of instantiated models from the V-PETL framework regarding the insert position of our PATT. Other instantiations can be inserted into different positions such as query, key, and value of the attention module. We further instantiate other variations of our Swin-BAPAT by inserting PATT to different positions. Table 4 shows the results. Findings show that inserting to the value position of 3DSW-MSA can contribute more than inserting to other two positions. While inserting to query of key makes little difference for the performance. This is due to the fact that query and key make the calculation of the attention mask. Hence, inserting either one of them will lead to a similar effect. On one hand, these results, to some extent, justify the original design of prefix-tuning that bring learnable prefix to key and value of the attention module. On the other hand, it indicates that our claim regarding the unified view of PETL for visual tasks is reasonable. In Table 4 , we also ablate the designs of PATT regarding concatenating K p and V p (i.e., Concat [K, V]), and using trainable parameters to generate K p and V p (i.e., No Z l-1 [K, V]).

3.7. COMPARISON ON VARIED TASKS VIA SELF-SUPERVISED PRE-TRAINED MODELS

Table 5 shows the comparison with AdaptFormer-64 (Chen et al., 2022) and VPT (Jia et al., 2022) on both image-and video-based downstream tasks. Our method ViT-BAPAT still shows promising parameter-accuracy trade-off via much smaller batch size, which is more convenient for reproduction on the general single server with 8 GPUs. The underperformance on SSv2 (better than full-tuning) can be due to the smaller batch size as SSv2 is much larger than other compared datasets and can be more relying on larger batch size. In real-world application scenarios, small dataset can be the more common case, which confirms our contributions. 

4. CONCLUSION

In this paper, we introduced a V-PETL framework for exploiting good parameter-accuracy tradeoff around adapting video-based pre-trained large models to downstream tasks. Our Swin-BAPAT method derived from the V-PETL with a variation of prefix-tuning known as PATT can effectively bring good parameter-accuracy trade-off on downstream tasks. The proposed PATT can be easily plugged to the attention module of other transformer-like models. Meanwhile, the amount of trainable parameter can be easily adjusted by the parameter d bottle . With small amount overhead on trainable parameters, our method performs significantly better than state-of-the-art method AdapFormer-Swin and full-tuning on the datasets SSv2 and HMDB51 via small batch size, validating our contribution to the literature of PETL. In the future we will test our proposed model on more action recognition datasets surveyed in Sun et al. (2022) under more learning regimes such as zero/few-shot learning, active learning and continual learning with other pre-training methods such as visual-language models. We will also explore other backbone models, activation functions for PATT, and PETL techniques such as LoRA for visual tasks.

A APPENDIX

A.1 THE EFFECT OF FC LAYER FOR SMALL SCALE DOWNSTREAM TASKS For the small dataset HMDB51, due to the good parameter-accuracy trade-off achieved by finetuning the FC layer only, adding the FC layer cannot bring extra improvement to our proposed method. Without sufficient taining data, full-tuning also cannot perform well (see results in Table 2 ). As such, small datasets do not need to rely on large models but can make use of large models with light transfer. Instead, without tuning the FC layer, our Swin-BAPAT can perform better than fine-tuning the FC layer with small amount of extra trainable parameters (see results in Table 6 ), validating the good parameter-accuracy trade-off of our method.



https://github.com/SwinTransformer/Video-Swin-Transformer https://github.com/ShoufaChen/AdaptFormer/blob/main/PRETRAIN.md



Figure 1: Parameter-accuracy trade-off. Adapting backbone Swin-B (Liu et al., 2022) pre-trained on Kinetics 400 via different fine-tuning methods on the something-something v2 (Goyal et al., 2017) dataset. Our methods perform significantly better than the state-of-the-art AdaptFormer-Swin (Chen et al., 2022) (our implementation with batch size 16) with slightly more tunable parameters, and outperform full-tuning with increasing margins when using larger values of d bottle .

and b 2 ∈ R d . The value of d hidden usually takes a large value (e.g., d hidden = 4d).

Figure 4: Top-1 accuracy of different settings on SSv2 throughout training process. F: frame, S: scalar, B: d bottle , K: pre-training domain.

Comparison of independently fine-tuning varied positions of the video swin transformer block on SSv2.

Following the experimental set ups in AdaptFormer, three datasets CIFAIR-100 Krizhevsky et al. (2009), Street View House Numbers (SVHN) Goodfellow et al. (2013), and Food-101 Bossard et al. (2014) are used. CIFAIR-100 has 50, 000 and 10, 000 training and validation images, respectively, with the resolution of 32×32 and 100 categories; SVHN is a digit classification dataset that has 73, 257 training sample and 26, 032 testing samples; Food-101 includes 101k images of 101 food categories with each of them has 750 training and 250 testing samples.

Comparison of Top-1 accuracy using varied amount of parameters adjusted by d bottle , different pre-training domains, and the number of frames with other fine-tuning strategies.

Top-1 accuracy (%) using different scalar values on two datasets: SSv2 and HMDB51. The d bottle is set to 128; pretraining is based on Kinetics 400.

Ablation of different implementation positions of PATT defined in Eq. 10, e.g., Ours (K, V) indicates inserting PATT to the query and key of 3DSW-MSA modules. Pre-training on Kinetics 600. d bottle is set to 128; Scalar s is set to 0.8.

Comparison of Top-1 accuracy via ViT-B models from MAE and VideoMAE pre-trained with self-supervised learning for image and video datasets, respectively. BAPAT-64 3.02 (3.51%) 86.35 (+0.45) 97.18 (-0.49) 87.53 (-2.56) 57.55 (+3.58) 57.18 (+10.77) Our ViT-BAPAT-128 4.79 (5.56%) 86.47 (+0.57) 97.28 (-0.39) 87.66 (-2.43) 56.97 (+3.00) 57.70 (+11.29) Our ViT-BAPAT-256 8.33 (9.68%) 86.55 (+0.65) 97.24 (-0.43) 87.68 (-2.41) 56.53 (+2.56) 57.31 (+10.90)

Results of with or without tuning the FC layer on the small scale dataset HMDB51.

