AIM: ADAPTING IMAGE MODELS FOR EFFICIENT VIDEO ACTION RECOGNITION

Abstract

Recent vision transformer based video models mostly follow the "image pretraining then finetuning" paradigm and have achieved great success on multiple video benchmarks. However, fully finetuning such a video model could be computationally expensive and unnecessary, given the pre-trained image transformer models have demonstrated exceptional transferability. In this work, we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient video understanding. By freezing the pre-trained image model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation to gradually equip an image model with spatiotemporal reasoning capability. We show that our proposed AIM can achieve competitive or even better performance than prior arts with substantially fewer tunable parameters on four video action recognition benchmarks. Thanks to its simplicity, our method is also generally applicable to different image pre-trained models, which has the potential to leverage more powerful image foundation models in the future. The project webpage is https://adapt-image-models.github.io/.

1. INTRODUCTION

The "pre-training then finetuning" paradigm has played an important role in computer vision. The key to this paradigm is a well pre-trained image model, which can provide strong transferability to downstream tasks through finetuning. Recently, large foundation models (Radford et al., 2021; Yuan et al., 2021b; Tong et al., 2022; Jia et al., 2021; Wang et al., 2022b) can even demonstrate remarkable few-/zero-shot performance given their learned superior visual representations. In video understanding, a common practice is also bootstrapping from an image pre-trained model and then finetuning on the video data. There are two dominating directions as shown in Fig. 1 , one is to extend an image model with additional temporal module (Lin et al., 2019; Zhu et al., 2019; Arnab et al., 2021) , the other is to inflate an image model to a video model (Carreira & Zisserman, 2017; Liu et al., 2022) . However, there exists at least two drawbacks for the aforementioned methods. First, most approaches require full finetuning (i.e., updating all the model parameters during training) to achieve promising results on common video benchmarks. This is quite costly in terms of both computation and memory footprint, e.g., 1200 Tesla V100 GPU hours to train Liu et al. (2022) . Second, it also remains questionable that whether it is necessary to fully finetune pre-trained image models given that they have demonstrated excellent transferability. An inadequate finetuning on downstream data might destroy the well generalized representations from such foundation models. To overcome the drawbacks, a research direction termed parameter-efficient transfer learning has been trending in natural language processing (NLP) (Houlsby et al., 2019; Lester et al., 2021; Ben Zaken et al., 2022; Hu et al., 2022) . The goal is to only finetune a small number of (extra) parameters while keeping large pre-trained language models (Devlin et al., 2018; Brown et al., 2020) frozen to attain strong performance. With the rise of large vision transformer (ViT) models, such techniques have been recently introduced to computer vision for efficient transfer learning. However, existing works either focus on tuning a pre-trained image model for image tasks (image-to-image) (Bahng et al., 2022; Jie & Deng, 2022; Jia et al., 2022) , or tuning a pre-trained video model for video In this work, we introduce a new way to Adapt pre-trained Image transformer Models (AIM) for efficient video action recognition. By freezing the pre-trained image model and adding a few lightweight adapters (Houlsby et al., 2019) during finetuning, we show that our proposed AIM can achieve competitive or even better results than previous state-of-the-art methods with substantially fewer tunable parameters (Fig. 1 right). To be specific, we first introduce adapter after self-attention layer in a transformer block to perform spatial adaptation. We show that a well pre-trained image model is sufficiently good for spatial modeling in video understanding. Then for temporal modeling, we simply reuse the image pre-trained self-attention layer but apply it to the temporal dimension of video input, forcing it to model the relationship across different frames. An adapter is also appended for temporal adaptation. Finally, we perform joint adaptation by adding another adapter in parallel to the MLP layer in a transformer block. To summarize, we make the following contributions: 1. We propose a new way to adapt pre-trained image transformer models for efficient video understanding. Our method is generally applicable to different image pre-trained models, simple to implement, and cost-effective to train. 2. Our method is significantly more efficient than fully finetuning a video model, e.g., on Swin-B backbone, we can reduce the memory footprint by 50% and training time by 42% compared to VideoSwin (Liu et al., 2022) . 3. AIM achieves comparable or higher performance than previous fully finetuned state-of-the-arts on 4 video action recognition benchmarks, e.g., 87.5% on K400 with 38M tunable parameters. 4. Our method also brings data efficiency, e.g., AIM outperforms counterpart TimeSformer (Bertasius et al., 2021) by 9% absolute accuracy improvement when using 1% of the training data.

2. RELATED WORK

Image pre-trained models. ViT (Dosovitskiy et al., 2020) and its variants (Liu et al., 2021; Wang et al., 2021b; Yuan et al., 2021a; Dong et al., 2022) have been proposed to achieve state-of-the-art performance on image recognition. Once trained, these models could also serve as good initialization for transfer learning to downstream tasks. In terms of training techniques, they are commonly trained on large-scale labeled datasests (Deng et al., 2009; Sun et al., 2017; Zhai et al., 2022) in a supervised manner. To alleviate the labeling cost, self-supervised learning methods (Chen et al., 2021; Bao et al., 2021; Zhou et al., 2021; He et al., 2022b; Xie et al., 2022) are introduced to learn effective representations from unlabeled data. Recent works (Radford et al., 2021; Jia et al., 2021; Yuan et al., 2021b; Wang et al., 2022b) adopt large-scale multimodal data (e.g., image-text pairs) for model training, which leads to even more powerful visual representations. In this work, thanks to the simplicity of our proposed method, we could take advantage of these well pre-trained image models and adapt them efficiently to solve video tasks.



Figure 1: Left: Pipeline comparison between traditional full finetuning and our efficient finetuning. Right: Performance comparison on K400 dataset (Kay et al., 2017). Bubble size indicates GFLOPS at inference time. Our proposed AIM achieves the highest accuracy while enjoying significantly less number of tunable parameters and GFLOPS.

