AIM: ADAPTING IMAGE MODELS FOR EFFICIENT VIDEO ACTION RECOGNITION

Abstract

Recent vision transformer based video models mostly follow the "image pretraining then finetuning" paradigm and have achieved great success on multiple video benchmarks. However, fully finetuning such a video model could be computationally expensive and unnecessary, given the pre-trained image transformer models have demonstrated exceptional transferability. In this work, we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient video understanding. By freezing the pre-trained image model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation to gradually equip an image model with spatiotemporal reasoning capability. We show that our proposed AIM can achieve competitive or even better performance than prior arts with substantially fewer tunable parameters on four video action recognition benchmarks. Thanks to its simplicity, our method is also generally applicable to different image pre-trained models, which has the potential to leverage more powerful image foundation models in the future. The project webpage is https://adapt-image-models.github.io/.

1. INTRODUCTION

The "pre-training then finetuning" paradigm has played an important role in computer vision. The key to this paradigm is a well pre-trained image model, which can provide strong transferability to downstream tasks through finetuning. Recently, large foundation models (Radford et al., 2021; Yuan et al., 2021b; Tong et al., 2022; Jia et al., 2021; Wang et al., 2022b) can even demonstrate remarkable few-/zero-shot performance given their learned superior visual representations. In video understanding, a common practice is also bootstrapping from an image pre-trained model and then finetuning on the video data. There are two dominating directions as shown in Fig. 1 , one is to extend an image model with additional temporal module (Lin et al., 2019; Zhu et al., 2019; Arnab et al., 2021) , the other is to inflate an image model to a video model (Carreira & Zisserman, 2017; Liu et al., 2022) . However, there exists at least two drawbacks for the aforementioned methods. First, most approaches require full finetuning (i.e., updating all the model parameters during training) to achieve promising results on common video benchmarks. This is quite costly in terms of both computation and memory footprint, e.g., 1200 Tesla V100 GPU hours to train Liu et al. ( 2022). Second, it also remains questionable that whether it is necessary to fully finetune pre-trained image models given that they have demonstrated excellent transferability. An inadequate finetuning on downstream data might destroy the well generalized representations from such foundation models. To overcome the drawbacks, a research direction termed parameter-efficient transfer learning has been trending in natural language processing (NLP) (Houlsby et al., 2019; Lester et al., 2021; Ben Zaken et al., 2022; Hu et al., 2022) . The goal is to only finetune a small number of (extra) parameters while keeping large pre-trained language models (Devlin et al., 2018; Brown et al., 2020) frozen to attain strong performance. With the rise of large vision transformer (ViT) models, such techniques have been recently introduced to computer vision for efficient transfer learning. However, existing works either focus on tuning a pre-trained image model for image tasks (image-to-image) (Bahng et al., 2022; Jie & Deng, 2022; Jia et al., 2022) , or tuning a pre-trained video model for video * Work done during an internship at Amazon Web Services.

