UNIFORMERV2: SPATIOTEMPORAL LEARNING BY ARMING IMAGE VITS WITH VIDEO UNIFORMER

Abstract

Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated imagepretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and wellpretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brandnew local and global relation aggregators, which allow for preferable accuracycomputation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-ofthe-art recognition performance on 8 popular video benchmarks, including scenerelated Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400, to our best knowledge. The models will be released afterward.

1. INTRODUCTION

Spatiotemporal representation learning is a fundamental task in video understanding. Recently, Vision Transformers (ViTs) have achieved remarkable successes in the image domain (Dosovitskiy et al., 2021; Wang et al., 2021b; Liu et al., 2021; Li et al., 2022a) . Therefore, researchers make a great effort to transfer image-based ViTs for video modeling (Bertasius et al., 2021; Arnab et al., 2021; Yan et al., 2022) , by extending Multi-Head Self-Attention (MHSA) along the temporal dimension. However, the spatiotemporal attention mechanism in these approaches mainly focuses on capturing global video dependency, while lacking the capacity of tackling local video redundancy. As a result, these models bear a large computational burden to encode local video representations in the shallow layers, leading to unsatisfactory accuracy-efficiency balance in spatiotemporal learning. To tackle these problems, researchers introduce a concise UniFormer (Li et al., 2022a) , which unifies convolution and self-attention as Multi-Head Relation Aggregator (MHRA) in a transformer fashion. By modeling local and global relations respectively in shallow and deep layers, it can not only learn discriminative spatiotemporal representation but also largely reduce computation burden. However, as a new architecture for video modeling, UniFormer does not have any image-based pretraining as a start. To obtain a robust visual representation, it has to go through a tedious supervised pretraining phase by learning images from scratch, before finetuning on videos. Alternatively, we notice that there are various open-sourced image ViTs (Wightman, 2019; Touvron et al., 2021) , which have been well-pretrained on huge web datasets under rich supervision such as image-text contrastive learning (Radford et al., 2021) and mask image modeling (He et al., 2022; Bao et al., 2021) . These models exhibit great generalization capacity on a range of vision tasks (Luo et al., 2022; Chen et al., 2022; Shen et al., 2021) . Hence, we are motivated by a natural question: Can we integrate advantages from both ViTs and UniFormer for video modeling? We deploy our paradigm on ViTs that are pretrained on three popular supervision, including supervised learning, contrastive learning, and mask image modeling. All the enhanced models have great performance on video classification, showing the generic property of our UniFormerV2. Moreover, we develop a compact Kinetics-710 benchmark, where we integrate action categories of Kinetics-400/600/700, and remove the repeated and/or leaked videos in the training sets of these benchmarks for fairness (i.e., the total number of training videos is reduced from 1.14M to 0.66M). After training on K710, our model can simply achieve higher accuracy on K400/600/700 via only 5-epoch finetuning. Finally, extensive experiments show that, our UniFormerV2 achieves state-of-the-art performance on 8 popular video benchmarks, including scene-related datasets (i.e., Kinetics-400/600/700 (Carreira & Zisserman, 2017; Carreira et al., 2018; 2019) 



Figure 1: Comparison on video modeling paradigm. UniFormerV1 requires costly image pretraining, while directly inserting temporal MHSA into ViTs struggles for accuracy-FLOPs balance. UniFormerV2 can effectively and efficiently arm well-pretrained ViTs with concise UniFormer designs, thus integrating advantages from both models for spatiotemporal representation learning. To our best knowledge, it is the first model that achieves 90.0% top-1 accuracy on Kinetics-400. In this paper, we propose a generic paradigm to construct a powerful family of video networks, by arming the image-pretrained ViTs with efficient video designs of UniFormer. We called the resulting model UniFormerV2 (Fig. 1), since it inherits the concise style of UniFormer but equips local and global UniBlocks with new MHRA. In the local UniBlock, we flexibly insert a local temporal MHRA before the spatial ViT block. In this case, we can largely reduce temporal redundancy as well as leverage the well-pretrained ViT block, for learning local spatiotemporal representation effectively. In the global UniBlock, we introduce a query-based cross MHRA. Unlike the costly global MHRA in the original UniFormer, our cross MHRA can summarize all the spatiotemporal tokens into a video token, for learning global spatiotemporal representation efficiently. Finally, we re-organize local and global UniBlocks as a multi-stage fusion architecture. It can adaptively integrate multi-scale spatiotemporal representation to capture complex dynamics in videos.

Heilbron et al., 2015)  and HACS(Zhao et al., 2019)). To our best knowledge, it is the first model to achieve 90.0% top-1 accuracy on Kinetics-400.

