UNIFORMERV2: SPATIOTEMPORAL LEARNING BY ARMING IMAGE VITS WITH VIDEO UNIFORMER

Abstract

Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated imagepretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and wellpretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brandnew local and global relation aggregators, which allow for preferable accuracycomputation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-ofthe-art recognition performance on 8 popular video benchmarks, including scenerelated Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400, to our best knowledge. The models will be released afterward.

1. INTRODUCTION

Spatiotemporal representation learning is a fundamental task in video understanding. Recently, Vision Transformers (ViTs) have achieved remarkable successes in the image domain (Dosovitskiy et al., 2021; Wang et al., 2021b; Liu et al., 2021; Li et al., 2022a) . Therefore, researchers make a great effort to transfer image-based ViTs for video modeling (Bertasius et al., 2021; Arnab et al., 2021; Yan et al., 2022) , by extending Multi-Head Self-Attention (MHSA) along the temporal dimension. However, the spatiotemporal attention mechanism in these approaches mainly focuses on capturing global video dependency, while lacking the capacity of tackling local video redundancy. As a result, these models bear a large computational burden to encode local video representations in the shallow layers, leading to unsatisfactory accuracy-efficiency balance in spatiotemporal learning. To tackle these problems, researchers introduce a concise UniFormer (Li et al., 2022a) , which unifies convolution and self-attention as Multi-Head Relation Aggregator (MHRA) in a transformer fashion. By modeling local and global relations respectively in shallow and deep layers, it can not only learn discriminative spatiotemporal representation but also largely reduce computation burden. However, as a new architecture for video modeling, UniFormer does not have any image-based pretraining as a start. To obtain a robust visual representation, it has to go through a tedious supervised pretraining phase by learning images from scratch, before finetuning on videos. Alternatively, we notice that there are various open-sourced image ViTs (Wightman, 2019; Touvron et al., 2021) , which have been well-pretrained on huge web datasets under rich supervision such as image-text contrastive learning (Radford et al., 2021) and mask image modeling (He et al., 2022; Bao et al., 2021) . These models exhibit great generalization capacity on a range of vision tasks (Luo et al., 2022; Chen et al., 2022; Shen et al., 2021) . Hence, we are motivated by a natural question: Can we integrate advantages from both ViTs and UniFormer for video modeling? UniFormerV2 can effectively and efficiently arm well-pretrained ViTs with concise UniFormer designs, thus integrating advantages from both models for spatiotemporal representation learning. To our best knowledge, it is the first model that achieves 90.0% top-1 accuracy on Kinetics-400. In this paper, we propose a generic paradigm to construct a powerful family of video networks, by arming the image-pretrained ViTs with efficient video designs of UniFormer. We called the resulting model UniFormerV2 (Fig. 1 ), since it inherits the concise style of UniFormer but equips local and global UniBlocks with new MHRA. In the local UniBlock, we flexibly insert a local temporal MHRA before the spatial ViT block. In this case, we can largely reduce temporal redundancy as well as leverage the well-pretrained ViT block, for learning local spatiotemporal representation effectively. In the global UniBlock, we introduce a query-based cross MHRA. Unlike the costly global MHRA in the original UniFormer, our cross MHRA can summarize all the spatiotemporal tokens into a video token, for learning global spatiotemporal representation efficiently. Finally, we re-organize local and global UniBlocks as a multi-stage fusion architecture. It can adaptively integrate multi-scale spatiotemporal representation to capture complex dynamics in videos. We deploy our paradigm on ViTs that are pretrained on three popular supervision, including supervised learning, contrastive learning, and mask image modeling. All the enhanced models have great performance on video classification, showing the generic property of our UniFormerV2. Moreover, we develop a compact Kinetics-710 benchmark, where we integrate action categories of Kinetics-400/600/700, and remove the repeated and/or leaked videos in the training sets of these benchmarks for fairness (i.e., the total number of training videos is reduced from 1.14M to 0.66M). After training on K710, our model can simply achieve higher accuracy on K400/600/700 via only 5-epoch finetuning. Finally, extensive experiments show that, our UniFormerV2 achieves state-of-the-art performance on 8 popular video benchmarks, including scene-related datasets (i.e., Kinetics-400/600/700 (Carreira & Zisserman, 2017; Carreira et al., 2018; 2019) and Moments in Time (Monfort et al., 2020) ), temporal-related datasets (i.e., Something-Something V1/V2 (Goyal et al., 2017b) ), and untrimmed datasets (i.e., ActivityNet (Heilbron et al., 2015) and HACS (Zhao et al., 2019) ). To our best knowledge, it is the first model to achieve 90.0% top-1 accuracy on Kinetics-400.

2. RELATED WORK

Vision Transformer. Following Transformer in NLP (Vaswani et al., 2017) , Vision Transformer (ViT) (Dosovitskiy et al., 2021) 2022), action recognition (Bertasius et al., 2021; Arnab et al., 2021) , temporal localization (Zhang et al., 2022) and multimodality learning (Radford et al., 2021; Wang et al., 2022) . To make ViT more efficient and effective, researchers introduce scale and locality modeling in different ways, such as multi-scale architectures (Wang et al., 2021b; Fan et al., 2021) , local window (Liu et al., 2021) , early convolution embedding (Xiao et al., 2021; Yuan et al., 2021a) and convolutional position encoding (Chu et al., 2021; Dong et al., 2022) . Alternatively, UniFormer (Li et al., 2022a) unifies convolution and self-attention as relation aggregator in a transformer manner, thus reducing large local redundancy. Video Learning. 3D Convolutional Neural Networks (CNNs) once played a dominant role in video understanding (Tran et al., 2015; Carreira & Zisserman, 2017) . Due to the difficult optimization problem of 3D CNNs, great efforts have been made to factorize 3D convolution in the spatiotemporal dimension (Tran et al., 2018; Qiu et al., 2017; Feichtenhofer et al., 2019) or channel dimension (Tran et al., 2019; Feichtenhofer, 2020; Kondratyuk et al., 2021) . However, the local receptive field limits 3D convolution to capture long-range dependency. The global attention motivates researchers to transfer image-pretrained ViTs to video tasks (Bertasius et al., 2021; Neimark et al., 2021; Zhang et al., 2021b; Arnab et al., 2021; Bulat et al., 2021; Patrick et al., 2021) . To make the video transformer more efficient, prior works introduce hierarchical structure with pooling self-attention (Fan et al., 2021) , local self-attention (Liu et al., 2022) or unified attention (Li et al., 2022a) . Though these novel models are adept at temporal modeling, they rely on tiresome image pretraining. In contrast, various well-pretrained ViTs with rich supervision are open-sourced (Wightman, 2019) . In this paper, we aim to extend efficient UniFormer designs to ViT, arming it as a strong video learner.

3. METHOD

Overall Framework. We propose to arm an image ViT with video designs of UniFormer (Li et al., 2022a) , leading to UniFormerV2. On one hand, spatial interactions in well-pretrained ViT can be fully leveraged and preserved to enhance spatial modeling. On the other hand, hierarchical temporal interactions in efficient UniFormer can be flexibly adopted to enhance temporal modeling. Our overall architecture is shown in Fig. 2 . It firstly projects input videos into tokens, then conducts local and global modeling by the corresponding UniBlocks. Finally, a multi-stage fusion block will adaptively integrate global tokens of different stages to further enhance video representation. Specifically, we first use 3D convolution (i.e., 3×16×16) to project the input video as L spatiotemporal tokens X in ∈ R L×C , where L=T ×H×W (T , H, and W respectively denote temporal, height, and width). Following the original ViT design (Dosovitskiy et al., 2021) , we perform spatial downsampling by a factor of 16. For better temporal modeling, we conduct temporal downsampling by a factor of 2. Next, we construct the local and global UniBlocks. For our local block, we reformulate the image-pretrained ViT block, by inserting the local temporal MHRA (Li et al., 2022a) before it. In this case, we can effectively leverage the robust spatial representation of ViT as well as efficiently reduce local temporal redundancy. Moreover, we introduce a global UniBlock on top of each local UniBlock, which can capture full spatiotemporal dependency. For computational efficiency, we design a query-based cross MHRA to aggregate all the spatiotemporal tokens as a global video token. All these tokens with different-level global semantics from multiple stages are further fused for discriminative video representation.

3.1. LOCAL UNIBLOCK

To efficiently model temporal dependency upon the well-learned spatial representation, we propose a new local UniBlock, by inserting a local temporal MHRA before the standard ViT block, X T = LT MHRA Norm X in + X in , X S = GS MHRA Norm X T + X T , X L = FFN Norm X S + X S . LT MHRA and GS MHRA refer to MHRA with local temporal affinity and global spatial affinity respectively. FFN consists of two linear projections separated by GeLU (Hendrycks & Gimpel, 2016) . Additionally, following the normalization in UniFormer (Li et al., 2022a) , we adopt Batch Norm (BN) (Ioffe & Szegedy, 2015)  (X) = A n V n (X), (4) MHRA(X) = Concat(R 1 (X); R 2 (X); • • • ; R N (X))U, (5) where R n (•) refers to the relation aggregator in the n-th head. A n is an affinity matrix that describes token relation and V n (•) is a linear projection, while U ∈ R C×C is a learnable fusion matrix. For our local UniBlock, we insert LT MHRA to reduce local temporal redundancy, which shares a similar design insight with the original UniFormer (Li et al., 2022a) . Hence, the affinity in LT MHRA is local with a learnable parameter matrix a n ∈ R t×1×1 in the temporal tube t × 1 × 1, A LT n (X i , X j ) = a i-j n , where j ∈ Ω t×1×1 i . (6) This allows to efficiently learn the local temporal relation between one token X i and other tokens X j in the tube. Alternatively, GS MHRA belongs to the original ViT block. Therefore, the affinity in GS MHRA refers to a global spatial self-attention in the single frame 1 × H × W , A GS n (X i , X j ) = exp{Q n (X i ) T K n (X j )} j ′ ∈Ω 1×H×W exp{Q n (X i ) T K n (X j ′ )} , where Q n (•) and K n (•) ∈ R L× C N are different linear projections in the n-th head. Discussion. (I) Note the spatiotemporal affinity in our local UniBlock is decomposed as local temporal one A LT n in Eq. ( 6), and global spatial one A GS n in Eq. ( 7). In this case, we can not only leverage the efficient video processing design of UniFormer but also inherit the effective image pretraining of ViT. Alternatively, such local affinity in the original UniFormer (Li et al., 2022a) (Bertasius et al., 2021) , we use local affinity for temporal characterization, largely reducing the computation burden by tackling temporal redundancy in the UniFormer style.

3.2. GLOBAL UNIBLOCK

To explicitly conduct long-range dependency modeling on the spatiotemporal scale, we introduce a global UniBlock in our UniFormerV2. Specifically, this global UniBlock consists of three basic components including DPE, MHRA, and FFN as follows, The DPE is instantiated as depth-wise spatiotemporal convolution (Li et al., 2022a) . We design the global C MHRA in a cross-attention style to efficiently construct a video representation, X C = DPE X L + X L , X ST = C MHRA Norm (q) , Norm X C , X G = FFN Norm X ST + X ST . ( ) ¿ }~ ; }~ ∑ }~ ª ; } ∑ } ª ; }Ñ ¿ } = ∑ }~ Ω ∑ }Ñ ª Ö ÜáàâáäãåÖç é èÖêÖççáç ¿ }Ñ = ∑ } Ω ¿ }~ ; }~ ∑ }~ ª ; } ∑ } ª ; }Ñ ¿ } ∑ }Ñ ª ¿ }Ñ ¿ }~ ; }~ ∑ }~ ª ∑ }~ Ω ; } ∑ } ª ∑ } Ω ; }Ñ [∑ }~ Ω , ¿ } ] ∑ }Ñ ª î íåáêÖêëℎåëÖç Å [∑ } Ω , ¿ }Ñ ] ∑ }Ñ Ω ¿ }~ ; }~ ∑ }~ ª ∑ }~ Ω ; } [∑ }~ Ω , ∑ } ª ] ∑ } Ω ; }Ñ ¿ } [∑ } Ω , ∑ }Ñ ª ] ë íåáêÖêëℎåëÖç ïñ ¿ }Ñ b ¡ e b ¡ e ∑ }Ñ Ω ∑ }~ Ω ∑ } Ω X }~ Ω ∑ } Ω ∑ }Ñ ª ∑ }Ñ Ω e e R C n (q, X) = A C n (q, X)V n (X), C MHRA(q, X) = Concat(R C 1 (q, X); R C 2 (q, X); • • • ; R C N (q, X))U. (12) R C n (q, •) is the cross relation aggregator, which can convert a learnable query q ∈ R 1×C into a video representation, via modeling dependency between this query q and all the spatiotemporal tokens X. First, it computes the cross affinity matrix A C n (q, X) to learn relation between q and X, A C n (q, X j ) = exp{Q n (q) T K n (X j )} j ′ ∈Ω T ×H×W exp{Q n (q) T K n (X j ′ )} . Then, it uses the linear projection to transform X as spatiotemporal context V n (X). Subsequently, it aggregates such context V n (X) into the learnable query, with guidance of their affinity A C n (q, X). Finally, the enhanced query tokens from all the heads are further fused as a final video representation, by linear projection U ∈ R C×C . Note the query token is zero-initialized for stable training. Discussion. We further discuss the distinct design of our global UniBlock, compared to the one in the original UniFormer (Li et al., 2022a) . (I) We add the global UniBlock on top of the local UniBlock, extracting multi-scale spatiotemporal representations in token form. Such design helps strengthen the discriminative video representation without compromising the pretrained architecture. (II) The typical global spatiotemporal attention is computationally heavy, due to its quadratic complexity. To pursue better accuracy-computation balance, we introduce a cross-attention style of global MHRA in UniFormerV2, thus largely reducing the computation complexity from O(L 2 ) to O(L), where L is the number of tokens. More importantly, since the query q is learnable, it can adaptively integrate the spatiotemporal context from all L tokens to boost video recognition. (III) The global UniBlock inherits DPE design from UniFormer, and we find it also helps in Table 9c .

3.3. MULTI-STAGE FUSION BLOCK

We propose a multi-stage fusion block to integrate all video tokens from each global UniBlock as in Fig. 3 . For simplicity, we denote the i-th global block as X G i = G i (q i , X L i ). Given the tokens X L i from the local UniBlock, the global block transforms the learnable query q into a video token X G i . In this paper, we explore four fusion strategies to integrate the video tokens from all the global blocks {X G i } N i=1 into a final video representation F, and employ the sequential way to conduct fusion regarding efficacy and efficiency. The studied fusion methods are given below. (a) Sequential: We sequentially use the video token from the previous global block X G i-1 as the query token in the current global block q i , where X G i = G i (X G i-1 , X L i ). (b) Parallel: We concatenate all the global tokens {X G i } N i=1 in parallel, and use a linear projection U F ∈ R N ×C to obtain the final token, where F = Concat(X G 1 , ..., X G N )U F . (c) Hierarchical KV: We use the video token from the previous global block X G i-1 as a part of contextual tokens in the current global block, where X G i = G i (q i , [X G i-1 , X L i ]). (d) Hierarchical Q: We use the video token from the previous global block X G i-1 as a part of query tokens in the current global block, i.e., Table 2 : Comparison with the state-of-the-art on Kinetics-600/700. X G i = G i ([X G i-1 , q i ], X L i ). Finally, we dynamically integrate the final tokens from both local and global blocks, effectively promoting recognition performance in empirical studies (Table 12 ). Specifically, we extract the class token F C from the final local UniBlock, and add it with the video token F by weighted sum, i.e., Z = αF + (1 -α)F C , where α is a learnable parameter processed by the Sigmoid function.

4. EXPERIMENTS

Datasets. To verify the learning capacity of our UniFormerV2, we conduct experiments on 8 popular video benchmarks, including the trimmed videos less than 10 seconds, and the untrimmed videos more than 1 min. For the trimmed video benchmarks, we divide them into two categories. (a) Scenerelated datasets: Kinetics family (Kay et al., 2017) (i.e., Kinetics-400, 600 and 700) and Moments in Time V1 (Monfort et al., 2020) . (b) Temporal-related datasets: Something-Something V1/V2 (Goyal et al., 2017b) . For the untrimmed video recognition, we choose ActivityNet (Heilbron et al., 2015) and HACS (Zhao et al., 2019) . More dataset details can be found in Appendix A. Concretely, we merge the training set of these Kinetics datasets, and then delete the repeated videos according to Youtube IDs. Note we also remove testing videos from different Kinetics datasets leaked in our combined training set for correctness. As a result, the total number of training videos is reduced from 1.14M to 0.66M. Additionally, we merge the action categories in these three Kinetics datasets, which leads to 710 classes in total. Hence, we call this video benchmark Kinetics-710. More detailed descriptions can be found in Appendix F. In our experiments, we empirically show the effectiveness of our Kinetics-710. For post-pretraining, we simply use 8 input frames and adopt the same hyperparameters as training on the individual Kinetics dataset. After that, no matter how many frames are input (16, 32, or even 64), we only need 5-epoch finetuning for more than 1% top-1 accuracy improvement on Kinetics-400/600/700, as shown in Table 9e .

Kinetics

Implement Details. Unless stated otherwise, we follow most of the training recipes in UniFormer (Li et al., 2022a) , and the detailed training hyperparameters can be found in Appendix A. We build UniFormerV2 based on ViTs pretrained with various supervisions (see Table 8 ), showing the generality of our design. For the best result, we adopt CLIP-ViT (Radford et al., 2021) as the backbone by default, due to its robust representation pretrained by vision-language contrastive learning. For most datasets, we insert the global UniBlocks in the last 4 layers of ViT-B/L to perform the multi-stage fusion. But for Sth-Sth V1/V2, we insert the global UniBlocks in the last 8/16 layers of ViT-B/L for better temporal modeling. The corresponding ablation studies are shown in Table 9 . Finally, we adopt sparse sampling (Wang et al., 2016) with the resolution of 224 for all the datasets.

4.1. COMPARISON TO STATE-OF-THE-ART

Kinetics. The second part shows the methods using web-scale data. On one hand, compared with MTV-H Method Frame Top-1 Top-5 TSN-R50 (Wang et al., 2016) 16 19.9 47.3 TSM-R50 (Lin et al., 2019) 16 47.2 77.1 TEA-R50 (Li et al., 2020b) 16 51.9 80.3 CT-Net-R50 (Li et al., 2020a) 16 52.5 80.9 TDN-R101 (Wang et al., 2021a) 16 55.3 88.3 UniFormerV1-S (Li et al., 2022a) 16 57.1 84.9 UniFormerV1-B (Li et al., 2022a) Something-Something. In Table 4 , we show the results on Sth-SthV2. First, our model outperforms those standard models based on the well-pretrained image ViT on hand. For example, under the same CLIP-400M pretraining and the same number of sampled frames, our UniFormerV2-B obtains 4% higher accuracy with only 11% FLOPs, compared with EVL-L (Lin et al., 2022) . Second, we compare our model with those models whose backbone is specially designed. Since the pretraining is unavailable for these models, they have to perform a tedious training phrase, consisting of imagepretraining, video pretraining and video finetuning. Alternatively, our UniFormerV2 can work well with only video finetuning, e.g., our model only uses 22 epochs to achieve the performance of UniFormerV1 (Li et al., 2022a) , which requires 110+50=160 video epochs to obtain results. Finally, we compare UniFormerV2 with those models which do not apply image pretraining. Such models require a huge number of training epochs, e.g., VideoMAE-B (Tong et al., 2022) contains 2400 video pretraining epochs and 40 video finetuning epochs, much longer than our UniFormerV2-B with a similar accuracy (only 22 video finetuning epochs, i.e., 0.9 % training epochs of VideoMAE-B). For Sth-Sth V1 in Table 5 , we reach the new state-of-the-art performance (62.7%). The above results reveal the effectiveness and efficiency of our UniFormerV2 for temporal modeling. ActivityNet and HACS. For the untrimmed videos, it is essential to capture long-range temporal information, since the action may occur multiple times at arbitrary moments. As shown in Table 6 and 7, our UniFormerV2 significantly outperforms the previous best results on the large-scale untrimmed benchmark ActivityNet and HACS by 4.5% and 3.6%, respectively. These results demonstrate the strong long-term modeling capacity of our UniFomrerV2.

4.2. ABLATION STUDIES

To evaluate the effectiveness of UniFormerV2, we investigate each key structure design, as shown in Pretraining Sources. To demonstrate the generality of our UniFormerV2 design, we apply it on the ViTs with different pertaining methods, including supervised learning (Dosovitskiy et al., 2021; Touvron et al., 2022) , contrastive learning (Caron et al., 2021; Radford et al., 2021) and mask image modeling (He et al., 2022; Bao et al., 2021) . Table 8 shows that all the models beat TimeSformer (Bertasius et al., 2021) , especially for Something-Something that relies on strong temporal modeling. It also reflects that a better-pretrained ViT is helpful for stronger video performance. Different Components. In Table 9a , note the global UniBlock is crucial for the scene-related benchmark (e.g., K400), since this block can effectively provide holistic video representation for classification. Alternatively, the local UniBlock is critical for the temporal-related benchmark (e.g., SSV2), since this block can effectively describe detailed video representation for classification. Besides, using temporal downsampling with double input frames (similar FLOPs) is also helpful for distinguishing fine-grained videos like SSV2, due to the larger temporal receptive field. Local UniBlock. To explore the structure of local UniBlock, we conduct experiments in Table 9b . It reveals that convolution is better than self-attention for temporal modeling, and our local MHRA is more powerful than both of them in SSV2. Following ST-Adapter (Pan et al., 2022) , we add another local MHRA after the spatial MHRA for better performance. Besides, we add local MHRA in all the layers and reduce the channel by 1.5 times for the best accuracy-flops trade-off. Global UniBlock and Multi-stage Fusion. In Table 9c , we find that the features in the deep layers are critical for capturing long-term dependency, while the DPE and the middle information are necessary for identifying the motion difference. For the fusion strategy, Table 9d shows that the simplest sequential fusion is adequate for integrating multi-stage features. Training Recipes. We compare different training and finetuning methods in Table 9e . Note that when co-training with K400, K600 and K700, we remove the leaked videos in the validation set and introduce three classification heads. K710 maintains only about 60% of the total training videos (0.66M vs. 1.14M for K400+K600+K700), but it improves classification performance significantly for Kinetics. Meanwhile it saves about 33% training cost (see Appendix A). Besides, direct training on it works better than a Kinetics co-training, especially for K600 (+1.3% vs. +1.0%) and K700 (+0.5 vs. -0.2%). Though co-finetuning shared the backbone and saved parameters, we adopt individual finetuning for each dataset considering the best performance.

5. CONCLUSION

In this paper, we propose a powerful video model, namely UniFormerV2. It arms image-pretrained ViTs with efficient UniFormer designs for video learning. By novel local and global video relation aggregators, it is capable of conducting effective spatiotemporal modeling with a tractable complexity. Besides of seamlessly integrating advantages from both ViTs and UniFormer, we also introduce multi-scale token fusion for further enhancing video representation. Our UniFormerV2 achieves state-of-the-art performance on 8 popular video benchmarks, and firstly reaches 90% top-1 accuracy on Kinetics-400, to our best knowledge. 9a . But for Something-Something V1/V2, we adopt all the designs and insert the global UniBlocks in the last 8/16 layers of ViT-B/L for better temporal modeling. Besides, when finetuning those models with large-scale dataset pretraining, it is necessary to initialize the new parameters properly. For stable training, we zero initialize some of the layers, including the last point-wise convolutions in the local temporal MHRA, the query tokens and output projection layers in the query-based cross MHRA, the last linear layers in the FFN of the global UniBlock, and the learnable fusion weights. What's more, we provide the detailed hyperparameters in Table 11 . Most of the training scripts follow UniFormer (Li et al., 2022a) , but differently, we do not apply Mixup (Zhang et al., 2018) , CutMix (Yun et al., 2019) , Label Smoothing (Szegedy et al., 2016) and Random Erasing (Zhong et al., 2020) . When finetuning the full models on Kinetics directly from image pretraining, we adopt the same hyperparameters as in K710 pretraining. If the backbone is frozen, we use a larger learning rate (4e-4) without warmup. Training Cost. In table 9e, we compare different training scripts. When finetuning Kinetics-400, 600 and 700 individually, we train the models for 55 epochs, and the total training data is about 0.24 + 0.366 + 0.529 ≈ 1.14M. When pretraining with Kinetics-710 (0.66M), we only finetune the models for 5 epochs. Thus the percentage of saving cost is as follows, 1 - 0.66 × 55 + 1.14 × 5 1.14 × 55 ≈ 0.33 Thus we save almost 33% of the training cost. More importantly, for the models with more frames (16, 32, or even 64), we only need to finetune the K710 pretrained models with 8 frames. Our training scripts are very efficient while effective for the Kinetics family.

B VISUALIZATIONS

In Figure 4 , we compared UniFormerV2 with the typical ViT-based model, i.e., TimeSformer (Bertasius et al., 2021) , and UniFormerV1 (Li et al., 2022a) through visualization. Since UniFormerV1 is a multi-scale architecture, we show its features at the bottom of 4 stages. For TimeSformer and UniFormerV2, they are based on ViTs with a fixed resolution, thus we show their features every 3 layers. We use CAM (Zhou et al., 2016) to show the most discriminative features that the network locates. The red parts indicate where the models focus more on, while the blue parts are ignored. It reveals that both UniFormerV1 and UniFormerV2 are good at capturing local details, but Uni-FormerV1 may lose information in deeper layers due to the shrinking resolution, thus it fails to activate the discriminative parts. In contrast, TimeSformer only learns local features in the shallow layers, thus struggling to focus on meaningful areas. As for UniFormerV2, it surprisingly maintains local details even in the deep layers. More importantly, it can observe the whole video and learn to concentrate more on the woman's leg, which helps recognize the action. These results demonstrate that our UniFormerV2 is effective to capture local details and long-term dependency. Table 19 : More results on ActivityNet and HACS. All models are based on UniFormerV2-L/14.

D ADDITIONAL RESULTS

In Table 16 , Table 17 , Table 18 and Table 19 , we give more results on the 8 video benchmarks, i.e., Kinetics-400/600/700, Moments in Time, Something-Something V1/V2, ActivityNet and HACS.

E MORE DISCUSSIONS

Local UniBlock vs. ST-Adapter (Pan et al., 2022) . Our Local UniBlock is motivated by the style of UniForme r (Li et al., 2022a) , i.e., we treat temporal depth-wise convolution as local temporal relation aggregator. Hence, like UniFormer, we introduce extra BatchNorm (Ioffe & Szegedy, 2015) before the first linear projection V(•). Alternatively, ST-adapter does not have this design, since it simply treats temporal depth-wise convolution as adaptation. With such motivation, it further introduces extra activation function for enhancing such adaptation, while our local UniBlock does not need it. In fact, we have also made comparisons in Table 9b . It shows that our local MHRA beats ST-Adapter (69.1% vs. 68.0%). Global UniBlock vs. Perceiver (Jaegle et al., 2021) , DETR (Carion et al., 2020) and Flamingo(Alayrac et al., 2022) . Our Glocal UniBlock is also motivated by the style of UniFormer (Li et al., 2022a) . But differently, to decrease the global computation in UniFormer, we change self-attention MHRA as cross-attention MHRA in our UniFormerV2. Hence, our Global UniBlock consists of Dynamic Position Embedding (DPE), cross MHRA and FFN. On the contrary, none of those works belong to such an operation combination, without insight of UniFormer in video learning. In fact, these methods often use the standard cross-style transformer block including self MHRA, cross MHRA and FFN. Limitations. In UniFormerV2, we propose the effective designs to arm pretrained ViT as spatiotemporal learners. Although its training is more efficient compared to non-trivial video backbones, its performance tends to depend on the scale of pretraining data, as shown in Table 8 . Hence, it would be interesting to explore our UniFormerV2 on huge image foundation models pretrained by massive datasets, for further evaluating its scalability and generalization capacity.

F LABEL LIST OF KINETICS-710

To generate our Kinetics-710, we align labels in different Kinetics datasets by filtering symbols and replacing synonyms. The final label list is shown in Table20. Compared with Kinetics-700, there are 8 and 2 unique labels in Kinetics-400 and Kinetics-600 respectively. When finetuning the models pretrained on Kinetics-710, it is vital to load the pretrained weight of the classification layer, thus we map the weight according to the label list. Table 20 : Labels of Kinetics-710.  Label K4 K6 K7 luge × ✓ ✓ yoga ✓ ✓ ✓ vault ✓ × × squat ✓ ✓ ✓ lunge ✓ ✓ ✓ zumba ✓ ✓ ✓ situp ✓ ✓ ✓ sewing × ✓ ✓ cumbia × ✓ ✓ crying ✓ ✓ ✓ dining ✓ ✓ ✓ digging ✓ × ✓ chasing × × ✓ sieving × × ✓ staring × ✓ ✓ karaoke × ✓ ✓ burping × ✓ ✓ packing × ✓ ✓ licking × ✓ ✓ winking × ✓ ✓ arguing × ✓ ✓ ironing ✓ ✓ ✓ drawing ✓ ✓ ✓ archery ✓ ✓ ✓ jogging ✓ ✓ ✓ singing ✓ ✓ ✓ yawning ✓ ✓ ✓ writing ✓ ✓ ✓ push up ✓ ✓ ✓ tai chi ✓ ✓ ✓ sailing ✓ ✓ ✓ welding ✓ ✓ ✓ smoking ✓ ✓ ✓ parkour ✓ ✓ ✓ texting ✓ ✓ ✓ bowling ✓ ✓ ✓ kissing ✓ ✓ ✓ busking ✓ ✓ ✓ gargling ✓ × ✓ spraying ✓ × ✓ coughing × × ✓ saluting × × ✓ shouting × × ✓ sleeping × ✓ ✓ smashing × ✓ ✓ tackling × ✓ ✓ shopping × ✓ ✓ pinching × ✓ ✓ huddling × ✓ ✓ bottling × ✓ ✓ drooling × ✓ ✓ tickling ✓ ✓ ✓ knitting ✓ ✓ ✓ unboxing ✓ ✓ ✓ shot put ✓ ✓ ✓ marching ✓ ✓ ✓ capoeira ✓ ✓ ✓ pull ups ✓ ✓ ✓ laughing ✓ ✓ ✓ hurdling ✓ ✓ ✓ sneezing ✓ ✓ ✓ clapping ✓ ✓ ✓ Label K4 K6 K7 krumping ✓ ✓ ✓ slapping ✓ ✓ ✓ decoupage × × ✓ arresting × × ✓ surveying × × ✓ fly tying × ✓ ✓ capsizing × ✓ ✓ tiptoeing × ✓ ✓ using atm × ✓ ✓ waking up × ✓ ✓ fidgeting × ✓ ✓ tie dying × ✓ ✓ wrestling ✓ ✓ ✓ whistling ✓ ✓ ✓ high kick ✓ ✓ ✓ abseiling ✓ ✓ ✓ high jump ✓ ✓ ✓ trapezing ✓ ✓ ✓ skydiving ✓ ✓ ✓ bandaging ✓ ✓ ✓ side kick ✓ ✓ ✓ jetskiing ✓ ✓ ✓ long jump ✓ ✓ ✓ hopscotch ✓ ✓ ✓ dodgeball ✓ ✓ ✓ crocheting × × ✓ ski ballet × × ✓ geocaching × ✓ ✓ bulldozing × ✓ ✓ cosplaying × ✓ ✓ spelunking × ✓ ✓ jaywalking × ✓ ✓ head stand × ✓ ✓ contorting × ✓ ✓ plastering ✓ ✓ ✓ bartending ✓ ✓ ✓ beatboxing ✓ ✓ ✓ applauding ✓ ✓ ✓ pole vault ✓ ✓ ✓ barbequing ✓ ✓ ✓ snowkiting ✓ ✓ ✓ making tea ✓ ✓ ✓ auctioning ✓ ✓ ✓ snorkeling ✓ ✓ ✓ testifying ✓ ✓ ✓ high fiving × × ✓ moving baby × × ✓ shoot dance × × ✓ pirouetting × ✓ ✓ coloring in × ✓ ✓ sawing wood × ✓ ✓ calculating × ✓ ✓ waving hand × ✓ ✓ watching tv × ✓ ✓ calligraphy × ✓ ✓ carving ice × ✓ ✓ bodysurfing × ✓ ✓ lifting hat × ✓ ✓ bathing dog × ✓ ✓ chewing gum × ✓ ✓ parasailing ✓ ✓ ✓ sipping cup ✓ ✓ ✓ Label K4 K6 K7 skiing mono ✓ ✓ ✓ ski jumping ✓ ✓ ✓ driving car ✓ ✓ ✓ tap dancing ✓ ✓ ✓ hockey stop ✓ ✓ ✓ tobogganing ✓ ✓ ✓ cooking egg ✓ ✓ ✓ slacklining ✓ ✓ ✓ pushing car ✓ ✓ ✓ ice skating ✓ ✓ ✓ ice fishing ✓ ✓ ✓ celebrating ✓ ✓ ✓ windsurfing ✓ ✓ ✓ riding mule ✓ ✓ ✓ waxing legs ✓ ✓ ✓ deadlifting ✓ ✓ ✓ bee keeping ✓ ✓ ✓ pumping gas ✓ ✓ ✓ tapping pen ✓ ✓ ✓ headbanging ✓ ✓ ✓ bookbinding ✓ ✓ ✓ flying kite ✓ ✓ ✓ fixing hair ✓ ✓ ✓ egg hunting ✓ ✓ ✓ mowing lawn ✓ ✓ ✓ triple jump ✓ ✓ ✓ milking cow ✓ ✓ ✓ doing nails ✓ ✓ ✓ dyeing hair ✓ ✓ ✓ eating cake ✓ ✓ ✓ paragliding ✓ ✓ ✓ headbutting ✓ ✓ ✓ bobsledding ✓ ✓ ✓ kitesurfing ✓ ✓ ✓ petting cat ✓ ✓ ✓ waxing back ✓ ✓ ✓ making slime × × ✓ steering car × × ✓ rolling eyes × × ✓ moving child × × ✓ pouring milk × × ✓ grooming cat × × ✓ doing sudoku × × ✓ closing door × × ✓ pouring wine × × ✓ cutting cake × × ✓ milking goat × × ✓ playing oboe × × ✓ filling cake × × ✓ sanding wood × × ✓ jumping sofa × × ✓ taking photo × × ✓ silent disco × × ✓ ironing hair × ✓ ✓ planing wood × ✓ ✓ gold panning × ✓ ✓ pillow fight × ✓ ✓ combing hair × ✓ ✓ laying stone × ✓ ✓ photobombing × ✓ ✓ playing lute × ✓ ✓ land sailing × ✓ ✓ Label K4 K6 K7 scrapbooking × ✓ ✓ tasting wine × ✓ ✓ docking boat × ✓ ✓ photocopying × ✓ ✓ clam digging × ✓ ✓ ice swimming × ✓ ✓ roasting pig × ✓ ✓ pouring beer × ✓ ✓ smoking pipe × ✓ ✓ lock picking × ✓ ✓ steer roping × ✓ ✓ hugging baby × ✓ ✓ embroidering × ✓ ✓ longboarding × ✓ ✓ laying tiles × ✓ ✓ playing gong × ✓ ✓ base jumping × ✓ ✓ playing polo × ✓ ✓ moon walking × ✓ ✓ opening door × ✓ ✓ tasting food ✓ ✓ ✓ shaving legs ✓ ✓ ✓ pumping fist ✓ ✓ ✓ making sushi ✓ ✓ ✓ snowmobiling ✓ ✓ ✓ tasting beer ✓ ✓ ✓ golf driving ✓ ✓ ✓ waxing chest ✓ ✓ ✓ faceplanting ✓ ✓ ✓ eating chips ✓ ✓ ✓ playing harp ✓ ✓ ✓ spinning poi ✓ ✓ ✓ front raises ✓ ✓ ✓ reading book ✓ ✓ ✓ shaking head ✓ ✓ ✓ snowboarding ✓ ✓ ✓ scuba diving ✓ ✓ ✓ bending back ✓ ✓ ✓ drop kicking ✓ ✓ ✓ using segway ✓ ✓ ✓ ice climbing ✓ ✓ ✓ tossing coin ✓ ✓ ✓ cheerleading ✓ ✓ ✓ blowing nose ✓ ✓ ✓ pushing cart ✓ ✓ ✓ water skiing ✓ ✓ ✓ making pizza ✓ ✓ ✓ punching bag ✓ ✓ ✓ feeding fish ✓ ✓ ✓ riding camel ✓ ✓ ✓ shaving head ✓ ✓ ✓ throwing axe ✓ ✓ ✓ grooming dog ✓ ✓ ✓ curling hair ✓ ✓ ✓ air drumming ✓ ✓ ✓ training dog ✓ ✓ ✓ disc golfing ✓ ✓ ✓ hula hooping ✓ ✓ ✓ washing hair ✓ ✓ ✓ cartwheeling ✓ ✓ ✓ changing oil ✓ ✓ ✓ hammer throw ✓ ✓ ✓ Label K4 K6 K7 washing feet ✓ ✓ ✓ diving cliff ✓ ✓ ✓ golf putting ✓ ✓ ✓ motorcycling ✓ ✓ ✓ breakdancing ✓ ✓ ✓ drinking beer ✓ × × swinging legs ✓ × × bull fighting × ✓ × tossing salad ✓ × ✓ playing cards ✓ × ✓ slicing onion × × ✓ stacking dice × × ✓ helmet diving × × ✓ dealing cards × × ✓ treating wood × × ✓ eating nachos × × ✓ being excited × × ✓ vacuuming car × × ✓ petting horse × × ✓ stacking cups × × ✓ poaching eggs × × ✓ yarn spinning × ✓ ✓ card stacking × ✓ ✓ rope pushdown × ✓ ✓ smelling feet × ✓ ✓ card throwing × ✓ ✓ playing darts × ✓ ✓ chopping meat × ✓ ✓ making cheese × ✓ ✓ crossing eyes × ✓ ✓ ✓ ✓ ✓ rock climbing ✓ ✓ ✓ catching fish ✓ ✓ ✓ playing drums ✓ ✓ ✓ cracking neck ✓ ✓ ✓ tying necktie ✓ ✓ ✓ juggling fire ✓ ✓ ✓ golf chipping ✓ ✓ ✓ javelin throw ✓ ✓ ✓ skateboarding ✓ ✓ ✓ laying bricks ✓ ✓ ✓ playing piano ✓ ✓ ✓ playing flute ✓ ✓ ✓ salsa dancing ✓ ✓ ✓ eating burger ✓ ✓ ✓ skipping rope ✓ ✓ ✓ climbing tree ✓ ✓ ✓ washing hands ✓ ✓ ✓ playing chess ✓ ✓ ✓ tango dancing ✓ ✓ ✓ using computer ✓ × × cleaning floor ✓ × × exercising arm ✓ × ✓ baby waking up ✓ × ✓ waxing armpits × × ✓ mixing colours × × ✓ carving marble × × ✓ peeling banana × × ✓ breaking glass × × ✓ laying decking × × ✓ brushing floor × × ✓ herding cattle × × ✓ blending fruit × × ✓ seasoning food × × ✓ checking watch × × ✓ massaging neck × ✓ ✓ leatherworking × ✓ ✓ acting in play × ✓ ✓ chiseling wood × ✓ ✓ square dancing × ✓ ✓ sausage making × ✓ ✓ using a wrench × ✓ ✓ weaving fabric × ✓ ✓ breathing fire × ✓ ✓ rolling pastry × ✓ ✓ cutting orange × ✓ ✓ Label K4 K6 K7 needle felting × ✓ ✓ skipping stone × ✓ ✓ scrubbing face × ✓ ✓ flint knapping × ✓ ✓ shuffling feet × ✓ ✓ throwing knife × ✓ ✓ fixing bicycle × ✓ ✓ making bubbles × ✓ ✓ counting money ✓ ✓ ✓ applying cream ✓ ✓ ✓ blowing leaves ✓ ✓ ✓ shoveling snow ✓ ✓ ✓ brush painting ✓ ✓ ✓ making the bed ✓ ✓ ✓ playing tennis ✓ ✓ ✓ playing violin ✓ ✓ ✓ tapping guitar ✓ ✓ ✓ picking apples ✓ ✓ ✓ doing aerobics ✓ ✓ ✓ drinking shots ✓ ✓ ✓ bungee jumping ✓ ✓ ✓ shearing sheep ✓ ✓ ✓ juggling balls ✓ ✓ ✓ stretching arm ✓ ✓ ✓ news anchoring ✓ ✓ ✓ smoking hookah ✓ ✓ ✓ massaging back ✓ ✓ ✓ weaving basket ✓ ✓ ✓ making snowman ✓ ✓ ✓ checking tires ✓ ✓ ✓ planting trees ✓ ✓ ✓ spray painting ✓ ✓ ✓ stretching leg ✓ ✓ ✓ clean and jerk ✓ ✓ ✓ peeling apples ✓ ✓ ✓ dancing ballet ✓ ✓ ✓ making jewelry ✓ ✓ ✓ grooming horse ✓ ✓ ✓ playing guitar ✓ ✓ ✓ sword fighting ✓ ✓ ✓ washing dishes ✓ ✓ ✓ roller skating ✓ ✓ ✓ massaging feet ✓ ✓ ✓ cleaning shoes ✓ ✓ ✓ bench pressing ✓ ✓ ✓ riding scooter ✓ ✓ ✓ sweeping floor ✓ ✓ ✓ brushing teeth ✓ ✓ ✓ trimming trees ✓ ✓ ✓ baking cookies ✓ ✓ ✓ massaging legs ✓ ✓ ✓ crossing river ✓ ✓ ✓ eating carrots ✓ ✓ ✓ taking a shower ✓ × × cooking chicken ✓ × ✓ shredding paper ✓ × ✓ metal detecting × × ✓ lighting candle × × ✓ using megaphone × × ✓ playing piccolo × × ✓ entering church × × ✓ playing mahjong × × ✓ Label K4 K6 K7 flipping bottle × × ✓ splashing water × × ✓ carrying weight × × ✓ spinning plates × × ✓ fencing (sport) × ✓ ✓ curling (sport) × ✓ ✓ separating eggs × ✓ ✓ playing ocarina × ✓ ✓ playing netball × ✓ ✓ polishing metal × ✓ ✓ jumping bicycle × ✓ ✓ trimming shrubs × ✓ ✓ playing marbles × ✓ ✓ blowdrying hair × ✓ ✓ dyeing eyebrows × ✓ ✓ laying concrete × ✓ ✓ playing pinball × ✓ ✓ dumpster diving × ✓ ✓ putting on sari × ✓ ✓ playing maracas × ✓ ✓ delivering mail × ✓ ✓ preparing salad × ✓ ✓ vacuuming floor × ✓ ✓ chiseling stone × ✓ ✓ breaking boards × ✓ ✓ climbing ladder ✓ ✓ ✓ hurling (sport) ✓ ✓ ✓ throwing discus ✓ ✓ ✓ recording music ✓ ✓ ✓ playing trumpet ✓ ✓ ✓ sled dog racing ✓ ✓ ✓ stomping grapes ✓ ✓ ✓ carving pumpkin ✓ ✓ ✓ unloading truck ✓ ✓ ✓ watering plants ✓ ✓ ✓ playing ukulele ✓ ✓ ✓ cleaning toilet ✓ ✓ ✓ folding napkins ✓ ✓ ✓ playing cymbals ✓ ✓ ✓ riding unicycle ✓ ✓ ✓ playing cricket ✓ ✓ ✓ climbing a rope ✓ ✓ ✓ scrambling eggs ✓ ✓ ✓ opening present ✓ ✓ ✓ folding clothes ✓ ✓ ✓ waiting in line ✓ ✓ ✓ finger snapping ✓ ✓ ✓ riding elephant ✓ ✓ ✓ waxing eyebrows ✓ ✓ ✓ shuffling cards ✓ ✓ ✓ walking the dog ✓ ✓ ✓ driving tractor ✓ ✓ ✓ strumming guitar ✓ × × filling eyebrows ✓ × ✓ playing rounders × × ✓ squeezing orange × × ✓ making latte art × × ✓ opening coconuts × × ✓ playing checkers × × ✓ sword swallowing × ✓ ✓ playing dominoes × ✓ ✓ putting on shoes × ✓ ✓ Label K4 K6 K7 tagging graffiti × ✓ ✓ raising eyebrows × ✓ ✓ threading needle × ✓ ✓ popping balloons × ✓ ✓ cooking scallops × ✓ ✓ backflip (human) × ✓ ✓ falling off bike × ✓ ✓ playing scrabble × ✓ ✓ visiting the zoo × ✓ ✓ mosh pit dancing × ✓ ✓ shucking oysters × ✓ ✓ looking at phone × ✓ ✓ throwing tantrum × ✓ ✓ tying shoe laces × ✓ ✓ dancing macarena ✓ ✓ ✓ playing bagpipes ✓ ✓ ✓ eating ice cream ✓ ✓ ✓ playing monopoly ✓ ✓ ✓ flipping pancake ✓ ✓ ✓ getting a tattoo ✓ ✓ ✓ building cabinet ✓ ✓ ✓ playing clarinet ✓ ✓ ✓ eating spaghetti ✓ ✓ ✓ drumming fingers ✓ ✓ ✓ eating doughnuts ✓ ✓ ✓ playing trombone ✓ ✓ ✓ moving furniture ✓ ✓ ✓ contact juggling ✓ ✓ ✓ playing recorder ✓ ✓ ✓ wrapping present ✓ ✓ ✓



Figure 1: Comparison on video modeling paradigm. UniFormerV1 requires costly image pretraining, while directly inserting temporal MHSA into ViTs struggles for accuracy-FLOPs balance.UniFormerV2 can effectively and efficiently arm well-pretrained ViTs with concise UniFormer designs, thus integrating advantages from both models for spatiotemporal representation learning. To our best knowledge, it is the first model that achieves 90.0% top-1 accuracy on Kinetics-400.

has made great successes in various vision tasks, including object detection Carion et al. (2020); Zhu et al. (2021), semantic segmentation Xie et al. (2021); Cheng et al. (2021), low-level image processing Liang et al. (2021); Cui et al. (

Figure 3: Multi-Stage Fusion Block.

Figure 4: More visualizations. Frames are sampled from Kinetics according to different sampling strategies in different methods. For UniFormerV1, it samples double frames and downsamples the temporal resolution in the patch embedding.

before local MHRA, and Layer Norm (LN)(Ba et al., 2016) Overall framework of our UniFormerV2. There are three key blocks, i.e., local and global UniBlocks, and multi-stage fusion block. All these designs are efficient and effective. before global MHRA and FFN. Note that GS MHRA and FFN come from the image-pretrained ViT block. In general, MHRA(Li et al., 2022a)  learn token relation via multi-head fusion:R n

Comparison

Comparison with the state-of-the-art on Moments in Time V1.

Comparison



Results on Something-Something V1.

Results on ActivityNet.

Results on HACS.

Different pretrained ViTs. Our UniFormerV2 based on different opensourced ViTs beat TimeSformer, especially for Something-Something.(ensembling 4 models)(Yan et al., 2022), our single model only requires 1% video post-pretraining, 16% finetuning epochs and 35% model parameters to achieve competitive accuracy. On the other hand, under the same CLIP-400M pretraining, our UniFormerV2-L (frozen) only uses 25% FLOPs to achieve the competitive accuracy compared with EVL-L (frozen)(Lin et al., 2022), and obtains 1.1% accuracy improvement with similar FLOPs. Finally, our UniFormerV2 is the first model to achieve 90.0% top-1 accuracy on K400, to our best knowledge. For Kinetics-600 and 700, our UniFormerV2 also obtains the state-of-the-art performance (90.1% and 82.7%, see Table2).

Ablation studies. T-Down means temporal downsampling, and we double the frames to maintain similar GFLOPs. ST-Adapter is proposed in Pan et al. (2022). Compared with simple co-training, our K710 pretraining saves 33% cost with consistent improvement (see Appendix A).

More results on Moments in Time V1.

More results on Something-Something. All models are directly finetuned from CLIP.

Reproducibility.

To ensure all the results can be reproduced, we give the details of the datasets, model and training hyperparameters in our experiments (see Table 10 and Table 11 ). For Kinetics-710, we provide its label list in Table 20 for reproduction. All the codes are based on the UniFormer (Li et al., 2022b) 

