T2D: SPATIOTEMPORAL FEATURE LEARNING BASED ON TRIPLE 2D DECOMPOSITION

Abstract

In this paper, we propose triple 2D decomposition (T2D) of a 3D vision Transformer (ViT) for efficient spatiotemporal feature learning. The idea is to decompose the self-attention operation in a 3D data cube into three self-attention operations in three 2D data planes. Such a design not only effectively reduces the computational complexity of a 3D ViT, but also guides the network to focus on learning correlations among more relevant tokens. Compared with other decomposition methods, the proposed T2D is shown to be more powerful at a similar computational complexity. The CLIP-initialized T2D-B model achieves state-of-the-art top-1 accuracy of 85.0% and 70.5% on Kinetics-400 and Something-Something-v2 datasets, respectively. It also outperforms other methods by a large margin on FineGym (+17.9%) and Diving-48 (+1.3%) datasets. Under the zero-shot setting, the T2D model obtains a 2.5% top-1 accuracy gain over X-CLIP on HMDB-51 dataset. In addition, T2D is a general decomposition method that can be plugged into any ViT structure of any model size. We demonstrate this by building a tiny size of T2D model based on a hierarchical ViT structure named DaViT. The resulting DaViT-T2D-T model achieves 82.0% and 71.3% top-1 accuracy with only 91 GFLOPs on Kinectics-400 and Something-Something-v2 datasets, respectively. Source code will be made publicly available.

1. INTRODUCTION

Learning spatiotemporal representation for videos is one of the most fundamental yet challenging tasks in computer vision (CV). The challenges come mainly from the contradiction between the insufficient data and the diverse spatiotemporal patterns that need to be learned. It will become very obvious if we take the representation learning for images as a reference. The largest public image dataset ImageNet-21K Deng et al. (2009) consists of more than 14M images divided into over 21K classes, while the largest public video dataset Kinetics-700 Carreira et al. (2019) only consists of around 500K videos divided into 700 human action classes. Video data are far inferior to image data in terms of quantity and diversity, but the spatiotemporal information to be learned is one dimension higher than the spatial information contained in the image. A straightforward idea to address this challenge in video representation learning is to make full use of the spatial modeling capability gained by the image models. Researchers started to implement this idea back in the convolution neural network (CNN) era. The quadratic complexity of Transformer brings a great challenge to the adaptation from imageoriented 2D ViT to video-oriented 3D ViT. Decomposition is in need to make the computation complexity tractable. Previous work Bertasius et al. (2021) has proposed space-time decomposition, axial decomposition, or local-global decomposition, among others, but they have not achieved satisfactory performance. In fact, the key question to be considered when decomposing a 3D ViT is which tokens should be grouped together to perform the self-attention. The selection of group size and coverage should trade off the computational complexity and the attainable modeling capability. 2022) to decompose a 3D tensor into three 2D data planes. Let X, Y, and T denote the horizontal, vertical, and temporal axis of a video tensor, respectively. The XY data plane contains sufficient spatial information for recognizing the main objects, and the two extended temporal data planes XT and TY provide rich information of object motions, as Fig. 1 shows. The T2D design groups the tokens in the same XY, XT, or TY plane for self-attention computation. The group size is more manageable than the default 3D attention. More importantly, all the computational resources and the available training data are spent on mining the correlation of the most relevant tokens. Besides, we propose to share weights between the XT and TY data planes, while leaving the XY branch to be separately initialized with pretrained weights of an image model. The proposed triple 2D decomposition is a straightforward design, but not sufficiently explored by previous researchers. We think the possible reasons are twofold. First, previous research mainly focused on reducing computational complexity of models, and such T2D decomposition do not reduce the complexity in a typical CNN settingfoot_0 . Second, the action recognition performance is usually evaluated on simple or scene-focused datasets, such as UCF-101 Kuehne et al. (2011 ), HMDB-51 Soomro et al. (2012 ), and Kinetics Kay et al. (2017) , where spatial modeling plays a dominant role. In this case, temporal modeling has not received the attention it deserves. In summary, the main contributions of our work are three-fold: • We propose triple 2D decomposition for efficient spatiotemporal feature learning with a Transformer network. Isolating self-attention operation within each 2D data plane not only makes the computational complexity easily manageable, but offers great design flexibility, allowing us to select different settings and initialization for spatial and temporal modeling. • We provide a detailed analysis of different decomposition methods for 3D ViT and carry out ablation studies to demonstrate the advantage of the proposed T2D. 



A typical 3 × 3 × 3 convolution kernel has the same computation complexity compared with three 3 × 3 convolution kernels.



For example, I3D network Carreira & Zisserman (2017) uses inflated ResNet weights trained on ImageNet for initialization. More recently, as the Transformer Vaswani et al. (2017) architecture starts to dominate in CV, the ViT Dosovitskiy et al. (2021) network pretrained on ImageNet or by CLIP Radford et al. (2021) has been adopted as the initialization of spatiotemporal feature learning networks in many efforts Wang et al. (2021b); Pan et al. (2022); Lin et al. (2022).

Figure 1: Decomposing 3D video data XYT into three data planes, denoted by XY, XT, and TY. The XY data plane is sufficient for recognizing main objects. The XT and TY data planes provide rich information of object motion.

We make T2D a plug-n-play component and implement it based on bothCLIP ViT Radford  et al. (2021) and DaViT Ding et al. (2022). All versions of the T2D network achieve higher or competitive performance on Kinetics-400 and Something-Something-v2 benchmarks when compared with state-of-the-art (SOTA) models of similar sizes. The CLIP-based T2D network is extensively evaluated on a broad range of video action recognition benchmarks. T2D-B pushes previous SOTA from 88.0%/86.4%/50.9% to 89.3%/93.6%/68.8% on Diving-48, Gym99, and Gym288, respectively. It also achieves competitive or higher performance in zero-shot evaluation on HMDB-51 and UCF-101 datasets compared to previous SOTA ActionCLIPWang et al. (2021b) and X-CLIP Ni et al. (2022).

