T2D: SPATIOTEMPORAL FEATURE LEARNING BASED ON TRIPLE 2D DECOMPOSITION

Abstract

In this paper, we propose triple 2D decomposition (T2D) of a 3D vision Transformer (ViT) for efficient spatiotemporal feature learning. The idea is to decompose the self-attention operation in a 3D data cube into three self-attention operations in three 2D data planes. Such a design not only effectively reduces the computational complexity of a 3D ViT, but also guides the network to focus on learning correlations among more relevant tokens. Compared with other decomposition methods, the proposed T2D is shown to be more powerful at a similar computational complexity. The CLIP-initialized T2D-B model achieves state-of-the-art top-1 accuracy of 85.0% and 70.5% on Kinetics-400 and Something-Something-v2 datasets, respectively. It also outperforms other methods by a large margin on FineGym (+17.9%) and Diving-48 (+1.3%) datasets. Under the zero-shot setting, the T2D model obtains a 2.5% top-1 accuracy gain over X-CLIP on HMDB-51 dataset. In addition, T2D is a general decomposition method that can be plugged into any ViT structure of any model size. We demonstrate this by building a tiny size of T2D model based on a hierarchical ViT structure named DaViT. The resulting DaViT-T2D-T model achieves 82.0% and 71.3% top-1 accuracy with only 91 GFLOPs on Kinectics-400 and Something-Something-v2 datasets, respectively. Source code will be made publicly available.

1. INTRODUCTION

Learning spatiotemporal representation for videos is one of the most fundamental yet challenging tasks in computer vision (CV). The challenges come mainly from the contradiction between the insufficient data and the diverse spatiotemporal patterns that need to be learned. It will become very obvious if we take the representation learning for images as a reference. The largest public image dataset ImageNet-21K Deng et al. (2009) consists of more than 14M images divided into over 21K classes, while the largest public video dataset Kinetics-700 Carreira et al. (2019) only consists of around 500K videos divided into 700 human action classes. Video data are far inferior to image data in terms of quantity and diversity, but the spatiotemporal information to be learned is one dimension higher than the spatial information contained in the image. A straightforward idea to address this challenge in video representation learning is to make full use of the spatial modeling capability gained by the image models. Researchers started to implement this idea back in the convolution neural network (CNN) era. The quadratic complexity of Transformer brings a great challenge to the adaptation from imageoriented 2D ViT to video-oriented 3D ViT. Decomposition is in need to make the computation complexity tractable. Previous work Bertasius et al. (2021) has proposed space-time decomposition, axial decomposition, or local-global decomposition, among others, but they have not achieved satisfactory performance. In fact, the key question to be considered when decomposing a 3D ViT is which tokens should be grouped together to perform the self-attention. The selection of group size and coverage should trade off the computational complexity and the attainable modeling capability. 1



For example, I3D network Carreira & Zisserman (2017) uses inflated ResNet weights trained on ImageNet for initialization. More recently, as the Transformer Vaswani et al. (2017) architecture starts to dominate in CV, the ViT Dosovitskiy et al. (2021) network pretrained on ImageNet or by CLIP Radford et al. (2021) has been adopted as the initialization of spatiotemporal feature learning networks in many efforts Wang et al. (2021b); Pan et al. (2022); Lin et al. (2022).

