TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION

Abstract

Video data is with complex temporal dynamics due to various factors such as camera motion, speed variation, and different activities. To effectively capture this diverse motion pattern, this paper presents a new temporal adaptive module (TAM) to generate video-specific temporal kernels based on its own feature maps. TAM proposes a unique two-level adaptive modeling scheme by decoupling dynamic kernel into a location sensitive importance map and a location invariant aggregation weight. The importance map is learned in a local temporal window to capture short term information, while the aggregation weight is generated from a global view with a focus on long-term structure. TAM is a principled module and could be integrated into 2D CNNs to yield a powerful video architecture (TANet) with a very small extra computational cost. The extensive experiments on Kinetics-400 and Something-Something datasets demonstrate that the TAM outperforms other temporal modeling methods consistently, and achieves the state-of-the-art performance under the similar complexity.

1. INTRODUCTION

Deep learning has brought great progress for various recognition tasks in image domain, such as image classification (Krizhevsky et al., 2012; He et al., 2016) , object detection (Ren et al., 2017) , and instance segmentation (He et al., 2017) . The key to these successes is to devise flexible and efficient architectures that are capable of learning powerful visual representations from large-scale image datasets (Deng et al., 2009) . However, deep learning research progress in video understanding is relatively more slowly, partially due to the high complexity of video data. The core technical problem in video understanding is to design an effective temporal module, that is expected to be able to capture complex temporal structure with high flexibility, while yet to be of low computational consumption for processing high dimensional video data efficiently. 3D Convolutional Neural Networks (3D CNNs) (Ji et al., 2010; Tran et al., 2015) have turned out to be mainstream architectures for video modeling (Carreira & Zisserman, 2017; Feichtenhofer et al., 2019; Tran et al., 2018; Qiu et al., 2017) . The 3D convolution is a natural extension over its 2D counterparts and provides a learnable operator for video recognition. However, this simple extension lacks specific consideration about the temporal properties in video data and might as well lead to high computational cost. Therefore, recent methods aim to improve 3D CNNs from two different aspects by combining a lightweight temporal module with 2D CNNs to improve efficiency (e.g., TSN (Wang et al., 2016) , TSM (Lin et al., 2019) ), or designing a dedicated temporal module to better capture temporal relation (e.g., Nonlocal Net (Wang et al., 2018b ), ARTNet (Wang et al., 2018a) , STM (Jiang et al., 2019) ). However, how to devise a temporal module with high efficiency and strong flexibility still remains to be an unsolved problem in video recognition. Consequently, we aim at advancing the current video architectures along this direction. In this paper, we focus on devising a principled adaptive module to capture temporal information in a more flexible way. In general, we observe that video data is with extremely complex dynamics along the temporal dimension due to factors such as camera motion and various speed. Thus 3D convolutions (temporal convolutions) might lack enough representation power to describe motion diversity by simply employing a fixed number of video invariant kernels. To deal with such complex temporal variations in videos, we argue that adaptive temporal kernels for each video are effective and as well necessary to describe motion patterns. To this end, as shown in Figure 1 , we present a two-level adaptive modeling scheme to decompose this video specific temporal kernel into a location sensitive importance map and a location invariant (also video adaptive) aggregation kernel. This unique design allows the location sensitive importance map to focus on enhancing discriminative temporal information from a local view, and enables the location invariant aggregation weights to capture temporal dependencies guided by a global view of the input video sequence. Specifically, the design of temporal adaptive module (TAM) strictly follows two principles: high efficiency and strong flexibility. To ensure our TAM with a low computational cost, we first squeeze the feature map by employing a global spatial pooling, and then establish our TAM in a channelwise manner to keep the efficiency. Our TAM is composed of two branches: a local branch (L) and a global branch (G). As shown in Fig. 2 , TAM is implemented in an efficient way. The local branch employs temporal convolutions to produce the location sensitive importance maps to discriminate the local feature, while the global branch uses fully connected layers to produce the location invariant kernel for temporal aggregation. The importance map generated by a local temporal window focuses on short-term motion modeling and the aggregation kernel using a global view pays more attention to the long-term temporal information. Furthermore, our TAM could be flexibly plugged into the existing 2D CNNs to yield an efficient video recognition architecture, termed as TANet. We validate the proposed TANet on the task of action classification in video recognition. Particularly, we first study the performance of the TANet on the Kinetics-400 dataset. We demonstrate that our TAM is better at capturing temporal information than other several counterparts, such as temporal pooling, temporal convolution, TSM (Lin et al., 2019), and Non-local block (Wang et al., 2018b) . Our TANet is able to yield a very competitive accuracy with the FLOPs similar to 2D CNNs. We further test our TANet on the motion dominated dataset of Something-Something, where the state-of-the-art performance is also achieved.

2. RELATED WORKS

Video understanding is a core topic in the field of computer vision. At early stage, a lot of traditional methods (Le et al., 2011; Kläser et al., 2008; Sadanand & Corso, 2012; Willems et al., 2008) have designed various hand-crafted features to encode the video data, but these methods are too inflexible when generalized to other video tasks. Recently, since the rapid development of video understanding has been much benefited from deep learning methods (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016) , especially in video recognition, a series of CNNs-based methods



Figure 1: Temporal module comparison: The standard temporal convolution shares weights among videos and may lack the flexibility to handle video variations due to the diversity of videos. The temporal attention learns position sensitive weights by assigning varied importance for different time without any temporal interaction, and may ignore the long-range temporal dependencies. Our proposed temporal adaptive module (TAM) presents a two-level adaptive scheme by learning the local importance weights for location adaptive enhancement and the global kernel weights for video adaptive aggregation. denotes attention operation, and ⊗ denotes convolution operation.

