TEMPORAL RELEVANCE ANALYSIS FOR VIDEO ACTION MODELS Anonymous authors Paper under double-blind review

Abstract

In this paper, we provide a deep analysis of temporal modeling for action recognition, an important but underexplored problem in the literature. We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models based on layer-wise relevance propagation. We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected by various factors such as dataset, network architecture, and input frames. With this, we further study some important questions for action recognition that lead to interesting findings. Our analysis shows that there is no strong correlation between temporal relevance and model performance; and action models tend to capture local temporal information, but less long-range dependencies.

1. INTRODUCTION

State-of-the-art action recognition systems are mostly based on deep learning. Popular CNN-based approaches either model spatial and temporal information jointly by 3D convolutions (Carreira & Zisserman, 2017; Feichtenhofer, 2020; Tran et al., 2015) or separate spatial and temporal modeling by 2D convolutions (Fan et al., 2019; Lin et al., 2019) in a more efficient way. One of the fundamental keys for action recognition is temporal modeling, which involves learning temporal relationships between frames. Despite the significant progress made on action recognition, our understanding of temporal modeling is still significantly lacking and some important questions remain unanswered. For example, how does an action model learn relationships between frames? Can we quantify the amount of temporal relationships learned by an model? Stronger backbones in general lead to better recognition accuracy (Chen et al., 2021; Zhu et al., 2020) , but do they learn temporal information better? Do models capture long-range temporal information across frames? In this paper, we provide a deep analysis of temporal modeling for action recognition. Previous works focus on performance benchmark (Chen et al., 2021; Zhu et al., 2020) , spatio-temporal feature visualization (Feichtenhofer et al., 2020; Selvaraju et al., 2017) or salieny analysis (Bargal et al., 2018; Hiley et al., 2019b; Roy et al., 2019; Wang et al., 2018) to gain better understanding of action models. For example, comprehensive studies of CNN-based models have been conducted recently in (Chen et al., 2021; Zhu et al., 2020) to compare performance of different action models. Others (Monfort et al., 2019b; Selvaraju et al., 2017; Zhou et al., 2016) focus on visualizing the evidence used to make specific predictions, sometimes posed as understanding the relevance of each pixel on the recognition. In contrast, we aim to understand how temporal information is captured by action models, i.e., temporal dependencies between frames or how a frame relates to other frames in a video clip. In this work, we propose a new approach to evaluate the effectiveness of temporal modeling based on layer-wise relevance propagation (Gu et al., 2018; Montavon et al., 2019) , a popular technique widely used for explaining deep learning models. Our approach studies temporal relationships between frames in an action model and quantify the amount of temporal dependencies captured by the model, which is referred to as action temporal relevance (ATR) here (Sec. 3.2). Fig. 1 illustrates our approach. We conduct comprehensive experiments on popular video benchmark datasets such as Kinetics400 (Kay et al., 2017) and Something-Something (Goyal et al., 2017) based on several representative CNN models including I3D (Carreira & Zisserman, 2017), TAM (Fan et al., 2019) and SlowFast (Feichtenhofer et al., 2018) . Our experiments provide deep analysis of how temporal relevance is affected by various factors including dataset, network architecture, network depth, and kernel size as well as the input frames (Sec. 4.2). Finally, based on the performed analysis, we effort to deliver a deep understanding of the important questions brought up above (Sec. 4.3). We exclusively focus on CNN-based approaches for action recognition. Nevertheless, we are fully aware that the recently emerging transformer-based approaches such as (Arnab et al., 2021; Bertasius et al., 2021; Fan et al., 2021b; Liu et al., 2021) demonstrate comparable or better performance than CNN-based models, but studying transformers is beyond the scope of this work. We summarize our contributions below: • Tool for Understanding Action Models. We present a new approach for better understanding of action modeling and develop means of evaluating the effectiveness of temporal modeling. • Temporal Relevance Analysis. We conduct comprehensive experiments to understand how temporal information in a video is modeled under different settings. • Deep Understanding of Temporal Modeling. We study some fundamental questions in action recognition that leads to interesting findings: a) There is no strong correlation between temporal relevance and model performance. Instead, temporal relevance is more related to architectures. b) Action models behave similarly on both temporal and static actions defined by human (Sevilla-Lara et al., 2019), and there is no strong indication that temporal actions require stronger temporal dependencies learned by these models. c) As the number of input frames increases, action models capture more short-range temporal information (local contextual information), but less long-range dependencies. The better performance due to using more frames seems to be largely attributed to richer local contextual information, rather than global contextual information. et al., 2019b) , etc. Many models proposed different temporal modeling approaches to handle the temporal dynamics of a video (Carreira & Zisserman, 2017; Fan et al., 2019; Feichtenhofer, 2020; Feichtenhofer et al., 2018; Hussein et al., 2019; Lin et al., 2019; Luo & Yuille, 2019; Tran et al., 2015; 2019; Wang et al., 2020; 2016; 2018; Xie et al., 2018; Zhou et al., 2018) . Chen et al. and Zhu et al. provided a comprehensive survey on how those CNN-based models achieve temporal modeling and compare their model accuracy. Transformer-based models have also become popular after their introduction to the computer vision community (Dosovitskiy et al., 2021) . In addition, multiple recent attention-based temporal modeling works have been proposed to enhance the transformerbased models, e.g., MViT (Fan et al., 2021a ), TimeSformer (Bertasius et al., 2021 ), Video Swin (Liu et al., 2021 ), etc (Li et al., 2021; Neimark et al., 2021) . The aforementioned works justify their capability of temporal modeling or the range of temporal modeling y validating the performance on benchmark datasets. In this work, we focus on quantifying the temporal relevance between each frame pair learned by a model to help understand the effects of different CNN-based approaches of temporal modeling for the task of action recognition.

2. RELATED WORK

Model Analysis. There are a few works that have assessed the temporal importance in a video, e.g., Huang et al. proposed the method approaches to identify the crucial motion information in a video based on the C3D model and then used it to reduce sparse frames of the video without too much motion information; on the other hand, Sigurdsson et al. analyzed the action category by measuring the complexity at different levels, such as verb, object and motion complexity, and then composed those attributes to form the action class. Feichtenhofer et al. visualized the features learned from various models trained by optical flow to explain why the network fails in certain cases. On the other hand, the receptive field is typically used to determine the range of a network can theoretically see in both spatial and temporal dimensions. Luo et al. thoroughly studied the spatial receptive field for image classification, and showed that the effective receptive field is much smaller than the theoretical one. In contrast, our work proposes an approach to quantify the learned temporal relevance between each frame pair, and uses this to understand how model architecture affects the temporal relevance. Explainability. Another popular research direction is to explain the decision made by a model through visualization of the class activation map, e.g., CAM (Zhou et al., 2016) , Grad-

