TEMPORAL RELEVANCE ANALYSIS FOR VIDEO ACTION MODELS Anonymous authors Paper under double-blind review

Abstract

In this paper, we provide a deep analysis of temporal modeling for action recognition, an important but underexplored problem in the literature. We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models based on layer-wise relevance propagation. We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected by various factors such as dataset, network architecture, and input frames. With this, we further study some important questions for action recognition that lead to interesting findings. Our analysis shows that there is no strong correlation between temporal relevance and model performance; and action models tend to capture local temporal information, but less long-range dependencies.

1. INTRODUCTION

State-of-the-art action recognition systems are mostly based on deep learning. Popular CNN-based approaches either model spatial and temporal information jointly by 3D convolutions (Carreira & Zisserman, 2017; Feichtenhofer, 2020; Tran et al., 2015) or separate spatial and temporal modeling by 2D convolutions (Fan et al., 2019; Lin et al., 2019) in a more efficient way. One of the fundamental keys for action recognition is temporal modeling, which involves learning temporal relationships between frames. Despite the significant progress made on action recognition, our understanding of temporal modeling is still significantly lacking and some important questions remain unanswered. For example, how does an action model learn relationships between frames? Can we quantify the amount of temporal relationships learned by an model? Stronger backbones in general lead to better recognition accuracy (Chen et al., 2021; Zhu et al., 2020) , but do they learn temporal information better? Do models capture long-range temporal information across frames? In this paper, we provide a deep analysis of temporal modeling for action recognition. Previous works focus on performance benchmark (Chen et al., 2021; Zhu et al., 2020) , spatio-temporal feature visualization (Feichtenhofer et al., 2020; Selvaraju et al., 2017) or salieny analysis (Bargal et al., 2018; Hiley et al., 2019b; Roy et al., 2019; Wang et al., 2018) to gain better understanding of action models. For example, comprehensive studies of CNN-based models have been conducted recently in (Chen et al., 2021; Zhu et al., 2020) to compare performance of different action models. Others (Monfort et al., 2019b; Selvaraju et al., 2017; Zhou et al., 2016) focus on visualizing the evidence used to make specific predictions, sometimes posed as understanding the relevance of each pixel on the recognition. In contrast, we aim to understand how temporal information is captured by action models, i.e., temporal dependencies between frames or how a frame relates to other frames in a video clip. In this work, we propose a new approach to evaluate the effectiveness of temporal modeling based on layer-wise relevance propagation (Gu et al., 2018; Montavon et al., 2019) , a popular technique widely used for explaining deep learning models. Our approach studies temporal relationships between frames in an action model and quantify the amount of temporal dependencies captured by the model, which is referred to as action temporal relevance (ATR) here (Sec. 3.2). Fig. 1 illustrates our approach. We conduct comprehensive experiments on popular video benchmark datasets such as Kinetics400 (Kay et al., 2017) and Something-Something (Goyal et al., 2017) based on several representative CNN models including I3D (Carreira & Zisserman, 2017), TAM (Fan et al., 2019) and SlowFast (Feichtenhofer et al., 2018) . Our experiments provide deep analysis of how

