DENSE CORRELATION FIELDS FOR MOTION MODEL-ING IN ACTION RECOGNITION Anonymous

Abstract

The challenge of action recognition is to capture reasoning motion information. Compared to spatial convolution for appearance, the temporal component provides an additional (and important) clue for motion modeling, as a number of actions can be reliably recognized based on the motion information. In this paper, we present an effective and interpretable module, Dense Correlation Fields (DCF), which builds up dense visual correlation volumes at the feature level to model different motion patterns explicitly. To achieve this goal, we rely on a spatially hierarchical architecture that preserves both fine local information provided in the lower layer and the high-level semantic information from the deeper layer. Our method fuses spatial hierarchical correlation and temporal long-term correlation, which is better suited for small objects and large displacements. This module is extensible and can be plugged into many backbone architectures to accurately predict object interactions in the video. DCF shows consistent improvements over 2D CNNs and 3D CNNs baseline networks with 3.7% and 3.0% gains respectively on the standard video action benchmark of SSV1.

1. INTRODUCTION

Action recognition is a fundamental problem in video understanding (Karpathy et al., 2014; Laptev et al., 2008) . Unlike image classification, action recognition should distinguish visual tempo variation as well as its semantic appearance. Recently, great progress has been made by deep learning based models to improve the accuracy of video action recognition (Feichtenhofer et al., 2019; Jiang et al., 2019; Yang et al., 2020a) . CNNs for video understanding has been extended with the capability of capturing not only appearance information contained in individual frames but also motion information extracted from the temporal dimension of the image sequence. One common method for action recognition is to use a two-stream network (Simonyan & Zisserman, 2014; Crasto et al., 2019; Feichtenhofer et al., 2016; Qiu et al., 2019) , where one stream is on raw frames to extract appearance information, and the other is to leverage optical flow to learn motion information. An alternative strategy implicitly uses 3D CNNs (Carreira & Zisserman, 2017; Tran et al., 2015a; Feichtenhofer et al., 2019) or temporal convolution (Tran et al., 2018; Xie et al., 2018) , as these methods can jointly capture spatial and temporal information in a unified spatiotemporal framework. Some other methods extend 2D CNN-based backbones with temporal modules (Lin et al., 2019; Li et al., 2020b; Meng et al., 2021; Liu et al., 2021b) to learn motion information. However, the performance of previous action recognition systems is limited by difficulties including small objects and large displacements (fast moving objects). One conundrum for these methods is that the high-level feature is semantically strong but spatially coarse, as the spatial feature is crucial to capture motion information. As the cases shown in Figure 1 , strong semantic features and fine spatial features are both the keys to distinguishing action classes. This paper propose Dense Correlation Fields (DCF), a new temporal module for motion modeling. The correlation operator captures the motion information by computing the alignment of visually similar image regions between frames. Visual correlation is highly relevant to optical flow, which is the most important clue for capturing motion patterns. The correlation operator can be used as approximate motion information, which has shown effectiveness in action recognition in CorrNet (Wang et al., 2020) . DCF combines low-resolution, semantically strong correlation features with high-resolution, semantically weak correlation features to recover different motion patterns. Our DCF consists of two main contributions: (1) correlation aggregation over a spatial pyramidal hierarchy; (2) cross-frame correlation volume with both short-term and long-term temporal information. Our DCF enables the efficient integration of spatial information and semantic information for motion modeling throughout the network. The design of DCF draws inspiration from many existing works but is substantially novel. First, DCF builds up dense fields by combining features from spatial pyramidal correlation hierarchy. This is different from the individual temporal module applied over multi-stage in prior works (Wang et al., 2020; 2018b; Huang et al., 2021) . Temporal modeling normally presents short-term motion between adjacent frames at low-level feature and long-term temporal aggregation at high-level feature. These methods use motion modeling as different stages to deal with different motions. In practice, this fails in cases where the determined motion on a lower scale is too spatially coarse to be close to the correct motion of a higher scale. Our DCF uses spatial pyramid hierarchy to hallucinate spatially coarser but semantically stronger correlation feature by spatially finer correlation feature. The principle advantage of featuring each level of a correlation pyramid is that it produces a multi-scale motion feature representation in which all levels are spatially fine, including the low-resolution levels. Second, DCF maintains a cross-frame correlation with both short-term and long-term temporal information. While CorrNet only uses the correlation between adjacent frames, we compute the correlation between consecutive frames to form long-term temporal information, following the strategy of previous methods (Wang et al., 2021; 2018b) . DCF provides the network with long-term motion information by operating on cross-frame correlation volume. DCF can be applied to different backbone architectures as a plugin module. We construct DCF networks with two backbone networks, (R(2+1)D (Tran et al., 2018) and X3D(Feichtenhofer, 2020) ). In order to evaluate the proposed method in terms of modeling motion variations, we construct experiments on the Something-Something dataset (Goyal et al., 2017) which has been well-known to be challenging to classify an action due to the temporal complexity. In addition, we validate various design choices of DCF through extensive ablation studies. Moreover, we show the performance on Kinetics-400 dataset (Kay et al., 2017) to compare the proposed method to the many state-of-thearts.

2. RELATED WORK

Action Recognition. Action recognition research has been largely driven by learned features and various learning models utilizing deep networks. Two-stream CNNs (Simonyan & Zisserman, 2014) with one stream of static images and the other stream of optical flows are proposed to fuse the information of appearance and motion. Temporal Segment Networks (Wang et al., 2016) sample frames and optical flow on different time segments to extract information for activity recognition. 3D-CNNs (Ji et al., 2012; Tran et al., 2015b) proposed 3D convolution to directly learn spatiotemporal features from videos. Several variants decompose 3D convolution into a 2D convolution and a 1D temporal convolution, for example P3D Qiu et al. (2017) , R(2+1)D (Tran et al., 2018 ), S3D (Xie et al., 2018 ), and CT-Net (Li et al., 2021) . Recently, the great success of image Transformers has led to investi-



Figure 1: The action examples above show small object and large displacement (fast moving object) from SSV1 valid videos. (a) This example is captioned with 'Moving something across a surface until it falls down'. There is another similar type of action where the object does not fall down. The action of the pen on the box is crucial to distinguish these two classes. (b) This example shows a fast-tempo action with large displacement across the frames.

