ADAFUSE: ADAPTIVE TEMPORAL FUSION NETWORK FOR EFFICIENT ACTION RECOGNITION

Abstract

Temporal modelling is the key for efficient video action recognition. While understanding temporal information can improve recognition accuracy for dynamic actions, removing temporal redundancy and reusing past features can significantly save computation leading to efficient action recognition. In this paper, we introduce an adaptive temporal fusion network, called AdaFuse, that dynamically fuses channels from current and past feature maps for strong temporal modelling. Specifically, the necessary information from the historical convolution feature maps is fused with current pruned feature maps with the goal of improving both recognition accuracy and efficiency. In addition, we use a skipping operation to further reduce the computation cost of action recognition. Extensive experiments on Something V1&V2, Jester and Mini-Kinetics show that our approach can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods.

1. INTRODUCTION

Over the last few years, video action recognition has made rapid progress with the introduction of a number of large-scale video datasets (Carreira & Zisserman, 2017; Monfort et al., 2018; Goyal et al., 2017) . Despite impressive results on commonly used benchmark datasets, efficiency remains a great challenge for many resource constrained applications due to the heavy computational burden of deep Convolutional Neural Network (CNN) models. Motivated by the need of efficiency, extensive studies have been recently conducted that focus on either designing new lightweight architectures (e.g., R(2+1)D (Tran et al., 2018 ), S3D (Xie et al., 2018) , channel-separated CNNs (Tran et al., 2019) ) or selecting salient frames/clips conditioned on the input (Yeung et al., 2016; Wu et al., 2019b; Korbar et al., 2019; Gao et al., 2020) . However, most of the existing approaches do not consider the fact that there exists redundancy in CNN features which can significantly save computation leading to more efficient action recognition. In particular, orthogonal to the design of compact models, the computational cost of a CNN model also has much to do with the redundancy of CNN features (Han et al., 2019) . Furthermore, the amount of redundancy depends on the dynamics and type of events in the video: A set of still frames for a simple action (e.g. "Sleeping") will have a higher redundancy comparing to a fast-changed action with rich interaction and deformation (e.g. "Pulling two ends of something so that it gets stretched"). Thus, based on the input we could compute just a subset of features, while the rest of the channels can reuse history feature maps or even be skipped without losing any accuracy, resulting in large computational savings compared to computing all the features at a given CNN layer. Based on this intuition, we present a new perspective for efficient action recognition by adaptively deciding what channels to compute or reuse, on a per instance basis, for recognizing complex actions. In this paper, we propose AdaFuse, an adaptive temporal fusion network that learns a decision policy to dynamically fuse channels from current and history feature maps for efficient action recognition. Specifically, our approach reuses history features when necessary (i.e., dynamically decides which channels to keep, reuse or skip per layer and per instance) with the goal of improving both recognition accuracy and efficiency. As these decisions are discrete and non-differentiable, we rely on a Gumbel Softmax sampling approach (Jang et al., 2016) to learn the policy jointly with the network parameters through standard back-propagation, without resorting to complex reinforcement learning as in (Wu et al., 2019b; Fan et al., 2018; Yeung et al., 2016) . We design the loss to achieve both competitive performance and resource efficiency required for action recognition. Extensive experiments on multiple benchmarks show that AdaFuse significantly reduces the computation without accuracy loss. The main contributions of our work are as follows: • We propose a novel approach that automatically determines which channels to keep, reuse or skip per layer and per target instance for efficient action recognition. • Our approach is model-agnostic, which allows this to be served as a plugin operation for a wide range of 2D CNN-based action recognition architectures. • The overall policy distribution can be seen as an indicator for the dataset characteristic, and the block-level distribution can bring potential guidance for future architecture designs. et al., 2019b; Korbar et al., 2019; Gao et al., 2020) . Our approach is most related to the latter which focuses on conditional computation and is agnostic to the network architecture used for recognizing actions. However, instead of focusing on data sampling, our approach dynamically fuses channels from current and history feature maps to reduce the computation. Furthermore, as feature maps can be redundant or noisy, we use a skipping operation to make it more efficient for action recognition. Conditional Computation. Many conditional computation methods have been recently proposed with the goal of improving computational efficiency (Bengio et al., 2015; 2013; Veit & Belongie, 2018; Wang et al., 2018b; Graves, 2016; Meng et al., 2020; Pan et al., 2021) . Several works have been



• We conduct extensive experiments on four benchmark datasets (Something-Something V1(Goyal  et al., 2017), Something-Something V2 (Mahdisoltani et al., 2018), Jester (Materzynska et al.,  2019)  andMini-Kinetics (Kay et al., 2017)) to demonstrate the superiority of our proposed approach over state-of-the-art methods.Hara  et al., 2018), that use 3D convolutions to model space and time jointly, have also been introduced for action recognition. SlowFast(Feichtenhofer et al., 2018)  employs two pathways to capture temporal information by processing a video at both slow and fast frame rates. Recently, STM(Jiang  et al., 2019)  proposes new channel-wise convolutional blocks to jointly capture spatio-temporal and motion information in consecutive frames. TEA(Li et al., 2020b)  introduces a motion excitation module including multiple temporal aggregation modules to capture both short-and long-range temporal evolution in videos. Gate-Shift networks (Sudhakaran et al., 2020) use spatial gating for spatial-temporal decomposition of 3D kernels in Inception-based architectures.

availability

//mengyuest.github.

