ADAFUSE: ADAPTIVE TEMPORAL FUSION NETWORK FOR EFFICIENT ACTION RECOGNITION

Abstract

Temporal modelling is the key for efficient video action recognition. While understanding temporal information can improve recognition accuracy for dynamic actions, removing temporal redundancy and reusing past features can significantly save computation leading to efficient action recognition. In this paper, we introduce an adaptive temporal fusion network, called AdaFuse, that dynamically fuses channels from current and past feature maps for strong temporal modelling. Specifically, the necessary information from the historical convolution feature maps is fused with current pruned feature maps with the goal of improving both recognition accuracy and efficiency. In addition, we use a skipping operation to further reduce the computation cost of action recognition. Extensive experiments on Something V1&V2, Jester and Mini-Kinetics show that our approach can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods.

1. INTRODUCTION

Over the last few years, video action recognition has made rapid progress with the introduction of a number of large-scale video datasets (Carreira & Zisserman, 2017; Monfort et al., 2018; Goyal et al., 2017) . Despite impressive results on commonly used benchmark datasets, efficiency remains a great challenge for many resource constrained applications due to the heavy computational burden of deep Convolutional Neural Network (CNN) models. Motivated by the need of efficiency, extensive studies have been recently conducted that focus on either designing new lightweight architectures (e.g., R(2+1)D (Tran et al., 2018 ), S3D (Xie et al., 2018) , channel-separated CNNs (Tran et al., 2019) ) or selecting salient frames/clips conditioned on the input (Yeung et al., 2016; Wu et al., 2019b; Korbar et al., 2019; Gao et al., 2020) . However, most of the existing approaches do not consider the fact that there exists redundancy in CNN features which can significantly save computation leading to more efficient action recognition. In particular, orthogonal to the design of compact models, the computational cost of a CNN model also has much to do with the redundancy of CNN features (Han et al., 2019) . Furthermore, the amount of redundancy depends on the dynamics and type of events in the video: A set of still frames for a simple action (e.g. "Sleeping") will have a higher redundancy comparing to a fast-changed action with rich interaction and deformation (e.g. "Pulling two ends of something so that it gets stretched"). Thus, based on the input we could compute just a subset of features, while the rest of the channels can reuse history feature maps or even be skipped without losing any accuracy, resulting in large computational savings compared to computing all the features at a given CNN layer. Based on this intuition, we present a new perspective for efficient action recognition by adaptively deciding what channels to compute or reuse, on a per instance basis, for recognizing complex actions. In this paper, we propose AdaFuse, an adaptive temporal fusion network that learns a decision policy to dynamically fuse channels from current and history feature maps for efficient action recognition. Specifically, our approach reuses history features when necessary (i.e., dynamically decides which channels to keep, reuse or skip per layer and per instance) with the goal of improving both recognition

availability

//mengyuest.github.

