VA-RED 2 : VIDEO ADAPTIVE REDUNDANCY REDUC-TION

Abstract

Performing inference on deep learning models for videos remains a challenge due to the large amount of computational resources required to achieve robust recognition. An inherent property of real-world videos is the high correlation of information across frames which can translate into redundancy in either temporal or spatial feature maps of the models, or both. The type of redundant features depends on the dynamics and type of events in the video: static videos have more temporal redundancy while videos focusing on objects tend to have more channel redundancy. Here we present a redundancy reduction framework, termed VA-RED 2 , which is input-dependent. Specifically, our VA-RED 2 framework uses an input-dependent policy to decide how many features need to be computed for temporal and channel dimensions. To keep the capacity of the original model, after fully computing the necessary features, we reconstruct the remaining redundant features from those using cheap linear operations. We learn the adaptive policy jointly with the network weights in a differentiable way with a shared-weight mechanism, making it highly efficient. Extensive experiments on multiple video datasets and different visual tasks show that our framework achieves 20% -40% reduction in computation (FLOPs) when compared to state-of-the-art methods without any performance loss.

1. INTRODUCTION

Large computationally expensive models based on 2D/3D convolutional neural networks (CNNs) are widely used in video understanding (Tran et al., 2015; Carreira & Zisserman, 2017; Tran et al., 2018) . Thus, increasing computational efficiency is highly sought after (Feichtenhofer, 2020; Zhou et al., 2018c; Zolfaghari et al., 2018) . However, most of these efficient approaches focus on architectural changes in order to maximize network capacity while maintaining a compact model (Zolfaghari et al., 2018; Feichtenhofer, 2020) or improving the way that the network consumes temporal information (Feichtenhofer et al., 2018; Korbar et al., 2019) . Despite promising results, it is well known that CNNs perform unnecessary computations at some levels of the network (Han et al., 2015a; Howard et al., 2017; Sandler et al., 2018; Feichtenhofer, 2020; Pan et al., 2018) , especially for video models since the high appearance similarity between consecutive frames results in a large amount of redundancy. In this paper, we aim at dynamically reducing the internal computations of popular video CNN architectures. Our motivation comes from the existence of highly similar feature maps across both time and channel dimensions in video models. Furthermore, this internal redundancy varies depending on the input: for instance, static videos will have more temporal redundancy whereas videos depicting a single large object moving tend to produce a higher number of redundant feature maps. To reduce the varied redundancy across channel and temporal dimensions, we introduce an input-dependent redundancy reduction framework called VA-RED 2 (Video Adaptive REDundancy REDuction) for efficient video recognition (see Figure 1 for an illustrative example). Our method is model-agnostic and hence can be applied to any state-of-the-art video recognition networks. The key mechanism that VA-RED 2 uses to increase efficiency is to replace full computations of some redundant feature maps with cheap reconstruction operations. Specifically, our framework avoids computing all the feature maps. Instead, we choose to only calculate those non-redundant part of feature maps and reconstruct the rest using cheap linear operations from the non-redundant features maps. In addition, VA-RED 2 makes decisions on a per-input basis: our framework learns an input-dependent policy that defines a "full computation ratio" for each layer of a 2D/3D network. This ratio determines the amount of features that will be fully computed at that layer, versus the features that will be reconstructed from the non-redundant feature maps. Importantly, we apply this strategy on both time and channel dimensions. We show that for both traditional video models such as I3D (Carreira & Zisserman, 2017), R(2+1)D (Tran et al., 2018) , and more advanced models such as X3D (Feichtenhofer, 2020) , this method significantly reduces the total floating point operations (FLOPs) on common video datasets without accuracy degradation. The main contributions of our work includes: (1) A novel input-dependent adaptive framework for efficient video recognition, VA-RED 2 , that automatically decides what feature maps to compute per input instance. Our approach is in contrast to most current video processing networks, where feature redundancy across both time and channel dimensions is not directly mitigated. (2) An adaptive policy jointly learned with the network weights in a fully differentiable way with a sharedweight mechanism, that allows us to make decisions on how many feature maps to compute. Our approach is model-agnostic and can be applied to any backbones to reduce feature redundancy in both time and channel domains. (3) Striking results of VA-RED 2 over baselines, with a 30% reduction in computation in comparison to R(2+1)D (Tran et al., 2018) , a 40% over I3D-InceptionV2 (Carreira & Zisserman, 2017), and about 20% over the recently proposed X3D-M (Feichtenhofer, 2020) without any performance loss, for video action recognition task. The superiority of our approach is extensively tested on three video recognition datasets (Mini-Kinetics-200, Kinetics-400 (Carreira & Zisserman, 2017), and Moments-In-Time (Monfort et al., 2019) ) and one spatio-temporal action localization dataset (J-HMDB-21 (Jhuang et al., 2013) ). ( 4) A generalization of our framework to video action recognition, spatio-temporal localization, and semantic segmentation tasks, achieving promising results while offering significant reduction in computation over competing methods.

2. RELATED WORK

Efficiency in Video Understanding Models. Video understanding has made significant progress in recent years, mainly due to the adoption of convolutional neural networks, in form of 2D CNNs (Karpathy et al., 2014; Simonyan & Zisserman, 2014; Chéron et al., 2015; Feichtenhofer et al., 2017; Gkioxari & Malik, 2015; Wang et al., 2016; Zhou et al., 2018a; Lin et al., 2019; Fan et al., 2019 ) or 3D CNNs (Tran et al., 2015; Carreira & Zisserman, 2017; Hara et al., 2018; Tran et al., 2018) . Despite promising results on common benchmarks, there is a significant interest in developing more efficient techniques and smaller models with reasonable performance. Previous works have shown reductions in computational complexity by using hybrid 2D-3D architectures (Xie et al., 2018; Zhou et al., 2018c; Zolfaghari et al., 2018 ), group convolution (Tran et al., 2019) or selecting salient clips (Korbar et al., 2019 ). Feichtenhofer et al., (Feichtenhofer et al., 2018) propose a dedicated low-framerate pathway. Expansion of 2D architectures through a stepwise expansion approach over the key variables such as temporal duration, frame rate, spatial resolution, network width, is



Figure1: Our VA-RED 2 framework dynamically reduces the redundancy in two dimensions. Example 1 (left) shows a case where the input video has little movement. The features in the temporal dimension are highly redundant, so our framework fully computes a subset of features, and reconstructs the rest with cheap linear operations. In the second example, we show that our framework can reduce computational complexity by performing a similar operation over channels: only part of the features along the channel dimension are computed, and cheap operations are used to generate the rest.

