VA-RED 2 : VIDEO ADAPTIVE REDUNDANCY REDUC-TION

Abstract

Performing inference on deep learning models for videos remains a challenge due to the large amount of computational resources required to achieve robust recognition. An inherent property of real-world videos is the high correlation of information across frames which can translate into redundancy in either temporal or spatial feature maps of the models, or both. The type of redundant features depends on the dynamics and type of events in the video: static videos have more temporal redundancy while videos focusing on objects tend to have more channel redundancy. Here we present a redundancy reduction framework, termed VA-RED 2 , which is input-dependent. Specifically, our VA-RED 2 framework uses an input-dependent policy to decide how many features need to be computed for temporal and channel dimensions. To keep the capacity of the original model, after fully computing the necessary features, we reconstruct the remaining redundant features from those using cheap linear operations. We learn the adaptive policy jointly with the network weights in a differentiable way with a shared-weight mechanism, making it highly efficient. Extensive experiments on multiple video datasets and different visual tasks show that our framework achieves 20% -40% reduction in computation (FLOPs) when compared to state-of-the-art methods without any performance loss.

1. INTRODUCTION

Large computationally expensive models based on 2D/3D convolutional neural networks (CNNs) are widely used in video understanding (Tran et al., 2015; Carreira & Zisserman, 2017; Tran et al., 2018) . Thus, increasing computational efficiency is highly sought after (Feichtenhofer, 2020; Zhou et al., 2018c; Zolfaghari et al., 2018) . However, most of these efficient approaches focus on architectural changes in order to maximize network capacity while maintaining a compact model (Zolfaghari et al., 2018; Feichtenhofer, 2020) or improving the way that the network consumes temporal information (Feichtenhofer et al., 2018; Korbar et al., 2019) . Despite promising results, it is well known that CNNs perform unnecessary computations at some levels of the network (Han et al., 2015a; Howard et al., 2017; Sandler et al., 2018; Feichtenhofer, 2020; Pan et al., 2018) , especially for video models since the high appearance similarity between consecutive frames results in a large amount of redundancy. In this paper, we aim at dynamically reducing the internal computations of popular video CNN architectures. Our motivation comes from the existence of highly similar feature maps across both time and channel dimensions in video models. Furthermore, this internal redundancy varies depending on the input: for instance, static videos will have more temporal redundancy whereas videos depicting a single large object moving tend to produce a higher number of redundant feature maps. To reduce the varied redundancy across channel and temporal dimensions, we introduce an input-dependent redundancy reduction framework called VA-RED 2 (Video Adaptive REDundancy REDuction) for efficient video recognition (see Figure 1 for an illustrative example). Our method is model-agnostic and hence can be applied to any state-of-the-art video recognition networks. The key mechanism that VA-RED 2 uses to increase efficiency is to replace full computations of some redundant feature maps with cheap reconstruction operations. Specifically, our framework avoids computing all the feature maps. Instead, we choose to only calculate those non-redundant part of feature maps and reconstruct the rest using cheap linear operations from the non-redundant 1

