MAXIMIZING SPATIO-TEMPORAL ENTROPY OF DEEP 3D CNNS FOR EFFICIENT VIDEO RECOGNITION

Abstract

3D convolution neural networks (CNNs) have been the prevailing option for video recognition. To capture the temporal information, 3D convolutions are computed along the sequences, leading to cubically growing and expensive computations. To reduce the computational cost, previous methods resort to manually designed 3D/2D CNN structures with approximations or automatic search, which sacrifice the modeling ability or make training time-consuming. In this work, we propose to automatically design efficient 3D CNN architectures via a novel training-free neural architecture search approach tailored for 3D CNNs considering the model complexity. To measure the expressiveness of 3D CNNs efficiently, we formulate a 3D CNN as an information system and derive an analytic entropy score, based on the Maximum Entropy Principle. Specifically, we propose a spatio-temporal entropy score (STEntr-Score) with a refinement factor to handle the discrepancy of visual information in spatial and temporal dimensions, through dynamically leveraging the correlation between the feature map size and kernel size depth-wisely. Highly efficient and expressive 3D CNN architectures, i.e., entropy-based 3D CNNs (E3D family), can then be efficiently searched by maximizing the STEntr-Score under a given computational budget, via an evolutionary algorithm without training the network parameters. Extensive experiments on Something-Something V1&V2 and Kinetics400 demonstrate that the E3D family achieves state-of-the-art performance with higher computational efficiency. Code is available at https://github.com/alibaba/ lightweight-neural-architecture-search.

1. INTRODUCTION

Video recognition is a fundamental task for video understanding. To capture the visual information in both temporal and spatial domains from high-quality large-scale videos, most works have been focusing on proposing highly expressive models which, however, lead to higher computational costs Kondratyuk et al. (2021) ; Zhang et al. (2022) ; Li et al.. Recent research shows that 3D CNNs achieve excellent performance on large-scale benchmarks (Hara et al., 2018) with unified computations to capture spatio-temporal features jointly. However, the computational cost grows cubically in standard 3D convolution, making it prohibitive for high-resolution long-duration videos. Previous works propose to improve the efficiency of 3D CNNs via 2D decomposition or approximation manually (Carreira & Zisserman, 2017; Tran et al., 2018; Feichtenhofer, 2020) . Some practices have also been conducted to manually design efficient 3D CNNs relying on heuristics or experiences (Hara et al., 2018; Feichtenhofer, 2020) . The manually designed 3D or 2D CNN structures cost massive efforts and time in strengthening the modeling ability. Neural Architecture Search (NAS) approaches (Kondratyuk et al., 2021; Wang et al., 2020) can automatically generate 3D CNN architectures with higher modeling ability. However, searching for a single 3D architecture requires days on multiple GPUs or TPUs, as training and evaluation of the accuracy indicator are required in the process, making the automatic 3D CNN design process time-consuming and/or hardware-dependent. To tackle the above issues, we study how to automatically generate (or design) efficient and expressive 3D CNNs with limited computations. Recently, training-free technologies have been introduced by some approaches (Chen et al., 2021; Lin et al., 2021; Sun et al., 2022b) , in which kernel spectrum analysis or forward inference are adopted to measure the expressiveness of spatial 2D CNNs. Inspired by the training-free concept and information theory, we suggest that a deep network can be regarded as an information system, and measuring the expressiveness of the network can be considered equivalent to analyzing how much information it can capture. Therefore, based on the Maximum Entropy Principle (Jaynes, 1957), the probability distribution of the system that best represents the current state of knowledge is the one with the highest entropy. However, as discussed in (Xie et al., 2018) , the information in spatial and temporal domains is different in natural video data. The spatial dimension is usually limited to some local properties, like connectivity (Claramunt, 2012), while the temporal dimension usually contains more drastic variations with more complex information. To address the spatio-temporal discrepancy in video data, we conduct a kernel selection experiment and observe that different 3D kernel selections in different stages have different effects on performance, and the focus of 3D CNNs changes from spatial information to spatio-temporal information, as the network depth increases. We thus consider that the design of 3D CNN architecture should focus on spatial-temporal aggregation depth-wisely. The above analysis has motivated us to propose a training-free NAS approach to obtain optimal architectures, i.e., entropy-based 3D CNNs (E3D family). Concretely, we first formulate a 3D CNN-based architecture as an information system whose expressiveness can be measured by the value of its differential entropy. We then derive the upper bound of the differential entropy using an analytic formulation, named Spatio-Temporal Entropy Score (STEntr-Score), conditioned on spatio-temporal aggregation by dynamically measuring the correlation between feature map size and kernel size depth-wisely. Finally, an evolutionary algorithm is employed to identify the optimal architecture utilizing the STEntr-Score without training network parameters during searching. In summary, the key contributions of our work are as follows: • We present a novel training-free neural architecture search approach to design efficient 3D CNN architectures. Instead of using forward inference estimation, we calculate the differential entropy of a 3D CNN by an analytic formulation under Maximum Entropy Principle. • We investigate the video data characteristics in spatial and temporal domains and correlation between feature map with kernel selection, then propose the corresponding spatio-temporal entropy score to estimate the spatio-temporal aggregation dynamically, with a spatio-temporal refinement mechanism to handle the information discrepancy. • Each model of E3D family can be searched within three hours on a desktop CPU, and the models demonstrate state-of-the-art performance on various video recognition datasets.

2. RELATED WORK

Action recognition. 2D CNNs lack temporal modeling for video sequences, and many approaches (Wang et al., 2016; Lin et al., 2019; Li et al., 2020; Wang et al., 2021a; b; Huang et al., 2022) focused on designing an extended module for temporal information learning. Meanwhile, 3D CNN-based frameworks have a spatio-temporal modeling capability, which improves model performance for video action recognition (Tran et al., 2015; Carreira & Zisserman, 2017; Feichtenhofer, 2020; Kondratyuk et al., 2021) . Some attempts (Feichtenhofer, 2020; Fan et al., 2020; Kondratyuk et al., 2021) focused on designing efficient 3D CNN-based architectures. For example, X3D (Feichtenhofer, 2020) progressively expands a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Our work also focuses on designing efficient 3D CNN-based architectures, but in a deterministic manner with entropy-based information criterion analysis. Maximum Entropy Principle. The Principle of Maximum Entropy is one of the fundamental principles in Physics and Information Theory (Shannon, 1948; Reza, 1994; Kullback, 1997; Brillouin, 2013) . Accompanied by the widespread applications of deep learning, many theoretical studies (Saxe et al., 2019; Chan et al., 2021; Yu et al., 2020; Sun et al., 2022b) try to understand the success

