MAXIMIZING SPATIO-TEMPORAL ENTROPY OF DEEP 3D CNNS FOR EFFICIENT VIDEO RECOGNITION

ABSTRACT

3D convolution neural networks (CNNs) have been the prevailing option for video recognition. To capture the temporal information, 3D convolutions are computed along the sequences, leading to cubically growing and expensive computations. To reduce the computational cost, previous methods resort to manually designed 3D/2D CNN structures with approximations or automatic search, which sacrifice the modeling ability or make training time-consuming. In this work, we propose to automatically design efficient 3D CNN architectures via a novel training-free neural architecture search approach tailored for 3D CNNs considering the model complexity. To measure the expressiveness of 3D CNNs efficiently, we formulate a 3D CNN as an information system and derive an analytic entropy score, based on the Maximum Entropy Principle. Specifically, we propose a spatio-temporal entropy score (STEntr-Score) with a refinement factor to handle the discrepancy of visual information in spatial and temporal dimensions, through dynamically leveraging the correlation between the feature map size and kernel size depth-wisely. Highly efficient and expressive 3D CNN architectures, i.e., entropy-based 3D CNNs (E3D family), can then be efficiently searched by maximizing the STEntr-Score under a given computational budget, via an evolutionary algorithm without training the network parameters. Extensive experiments on Something-Something V1&V2 and Kinetics400 demonstrate that the E3D family achieves state-of-the-art performance with higher computational efficiency. Code is available at https://github.com/alibaba/ lightweight-neural-architecture-search.

1. INTRODUCTION

Video recognition is a fundamental task for video understanding. To capture the visual information in both temporal and spatial domains from high-quality large-scale videos, most works have been focusing on proposing highly expressive models which, however, lead to higher computational costs Kondratyuk et al. (2021); Zhang et al. (2022) ; Li et al.. Recent research shows that 3D CNNs achieve excellent performance on large-scale benchmarks (Hara et al., 2018) with unified computations to capture spatio-temporal features jointly. However, the computational cost grows cubically in standard 3D convolution, making it prohibitive for high-resolution long-duration videos. Previous works propose to improve the efficiency of 3D CNNs via 2D decomposition or approximation manually (Carreira & Zisserman, 2017; Tran et al., 2018; Feichtenhofer, 2020) . Some practices have also been conducted to manually design efficient 3D CNNs relying on heuristics or experiences (Hara et al., 2018; Feichtenhofer, 2020) . The manually designed 3D or 2D CNN structures cost massive efforts and time in strengthening the modeling ability. Neural Architecture Search (NAS) approaches (Kondratyuk et al., 2021; Wang et al., 2020) can automatically generate 3D CNN architectures with

