CT-NET: CHANNEL TENSORIZATION NETWORK FOR VIDEO CLASSIFICATION

Abstract

3D convolution is powerful for video classification but often computationally expensive, recent studies mainly focus on decomposing it on spatial-temporal and/or channel dimensions. Unfortunately, most approaches fail to achieve a preferable balance between convolutional efficiency and feature-interaction sufficiency. For this reason, we propose a concise and novel Channel Tensorization Network (CT-Net), by treating the channel dimension of input feature as a multiplication of K sub-dimensions. On one hand, it naturally factorizes convolution in a multiple dimension way, leading to a light computation burden. On the other hand, it can effectively enhance feature interaction from different channels, and progressively enlarge the 3D receptive field of such interaction to boost classification accuracy. Furthermore, we equip our CT-Module with a Tensor Excitation (TE) mechanism. It can learn to exploit spatial, temporal and channel attention in a high-dimensional manner, to improve the cooperative power of all the feature dimensions in our CT-Module. Finally, we flexibly adapt ResNet as our CT-Net. Extensive experiments are conducted on several challenging video benchmarks, e.g., Kinetics-400, Something-Something V1 and V2. Our CT-Net outperforms a number of recent SOTA approaches, in terms of accuracy and/or efficiency.

1. INTRODUCTION

3D convolution has been widely used to learn spatial-temporal representation for video classification (Tran et al., 2015; Carreira & Zisserman, 2017) . However, over parameterization often makes it computationally expensive and hard to train. To alleviate such difficulty, recent studies mainly focus on decomposing 3D convolution (Tran et al., 2018; 2019) . One popular approach is spatial-temporal factorization (Qiu et al., 2017; Tran et al., 2018; Xie et al., 2018) , which can reduce overfitting by replacing 3D convolution with 2D spatial convolution and 1D temporal convolution. But it still introduces unnecessary computation burden, since both spatial convolution and temporal convolution are performed over all the feature channels. To further decrease such computation cost, channel separation has been recently developed via operating 3D convolution in the depth-wise manner (Tran et al., 2019) . However, it inevitably loses accuracy due to the lack of feature interaction between different channels. For compensation, it has to introduce point-wise convolution to preserve interaction with extra computation. So there is a natural question: How to construct effective 3D convolution to achieve a preferable trade-off between efficiency and accuracy for video classification? We tensorize the channel dimension of input feature as a multiplication of K sub-dimensions. Via performing spatial/temporal tensor separable convolution along each sub-dimension, we can achieve a preferable balance between convolutional efficiency and feature-interaction sufficiency. Introduction shows more explanations.

Method 3D Convolution Convolutional Efficiency

Feature-Interaction Sufficiency (t × h × w) Spatial-temporal Channel Interact Manner Interact Field 1 C3D F ull : 3 × 3 × 3 ST C 3 3 (Tran et al., 2015) R(2+1)D F ull : 1 × 3 × 3 SC 3 3 (Tran et al., 2018) F ull : 3 × 1 × 1 T C CSN F ull : 1 × 1 × 1 C 3 3 (Tran et al., 2019) DW : 3 × 3 × 3 ST Our CT-Net C 1 : C 1 × • • • × 1 × (1 × 3 × 3 + 3 × 1 × 1) ST C 1 (2K + 1) 3 (C = C 1 × • • • × C K ) C K : 1 × • • • × C K × (1 × 3 × 3 + 3 × 1 × 1) ST C K 1 Interact Field means the receptive field for feature interaction. et al., 2019) . To enhance convolutional efficiency, we consider decomposing convolution in a higher dimension with a novel representation of feature tensor. (2) Feature-Interaction Sufficiency. Table 1 clearly shows that, for current decomposition approaches (Tran et al., 2018; 2019) , feature interaction only contains one or two of spatial, temporal and channel dimensions at each sub-operation. Such a partial interaction manner would reduce classification accuracy. On one hand, it decreases the discriminative power of video representation, due to the lack of joint learning on all the dimensions. On the other hand, it restricts feature interaction in a limited receptive field, which ignores rich context from a larger 3D region. Hence, to boost classification accuracy, each sub-operation should achieve feature interaction on all the dimensions, and the receptive field of such interaction should be progressively enlarged as the number of sub-operations increases. Based on these desirable principles, we design a novel and concise Channel Tensorization Module (CT-Module). Specifically, we propose to tensorize the channel dimension of input feature as a multiplication of K sub-dimensions, i.e., C = C 1 × C 2 × • • • × C K . Via performing spatial/temporal separable convolution along each sub-dimension, we can effectively achieve convolutional efficiency and feature-interaction sufficiency. For better understanding, we use the case of K = 2 as a simple illustration in Figure 1 . First, we tensorize the input channel into C = C 1 × C 2 . Naturally, we separate the convolution into distinct ones along each sub-dimension, e.g., for the 1 st sub-dimension, we apply our spatial-temporal tensor separable convolution with the size C 1 ×1×t×h×w, which allows us to achieve convolutional efficiency on all the spatial, temporal and channel dimensions. After that, we sequentially perform the tensor separable convolution sub-dimension by sub-dimension. As a result, we can progressively achieve feature interaction on all the channels, and enlarge the spatialtemporal receptive field. For example, after operating 1 st tensor separable convolution on the 1 st sub-dimension, C 1 channels interact, and 3D receptive field of such interaction is 3 × 3 × 3. Via further operating 2 nd tensor separable convolution on the 2 nd sub-dimension, all C 1 × C 2 = C



Figure1: Simple illustration of channel tensorization (K = 2). We tensorize the channel dimension of input feature as a multiplication of K sub-dimensions. Via performing spatial/temporal tensor separable convolution along each sub-dimension, we can achieve a preferable balance between convolutional efficiency and feature-interaction sufficiency. Introduction shows more explanations.

Two design principles to build effective video representation and efficient convolution.

