OPTIMIZATION PLANNING FOR 3D CONVNETS

Abstract

3D Convolutional Neural Networks (3D ConvNets) have been regarded as a powerful class of models for video recognition. Nevertheless, it is not trivial to optimally learn a 3D ConvNets due to high complexity and various options of the training scheme. The most common hand-tuning process starts from learning 3D ConvNets using short video clips and then is followed by learning long-term temporal dependency using lengthy clips, while gradually decaying the learning rate from high to low as training progresses. The fact that such process comes along with several heuristic settings motivates the study to seek an optimal "path" to automate the entire training. In this paper, we decompose the path into a series of training "states" and specify the hyper-parameters, e.g., learning rate and the length of input clips, in each state. The estimation of the knee point on the performance-epoch curve triggers the transition from one state to another. We perform dynamic programming over all the candidate states to plan the optimal permutation of states, i.e., optimization path. Furthermore, we devise a new 3D ConvNets with a unique design of dual-head classifier to improve the spatial and temporal discrimination. Extensive experiments conducted on seven public video recognition benchmarks demonstrate the advantages of our proposal. With the optimization planning, our 3D ConvNets achieves superior results when comparing to the state-of-the-art video recognition approaches. More remarkably, we obtain the top-1 accuracy of 82.5% and 84.3% on the large-scale Kinetics-400 and Kinetics-600 datasets, respectively.

1. INTRODUCTION

The recent advances in 3D Convolutional Neural Networks (3D ConvNets) have successfully pushed the limits and improved the state-of-the-art of video recognition. For instance, an ensemble of LGD-3D networks (Qiu et al., 2019) achieves 17.88% in terms of average error in trimmed video classification task of ActivityNet Challenge 2019, which is dramatically lower than the error (29.3%) attained by the former I3D networks (Carreira & Zisserman, 2017) . The result basically indicates the advantage and great potential of 3D ConvNets for improving the performance of video recognition. Despite these impressive progresses, learning effective 3D ConvNets for video recognition remains challenging, due to large variations and complexities of video content. Existing works on 3D ConvNets (Tran et al., 2015; Carreira & Zisserman, 2017; Tran et al., 2018; Wang et al., 2018c; Feichtenhofer et al., 2019; Qiu et al., 2017; 2019) predominately focus on the designs of network architectures but seldom explore how to train a 3D ConvNets in a principled way. The difficulty in training 3D ConvNets originates from the high flexibility of the training scheme. Compared to the training of 2D ConvNets (Ge et al., 2019; Lang et al., 2019; Yaida, 2019) , the involvement of temporal dimension in 3D ConvNets brings two new problems of how many frames should be sampled from the video and how to sample these frames. First, the length of video clip is a tradeoff to control the balance between training efficiency and long-range temporal modeling for learning 3D ConvNets. On one hand, training with short clips (16 frames) (Tran et al., 2015; Qiu et al., 2017) generally leads to fast convergence with large mini-batch, and also alleviates the overfitting problem through data augmentation brought by sampling short clips. On the other hand, recent works (Varol et al., 2018; Wang et al., 2018c; Qiu et al., 2019) have proven better ability in capturing long-range dependency when training with long clips (over 100 frames) at the expense of training time. The second issue is the sampling strategy. Uniform sampling (Fan et al., 2019; Jiang et al., 2019; Martínez et al., 2019) offers the network a fast-forward overview of the entire video, while consecutive sampling (Tran et al., 2015; Qiu et al., 2017; 2019; Varol et al., 2018; Wang et al., 2018c) can better capture the spatio-temporal relation across frames. Given these complex choices of training scheme, learning a powerful 3D ConvNets often requires significant engineering efforts of human experts to determine the optimal strategy on each dataset. That motivates us to automate the design of training strategy for 3D ConvNets. In the paper, we propose optimization planning mechanism which seeks the optimal training strategy of 3D ConvNets adaptively. To this end, our optimization planning studies three problems: 1) choose between consecutive or uniform sampling; 2) when to increase the length of input clip; 3) when to decrease the learning rate. Specifically, we decompose the training process into several training states. Each state is assigned with the fixed hyper-parameters, including sampling strategy, length of input clip and learning rate. The transition between states represents the change of hyper-parameters during training. Therefore, the training process can be decided by the permutation of different states and the number of epochs for each state. Here, we build a candidate transition graph to define the valid transitions between states. The search of the best optimization strategy is then equivalent to seeking the optimal path from the initial state to the final state on the graph, which can be solved by dynamic programming algorithm. In order to determine the best epoch for each state in such process, we propose a knee point estimation method via fitting the performance-epoch curve. In general, our optimization planning is viewed as a training scheme controller and is readily applicable to train other neural networks in stages with multiple hyper-parameters. To the best of our knowledge, our work is the first to address the issue of optimization planning for 3D ConvNets training. The issue also leads to the elegant view of how the order and epochs for different hyper-parameters should be planned adaptively. We uniquely formulate the problem as seeking an optimal training path and devise a new 3D ConvNets with dual-head classifier. Extensive experiments on seven datasets demonstrate the effectiveness of our proposal, and with optimization planning, our 3D ConvNets achieves superior results than several state-of-the-art techniques.

2. RELATED WORK

The early works using Convolutional Neural Networks for video recognition are mostly extended from 2D ConvNets for image classification (Karpathy et al., 2014; Simonyan & Zisserman, 2014; Feichtenhofer et al., 2016; Wang et al., 2016) . These approaches often treat a video as a sequence of frames or optical flow images, and the pixel-level temporal evolution across consecutive frames are seldom explored. To alleviate this issue, 3D ConvNets in Ji et al. ( 2013 



) is devised to directly learn spatio-temporal representation from a short video clip via 3D convolution. Tran et al. design a widely-adopted 3D ConvNets inTran et al. (2015), namely C3D, consisting of 3D convolutions and 3D poolings optimized on the large-scale Sports1M(Karpathy et al., 2014)  dataset. Despite having encouraging performances, the training of 3D ConvNets is computationally expensive and the model size suffers from a massive growth. Later in Qiu et al. (2017); Tran et al. (2018); Xie et al. (2018), the decomposed 3D convolution is proposed to simulate one 3D convolution with one 2D spatial convolution plus one 1D temporal convolution. Recently, more advanced techniques are presented for 3D ConvNets, including inflating 2D convolutions (Carreira & Zisserman, 2017), non-local pooling (Wang et al., 2018c) and local-and-global diffusion (Qiu et al., 2019). Our work expands the research horizons of 3D ConvNets and focuses on improving 3D ConvNets training by adaptively planning the optimization process. The related works for 2D ConvNets training (Chee & Toulis, 2018; Lang et al., 2019; Yaida, 2019) automate the training strategy via only changing the learning rate adaptively. Our problem is much more challenging especially when temporal dimension is additionally considered and involved in the training scheme of 3D ConvNets. For enhancing 3D ConvNets training, the recent works (Wang et al., 2018c; Qiu et al., 2019) first train 3D ConvNets with short input clips and then fine-tune the network with lengthy clips, which balances training efficiency and long-range temporal modeling. The multigrid method (Wu et al., 2020) further cyclically changes spatial resolution and temporal duration of input clips for a more efficient optimization of 3D ConvNets. The research in this paper contributes by studying not only training 3D ConvNets with multiple lengths of input clips, but also adaptively scheduling the change of input clip length through optimization planning.

