PROBING INTO OVERFITTING FOR VIDEO RECOGNI-TION

Abstract

Video recognition methods based on 2D networks have thrived in recent years, leveraging the advanced image classification techniques. However, overfitting is a severe problem in 2D video recognition models as 1) the scale of video datasets is relatively small compared to image recognition datasets like ImageNet; 2) current pipeline treats informative and non-informative (e.g., background) frames equally during optimization which aggravates overfitting. Based on these challenges, we design a video-specific data augmentation approach, named Ghost Motion (GM), to alleviate overfitting. Specifically, GM shifts channels along the temporal dimension to enable semantic motion information diffused into other frames which may be irrelevant originally, leading to motion artifacts which describe the appearance change and emphasize the motion salient frames. In this manner, GM can prevent the model from overfitting non-informative frames and results in better generalization ability. Comprehensive empirical validation on various architectures and datasets shows that GM can improve the generalization of existing methods and is compatible to existing image-level data augmentation approaches to further boost the performance.

1. INTRODUCTION

Video recognition methods has evolved rapidly due to the increasing number of online videos and success of advanced deep neural networks. Even if 3D networks (Feichtenhofer et al., 2019; Carreira & Zisserman, 2017; Tran et al., 2015) provide straightforward solutions for video recognition, 2D based methods (Wang et al., 2016; Lin et al., 2019; Li et al., 2020; Wang et al., 2021) still arouse researchers' interests because of their efficiency in both computing and storage. However, 2D networks still suffer from overfitting issue. For instance, on Something-Something V1 (Goyal et al., 2017) dataset which contains strong temporal dependency, the training and validation accuracy of TSM (Lin et al., 2019) is 81.22% and 45.34%, respectively. Besides, its Expected Calibration Error (ECE) (Guo et al., 2017) is 25.83% which means the model gives overconfident predictions and brings negative impact when deploying the model in real scenarios. There are two main reasons for overfitting in video: 1) video recognition benchmarks usually have fewer training samples compared with image classification datasets (e.g., ImageNet (Deng et al., 2009) with 1.2 million training samples compared to Kinetics (Kay et al., 2017) with 240K videos). Furthermore, spatial-temporal modeling for video is harder than recognizing static images, which should require even more samples. 2) 2D based methods average the logits of all frames to vote for final decision which increases the tendency to overfit frames which contain less semantic information (e.g., background scene). In this view, these frames can be regarded as noise in optimization because they do not provide motion information for temporal modeling. To alleviate overfitting, many attempts have been proposed. Dropout (Srivastava et al., 2014) and Label Smoothing (Szegedy et al., 2016) are widely used in deep networks because of the regularization effects they bring. Designing data augmentation methods (Zhang et al., 2017; Yun et al., 2019; DeVries & Taylor, 2017; Cubuk et al., 2020) to relieve overfitting is another line of research. Although some methods have shown effectiveness in image-level tasks, directly employing them on video tasks may result in detrimental temporal representations as these methods are not specifically designed for video data and some transformations will break the motion pattern. In our work, we propose a data augmentation approach Ghost Motion (GM) which shifts channels along the temporal dimension to propagate motion information to adjacent frames. Specifically, all channels will be shifted for one step along the temporal axis leading to misaligned channel orders, and we interpolate between the original video and the misaligned one to form the new video, named ghost video. It diffuses motion patterns from salient frames into other frames to enhance the overall representative capacity. In this way, we can enforce the model to focus more on the informative frames and prevent it from overfitting non-informative frames to relieve overfitting. Although the shifting operation results in mismatched RGB orders, we surprisingly find the disorder is beneficial to improve generalization as well where it is elaborated in the analysis of Sec. 4.5. Ghost Motion effectively enlarges the input space without crashing the motion pattern and offers continuous variance in the input space which is beneficial to improve the generalization abilities of video recognition methods. Moreover, we find utilizing a hyperparameter Temperature to scale the logits before Softmax can further mitigate overfitting and reduce the calibration error, especially on challenging datasets such as Something-Something V1&V2. The proposed GM can be easily plugged into existing methods to improve their generalization abilities with a few line codes and brings negligible computational costs. Shown in Fig. 1 (a), GM results in consistent improvements over various 2D models. In Fig 1 (b), GM continuously improves the performance of baseline model with different number of sampled frames. In addition, our method is compatible with existing image-level augmentations. We can jointly apply them to relieve overfitting and further boost the performance. The main contributions are summarized as follow: • We propose video recognition data augmentation method Ghost Motion (GM) which can effectively improve the generalization of current video benchmark models and is compatible with existing image-level data augmentation approaches. • We find smoothing the logits can prevent overconfident predictions to further alleviate overfitting for temporal-dependent datasets such as Something-Something. • We conduct comprehensive experiments to validate the strength of Ghost Motion on various datasets and methods. Extensive ablation with detailed analysis illustrate the motivation and intuition about our GM strategy.

2.1. VIDEO RECOGNITION

Video recognition has benefited a lot from the development in image classification, especially the 2D based methods which share the same backbone with the models in image classification. The main focus of these 2D methods lies in temporal modeling. For instance, TSN (Wang et al., 2016) proposes to average the prediction of all the frames to present the final prediction and lays a foundation for later 2D models. While TSM (Lin et al., 2019) is designed to shift parts of the channels among adjacent frames for temporal modeling. Recently, people have paid more attention to multi-scale



Ghost Motion continuously improves the performance with different sampled frames on ActivityNet.

Figure 1: The performance of Ghost Motion on different methods and datasets. Ghost Motion introduces negligible cost to current competing methods and improves their generalization abilities.

