PROBING INTO OVERFITTING FOR VIDEO RECOGNI-TION

Abstract

Video recognition methods based on 2D networks have thrived in recent years, leveraging the advanced image classification techniques. However, overfitting is a severe problem in 2D video recognition models as 1) the scale of video datasets is relatively small compared to image recognition datasets like ImageNet; 2) current pipeline treats informative and non-informative (e.g., background) frames equally during optimization which aggravates overfitting. Based on these challenges, we design a video-specific data augmentation approach, named Ghost Motion (GM), to alleviate overfitting. Specifically, GM shifts channels along the temporal dimension to enable semantic motion information diffused into other frames which may be irrelevant originally, leading to motion artifacts which describe the appearance change and emphasize the motion salient frames. In this manner, GM can prevent the model from overfitting non-informative frames and results in better generalization ability. Comprehensive empirical validation on various architectures and datasets shows that GM can improve the generalization of existing methods and is compatible to existing image-level data augmentation approaches to further boost the performance.

1. INTRODUCTION

Video recognition methods has evolved rapidly due to the increasing number of online videos and success of advanced deep neural networks. Even if 3D networks (Feichtenhofer et al., 2019; Carreira & Zisserman, 2017; Tran et al., 2015) provide straightforward solutions for video recognition, 2D based methods (Wang et al., 2016; Lin et al., 2019; Li et al., 2020; Wang et al., 2021) still arouse researchers' interests because of their efficiency in both computing and storage. However, 2D networks still suffer from overfitting issue. For instance, on Something-Something V1 (Goyal et al., 2017) Furthermore, spatial-temporal modeling for video is harder than recognizing static images, which should require even more samples. 2) 2D based methods average the logits of all frames to vote for final decision which increases the tendency to overfit frames which contain less semantic information (e.g., background scene). In this view, these frames can be regarded as noise in optimization because they do not provide motion information for temporal modeling. To alleviate overfitting, many attempts have been proposed. Dropout (Srivastava et al., 2014) and Label Smoothing (Szegedy et al., 2016) are widely used in deep networks because of the regularization effects they bring. Designing data augmentation methods (Zhang et al., 2017; Yun et al., 2019; DeVries & Taylor, 2017; Cubuk et al., 2020) to relieve overfitting is another line of research. Although some methods have shown effectiveness in image-level tasks, directly employing them on video tasks may result in detrimental temporal representations as these methods are not specifically designed for video data and some transformations will break the motion pattern.



dataset which contains strong temporal dependency, the training and validation accuracy of TSM (Lin et al., 2019) is 81.22% and 45.34%, respectively. Besides, its Expected Calibration Error (ECE) (Guo et al., 2017) is 25.83% which means the model gives overconfident predictions and brings negative impact when deploying the model in real scenarios. There are two main reasons for overfitting in video: 1) video recognition benchmarks usually have fewer training samples compared with image classification datasets (e.g., ImageNet (Deng et al., 2009) with 1.2 million training samples compared to Kinetics (Kay et al., 2017) with 240K videos).

