EXPLORING TEMPORALLY DYNAMIC DATA AUGMENTATION FOR VIDEO RECOGNITION

Abstract

Data augmentation has recently emerged as an essential component of modern training recipes for visual recognition tasks. However, data augmentation for video recognition has been rarely explored despite its effectiveness. Few existing augmentation recipes for video recognition naively extend the image augmentation methods by applying the same operations to the whole video frames. Our main idea is that the magnitude of augmentation operations for each frame needs to be changed over time to capture the real-world video's temporal variations. These variations should be generated as diverse as possible using fewer additional hyperparameters during training. Through this motivation, we propose a simple yet effective video data augmentation framework, DynaAugment. The magnitude of augmentation operations on each frame is changed by an effective mechanism, Fourier Sampling that parameterizes diverse, smooth, and realistic temporal variations. DynaAugment also includes an extended search space suitable for video for automatic data augmentation methods. DynaAugment experimentally demonstrates that there are additional performance rooms to be improved from static augmentations on diverse video models. Specifically, we show the effectiveness of DynaAugment on various video datasets and tasks: large-scale video recognition (Kinetics-400 and Something-Something-v2), small-scale video recognition (UCF-101 and HMDB-51), fine-grained video recognition (Diving-48 and FineGym), video action segmentation on Breakfast, video action localization on THUMOS'14, and video object detection on MOT17Det.

1. INTRODUCTION

Data augmentation is a crucial component of machine learning tasks as it prevents overfitting caused by a lack of training data and improves task performance without additional inference costs. Many data augmentation methods have been proposed across a broad range of research fields, including image recognition (Cubuk et al., 2019; 2020; Hendrycks et al., 2019; DeVries & Taylor, 2017; Zhang et al., 2018; Yun et al., 2019; LingChen et al., 2020; Müller & Hutter, 2021) , image processing (Yoo et al., 2020; Yu et al., 2020) , language processing (Sennrich et al., 2016; Wei & Zou, 2019; Wang et al., 2018; Chen et al., 2020) , and speech recognition (Park et al., 2019; Meng et al., 2021) . In image recognition, each augmentation algorithm has become an essential component of the modern training recipe through various combinations (Touvron et al., 2021; Bello et al., 2021; Wightman et al., 2021; Liu et al., 2022) . However, data augmentation for video recognition tasks has not been extensively studied yet beyond the direct adaptation of the image data augmentations. An effective data augmentation method is required to cover the comprehensive characteristics of data. Videos are characterized by diverse, dynamic temporal variations, such as camera/object movement or photometric (color or brightness) changes. As shown in Fig. 1 , dynamic temporal changes make it difficult to find out the video's category for video recognition. Therefore, it is important to make the video models effectively cope with all possible temporal variations, and it would improve the model's generalization performance. Recent achievements in video recognition research show notable performance improvements by extending image-based augmentation methods to video (Yun et al., 2020; Qian et al., 2021) . Following image transformers (Touvron et al., 2021; Liu et al., 2021a ), video transformers (Fan et al., 2021;  

