TEMPORAL COHERENT TEST-TIME OPTIMIZATION FOR ROBUST VIDEO CLASSIFICATION

Abstract

Deep neural networks are likely to fail when the test data is corrupted in realworld deployment (e.g., blur, weather, etc.). Test-time optimization is an effective way that adapts models to generalize to corrupted data during testing, which has been shown in the image domain. However, the techniques for improving video classification corruption robustness remain few. In this work, we propose a Temporal Coherent Test-time Optimization framework (TeCo) to utilize spatiotemporal information in test-time optimization for robust video classification. To exploit information in video with self-supervised learning, TeCo minimizes the entropy of the prediction based on the global content from video clips. Meanwhile, it also feeds local content to regularize the temporal coherence at the feature level. TeCo retains the generalization ability of various video classification models and achieves significant improvements in corruption robustness across Mini Kinetics-C and Mini SSV2-C. Furthermore, TeCo sets a new baseline in video classification corruption robustness via test-time optimization.

1. INTRODUCTION

Deep neural networks have achieved tremendous success in many computer vision tasks when the training and test data are identically and independently distributed (i.i.d). However, the mismatch between them is common when the model is deployed in the real world (Wang et al. (2022b) ; Li et al. (2020) ). For example, weather changes like rain and fog, and data pre-processing like saturate adjustment and compression can corrupt the test data. Many works show that the common corruptions arising in nature can degrade the performance of models at test time significantly (Hendrycks & Dietterich, 2019; Yi et al., 2021; Kar et al., 2022; Geirhos et al., 2018; Yu et al., 2022) . In video classification, Yi et al. (2021) demonstrates the vulnerability of models against corruptions like noise, blur, and weather variations. Various techniques have been proposed to improve the robustness of models against common corruptions (a.k.a.corruption robustness). The most popular direction is increasing the diversity of input training data via data augmentations and applying regularization at training time (Zheng et al., 2016; Wang et al., 2021b; Hendrycks et al., 2020; Rusak et al., 2020; Wang et al., 2020) . It trains one model and evaluates on all types of corruption. However, the training data is often unavailable because of privacy concerns and policy constraints, making it more challenging to deploy models in the real world. In this work, we focus on the direction of test-time optimization. Under such a scheme, the parameters of one model will be optimized by one type of corrupted test data specifically. Test-time optimization updates the model to fit into the deployment environment at test time, without access to training data. There are several test-time optimization techniques emerging in image-based tasks (Wang et al., 2021a; Schneider et al., 2020; Liang et al., 2020; Sun et al., 2020) . However, we find these techniques are not able to generalize to video-based corruption robustness tasks well from empirical analysis. We hypothesize the gap between image and video-based tasks comes from several aspects. Firstly, the corruptions in the video can change temporal information, which requires the model to be both generalizable and robust (Yi et al., 2021) . Hence, improving model robustness against the corruptions in video like bit error, and frame rate conversion is more challenging. Secondly, video input data has a different format from image data. For example, video data has a much larger size than image data. It is impractical to use a similar batch size as image-based tasks (e.g., batch size of 256 in Tent (Wang et al., 2021a)), though the batch size is an important hyper-parameter in test-time optimization (Wang et al., 2021a; Schneider et al., 2020) . Lastly, these techniques ignore the huge information hidden in the temporal dimension. To improve the video classification model robustness against corruptions consistently, we propose a temporal coherent test-time optimization framework TeCo. TeCo is a test-time optimization technique with two self-supervised objectives. We propose to build our method upon the test-time optimization which updates all the parameters in shallow layers and only normalization layers in the deep layers. It can benefit from both training and test data, and such an optimization strategy remains part of model parameters and statistics obtained at training time and updates the unfrozen parameters with test information. Besides, we utilize global and local spatio-temporal information for self-supervision. We use uniform sampling to ensure global information in input video data and optimize the model parameters via entropy minimization. By dense sampling, we extract another local stream that has a smaller time gap between consecutive frames. Due to the smooth and continuous nature of adjacent frames in the video (Li & DiCarlo, 2008; Wood & Wood, 2016) , we apply a temporal coherence regularization as a self-supervisor in test-time optimization on the local pathway. As such, our proposed technique enables the model to learn more invariant features against corruption. As a result, TeCo achieves promising robustness improvement on two largescale video corruption robustness datasets, Mini Kinetics-C and Mini SSV2-C. Its performance is superior across various video classification backbones. TeCo increases average accuracy by 6.5% across backbones on Mini Kinetics-C and by 4.1% on Mini SSV2-C, which is better than the baseline methods Tent (1.9% and 0.9%) and SHOT (1.9% and 2.0%). Additionally, We show that TeCo can guarantee the smoothness between consequent frames at the feature level, which indicates the effectiveness of temporal coherence regularization. We summarize our contributions as follows: • To the best of our knowledge, we make the first attempt to study the test-time optimization techniques for video classification corruption robustness across datasets and model architectures. • We propose a novel test-time optimization framework TeCo for video classification, which utilizes spatio-temporal information in training and test data to improve corruption robustness. • For video corruption robustness, TeCo outperforms other baseline test-time optimization techniques significantly and consistently on Mini Kinetics-C and Mini SSV2-C datasets. et al., 2022; Michaelis et al., 2019; Kamann & Rother, 2020) . These studies on corruption robustness bridge the gap between research in well-setup lab environments and deployment in the field. Recently, studies have emerged on corruption robustness in video classification (Yi et al., 2021; Wu & Kwiatkowska, 2020) . In this work, we tap the potential of spatio-temporal information in video data and improve the corruption robustness of video classification models during testing.



It assumes the tested model has no prior knowledge of the corruption arising during test time. The model is trained with clean data while tested on corrupted data. Under such a setting, we are able to estimate the overall robustness of models against corruption. In the following, benchmark studies and techniques across various computer vision tasks are booming (Kar

