TEMPORAL COHERENT TEST-TIME OPTIMIZATION FOR ROBUST VIDEO CLASSIFICATION

Abstract

Deep neural networks are likely to fail when the test data is corrupted in realworld deployment (e.g., blur, weather, etc.). Test-time optimization is an effective way that adapts models to generalize to corrupted data during testing, which has been shown in the image domain. However, the techniques for improving video classification corruption robustness remain few. In this work, we propose a Temporal Coherent Test-time Optimization framework (TeCo) to utilize spatiotemporal information in test-time optimization for robust video classification. To exploit information in video with self-supervised learning, TeCo minimizes the entropy of the prediction based on the global content from video clips. Meanwhile, it also feeds local content to regularize the temporal coherence at the feature level. TeCo retains the generalization ability of various video classification models and achieves significant improvements in corruption robustness across Mini Kinetics-C and Mini SSV2-C. Furthermore, TeCo sets a new baseline in video classification corruption robustness via test-time optimization.

1. INTRODUCTION

Deep neural networks have achieved tremendous success in many computer vision tasks when the training and test data are identically and independently distributed (i.i.d). However, the mismatch between them is common when the model is deployed in the real world (Wang et al. (2022b); Li et al. (2020) ). For example, weather changes like rain and fog, and data pre-processing like saturate adjustment and compression can corrupt the test data. Many works show that the common corruptions arising in nature can degrade the performance of models at test time significantly (Hendrycks & Dietterich, 2019; Yi et al., 2021; Kar et al., 2022; Geirhos et al., 2018; Yu et al., 2022) . In video classification, Yi et al. ( 2021) demonstrates the vulnerability of models against corruptions like noise, blur, and weather variations. Various techniques have been proposed to improve the robustness of models against common corruptions (a.k.a.corruption robustness). The most popular direction is increasing the diversity of input training data via data augmentations and applying regularization at training time (Zheng et al., 2016; Wang et al., 2021b; Hendrycks et al., 2020; Rusak et al., 2020; Wang et al., 2020) . It trains one model and evaluates on all types of corruption. However, the training data is often unavailable because of privacy concerns and policy constraints, making it more challenging to deploy models in the real world. In this work, we focus on the direction of test-time optimization. Under such a scheme, the parameters of one model will be optimized by one type of corrupted test data specifically. Test-time optimization updates the model to fit into the deployment environment at test time, without access to training data. There are several test-time optimization techniques emerging in image-based tasks (Wang et al., 2021a; Schneider et al., 2020; Liang et al., 2020; Sun et al., 2020) . However, we find these techniques are not able to generalize to video-based corruption robustness tasks well from empirical analysis. We hypothesize the gap between image and video-based tasks comes from several aspects. Firstly, the corruptions in the video can change temporal information,

