DELTA: DEGRADATION-FREE FULLY TEST-TIME ADAP-TATION *

Abstract

Fully test-time adaptation aims at adapting a pre-trained model to the test stream during real-time inference, which is urgently required when the test distribution differs from the training distribution. Several efforts have been devoted to improving adaptation performance. However, we find that two unfavorable defects are concealed in the prevalent adaptation methodologies like test-time batch normalization (BN) and self-learning. First, we reveal that the normalization statistics in test-time BN are completely affected by the currently received test samples, resulting in inaccurate estimates. Second, we show that during test-time adaptation, the parameter update is biased towards some dominant classes. In addition to the extensively studied test stream with independent and class-balanced samples, we further observe that the defects can be exacerbated in more complicated test environments, such as (time) dependent or class-imbalanced data. We observe that previous approaches work well in certain scenarios while show performance degradation in others due to their faults. In this paper, we provide a plug-in solution called DELTA for Degradation-freE fuLly Test-time Adaptation, which consists of two components: (i) Test-time Batch Renormalization (TBR), introduced to improve the estimated normalization statistics. (ii) Dynamic Online re-weighTing (DOT), designed to address the class bias within optimization. We investigate various test-time adaptation methods on three commonly used datasets with four scenarios, and a newly introduced real-world dataset. DELTA can help them deal with all scenarios simultaneously, leading to SOTA performance.

1. INTRODUCTION

Models suffer from performance decrease when test and training distributions are mismatched (Quinonero-Candela et al., 2008) . Numerous studies have been conducted to narrow the performance gap based on a variety of hypotheses/settings. Unsupervised domain adaptation methods (Ganin et al., 2016) necessitate simultaneous access to labeled training data and unlabeled target data, limiting their applications. Source-free domain adaptation approaches (Liang et al., 2020) only need a trained model and do not require original training data when performing adaptation. Nonetheless, in a more difficult and realistic setting, known as fully test-time adaptation (Wang et al., 2021) , the model must perform online adaptation to the test stream in real-time inference. The model is adapted in a single pass on the test stream using a pre-trained model and continuously arriving test data (rather than a prepared target set). Offline iterative training or extra heavy computational burdens beyond normal inference do not meet the requirements. There have been several studies aimed at fully test-time adaptation. Test-time BN (Nado et al., 2020 ) / BN adapt (Schneider et al., 2020) directly uses the normalization statistics derived from test samples instead of those inherited from the training data, which is found to be beneficial in reducing the performance gap. Entropy-minimization-based methods, such as TENT (Wang et al., 2021) , further optimize model parameters during inference. Contrastive learning (Chen et al., 2022) , data augmentation (Wang et al., 2022a) and uncertainty-aware optimization (Niu et al., 2022) have been introduced to enhance adaptation performance. Efforts have also been made to address test-time adaptation in more complex test environments, like LAME (Boudiaf et al., 2022) . Despite the achieved progress, we find that there are non-negligible defects hidden in the popular methods. First, we take a closer look at the normalization statistics within inference (Section 3.2). We observe that the statistics used in BN adapt is inaccurate in per batch compared to the actual population statistics. Second, we reveal that the prevalent test-time model updating is biased towards some dominant categories (Section 3.3). We notice that the model predictions are extremely imbalanced on out-of-distribution data, which can be exacerbated by the self-learning-based adaptation methods. Besides the most common independent and class-balanced test samples considered in existing studies, following Boudiaf et al. ( 2022), we investigate other three test scenarios as illustrated in Figure 1 (please see details in Section 3.1) and find when facing the more intricate test streams, like dependent samples or class-imbalanced data, the prevalent methods would suffer from severe performance degradation, which limits the usefulness of these test-time adaptation strategies. To address the aforementioned issues, we propose two powerful tools. Specifically, to handle the inaccurate normalization statistics, we introduce test-time batch renormalization (TBR) (Section 3.2), which uses the test-time moving averaged statistics to rectify the normalized features and considers normalization during gradient optimization. By taking advantage of the observed test samples, the calibrated normalization is more accurate. We further propose dynamic online re-weighting (DOT) (Section 3.3) to tackle the biased optimization, which is derived from cost-sensitive learning. To balance adaptation, DOT assigns low/high weights to the frequent/infrequent categories. The weight mapping function is based on a momentum-updated class-frequency vector that takes into account multiple sources of category bias, including the pre-trained model, the test stream, and the adaptation methods (the methods usually do not have an intrinsic bias towards certain classes, but can accentuate existing bias). TBR can be applied directly to the common BN-based pre-trained models and does not interfere with the training process (corresponding to the fully test-time adaptation setting), and DOT can be easily combined with other adaptation approaches as well. Table 1 compares our method to others on CIFAR100-C across various scenarios. The existing test-time adaptation methods behave differently across the four scenarios and show performance degradation in some scenarios. While our tools perform well in all four scenarios simultaneously without any prior knowledge of the test data, which is important for real-world applications. Thus, the whole method is named DELTA (Degradation-freE fuLly Test-time Adaptation). The major contributions of our work are as follows. (i) We expose the defects in commonly used test-time adaptation methods, which ultimately harm adaptation performance. (ii) We demonstrate that the defects will be even more severe in complex test environments, causing performance degradation. (iii) To achieve degradation-free fully test-time adaptation, we propose DELTA which comprises two components: TBR and DOT, to improve the normalization statistics estimates and mitigate the bias within optimization. (iv) We evaluate DELTA on three common datasets with four scenarios and a newly introduced real-world dataset, and find that it can consistently improve the popular test-time adaptation methods on all scenarios, yielding new state-of-the-art results.

2. RELATED WORK

Unsupervised domain adaptation (UDA). In reality, test distribution is frequently inconsistent with the training distribution, resulting in poor performance. UDA aims to alleviate the phenomenon with the collected unlabeled samples from the target distribution. One popular approach is to align the statistical moments across different distributions (Gretton et al., 2006; Zellinger et al., 2017; Long et al., 2017) . Another line of studies adopts adversarial training to achieve adaptation (Ganin et al., 2016;  



Figure 1: IS+CB / DS+CB: the test stream which is independently / dependently sampled from a class-balanced test distribution; IS+CI/ DS+CI: independently / dependently drawn from a class-imbalanced test distribution. Each bar represents a sample, each color represents a category.

Comparison of fully test-time adaptation methods against the pretrained model on CIFAR100-C. DELTA achieves improvement in all scenarios.

