DELTA: DEGRADATION-FREE FULLY TEST-TIME ADAP-TATION *

Abstract

Fully test-time adaptation aims at adapting a pre-trained model to the test stream during real-time inference, which is urgently required when the test distribution differs from the training distribution. Several efforts have been devoted to improving adaptation performance. However, we find that two unfavorable defects are concealed in the prevalent adaptation methodologies like test-time batch normalization (BN) and self-learning. First, we reveal that the normalization statistics in test-time BN are completely affected by the currently received test samples, resulting in inaccurate estimates. Second, we show that during test-time adaptation, the parameter update is biased towards some dominant classes. In addition to the extensively studied test stream with independent and class-balanced samples, we further observe that the defects can be exacerbated in more complicated test environments, such as (time) dependent or class-imbalanced data. We observe that previous approaches work well in certain scenarios while show performance degradation in others due to their faults. In this paper, we provide a plug-in solution called DELTA for Degradation-freE fuLly Test-time Adaptation, which consists of two components: (i) Test-time Batch Renormalization (TBR), introduced to improve the estimated normalization statistics. (ii) Dynamic Online re-weighTing (DOT), designed to address the class bias within optimization. We investigate various test-time adaptation methods on three commonly used datasets with four scenarios, and a newly introduced real-world dataset. DELTA can help them deal with all scenarios simultaneously, leading to SOTA performance.

1. INTRODUCTION

Models suffer from performance decrease when test and training distributions are mismatched (Quinonero-Candela et al., 2008) . Numerous studies have been conducted to narrow the performance gap based on a variety of hypotheses/settings. Unsupervised domain adaptation methods (Ganin et al., 2016) necessitate simultaneous access to labeled training data and unlabeled target data, limiting their applications. Source-free domain adaptation approaches (Liang et al., 2020) only need a trained model and do not require original training data when performing adaptation. Nonetheless, in a more difficult and realistic setting, known as fully test-time adaptation (Wang et al., 2021) , the model must perform online adaptation to the test stream in real-time inference. The model is adapted in a single pass on the test stream using a pre-trained model and continuously arriving test data (rather than a prepared target set). Offline iterative training or extra heavy computational burdens beyond normal inference do not meet the requirements. There have been several studies aimed at fully test-time adaptation. Test-time BN (Nado et al., 2020) / BN adapt (Schneider et al., 2020) directly uses the normalization statistics derived from test samples instead of those inherited from the training data, which is found to be beneficial in reducing the performance gap. Entropy-minimization-based methods, such as TENT (Wang et al., 2021) , further optimize model parameters during inference. Contrastive learning (Chen et al., 2022) , data augmentation (Wang et al., 2022a) and uncertainty-aware optimization (Niu et al., 2022) have been introduced to enhance adaptation performance. Efforts have also been made to address test-time adaptation in more complex test environments, like LAME (Boudiaf et al., 2022) . Despite the achieved progress, we find that there are non-negligible defects hidden in the popular methods. First, we take a closer look at the normalization statistics within inference (Section 3.2). We observe that the statistics used in BN adapt is inaccurate in per batch compared to the actual population statistics. Second, we reveal that the prevalent test-time model updating is biased towards some dominant categories (Section 3.3). We notice that the model predictions are extremely imbalanced on out-of-distribution data, which can be exacerbated by the self-learning-based adaptation methods. Besides the most common independent and class-balanced test samples considered in existing studies, following Boudiaf et al. (2022) , we investigate other three test scenarios as illustrated in Figure 1 (please see details in Section 3.1) and find when facing the more intricate test streams, like dependent samples or class-imbalanced data, the prevalent methods would suffer from severe performance degradation, which limits the usefulness of these test-time adaptation strategies. To address the aforementioned issues, we propose two powerful tools. Specifically, to handle the inaccurate normalization statistics, we introduce test-time batch renormalization (TBR) (Section 3.2), which uses the test-time moving averaged statistics to rectify the normalized features and considers normalization during gradient optimization. By taking advantage of the observed test samples, the calibrated normalization is more accurate. We further propose dynamic online re-weighting (DOT) (Section 3.3) to tackle the biased optimization, which is derived from cost-sensitive learning. To balance adaptation, DOT assigns low/high weights to the frequent/infrequent categories. The weight mapping function is based on a momentum-updated class-frequency vector that takes into account multiple sources of category bias, including the pre-trained model, the test stream, and the adaptation methods (the methods usually do not have an intrinsic bias towards certain classes, but can accentuate existing bias). TBR can be applied directly to the common BN-based pre-trained models and does not interfere with the training process (corresponding to the fully test-time adaptation setting), and DOT can be easily combined with other adaptation approaches as well. Table 1 compares our method to others on CIFAR100-C across various scenarios. The existing test-time adaptation methods behave differently across the four scenarios and show performance degradation in some scenarios. While our tools perform well in all four scenarios simultaneously without any prior knowledge of the test data, which is important for real-world applications. Thus, the whole method is named DELTA (Degradation-freE fuLly Test-time Adaptation). The major contributions of our work are as follows. (i) We expose the defects in commonly used test-time adaptation methods, which ultimately harm adaptation performance. (ii) We demonstrate that the defects will be even more severe in complex test environments, causing performance degradation. (iii) To achieve degradation-free fully test-time adaptation, we propose DELTA which comprises two components: TBR and DOT, to improve the normalization statistics estimates and mitigate the bias within optimization. (iv) We evaluate DELTA on three common datasets with four scenarios and a newly introduced real-world dataset, and find that it can consistently improve the popular test-time adaptation methods on all scenarios, yielding new state-of-the-art results.

2. RELATED WORK

Unsupervised domain adaptation (UDA). In reality, test distribution is frequently inconsistent with the training distribution, resulting in poor performance. UDA aims to alleviate the phenomenon with the collected unlabeled samples from the target distribution. One popular approach is to align the statistical moments across different distributions (Gretton et al., 2006; Zellinger et al., 2017; Long et al., 2017) . Another line of studies adopts adversarial training to achieve adaptation (Ganin et al., 2016; Long et al., 2018) . UDA has been developed for many tasks including object classification (Saito et al., 2017) /detection (Li et al., 2021) and semantic segmentation (Hoffman et al., 2018) . Source-free domain adaptation (SFDA). SFDA deals with domain gap with only the trained model and the prepared unlabeled target data. To be more widely used, SFDA methods should be built on a common source model trained by a standard pipeline. SHOT (Liang et al., 2020) freezes the source model's classifier and optimizes the feature extractor via entropy minimization, diversity regularization, and pseudo-labeling. SHOT incorporates weight normalization, 1D BN, and labelsmoothing into backbones and training, which do not exist in most off-the-shelf trained models, but its other ideas can be used. USFDA (Kundu et al., 2020) utilizes synthesized samples to achieve compact decision boundaries. NRC (Yang et al., 2021b) encourages label consistency among local target features with the same network architecture as SHOT. GSFDA (Yang et al., 2021a) further expects the adapted model performs well not only on target data but also on source data. Fully test-time adaptation (FTTA). FTTA is a more difficult and realistic setting. In the same way that SFDA does not provide the source training data, only the trained model is provided. Unlike SFDA, FTTA cannot access the entire target dataset; however, the methods should be capable of doing online adaptation on the test stream and providing instant predictions for the arrived test samples. BN adapt (Nado et al., 2020; Schneider et al., 2020) replaces the normalization statistics estimated during training with those derived from the test mini-batch. On top of it, TENT (Wang et al., 2021) optimizes the affine parameters in BN through entropy minimization during test. EATA (Niu et al., 2022) and CoTTA (Wang et al., 2022a ) study long-term test-time adaptation in continually changing environments. ETA (Niu et al., 2022) excludes unreliable and redundant samples from the optimization. AdaContrast (Chen et al., 2022) resorts to contrastive learning to promote feature learning along with a pseudo label refinement mechanism. Both AdaContrast and CoTTA utilize heavy data augmentation during test, which will increase inference latency. Besides, AdaContrast modifies the model architecture as in SHOT. Different from them, LAME (Boudiaf et al., 2022) does not rectify the model's parameters but only the model's output probabilities via the introduced unsupervised objective laplacian adjusted maximum-likelihood estimation.

Class-imbalanced learning.

Training with class-imbalanced data has attracted widespread attention (Liu et al., 2019) . Cost-sensitive learning (Elkan, 2001) and resampling (Wang et al., 2020) are the classical strategies to handle this problem. Ren et al. (2018) designs a meta-learning paradigm to assign weights to samples. Class-balanced loss (Cui et al., 2019) uses the effective number of samples when performing re-weighting. Decoupled training (Kang et al., 2020b) learns the feature extractor and the classifier separately. Menon et al. (2021) propose logit adjustment from a statistical perspective. Other techniques such as weight balancing (Alshammari et al., 2022; Zhao et al., 2020) , contrastive learning (Kang et al., 2020a) , knowledge distillation (He et al., 2021) , etc. have also been applied to solve this problem.

3.1. PROBLEM DEFINITION

Assume that we have the training data D train = {(x i , y i )} N train i=1 ∼ P train (x, y), where x ∈ X is the input and y ∈ Y = {1, 2, • • • , K} is the target label; f {θ0,a0} denotes the model with parameters θ 0 and normalization statistics a 0 learned or estimated on D train . Without loss of generality, we denote the test stream as D test = {(x j , y j )} N test j=1 ∼ P test (x, y), where {y j } are not available actually, the subscript j also indicates the sample position within the test stream. When P test (x, y) ̸ = P train (x, y) (the input/output space X /Y is consistent between training and test data), f {θ0,a0} may perform poorly on D test . Under fully test-time adaptation scheme (Wang et al., 2021) , during inference step t ≥ 1, the model f {θt-1,at-1} receives a mini-batch of test data {x mt+b } B b=1 with B batch size (m t is the number of test samples observed before inference step t), and then elevates itself to f {θt,at} based on current test mini-batch and outputs the real-time predictions {p mt+b } B b=1 (p ∈ R K ). Finally, the evaluation metric is calculated based on the online predictions from each inference step. Fully test-time adaptation emphasizes performing adaptation during real-time inference entirely, i.e., the training process cannot be interrupted, the training data is no longer available during test, and the adaptation should be accomplished in a single pass over the test stream. The most common hypothesis is that D test is independently sampled from P test (x, y). However, in real environment, the assumption does not always hold, e.g., samples of some classes may appear more frequently in a certain period of time, leading to another hypothesis: the test samples are dependently sampled. Most studies only considered the scenario with class-balanced test samples, while in real-world, the test stream can be class-imbalancedfoot_0 . We investigate fully test-time adaptation under the four scenarios below, considering the latent sampling strategies and the test class distribution. For convenience, we denote the scenario where test samples are independently/dependently sampled from a class-balanced test distribution as IS+CB / DS+CB; denote the scenario where test samples are independently/dependently sampled from a class-imbalanced test distribution as IS+CI/ DS+CI, as shown in Figure 1 . Among them, IS+CB is the most common scenario within FTTA studies, and the other three scenarios also frequently appear in real-world applications.

3.2. A CLOSER LOOK AT NORMALIZATION STATISTICS

We revisit BN (Ioffe & Szegedy, 2015) briefly. Let v ∈ R B×C×S×S ′ be a mini-batch of features with C channels, height S and width S ′ . BN normalizes v with the normalization statistics µ, σ ∈ R C : v * = v-µ σ , v ⋆ = γ • v * + β, where γ, β ∈ R C are the learnable affine parameters, {γ, β} ⊂ θ. We mainly focus on the first part v → v * (all the discussed normalization methods adopt the affine parameters). In BN, during training, µ, σ are set to the empirical mean µ batch and standard deviation σ batch calculated for each channel c: µ batch [c] = 1 BSS ′ b,s,s ′ v[b, c, s, s ′ ], σ batch [c] = 1 BSS ′ b,s,s ′ (v[b, c, s, s ′ ] -µ batch [c]) 2 + ϵ, where ϵ is a small value to avoid division by zero. During inference, µ, σ are set to µ ema , σ ema which are the exponential-moving-average (EMA) estimates over training process (a 0 is formed by the EMA statistics of all BN modules). However, when P test (x, y) ̸ = P train (x, y), studies found that replacing µ ema , σ ema with the statistics of the test mini-batch: μbatch , σbatch can improve model accuracy (Nado et al., 2020) (for clarify, statistics estimated on test samples are denoted with ' ˆ'). The method is also marked as "BN adapt" (Schneider et al., 2020)  = α • μema t-1 + (1 -α) • sg(μ batch t ), σema t = α • σema t-1 + (1 -α) • sg(σ batch t ), sg(•) stands for the operation of stopping gradient, e.g., the Tensor.detach() function in PyTorch, α is a smoothing coef-ficient. TEMA can consistently improve BN adapt: the normalization statistics in Figure 2 become more stable and accurate, and the test accuracy in Table 2 is improved as well. However, for TENT which involves parameters update, TEMA can destroy the trained model as shown in Table 2 . As discussed in Ioffe & Szegedy (2015) , simply employing the moving averages would neutralize the effects of gradient optimization and normalization, as the gradient descent optimization does not consider the normalization, leading to unlimited growth of model parameters. Thus, we introduce batch renormalization (Ioffe, 2017) into test-time adaptation, leading to TBR, which is formulated by v * = v -μbatch σbatch • r + d, where r = sg(σ batch ) σema , d = sg(μ batch ) -μema σema , We present a detailed algorithm description in Appendix A.2. Different from BN adapt, we use the test-time moving averages to rectify the normalization (through r and d). Different from the TEMA, TBR is well compatible with gradient-based adaptation methods (e.g., TENT) and can improve them as summarised in Table 2 . For BN adapt, TEMA is equal to TBR. Different from the original batch renormalization used in the training phase, TBR is employed in the inference phase which uses the statistics and moving averages derived from test batches. Besides, as the adaptation starts with a trained model f {θ0,a0} , TBR discards the warm-up and truncation operation to r and d, thus does not introduce additional hyper-parameters. TBR can be applied directly to a common pre-trained model with BN without requiring the model to be trained with such calibrated normalization. Building on BN adapt, TENT (Wang et al., 2021) further optimizes the affine parameters γ, β through entropy minimization and shows that test-time parameter optimization can yield better results compared to employing BN adapt alone. We further take a closer look at this procedure.

3.3. A CLOSER LOOK

Diagnosis II: the test-time optimization is biased towards dominant classes. We evaluate the model on IS+CB and DS+CB gaussian-noise-corrupted test data (Gauss) of CIFAR100-C. We also test the model on the original clean test set of CIFAR100 for comparison. Figure 3 depicts the per-class number of predictions, while Table 3 shows the corresponding standard deviation, range (maximum subtract minimum), and accuracy. We draw the following five conclusions. • Predictions are imbalanced, even for a model trained on class-balanced training data and tested on a class-balanced test set with P test (x, y) = P train (x, y): the "clean" curve in Figure 3 (left) with standard deviation 8.3 and range 46. This phenomenon is also studied in Wang et al. (2022b) . • Predictions becomes more imbalanced when P test (x, y) ̸ = P train (x, y) as shown in Figure 3 (left): the ranges are 46 and 956 on the clean and corrupted test set respectively. • BN adapt+TEMA improves accuracy (from 27.0% to 58.0%) and alleviates the prediction imbalance at the same time (the range dropped from 956 to 121.6). • Though accuracy is further improved with TENT+TBR (from 58.0% to 62.2%), the predictions become more imbalanced inversely (the range changed from 121.6 to 269.8). The entropy minimization loss focuses on data with low entropy, while samples of some classes may have relatively lower entropy owing to the trained model, thus TENT would aggravate the prediction imbalance. • On dependent test streams, not only the model accuracy drops, but also the predictions become more imbalanced (range 269.8 / range 469.2 on independent/dependent samples for TENT+TBR), as the model may be absolutely dominated by some classes over a period of time in DS+CB scenario. Algorithm 1: Dynamic Online reweighTing (DOT) Input: inference step t := 0; test stream samples {xj}; pre-trained model f {θ 0 ,a 0 } ; class-frequency vector z0; loss function L; smooth coefficient λ. while the test mini-batch {x m t +b } B b=1 arrives do t = t + 1 {p m t +b } B b=1 , f {θ t-1 ,a t } ← Forward({x m t +b } B b=1 , f {θ t-1 ,a t-1 } ) // output predictions for b = 1 to B do k * m t +b = arg max k∈[1,K] p m t +b [k] // predicted label w m t +b = 1/(zt-1[k * m t +b ]+ϵ) // assign sample weight wm t +b = B • w m t +b / B b ′ =1 w m t +b ′ , b = 1, 2, • • • , B // normalize sample weight l = 1 B B b=1 wm t +b • L(p m t +b ) // combine sample weight with loss f {θ t ,a t } ← Backward & Update(l, f {θ t-1 ,a t } ) // update θ zt ← λzt-1 + (1-λ) B B b=1 p m t +b // update z The imbalanced data is harmful during the normal training phase, resulting in biased models and poor overall accuracy (Liu et al., 2019; Menon et al., 2021) . Our main motivation is that the test-time adaptation methods also involve gradient-based optimization which is built on the model predictions; however, the predictions are actually imbalanced, particularly for dependent or class-imbalanced streams and the low-entropy-emphasized adaptation methods. Therefore, we argue that the test-time optimization is biased towards some dominant classes actually, resulting in inferior performance. A vicious circle is formed by skewed optimization and imbalanced predictions. Given these constraints, we propose DOT as presented in Algorithm 1. DOT is mainly derived from class-wise re-weighting (Cui et al., 2019) . To tackle the dynamically changing and unknown class frequencies, we use a momentum-updated class-frequency vector z ∈ R K instead (Line 10 of Algorithm 1), which is initiated with z [k] = 1 K , k = 1, 2, • • • , K. For each inference step, we assign weights to each test sample based on its pseudo label and the current z (Line 5,6 of Algorithm 1). Specifically, when z[k] is relatively large, during the subsequent adaptation, DOT will reduce the contributions of the k th class samples (pseudo label) and emphasize others. It is worth noting that DOT can alleviate the biased optimization caused by the pre-trained model (e.g., inter-class similarity), test stream (e.g., class-imbalanced scenario) simultaneously. DOT is a general idea to tackle the biased optimization, some parts in Algorithm 1 have multiple options, so it can be combined with different existing test-time adaptation techniques. For the "Forward (•)" function (Line 3 of Algorithm 1), the discussed BN adapt and TBR can be incorporated. For the loss function L(•) (Line 8 of Algorithm 1), studies usually employ the entropy minimization loss: L(p b ) = - K k=1 p b [k] log p b [k] or the cross-entropy loss with pseudo labels: L(p b ) = -I p b [k * b ]≥τ • log p b [k * b ] (commonly, only samples with high prediction confidence are utilized, τ is a pre-defined threshold). Similarly, for entropy minimization, Ent-W (Niu et al., 2022 ) also discards the high-entropy samples and emphasizes the low-entropy ones: L(p b ) = -I H b <τ • e τ -H b • K k=1 p b [k] log p b [k], where H b is the entropy of sample x b .

4. EXPERIMENTS

Datasets and models. We conduct experiments on common datasets CIFAR100-C, ImageNet-C (Hendrycks & Dietterich, 2019) , ImageNet-R (Hendrycks et al., 2021) , and a newly introduced video (segments) dataset: the subset of YouTube-BoundingBoxes (YTBB-sub) (Real et al., 2017) . CIFAR100-C / ImageNet-C contains 15 corruption types, each with 5 severity levels; we use the highest level unless otherwise specified. ImageNet-R contains various styles (e.g., paintings) of Ima-geNet categories. Following Wang et al. (2022a) ; Niu et al. (2022) , for evaluations on CIFAR100-C, we adopt the trained ResNeXt-29 (Xie et al., 2017) Metrics. Unless otherwise specified, we report the mean accuracy over classes (Acc, %) (Liu et al., 2019) ; results are averaged over 15 different corruption types for CIFAR100-C and ImageNet-C in the main text, please see detailed performance on each corruption type in Appendix A.5, A.6. Implementation. The configurations are mainly followed previous work Wang et al. (2021; 2022a) ; Niu et al. (2022) for comparison, details are listed in Appendix A.3. Code is available online. Baselines. We adopt the following SOTA methods as baselines: pseudo label (PL) (Lee et al., 2013) , test-time augmentation (TTA) (Ashukha et al., 2020) , BN adaptation (BN adapt) (Schneider et al., 2020; Nado et al., 2020) , test-time entropy minimization (TENT) (Wang et al., 2021) , marginal entropy minimization with one test point (MEMO) (Zhang et al., 2021) , efficient test-time adaptation (ETA) (Niu et al., 2022) , entropy-based weighting (Ent-W) (Niu et al., 2022) , laplacian adjusted maximum-likelihood estimation (LAME) (Boudiaf et al., 2022) , continual test-time adaptation (CoTTA/CoTTA*: w/wo resetting) (Wang et al., 2022a) . We combine DELTA with PL, TENT, and Ent-W in this work. Evaluation in IS+CB scenario. The results on CIFAR100-C are reported in Table 4 . As can be seen, the proposed DELTA consistently improves the previous adaptation approaches PL (gain 0.7%), TENT (gain 0.8%), and Ent-W (gain 0.8%), achieving new state-of-the-art performance. The results also indicate that current test-time adaptation methods indeed suffer from the discussed drawbacks, and the proposed methods can help them obtain superior performance. Then we evaluate the methods on the more challenging dataset ImageNet-C. Consistent with the results on CIFAR100-C, DELTA remarkably improves the existing methods. As the adaptation batch size ( 64) is too small compared to the class number (1,000) on ImageNet-C, the previous methods undergo more severe damage than on CIFAR100-C. Consequently, DELTA achieves greater gains on ImageNet-C: 1.6% gain over PL, 2.4% gain over TENT, and 5.6% gain over Ent-W. which is detailed in Appendix A.1). We test models with π ∈ {0.1, 0.05} (similarly, we show the extreme experiments with π = 0.001 in Appendix A.4). Table 6 summarizes the results in IS+CI and DS+CI scenarios, with the following observations: (i) Under class-imbalanced scenario, the performance degradation is not as severe as under dependent data. This is primarily because the imbalanced test data has relatively little effect on the normalization statistics. DELTA works well on the imbalanced test stream. (ii) The hybrid DS+CI scenario can be more difficult than the individual scenarios. DELTA can also boost baselines in the hybrid scenario. (iii) Though the low-entropyemphasized method Ent-W improves TENT in IS+CB scenario (Table 4 ), it can be inferior to TENT in dependent or class-imbalanced scenarios (the results on ImageNet-C in Table 5, 6 ). The reason is that Ent-W leads to a side effect -amplifying the class bias, which would neutralize or even overwhelm its benefits. DELTA eliminates Ent-W's side effects while retaining its benefits, so Ent-W+DELTA always significantly outperforms TENT+DELTA. Evaluation on realistic out-of-distribution datasets ImageNet-R and YTBB-sub. ImageNet-R is inherently class-imbalanced and consists of mixed variants such as cartoon, art, painting, sketch, toy, etc. As shown in Table 8 , DELTA also leads to consistent improvement on it. While compared to ImageNet-C, ImageNet-R is collected individually, which consists of more hard cases that are still difficult to recognize for DELTA, the gain is not as great as on ImageNet-C. For YTBB-sub, dependent and class-imbalanced samples are encountered naturally. We see that classical methods suffer from severe degradation, whereas DELTA assists them in achieving good performance. Evaluation on in-distribution test data. A qualified FTTA method should be "safe" on indistribution datasets, i.e., P test (x, y) = P train (x, y). According to Contribution of each component of DELTA. DELTA consists of two tools: TBR and DOT. In Table 9 , we analyze their contributions on the basis of TENT with four scenarios and two datasets. Row #1 indicates the results of TENT. Applying either TBR or DOT alone on TENT brings gain in most scenarios and datasets. While, we find that TBR achieves less improvement when the test stream is IS+CB and the batch size is large (e.g., performing adaptation with TBR alone on the IS+CB data of CIFAR100-C with batch size of 200 does not improve TENT). However, when the batch size is relatively small (e.g., ImageNet-C, batch size of 64), the benefits of TBR will become apparent. More importantly, TBR is extremely effective and necessary for dependent samples. Comparing DOT with other techniques for class imbalance. On the basis of Ent-W+TBR, Table 10 compares DOT against the following strategies for solving class imbalance. Diversity-based weight (Div-W) (Niu et al., 2022) computes the cosine similarity between the arrived test samples' prediction and a moving average one like z, then only employs the samples with low similarity to update model. Although the method is proposed to reduce redundancy, we find it can resist class imbalance too. The method relies on a predefined similarity threshold to determine whether to use a sample. We report the results of Div-W with varying thresholds (shown in parentheses). We observe that the threshold is very sensitive and the optimal value varies greatly across datasets. Logit adjustment (LA) (Menon et al., 2021) shows strong performance when training on imbalanced data. Following Wang et al. (2022b) , we can perform LA with the estimated class-frequency vector z in test-time adaptation tasks. While we find that LA does not show satisfactory results here. We speculate that this is because the estimated class distribution is not accurate under the one-pass adaptation and small batch size, while LA requires a high-quality class distribution estimate. KL divergence regularizer (KL-div) (Mummadi et al., 2021) augments loss function to encourage the predictions of test samples to be uniform. While, this is not always reasonable for TTA, e.g., for the class-imbalanced test data, forcing the outputs to be uniform will hurt the performance conversely. We examine multiple regularization strength options (shown in parentheses) and report the best two. The results show that KL-div is clearly inferior in dependent or class-imbalanced scenarios. We further propose another strategy called Sample-drop. It records the (pseudo) categories of the test samples that have been employed, then Sample-drop will directly discard a newly arrived test sample (i.e., not use the sample to update the model) if its pseudo category belongs to the majority classes among the counts. This simple strategy is valid but inferior to DOT, as it completely drops too many useful samples. Impacts of α in TBR and λ in DOT. Similar to most exponential-moving-average-based methods, when the smoothing coefficient α (or λ) is too small, the adaptation may be unstable; when α (or λ) is too large, the adaptation would be slow. Figure 5 provides the ablation studies of α (left) and λ (right) on the DS+CB (ρ = 0.5) samples of CIFAR100-C (from the validation set). We find that TBR and DOT perform reasonably well under a wide range of α and λ.

5. CONCLUSION

In this paper, we expose the defects in test-time adaptation methods which cause suboptimal or even degraded performance, and propose DELTA to mitigate them. First, the normalization statistics used in BN adapt are heavily influenced by the current test mini-batch, which can be one-sided and highly fluctuant. We introduce TBR to improve it using the (approximate) global statistics. Second, the optimization is highly skewed towards dominant classes, making the model more biased. DOT alleviates this problem by re-balancing the contributions of each class in an online manner. The combination of these two powerful tools results in our plug-in method DELTA, which achieves improvement in different scenarios (IS+CB, DS+CB, IS+CI, and DS+CI) at the same time. For the real-word applications with dependent and class-imbalanced test samples, we consider an automatic video content moderation task (e.g., for the short-video platform), which needs to recognize the categories of interest from the extracted frames. It is exactly a natural DS+CI scenario. We collect 1686 test videos from YouTube, which are annotated in YouTube-BoundingBoxes dataset. 49006 video segments are extracted from these videos and form the test stream in this experiment, named YTBB-sub here. We consider 21 categories. For the trained model, we adopt a model (ResNet18) trained on the related images from COCO dataset. Thus, there is a natural difference between the training domain and test domain. The consecutive video segments form the natural dependent samples (an object usually persists over several frames) as shown in Figure 8 . Moreover, the test class distribution is also skewed naturally as shown in Figure 8 . To simulate dependent test samples, for each class, we sample q k ∼ Dir J (ρ), q k ∈ R J and allocate a q k,j proportion of the k th class samples to piece j, then the J pieces are concatenated to form a test stream in our experiments (J is set to 10 for all experiments); ρ > 0 is a concentration factor, when ρ is small, samples belong to the same category will concentrate in test stream. To simulate class-imbalanced test samples, we re-sample data points with an exponential decay in frequencies across different classes. We control the degree of imbalance through an imbalance factor π, which is defined as the ratio between sample sizes of the least frequent class and the most frequent class. For DS+CI scenario, we mimic a class-imbalanced test set first, then the final test samples are dependently sampled from it.  ∈ R C , β ∈ R C ; current test-time moving mean μema ∈ R C and standard deviation σema ∈ R C ; smoothing coefficient α. μbatch [c] = 1 BSS ′ b,s,s ′ v[b, c, s, s ′ ], c = 1, 2, • • • , C // get mean (for each channel) σbatch [c] = 1 BSS ′ b,s,s ′ (v[b, c, s, s ′ ] -μbatch [c]) 2 + ϵ, c = 1, 2, • • • , C // get standard deviation (for each channel) r = sg(σ batch ) σema // get r d = sg( μbatch )-μema σema // get d v * = v-μbatch σbatch • r + d // normalize v ⋆ = γ • v * + β // scale and shift μema ← α • μema + (1 -α) • sg(μ batch ) // update μema σema ← α • σema + (1 -α) • sg(σ batch ) // update σema Output: v ⋆ , μema , σema A.3 IMPLEMENTATIONS We use Adam optimizer with learning rate of 1e-3, batch size of 200 for CIFAR100-C; SGD optimizer with learning rate of 2.5e-4, batch size of 64 for ImageNet-C/-R; SGD optimizer with learning rate of 2.5e-4, batch size of 200 for YTBB-sub. For DELTA, the hyper-parameters α and λ are roughly selected from {0.9, 0.95, 0.99, 0.999} on validation sets, e.g., the extra sets with corruption types outside the 15 types used in the benchmark. The smoothing coefficient α in TBR is set to 0.95 for CIFAR100-C and ImageNet-C/-R, 0.999 for YTBB-sub, λ in DOT is set to 0.95 for ImageNet-C/-R and 0.9 for CIFAR100-C / YTBB-sub. Then, we summarize the implementation details of the compared methods here, including BN adapt, PL, TENT, LAME, ETA, Ent-W, and CoTTA (CoTTA*). Unless otherwise specified, the optimizer, learning rate, and batch size are the same as those described in the main paper. For BN adapt, we follow the operation in Nado et al. (2020) and the official code of TENT (https://github.com/ DequanWang/tent), i.e., using the test-time normalization statistics completely. Though one can introduce a hyper-parameter to adjust the trade-off between current statistics and those inherited from the trained model (a 0 ) (Schneider et al., 2020) , we find this strategy does not lead to significant improvement and its effect varies from dataset to dataset. For PL and TENT, besides the normalization statistics, we update the affine parameters in BN modules. The confidence threshold in PL is set to 0.4, which can produce acceptable results in most cases. We adopt/modify the official implementation https://github.com/DequanWang/tent to produce the results of TENT/PL. For LAME, we use the k-NN affinity matrix with 5 nearest neighbors following Boudiaf et al. (2022) and the official implementation https://github.com/fiveai/LAME. For ETA, the entropy constant threshold is set to 0.4 × ln K (K is the number of task classes), and the similarity threshold is set to 0.4/0.05 for CIFAR/ImageNet experiments following the authors' suggestion and official implementation https://github.com/mr-eggplant/EATA. For Ent-W, the entropy constant threshold is set to 0.4 or 0.5 times ln K. For CoTTA, the used random augmentations include color jitter, random affine, gaussian blur, random horizontal flip, and gaussian noise. 32 augmentations are employed in this method. The learning rate is set to 0.01 for ImageNet experiments following official implementation https://github.com/qinenergy/cotta. The restoration probability is set to 0.01 for CIFAR experiments and 0.001 for ImageNet experiments. The augmentation threshold is set to 0.72 for CIFAR experiments and 0.1 for ImageNet experiments. The exponential-moving-average factor is set to 0.999 for all experiments. CoTTA optimizes all learnable parameters during adaptation.

A.4 ADDITIONAL ANALYSIS

Fully test-time adaptation with small (test) batch size. In the main paper, we report results with the default batch size following previous studies. Here, we study test-time adaptation with a much smaller batch size. The small batch size brings two serious challenges: the normalization statistics can be inaccurate and fluctuate dramatically; the gradient-based optimization can be noisy. Previ-ous study (Niu et al., 2022) employs a sliding window with L samples in total (including L -B previous samples, assuming L > B, L%B = 0 here) to perform adaptation. However, this strategy significantly increases the computational cost: L B × forward and backward, e.g., 64× when B = 1, L = 64. We employ another strategy, called "fast-inference and slow-update". When the samples arrive, infer them instantly with the current model but do not perform adaptation; the model is updated with the recent L samples every L B mini-batches. Thus, this strategy only needs 2× forward and 1× backward. Note that the two strategies both need to cache some recent test samples, which may be a bit against the "online adaptation". We evaluate TENT and DELTA on the IS+CB test stream of CIFAR100-C with batch sizes 128, 16, 8, and 1. The results are listed in Table 11 . We find that TENT suffers from severe performance degeneration when the batch size is small, which is due to TENT always using the normalization statistics derived from the test mini-batches, thus it is still affected by the small batch size during "fast-inference". With the assistance of DELTA, the performance degradation can be significantly alleviated: it only drops by 0.7% (from 69.8% to 69.1%) when B = 1. The initialization of TBR's normalization statistics. As described in Section 3.2, TBR keeps the moving normalization statistics μema , σema , we usually have two ways to initialize them: using the statistics μbatch 1 , σbatch 1 derived from the first test mini-batch (First); using the statistics µ ema , σ ema inherited from the trained model (Inherit). In the main paper, we use the "First" initialization strategy. However, it is worth noting that "First" is not reasonable for too small batch size. We perform TENT+DELTA with the above two initialization strategies and different batch sizes on the IS+CB test stream of CIFAR100-C. Figure 9 summaries the results, we can see that when the batch size is too small, using the inherited normalization statistics as initialization is better; when the batch size is acceptable (just > 8 for CIFAR100-C), using the "First" initialization strategy is superior. Performance under different severity levels on CIFAR100-C and ImageNet-C. In the main paper, for CIFAR100-C and ImageNet-C, we report the results with the highest severity level 5 following previous studies. Here, we investigate DELTA on top of TENT with different severity levels on CIFAR100-C (IS+CB scenario). Figure 10 presents the results. We observe that (i) as the corruption level increases, the model accuracy decreases; (ii) DELTA works well under all severity levels. Performance in extreme cases. We examine the performance of DELTA with more extreme conditions: DS+CB with ρ = 0.01, IS+CI with π = 0.001. Table 12 shows DELTA can manage the intractable cases. Influence of random seeds. As fully test-time adaptation is established based on a pre-trained model, i.e., does not need random initialization; methods like PL, TENT, Ent-W, and our DELTA 30.536 30.496 30.370 30.458 33.146 33.140 33.124 32.994 Ablation on DOT. We examine the performance of DOT with another way to get the sample weights (Line 5,6 in Algorithm 1). One can discard line 5 and modify line 6 to adopt the original soft probabilities: ω mt+b = K k=1 1/(z t-1 [k] + ϵ) • p mt+b [k]. We compare the hard label strategy (Algorithm 1) with the soft one in Table 14 (on the basis of Enw-W+TBR, on ImageNet-C). We find that both strategies work well in all scenarios, demonstrating the effectiveness of the idea of DOT. The performance of the soft strategy is slightly worse than the hard strategy in some scenarios. However, we think it is difficult to say "hard labels are necessarily better than soft labels" or "soft labels are necessarily better than hard labels", for example, the two strategies both exist in recent semi-supervised methods: hard label in FixMatch, soft label in UDA. 15 . Table 16 presents the results of all corruption types under different batch sizes and the two initialization strategies for normalization statistics in TBR, the averaged results have been illustrated in Table 11 and Figure 9 respectively. 



Regarding training class distribution, in experiments, we primarily use models learned on balanced training data following the benchmark of previous studies. Furthermore, when P train (y) is skewed, some techniques are commonly used to bring the model closer to the one trained on balanced data, such as on YTBB-sub (Section 4), where the trained model is learned with logit adjustment on class-imbalanced training data.



Figure 1: IS+CB / DS+CB: the test stream which is independently / dependently sampled from a class-balanced test distribution; IS+CI/ DS+CI: independently / dependently drawn from a class-imbalanced test distribution. Each bar represents a sample, each color represents a category.

Figure 3: Per-class number of predictions under combinations of [data, scenario, method].

Treatment II: Dynamic online re-weighting (DOT) can alleviate the biased optimization. Many methods have been developed to deal with class imbalance during the training phase, but they face several challenges when it comes to fully test-time adaptation: (i) Network architectures are immutable. (ii) Because test sample class frequencies are dynamic and agnostic, the common constraint of making the output distribution uniform(Liang et al., 2020) is no longer reasonable. (iii) Inference and adaptation must occur in real-time when test mini-batch arrived (only a single pass through test data, no iterative learning).

model from Hendrycks et al. (2020) as f {θ0,a0} ; for ImageNet-C / -R, we use the trained ResNet-50 model from Torchvision. The models are trained on the corresponding original training data. For YTBB-sub, we use a ResNet-18 trained on the related images of COCO. Details of the tasks, datasets and examples are provided in Appendix A.1.

Figure 4: Across architecture.

Figure 5: Impacts of α and λ.

Figure 6: Different renditions of class n01694178 (African chameleon) from ImageNet-R.

Figure 7: Different corruption types of class n01694178 (African chameleon) from ImageNet-C.

Figure 8: Characters of YTBB-sub dataset.

Figure 9: Comparison of two TBR initialization strategies on top of TENT+DELTA in IS+CB scenario on CIFAR100-C.

Figure 10: Comparison under different severity levels on CIFAR100-C.

0 ρ = 0.5 ρ = 0.1 π = 0.1 π = 0.05 ρ = 0.5, π = 0.1 ρ = 0.5, π = 0.05

7 67.4 Ent-W+TBR+Div-W(0.1) 63.5 65.3 67.0 75.2 62.7 72.8 74.7 70.0 69.4 66.7 76.1 73.2 67.1 71.7 63.8 69.3 Ent-W+TBR+Div-W(0.2) 63.8 65.6 68.1 75.3 63.1 73.4 75.0 70.7 70.0 67.4 77.0 73.5 67.3 72.5 64.1 69.8 Ent-W+TBR+Div-W(0.4) 63.6 65.4 68.2 75.3 63.1 73.3 75.0 70.8 69.9 67.3 76.9 73.6 67.1 72.6 64.0 69.7 Ent-W+TBR+LA 64.0 65.9 68.4 75.4 63.5 73.6 75.1 71.0 70.2 67.6 77.0 73.8 67.6 72.8 64.5 70.0 Ent-W+TBR+Sample-drop 64.1 66.2 68.6 75.8 63.8 73.5 75.5 70.9 70.2 67.7 77.0 73.9 68.2 72.8 64.4 70.2 Ent-W+DELTA 64.2 66.1 68.5 75.6 63.6 73.5 75.2 71.2 70.3 68.0 77.1 74.0 68.0 72.8 64.7 70.2

Comparison of fully test-time adaptation methods against the pretrained model on CIFAR100-C. DELTA achieves improvement in all scenarios.

.

±0.0 956.0 ±0.0 27.0 ±0.0 158.3 ±0.0 956.0 ±0.0 27.0 ±0.0 BN adapt+TEMA 18.4 ±0.2 121.6 ±3.7 58.0 ±0.2 19.8 ±1.1 130.0 ±13.6 56.7 ±0.5 TENT+TBR 35.8 ±2.9 269.8 ±44.0 62.2 ±0.4 52.4 ±9.1 469.2 ±104.2 57.1 ±0.8 TENT+TBR+DOT 20.4 ±1.1 122.0 ±15.2 63.9 ±0.2 25.5 ±2.1 164.6 ±43.0 60.4 ±0.5

Acc in IS+CB scenario.

Acc in DS+CB scenario with varying ρ. ±0.00 53.5 ±0.00 53.5 ±0.00 18.0 ±0.00 18.0 ±0.00 18.0 ±0.00 BN adapt 53.0 ±0.48 49.0 ±0.32 35.2 ±0.64 21.8 ±0.12 19.2 ±0.09 12.1 ±0.13 ETA 55.4 ±0.63 50.5 ±0.34 34.5 ±0.83 27.6 ±0.31 22.4 ±0.20 9.7 ±0.24 LAME 60.3 ±0.25 61.8 ±0.26 65.4 ±0.41 21.9 ±0.03 22.7 ±0.05 24.7 ±0.03 CoTTA 53.8 ±0.51 50.0 ±0.23 36.3 ±0.63 23.4 ±0.15 20.5 ±0.05 12.6 ±0.15 CoTTA* 54.1 ±0.65 50.2 ±0.23 36.1 ±0.71 23.5 ±0.27 20.3 ±0.55 12.8 ±0.26 PL 54.9 ±0.54 50.1 ±0.29 34.8 ±0.76 25.9 ±0.18 22.5 ±0.14 13.0 ±0.09 +DELTA 68.0 ±0.25 67.5 ±0.30 66.0 ±0.45 40.5 ±0.05 39.9 ±0.07 37.3 ±0.10 ±0.52 49.7 ±0.40 33.7 ±0.70 26.0 ±0.20 22.1 ±0.12 12.1 ±0.10 +DELTA 68.9 ±0.20 68.5 ±0.40 67.1 ±0.47 43.7 ±0.06 43.1 ±0.07 40.3 ±0.06 ±0.63 50.5 ±0.35 34.5 ±0.83 17.4 ±0.40 13.0 ±0.22 4.1 ±0.22 +DELTA 69.4 ±0.22 68.8 ±0.35 67.1 ±0.45 48.3 ±0.12 47.4 ±0.04 43.2 ±0.11

Mean acc in IS+CI, DS+CI scenarios with different π. ±0.00 53.3 ±0.00 17.9 ±0.00 17.9 ±0.00 53.3 ±0.00 53.3 ±0.00 17.9 ±0.00 17.9 ±0.00BN adapt 64.3  ±0.16 64.2 ±0.48 31.5 ±0.24 31.4 ±0.19 49.8 ±0.47 49.9 ±0.63 20.0 ±0.22 20.5 ±0.22 ETA 68.2 ±0.24 68.2 ±0.59 47.4 ±0.23 47.1 ±0.18 51.1 ±0.45 51.0 ±0.54 21.7 ±0.52 21.0 ±0.40 LAME 50.6 ±0.18 50.8 ±0.39 17.2 ±0.10 17.2 ±0.07 60.4 ±0.34 59.6 ±0.43 21.8 ±0.12 21.5 ±0.07 CoTTA 65.1 ±0.13 65.1 ±0.58 34.2 ±0.26 34.2 ±0.16 50.5 ±0.47 50.5 ±0.60 21.4 ±0.21 22.0 ±0.26 CoTTA* 67.0 ±0.17 66.9 ±0.66 34.6 ±0.78 34.3 ±0.51 50.7 ±0.52 50.6 ±0.63 21.6 ±0.56 22.1 ±0.24 PL 67.2 ±0.21 67.3 ±0.57 39.4 ±0.21 39.3 ±0.18 50.7 ±0.41 50.6 ±0.53 22.8 ±0.35 23.1 ±0.25 +DELTA 67.6 ±0.36 67.6 ±0.46 40.9 ±0.26 40.7 ±0.22 66.6 ±0.39 66.3 ±0.57 38.8 ±0.27 38.5 ±0.21 ±0.29 67.7 ±0.58 42.2 ±0.26 42.0 ±0.21 50.3 ±0.41 50.2 ±0.56 22.3 ±0.25 22.5 ±0.23 +DELTA 68.5 ±0.31 68.6 ±0.60 44.4 ±0.25 44.2 ±0.22 67.7 ±0.41 67.5 ±0.70 42.1 ±0.28 41.9 ±0.24 ±0.26 68.2 ±0.58 40.8 ±0.76 39.5 ±0.82 51.1 ±0.44 51.0 ±0.53 11.3 ±0.81 10.8 ±0.40 +DELTA 69.1 ±0.25 69.2 ±0.53 48.4 ±0.31 47.7 ±0.21 68.0 ±0.30 67.8 ±0.60 45.4 ±0.53 44.8 ±0.24

Results on in-distribution test set of CIFAR100.





Ablation on the effectiveness of each component (on top of TENT) measured in various scenarios: IS+CB, DS+CB (ρ=0.5), IS+CI (π=0.1), DS+CI (ρ=0.5, π=0.05).

Ablation on different techniques for class imbalance (on top of Ent-W+TBR) measured in various scenarios (same as in Table9).

Results (classification accuracy, %) with different batch sizes on IS+CB test stream of CIFAR100-C.

Performance in extreme cases. As a result, the adaptation results are always the same on one fixed test stream. However, the random seeds can affect sample order in our experiments. We study the influence of random seeds on Gauss and Shot data (IS+CB scenario) of ImageNet-C with seeds {2020, 2021, 2022, 2023}. The results of TENT and DELTA are summarized in Table13, from which one can see the methods are not greatly affected by the sample order within the same scenario. For fair comparison, all methods are investigated under the same sample order for each specific scenario in our experiments.

Influence of random seeds. Classification accuracies (%) are reported on two kinds of corrupted data (IS+CB) of ImageNet-C under four random seeds(2020, 2021, 2022, and 2023).

Ablation on DOT.

Table 2 has compared the usages of different normalization statistics, we further provide the detailed results of all corruption types in Table

summarises the detailed performance on IS+CB test stream with different severity levels.

compares the test-time adaptation methods in IS+CB scenario; Table19for DS+CB test stream (ρ = 1.0), Table20for DS+CB test stream (ρ = 0.5), Table21for DS+CB test stream (ρ = 0.1); Table22, 23 for IS+CI data with π = 0.1, π = 0.05; Table24/ Table25for DS+CI test data with ρ = 0.5 and π = 0.1 / π = 0.05.

compares the test-time adaptation methods in IS+CB scenario and Table 27 further compares them with different model architectures; Table 28, Table 29, and Table 30 for DS+CB test streams with ρ = 1.0, ρ = 0.5 and ρ = 0.1, respectively; Table 31, 32 for IS+CI data with π = 0.1, π = 0.05; Table 33 / Table 34 for DS+CI test data with ρ = 0.5 and π = 0.1 / π = 0.05. The results in Table 15-Table 34 are obtained with seed 2020.

Comparison of the normalization statistics on IS+CB and DS+CB test streams of CIFAR100-C with B = 128 in terms of classification accuracy (%).

Comparison of different batch sizes and the initialization strategies for TBR's normalization statistics on IS+CB test stream of CIFAR100-C in terms of classification accuracy (%).

Classification accuracy (%) on IS+CB test stream of CIFAR100-C with different severity levels (B = 128).

Classification accuracy (%) on IS+CB test stream of CIFAR100-C.

Classification accuracy (%) on IS+CB test stream of ImageNet-C with different architectures.

36.8 Ent-W+DELTA 31.7 33.8 32.0 29.0 30.3 40.2 46.1 44.2 39.7 53.1 60.9 36.9 51.5 54.7 49.8 42.3 ResNet50 Source 2.2 2.9 1.9 17.9 9.8 14.8 22.5 16.9 23.3 24.4 58.9 5.4 17.0 20.6 31.6 18.0 TENT 28.7 30.5 30.1 28.0 27.2 41.4 49.4 47.2 41.2 57.3 67.4 26.7 54.6 58.5 52.5 42.7 TENT+DELTA 31.2 33.1 32.1 30.5 30.2 42.9 50.9 48.2 43.0 58.5 68.1 37.9 56.2 59.5 53.6 45.1 Ent-W 34.5 29.0 33.1 29.6 26.3 47.4 52.2 51.9 45.6 59.9 67.8 17.8 57.8 60.9 55.0 44.6 Ent-W+DELTA 38.1 39.6 39.0 36.3 36.5 49.9 54.0 53.5 47.6 61.1 68.4 46.9 59.2 61.9 56.6 49.9 ResNet101 Source 3.5 4.3 3.5 21.9 13.1 19.2 26.5 21.0 26.7 28.1 61.4 7.2 24.3 35.0 42.3 22.5 TENT 32.6 34.0 33.2 32.2 32.4 45.1 53.0 50.8 45.0 59.6 69.1 33.8 58.6 61.1 55.8 46.4 TENT+DELTA 35.1 37.4 35.6 34.9 35.1 46.8 54.6 51.8 46.7 60.7 69.9 42.6 60.1 62.3 57.2 48.7 Ent-W 36.1 20.8 37.3 33.6 31.7 50.3 55.6 54.9 46.8 62.4 69.8 19.7 61.1 63.2 58.2 46.8 Ent-W+DELTA 40.9 43.0 41.9 39.8 40.1 53.1 57.4 56.5 50.8 63.4 70.2 50.6 62.3 64.2 59.8 53.0 ResNet152 Source 3.6 4.4 3.3 22.1 11.9 24.8 25.5 22.1 28.9 27.7 63.1 5.2 24.9 27.1 42.2 22.5 TENT 34.0 36.8 35.3 34.1 34.0 46.9 54.0 52.4 47.0 61.3 70.7 35.5 59.9 62.4 57.2 48.1 TENT+DELTA 36.6 39.2 37.7 36.7 36.3 48.7 55.6 54.0 48.4 62.4 71.2 44.0 61.3 63.3 58.4 50.2 Ent-W 38.7 33.4 34.6 36.6 33.2 52.9 57.4 56.9 46.5 64.2 71.0 29.3 62.7 64.8 60.0 49.5 Ent-W+DELTA 42.6 45.4 44.5 42.0 42.2 55.5 58.9 58.5 52.7 65.5 71.4 51.9 63.7 65.8 61.2 54.8 Ent-W+DELTA 40.7 43.6 42.0 39.5 39.1 53.1 56.7 56.6 51.1 63.2 70.4 50.7 61.5 64.9 58.2 52.8 Classification accuracy (%) on DS+CB (ρ = 1.0) test stream of ImageNet-C. .8 16.8 15.1 14.1 23.3 30.2 28.8 24.9 37.5 46.7 9.3 34.9 37.8 33.0 25.7 TENT+DELTA 29.6 31.7 30.4 29.1 28.6 41.5 49.8 47.0 42.1 57.6 67.5 35.7 54.9 58.5 52.0 43.7 Ent-W 4.2 2.8 3.1 2.9 3.6 11.3 20.2 20.0 12.5 34.4 44.7 1.7 32.0 37.1 21.5 16.8 Ent

ACKNOWLEDGMENTS

This work is supported in part by the National Natural Science Foundation of China under Grant 62171248, the R&D Program of Shenzhen under Grant JCYJ20220818101012025, the PCNL KEY project (PCL2021A07), and Shenzhen Science and Technology Innovation Commission (Research Center for Computer Network (Shenzhen) Ministry of Education).

