TOWARDS STABLE TEST-TIME ADAPTATION IN DYNAMIC WILD WORLD

Abstract

Test-time adaptation (TTA) has shown to be effective at tackling distribution shifts between training and testing data by adapting a given model on test samples. However, the online model updating of TTA may be unstable and this is often a key obstacle preventing existing TTA methods from being deployed in the real world. Specifically, TTA may fail to improve or even harm the model performance when test data have: 1) mixed distribution shifts, 2) small batch sizes, and 3) online imbalanced label distribution shifts, which are quite common in practice. In this paper, we investigate the unstable reasons and find that the batch norm layer is a crucial factor hindering TTA stability. Conversely, TTA can perform more stably with batch-agnostic norm layers, i.e., group or layer norm. However, we observe that TTA with group and layer norms does not always succeed and still suffers many failure cases. By digging into the failure cases, we find that certain noisy test samples with large gradients may disturb the model adaption and result in collapsed trivial solutions, i.e., assigning the same class label for all samples. To address the above collapse issue, we propose a sharpness-aware and reliable entropy minimization method, called SAR, for further stabilizing TTA from two aspects: 1) remove partial noisy samples with large gradients, 2) encourage model weights to go to a flat minimum so that the model is robust to the remaining noisy samples. Promising results demonstrate that SAR performs more stably over prior methods and is computationally efficient under the above wild test scenarios. The source code is available at https://github.com/mr-eggplant/SAR.

1. INTRODUCTION

Deep neural networks achieve excellent performance when training and testing domains follow the same distribution (He et al., 2016; Wang et al., 2018; Choi et al., 2018) . However, when domain shifts exist, deep networks often struggle to generalize. Such domain shifts usually occur in real applications, since test data may unavoidably encounter natural variations or corruptions (Hendrycks & Dietterich, 2019; Koh et al., 2021) , such as the weather changes (e.g., snow, frost, fog), sensor degradation (e.g., Gaussian noise, defocus blur), and many other reasons. Unfortunately, deep models can be sensitive to the above shifts and suffer from severe performance degradation even if the shift is mild (Recht et al., 2018) . However, deploying a deep model on test domains with distribution shifts is still an urgent demand, and model adaptation is needed in these cases. Recently, numerous test-time adaptation (TTA) methods (Sun et al., 2020; Wang et al., 2021; Iwasawa & Matsuo, 2021; Bartler et al., 2022) have been proposed to conquer the above domain shifts by online updating a model on the test data, which include two main categories, i.e., Test-Time Training (TTT) (Sun et al., 2020; Liu et al., 2021) and Fully TTA (Wang et al., 2021; Niu et al., 2022a) . In this work, we focus on Fully TTA since it is more generally to be used than TTT in two aspects: i) it does not alter training and can adapt arbitrary pre-trained models to the test data without access to original training data; ii) it may rely on fewer backward passes (only one or less than one) for each test sample than TTT (see efficiency comparisons of TTT, Tent and EATA in Table 6 ). TTA has been shown boost model robustness to domain shifts significantly. However, its excellent performance is often obtained under some mild test settings, e.g., adapting with a batch of test samples that have the same distribution shift type and randomly shuffled label distribution (see Figure 1 ➀ ). In the complex real world, test data may come arbitrarily. As shown in Figure 1 ➁, the test scenario may meet: i) mixture of multiple distribution shifts, ii) small test batch sizes (even single sample), iii) the ground-truth test label distribution Q t (y) is online shifted and Q t (y) may be imbalanced at each time-step t. In these wild test settings, online updating a model by existing TTA methods may be unstable, i.e., failing to help or even harming the model's robustness. To stabilize wild TTA, one immediate solution is to recover the model weights after each time adaptation of a sample or mini-batch, such as MEMO (Zhang et al., 2022) and episodic Tent (Wang et al., 2021) . Meanwhile, DDA (Gao et al., 2022) provides a potentially effective idea to address this issue: rather than model adaptation, it seeks to transfer test samples to the source training distribution (via a trained diffusion model (Dhariwal & Nichol, 2021)) , in which all model weights are frozen during testing. However, these methods cannot cumulatively exploit the knowledge of previous test samples to boost adaptation performance, and thus obtain limited results when there are lots of test samples. In addition, the diffusion model in DDA is expected to have good generalization ability and can project any possible target shifts to the source data. Nevertheless, this is hard to be satisfied as far as it goes, e.g., DDA performs well on noise shifts while less competitive on blur and weather (see Table 2 ). Thus, how to stabilize online TTA under wild test settings is still an open question. In this paper, we first point out that the batch norm (BN) layer (Ioffe & Szegedy, 2015) is a key obstacle since under the above wild scenarios the mean and variance estimation in BN layers will be biased. In light of this, we further investigate the effects of norm layers in TTA (see Section 4) and find that pre-trained models with batch-agnostic norm layers (i.e., group norm (GN) (Wu & He, 2018) and layer norm (LN) (Ba et al., 2016) ) are more beneficial for stable TTA. However, TTA on GN/LN models does not always succeed and still has many failure cases. Specifically, GN/LN models optimized by online entropy minimization (Wang et al., 2021) tend to occur collapse, i.e., predicting all samples to a single class (see Figure 2 ), especially when the distribution shift is severe. To address this issue, we propose a sharpness-aware and reliable entropy minimization method (namely SAR). Specifically, we find that indeed some noisy samples that produce gradients with large norms harm the adaptation and thus result in model collapse. To avoid this, we filter partial samples with large and noisy gradients out of adaptation according to their entropy. For the remaining samples, we introduce a sharpness-aware learning scheme to ensure that the model weights are optimized to a flat minimum, thereby being robust to the large and noisy gradients/updates. Main Findings and Contributions. (1) We analyze and empirically verify that batch-agnostic norm layers (i.e., GN and LN) are more beneficial than BN to stable test-time adaptation under wild test settings, i.e., mix domain shifts, small test batch sizes and online imbalanced label distribution shifts (see Figure 1 ). ( 2) We further address the model collapse issue of test-time entropy minimization on GN/LN models by proposing a sharpness-aware and reliable (SAR) optimization scheme, which jointly minimizes the entropy and the sharpness of entropy of those reliable test samples. SAR is simple yet effective and enables online test-time adaptation stabilized under wild test settings. Test-time Training (TTT). Let f Θ (x) denote a model trained on D train = {(x i , y i )} N i=1 with parameter Θ, where x i ∈ X train (the training data space) and y i ∈ C (the label space). The goal of test-time adaptation (Sun et al., 2020; Wang et al., 2021)  is to boost f Θ (x) on out-of-distribution test samples D test = {x j } M j=1 , where x j ∈ X test (testing data space) and X test ̸ = X train . Sun et al. (2020) first propose the TTT pipeline, in which at training phase a model is trained on source D train via both cross-entropy L CE and self-supervised rotation prediction (Gidaris et al., 2018)  L S : min Θ b ,Θc,Θs E x∈Dtrain [L CE (x; Θ b , Θ c ) + L S (x; Θ b , Θ s )], where Θ b is the task-shared parameters (shadow layers), Θ c and Θ s are task-specific parameters (deep layers) for L CE and L S , respectively. At testing phase, given a test sample x, TTT first updates the model with self-supervised task: Θ ′ b ← arg min Θ b L S (x; Θ b , Θ s ) and then use the updated model weights Θ ′ b to perform final prediction via f (x; Θ ′ b , Θ c ). Fully Test-time Adaptation (TTA). The pipeline of TTT needs to alter the original model training process, which may be infeasible when training data are unavailable due to privacy/storage concerns. To avoid this, Wang et al. (2021) propose fully TTA, which adapts arbitrary pre-trained models for a given test mini-batch by conducting entropy minimization (Tent): minc ŷc log ŷc where ŷc =f Θ (c|x) and c denotes class c. This method is more efficient than TTT as shown in Table 6 .

3. STABLE ADAPTATION BY TEST ENTROPY AND SHARPNESS MINIMIZATION

Test-time Adaptation (TTA) in Dynamic Wild World. Although prior TTA methods have exhibited great potential for out-of-distribution generalization, its success may rely on some underlying test prerequisites (as illustrated in Figure 1 ): 1) test samples have the same distribution shift type; 2) adapting with a batch of samples each time, 3) the test label distribution is uniform during the whole online adaptation process, which, however, are easy to be violated in the wild world. In wild scenarios (Figure 1 ➁), prior methods may perform poorly or even fail. In this section, we seek to analyze the underlying reasons why TTA fails under wild testing scenarios described in Figure 1 from a unified perspective (c.f. Section 3.1) and then propose associated solutions (c.f. Section 3.2).

3.1. WHAT CAUSES UNSTABLE TEST-TIME ADAPTATION?

We first analyze why wild TTA fails by investigating the norm layer effects in TTA and then dig into the unstable reasons for entropy-based methods with batch-agnostic norms, e.g., group norm. Batch Normalization Hinders Stable TTA. In TTA, prior methods often conduct adaptation on pre-trained models with batch normalization (BN) layers (Ioffe & Szegedy, 2015) , and most of them are built upon BN statistics adaptation (Schneider et al., 2020; Nado et al., 2020; Khurana et al., 2021; Wang et al., 2021; Niu et al., 2022a; Hu et al., 2021; Zhang et al., 2022) . Specifically, for a layer with d-dimensional input x = x (1) . . . x (d) , the batch normalized output are: y (k) = γ (k) x (k) + β (k) , where x (k) = x (k) -E x (k) Var x (k) . Here, γ (k) and β (k) are learnable affine parameters. BN adaptation methods calculate mean E[x (k) ] and variance Var[x (k) ] over (a batch of) test samples. However, in wild TTA, all three practical adaptation settings (in Figure 1 ) in which TTA may fail will result in problematic mean and variance estimation. First, BN statistics indeed represent a distribution and ideally each distribution should have its own statistics. Simply estimating shared BN statistics of multiple distributions from mini-batch test samples unavoidably obtains limited performance, such as in multi-task/domain learning (Wu & Johnson, 2021) . Second, the quality of estimated statistics relies on the batch size, and it is hard to use very few samples (i.e., small batch size) to estimate it accurately. Third, the imbalanced label shift will also result in biased BN statistics towards some specific classes in the dataset. Based on the above, we posit that batch-agnostic norm layers, i.e., agnostic to the way samples are grouped into a batch, are more suitable for performing TTA, such as group norm (GN) (Wu & He, 2018) and layer norm (LN) (Ba et al., 2016) . We devise our method based on GN/LN models in Section 3.2. To verify the above claim, we empirically investigate the effects of different normalization layers (including BN, GN, and LN) in TTA (including TTT and Tent) in Section 4. From the results, we observe that models equipped with GN and LN are more stable than models with BN when performing online test-time adaptation under three practical test settings (in Figure 1 ) and have fewer failure cases. The detailed empirical studies are put in Section 4 for the coherence of presentation. Online Entropy Minimization Tends to Result in Collapsed Trivial Solutions, i.e., Predict All Samples to the Same Class. Although TTA performs more stable on GN and LN models, it does not always succeed and still faces several failure cases (as shown in Section 4). For example, entropy minimization (Tent) on GN models (ResNet50-GN) tends to collapse, especially when the distribution shift extent is severe. In this paper, we aim to stabilize online fully TTA under various practical test settings. To this end, we first analyze the failure reasons, in which we find models are often optimized to collapse trivial solutions. We illustrate this issue in the following. During the online adaptation process, we record the predicted class and the gradients norm (produced by entropy loss) of ResNet50-GN on shuffled ImageNet-C of Gaussian noise. By comparing Figures 2 (a ) and (b), entropy minimization is shown to be unstable and may occur collapse when the distribution shift is severe (i.e., severity level 5). From Figure 2 (a), as the adaptation goes by, the model tends to predict all input samples to the same class, even though these samples have different ground-truth classes, called model collapse. Meanwhile, we notice that along with the model starts to collapse the ℓ 2 -norm of gradients of all trainable parameters suddenly increases and then degrades to almost 0 (as shown in Figure 2 (c )), while on severity level 3 the model works well and the gradients norm keep in a stable range all the time. This indicates that some test samples produce large gradients that may hurt the adaptation and lead to model collapse.

3.2. SHARPNESS-AWARE AND RELIABLE TEST-TIME ENTROPY MINIMIZATION

Based on the above analyses, two most straightforward solutions to avoid model collapse are filtering out test samples according to the sample gradients or performing gradients clipping. However, these are not very feasible since the gradients norms for different models and distribution shift types have different scales, and thus it is hard to devise a general method to set the threshold for sample filtering or gradient clipping (see Section 5.2 for more analyses). We propose our solutions as follows. Reliable Entropy Minimization. Since directly filtering samples with gradients norm is infeasible, we first investigate the relation between entropy loss and gradients norm and seek to remove samples with large gradients based on their entropy. Here, the entropy depends on the model's output class number C and it belongs to (0, ln C) for different models and data. In this sense, the threshold for filtering samples with entropy is easier to select. As shown in Figure 2 (d) , selecting samples with small loss values can remove part of samples that have large gradients (area@1) out of adaptation. Formally, let E(x; Θ) be the entropy of sample x, the selective entropy minimization is defined by: min Θ S(x)E(x; Θ), where S(x) ≜ I {E(x;Θ)<E0} (x). (2) Here, Θ denote model parameters, I {•} (•) is an indicator function and E 0 is a pre-defined parameter. Note that the above criteria will also remove samples within area@2 in Figure 2 (d), in which the samples have low confidence and thus are unreliable (Niu et al., 2022a) . Sharpness-aware Entropy Minimization. Through Eqn. (2), we have removed test samples in area@1&2 in Figure 2 (d) from adaptation. Ideally, we expect to optimize the model via samples only in area@3, since samples in area@4 still have large gradients and may harm the adaptation. However, it is hard to further remove the samples in area@4 via a filtering scheme. Alternatively, we seek to make the model insensitive to the large gradients contributed by samples in area@4. Here, we encourage the model to go to a flat area of the entropy loss surface. The reason is that a flat minimum has good generalization ability and is robust to noisy/large gradients, i.e., the noisy/large updates over the flat minimum would not significantly affect the original model loss, while a sharp minimum would. To this end, we jointly minimize the entropy and the sharpness of entropy loss by: min Θ E SA (x; Θ), where E SA (x; Θ) ≜ max ∥ϵ∥2≤ρ E(x; Θ + ϵ). Here, the inner optimization seeks to find a weight perturbation ϵ in a Euclidean ball with radius ρ that maximizes the entropy. The sharpness is quantified by the maximal change of entropy between Θ and Θ + ϵ. This bi-level problem encourages the optimization to find flat minima. To address problem (3), we follow SAM (Foret et al., 2021) that first approximately solves the inner optimization via first-order Taylor expansion, i.e., ϵ * (Θ) ≜ arg max ∥ϵ∥2≤ρ E(x; Θ + ϵ) ≈ arg max ∥ϵ∥2≤ρ E(x; Θ) + ϵ T ∇ Θ E(x; Θ) = arg max ∥ϵ∥2≤ρ ϵ T ∇ Θ E(x; Θ). Then, ε(Θ) that solves this approximation is given by the solution to a classical dual norm problem: ε(Θ) = ρ sign (∇ Θ E(x; Θ)) |∇ Θ E(x; Θ)|/ ∥∇ Θ E(x; Θ)∥ 2 . ( ) Substituting ε(Θ) back into Eqn. (3) and differentiating, by omitting the second-order terms for computation acceleration, the final gradient approximation is: ∇ Θ E SA (x; Θ) ≈ ∇ Θ E(x; Θ) Θ+ε(Θ) . Overall Optimization. In summary, our sharpness-aware and reliable entropy minimization is: min Θ S(x)E SA (x; Θ), where S(x) and E SA (x; Θ) are defined in Eqns. ( 2) and (3) respectively, Θ ⊂ Θ denote learnable parameters during test-time adaptation. In addition, to avoid a few extremely hard cases that Eqn. ( 6) may also fail, we further introduce a Model Recovery Scheme. We record a moving average e m of entropy loss values and reset Θ to be original once e m is smaller than a small threshold e 0 , since models after occurring collapse will produce very small entropy loss. Here, the additional memory costs are negligible since we only optimize affine parameters in norm layers (see Appendix C.2 for more details). We summarize the details of our method in Algorithm 1 in Appendix B.

4. EMPIRICAL STUDIES OF NORMALIZATION LAYER EFFECTS IN TTA

This section designs experiments to illustrate how test-time adaptation (TTA) performs on models with different norm layers (including BN, GN and LN) under wild test settings described in Figure 1 . We verify two representative methods introduced in Section 2, i.e., self-supervised TTT (Sun et al., 2020) and unsupervised Tent (a fully TTA method) (Wang et al., 2021) . Considering that the norm layers are often coupled with mainstream network architectures, we conduct adaptation on ResNet-50-BN (R-50-BN), ResNet-50-GN (R-50-GN) and VitBase-LN (Vit-LN). All adopted model weights are public available and obtained from torchvision or timm repository (Wightman, 2019). Implementation details of experiments in this section can be found in Appendix C.2. . We report mean&stdev. results of 3 runs. Note that except for VitBase-LN, the stdev. is too small to display in the figures. (1) Norm Layer Effects in TTA Under Small Test Batch Sizes. We evaluate TTA methods (TTT and Tent) with different batch sizes (BS), selected from {1, 2, 4, 8, 16, 32, 64}. Due to GPU memory limits, we only report results of BS up to 8 or 16 for TTT (in original TTT BS is 1), since TTT needs to augment each test sample multiple times (for which we set to 20 by following Niu et al. (2022a) ). From Figure 3 , we have: i) For Tent, compared with R-50-BN, R-50-GN and Vit-LN are less sensitive to small test batch sizes. The adaptation performance of R-50-BN degrades severely when the batch size goes small (<8), while R-50-GN/Vit-LN show stable performance across various batch sizes (Vit-LN on levels 5&3 and R-50-GN on level 3, in subfigures (b)&(d)). It is worth noting that Tent with R-50-GN and Vit-LN not always succeeds and also has failure cases, such as R-50-GN on level 5 (Tent performs worse than no adapt), which is analyzed in Section 3.1. ii) For TTT, all R-50-BN/GN and Vit-LN can perform well under various batch sizes. However, TTT with Vit-LN is very unstable and has a large variance over different runs, showing that TTT+VitBase is very sensitive to different sample orders. Here, TTT performs well with R-50-BN under batch size 1 is mainly benefited from TTT applying multiple data augmentations to a single sample to form a mini-batch. (2) Norm Layer Effects in TTA Under Mixed Distribution Shifts. We evaluate TTA methods on models with different norm layers when test data come from multiple shifted domains simultaneously. We compare 'no adapt', 'avg. adapt' (the average accuracy of adapting on each domain separately) and 'mix adapt' (adapting on mixed and shifted domains) accuracy on ImageNet-C consisting of 15 corruption types. The larger accuracy gap between 'mix adapt' and 'avg. adapt' indicates the more sensitive to mixed distribution shifts. From Figure 4 , we have: i) For both Tent and TTT, R-50-GN and Vit-LN perform more stable than R-50-BN under mix domain shifts. Specifically, the mix adapt accuracy of R-50-BN is consistently poor than the average adapt accuracy across different severity levels (in all subfigures (a-d)). In contrast, R-50-GN and Vit-LN are able to achieve comparable accuracy of mix and average adapt, i.e., TTT on R-50-GN (levels 5&3) and Tent on Vit-LN (level 3). ii) For R-50-GN and Vit-LN, TTT performs more stable than Tent. To be specific, Tent gets 3/4 failure cases (R-50-GN on levels 5&3, Vit-LN on level 5), which is more than that of TTT. iii) The same as Section 4 (1), TTT on Vit-LN has large variances over multiple runs, showing TTT+Vit-LN is sensitive to different sample orders. (3) Norm Layer Effects in TTA Under Online Imbalanced Label Shifts. As in Figure 1 (c) , during the online adaptation process, the label distribution Q t (y) at different time-steps t may be different (online shift) and imbalanced. To evaluate this, we first simulate this imbalanced label distribution shift by adjusting the order of input samples (from a test set) as follows. Online Imbalanced Label Distribution Shift Simulation. Assuming that we have totally T time-steps and T equals to the class number C. We set the probability vector Q t (y) = [q 1 , q 2 , ..., q C ], where q c = q max if c = t and q c = q min ≜ (1q max )/(C -1) if c ̸ = t. Here, q max /q min denotes the imbalance ratio. Then, at each t ∈ {1, 2, ..., T =C}, we sample M images from the test set according to Q t (y). Based on ImageNet-C (Gaussian noise), we construct a new testing set that has online imbalanced label distribution shifts with totally 100(M ) × 1000(T ) images. Note that we pre-shuffle the class orders in ImageNet-C, since we cannot know which class will come in practice. From Figure 5 , we have: i) For Tent, R-50-GN and Vit-LN are less sensitive than R-50-BN to online imbalanced label distribution shifts (see subfigures (b)&(d)). Specifically, the adaptation accuracy of R-50-BN (levels 5&3) degrades severely as the imbalance ratio increases. In contrast, R-50-GN and Vit-LN have the potential to perform stably under various imbalance ratios (e.g., R-50-GN and Vit-LN on level 3). ii) For TTT, all R-50-BN/GN and Vit-LN perform relatively stable under label shifts, except for TTT+Vit-LN has large variances. The adaptation accuracy will also degrade but not very severe as the imbalance ratio increases. iii) Tent with GN is more sensitive to the extent of distribution shift than BN. Specifically, for imbalanced ratio 1 (all Q t (y) are uniform) and severity level 5, Tent+R-50-GN fails and performs poorer than no adapt, while Tent+R-50-BN works well. (4) Overall Observations. Based on all the above results, we have: i) R-50-GN and Vit-LN are more stable than R-50-BN when performing TTA under wild test settings (see Figure 1 ). However, they do not always succeed and still suffer from several failure cases. ii) R-50-GN is more suitable for self-supervised TTT than Vit-LN, since TTT+Vit-LN is sensitive to different sample orders and has large variances over different runs. iii) Vit-LN is more suitable for unsupervised Tent than R-50-GN, since Tent+R-50-GN is easily to collapse, especially when the distribution shift is severe. (Gao et al., 2022) 146,220 secs TTT (Sun et al., 2020) 3,600 secs Tent (Wang et al., 2021) 110 secs EATA (Niu et al., 2022a) 114 secs

SAR (ours) 115 secs

Dataset and Methods. We conduct experiments based on ImageNet-C (Hendrycks & Dietterich, 2019) , a large-scale and widely used benchmark for out-of-distribution generalization. It contains 15 types of 4 main categories (noise, blur, weather, digital) corrupted images and each type has 5 severity levels. We compare our SAR with the following state-of-the-art methods. DDA (Gao et al., 2022) performs input adaptation at test time via a diffusion model. MEMO (Zhang et al., 2022) minimizes marginal entropy over different augmented copies w.r.t. a given test sample. Tent (Wang et al., 2021) and EATA (Niu et al., 2022a) are two entropy based online fully test-time adaptation (TTA) methods. Models and Implementation Details. We conduct experiments on ResNet50-BN/GN and VitBase-LN that are obtained from torchvision or timm (Wightman, 2019) . For our SAR, we use SGD as the update rule, with a momentum of 0.9, batch size of 64 (except for the experiments of batch size=1), and learning rate of 0.00025/0.001 for ResNet/Vit models. The threshold E 0 in Eqn. ( 2) is set to 0.4× ln 1000 by following EATA (Niu et al., 2022a) . ρ in Eqn. ( 3) is set by the default value 0.05 in Foret et al. (2021) . For trainable parameters of SAR during TTA, following Tent (Wang et al., 2021) , we adapt the affine parameters of group/layer normalization layers in ResNet50-GN/VitBase-LN. More details and hyper-parameters of compared methods are put in Appendix C.2.

5.1. ROBUSTNESS TO CORRUPTION UNDER VARIOUS WILD TEST SETTINGS

Results under Online Imbalanced Label Distribution Shifts. As illustrated in Section 4, as the imbalance ratio q max /q min increases, TTA degrades more and more severe. Here, we make comparisons under the most difficult case: q max /q min =∞, i.e., test samples come in class order. We evaluate all methods under different corruptions via the same sample sequence for fair comparisons. From Table 2 , our SAR achieves the best results in average of 15 corruption types over ResNet50-GN and VitBase-LN, suggesting its effectiveness. It is worth noting that Tent works well for many corruption types on VitBase-LN (e.g., defocus and motion blur) and ResNet50-GN (e.g., pixel), while consistently fails on ResNet50-BN. This further verifies our observations in Section 4 that entropy minimization on LN/GN models has the potential to perform well under online imbalanced label distribution shifts. Meanwhile, Tent also suffers from many failure cases, e.g., VitBase-LN on shot noise and snow. For these cases, our SAR works well. Moreover, EATA has fewer failure cases than Tent and achieves higher average accuracy, which indicates that the weight regularization is somehow able to alleviate the model collapse issue. Nonetheless, the performance of EATA is still inferior to our SAR, e.g., 49.9% vs. 58.0% (ours) on VitBase-LN regarding average accuracy. Results under Batch Size = 1. From Table 4 , our SAR achieves the best results in many cases. It is worth noting that MEMO and DDA are not affected by small batch sizes, mix domain shifts, or online imbalanced label shifts. They achieve stable/same results under these settings since they reset (or fix) the model parameters after the adaptation of each sample. However, the computational complexity of these two methods is much higher than SAR (see Table 1 ) and only obtain limited performance gains since they cannot exploit the knowledge from previously seen images. Although EATA performs better than our SAR on ResNet50-GN, it relies on pre-collecting 2,000 additional in-distribution samples (while we do not). Moreover, our SAR consistently outperforms EATA in other cases, see batch size 1 results on VitBase-LN, Tables 2 3 , and Tables 8-9 in Appendix. Table 5 : Effects of components in SAR. We report the Accuracy (%) on ImageNet-C (level 5) under ONLINE IMBALANCED LABEL SHIFTS (imbalance ratio q max /q min = ∞). "reliable" and "sharpness-aware (sa)" denote Eqn. (2) and Eqn. (3), "recover" denotes the model recovery scheme. 6 , for both two variants, it is hard to set a proper threshold δ for clipping, since the gradients for different models and test data have different scales and thus the δ selection would be sensitive. We carefully select δ on a specific test set (shot noise level 5). Then, we select a very small δ to make gradient clip work, i.e., clip by value 0.001 and by norm 0.1. Nonetheless, the performance gain over "no adapt" is very marginal, since the small δ would limit the learning ability of the model and in this case the clipped gradients may point in a very different direction from the true gradients. However, a large δ fails to stabilize the adaptation process and the accuracy will degrade after the model collapses (e.g., clip by value 0.005 and by norm 1.0). In contrast, SAR does not need to tune such a parameter and achieve significant improvements than gradient clipping. 

A RELATED WORK

We relate our SAR to existing adaptation methods without and with target data, sharpness-aware optimization, online learning methods, and EATA (Niu et al., 2022a) . Adaptation without Target Data. The problem of conquering distribution shifts has been studied in a number of works at training time, including domain generalization (Shankar et al., 2018; Li et al., 2018a; Dou et al., 2019) , increasing the training dataset size (Orhan, 2019), various data augmentation techniques (Lim et al., 2019; Hendrycks et al., 2020; Li et al., 2021; Hendrycks et al., 2021; Yao et al., 2022) Adaptation with Target Data. We divide the discussion on related methods that exploit target data into 1) unsupervised domain adaptation (adapt offline) and 2) test-time adaptation (adapt online). • Unsupervised domain adaptation (UDA). Conventional UDA jointly optimizes on the labeled source and unlabeled target data to mitigate distribution shifts, such as devising a domain discriminator to align source and target domains at feature level (Pei et al., 2018; Saito et al., 2018; Zhang et al., 2020b; a) and aligning the prototypes of source and target domains through a contrastive learning manner (Lin et al., 2022) . Recently, source-free UDA methods have been proposed to resolve the adaptation problem when source data are absent, such as generative-based methods that generate source images or prototypes from the model (Li et al., 2020; Kundu et al., 2020; Qiu et al., 2021) , and information maximization (Liang et al., 2020) . These methods adapt models on a whole test set, in which the adaptation is offline and often requires multiple training epochs, and thus are hard to be deployed on online testing scenarios. • Test-time adaptation (TTA). According to whether alter training, TTA methods can be mainly categorized into two groups. i) Test-Time Training (TTT) (Sun et al., 2020) jointly optimizes a source model with both supervised and self-supervised losses, and then conducts self-supervised learning at test time. The self-supervised losses can be rotation prediction (Gidaris et al., 2018) in TTT or contrastive-based objectives (Chen et al., 2020) in TTT++ (Liu et al., 2021) and MT3 (Bartler et al., 2022) , etc. ii) Fully Test-Time Adaptation (Wang et al., 2021; Niu et al., 2022a; Hong et al., 2023) does not alter the training process and can be applied to any pre-trained model, including adapting the statistics in batch normalization layers (Schneider et al., 2020; Hu et al., 2021; Khurana et al., 2021; Lim et al., 2023; Zhao et al., 2023) , unsupervised entropy minimization (Wang et al., 2021; Niu et al., 2022a; Zhang et al., 2022) , prediction consistency maximization (Zhang et al., 2022; Wang et al., 2022; Chen et al., 2022a) , top-k classification boosting (Niu et al., 2022b) , etc. Though effective at handling test shifts, prior TTA methods are shown to be unstable in the online adaptation process and sensitive to when test data are insufficient (small batch sizes), from mixed domains, have imbalanced and online shifted label distribution (see Figure 1 ). Here, it is worth noting that methods like MEMO (Zhang et al., 2022) and DDA (Gao et al., 2022) are not affected under the above 3 scenarios, since MEMO resets the model parameters after each sample adaptation and DDA performs input adaptation via diffusion (in which the model weights are frozen during testing). However, these methods can not exploit the knowledge learned from previously seen samples and thus obtain limited performance gains. Moreover, the heavy data augmentations and diffusion in MEMO and DDA are computationally expensive and inefficient at test time (see Table 6 ). In this work, we analyze why online TTA may fail under the above practical test settings and propose associated solutions to make TTA stable under various wild test settings. Sharpness-aware Minimization (SAM). SAM (Foret et al., 2021) optimizes both a supervised objective (e.g., cross-entropy) and the sharpness of loss surface, aiming to find a flat minimum that has good generalization ability (Hochreiter & Schmidhuber, 1997) . SAM and its variants (Kwon et al., 2021; Zheng et al., 2021; Du et al., 2022; Chen et al., 2022b) have shown outstanding performance on several deep learning benchmarks. In this work, when we analyze the failure reasons of test-time entropy minimization, we find that some noisy samples that produce gradients with large norms harm the adaptation and thus lead to model collapse. To alleviate this, we propose to minimize the sharpness of the test-time entropy loss surface so that the online model update is robust to those noisy/large gradients. Online Learning (OL). OL (Hoi et al., 2021; Chowdhury & Gopalan, 2017; Zhao et al., 2011) conducts model learning from a sequence of data samples one by one at a time, which is common in many real-world applications (e.g., social web recommendation). According to the supervision type, OL can be categorized into three groups: i) Supervised methods (Rakhlin et al., 2010; Shalev-Shwartz et al., 2012) obtain supervision at the end of each online learning iteration, ii) Semisupervised methods (Zhang & Hoi, 2019 ) can obtain supervision from only partial samples, e.g., online active learning (Zhao & Hoi, 2013; Zhang et al., 2019) selects informative samples to query the ground-truth label for the model update, iii) Unsupervised methods (Bhatnagar et al., 2014) can not obtain any supervision during the whole online learning process. In this sense, test-time adaptation (TTA) (Sun et al., 2020; Wang et al., 2021) online updates models with only unlabeled test data and thus falls into the third category. However, unlike unsupervised OL that mainly aims to learn representations or clusters (Ren et al., 2021) , TTA seeks to boost the performance of any pre-trained model on out-of-distribution test samples. Comparison with EATA (Niu et al., 2022a) . Although both EATA and our SAR include a step to remove samples via entropy, their motivations behind this step are different. EATA seeks to improve the adaptation efficiency via sample entropy selection. In our SAR, we discover that some noisy gradients with large norms may hurt the adaptation and thus result in model collapse under wild test settings. To remove these gradients, we exploit an alternative metric (i.e., entropy), which helps to remove partial noisy gradients with large norms. However, this is still insufficient for achieving stable TTA (see ablation results in Table 5 ). Thus, we further introduce the sharpnessaware optimization and a model recovery scheme. With these three strategies, our SAR performs stably under wild test settings. Table 6 : Characteristics of state-of-the-art methods. We evaluate the efficiency of different methods with ResNet-50 (group norm) on ImageNet-C (Gaussian noise, severity level 5), which consists of totally 50,000 images. The real run time is tested via a single V100 GPU. DDA (Gao et al., 2022) pre-trains an additional diffusion model and then perform input adaptation/diffusion at test time. Method Need source data? Online update? #Forward #Backward Other computation GPU time (50,000 images) MEMO (Zhang et al., 2022) 50,000×65 50,000×64 AugMix (Hendrycks et al., 2020) 55,980 seconds DDA (Gao et al., 2022) 50,000×2 0 50,000 diffusion 146,220 seconds TTT (Sun et al., 2020) 50,000×21 50,000×20 rotation augmentation 3,600 seconds Tent (Wang et al., 2021) 50,000 50,000 n/a 110 seconds EATA (Niu et al., entropy loss values with a moving average factor of 0.9 for e m , and the reset threshold e 0 is set to 0.2. For learnable parameters, we only update affine parameters in normalization layers by following Tent (Wang et al., 2021) . However, since the top/deep layers are more sensitive and more important to the original model than shallow layers as mentioned in (Mummadi et al., 2021; Choi et al., 2022) , we freeze the top layers and update the affine parameters of layer or group normalization in the remaining shallow layers. Specifically, for ResNet50-GN that has 4 layer groups (layer1, 2, 3, 4), we freeze the layer4. For ViTBase-LN that has 11 blocks groups (blocks1-11), we freeze blocks9, blocks10, blocks11. TTTfoot_5 (Sun et al., 2020) . For fair comparisons, we seek to compare all methods based on the same model weights. However, TTT alters the model training process and requires the model contains a self-supervised rotation prediction branch for test-time training. Therefore, we modify TTT so that it can be applied to any pre-trained model. Specifically, given a pre-trained model, we add a new branch (random initialized) from the end of a middle layer (2nd layer group of ResNet-50-GN and 6th blocks group of VitBase-LN) for the rotation prediction task. We first freeze all original parameters of the pre-trained model and train the newly added branch for 10 epochs on the original ImageNet training set. Here, we apply an SGD optimizer, with a momentum of 0.9, an initial learning rate of 0.1/0.005 for ResNet50-GN/VitBase-LN, and decrease it at epochs 4 and 7 by decreasing factor 0.1. Then, we take the newly obtained model (with two branches) as the base model to perform test-time training. During the test-time training phase, we use SGD as the update rule with a learning rate of 0.001 for ResNet0-GN (following TTT) and 0.0001 for VitBase-LN, and the data augmentation size is set to 20 (following Niu et al. (2022a) ). Tentfoot_6 (Wang et al., 2021) . We follow all hyper-parameters that are set in Tent unless it does not provide. Specifically, we use SGD as the update rule, with a momentum of 0.9, batch size of 64 (except for the experiments of batch size = 1 and effects of small test batch sizes (in Section 4)), and learning rate of 0.00025/0.001 for ResNet/Vit models. The learning rate for batch size = 1 is set to (0.00025/32) for ResNet models and (0.001/64) for Vit models. The trainable parameters are all affine parameters of batch normalization layers. EATAfoot_7 (Niu et al., 2022a) . We follow all hyper-parameters that are set in EATA unless it does not provide. Specifically, the entropy constant E 0 (for reliable sample identification) is set to 0.1 × ln 1000. The ϵ for redundant sample identification is set to 0.05. The trade-off parameter β for entropy loss and regularization loss is set to 2,000. The number of pre-collected in-distribution test samples for Fisher importance calculation is 2,000. The update rule is SGD, with a momentum of 0.9, batch size of 64 (except for the experiments of batch size = 1 and effects of small test batch sizes (in Section 4)), and learning rate of 0.00025/0.001 for ResNet/Vit models. The learning rate for batch size = 1 is set to (0.00025/32) for ResNet models and (0.001/64) for Vit models. The trainable parameters are all affine parameters of batch normalization layers. MEMOfoot_8 (Zhang et al., 2022) . We follow all hyper-parameters that are set in MEMO. Specifically, we use the AugMix (Hendrycks et al., 2020) as a set of data augmentations and the augmentation size is set to 64. For Vit models, the optimizer is AdamW (Loshchilov & Hutter, 2018) , with learning rate 0.00001 and weight decay 0.01. For ResNet models, the optimizer is SGD, with learning rate 0.00025 and no weight decay. The trainable parameters are the entire model. DDA 10 (Gao et al., 2022). We reproduce DDA according to its official GitHub repository and use the default hyper-parameters. More Details on Experiments in Section 4: Normalization Layer Effects in TTA. In Section 4, we investigate the effects of TTT and Tent with models that have different norm layers under {small test batch sizes, mixed distribution shifts, online imbalanced label distribution shifts}. For each experiment, we only consider one of the three above test settings. To be specific, for experiments regarding batch size effects (Section 4 ( 1)), we only tune the batch size and the test set does not contain multiple types of distribution shifts and its label distribution is always uniform. For experiments of mixed domain shifts (Section 4 (2)), the test samples come from the mixture of 15 corruption types, while the batch size is 64 for Tent and 1 for TTT, and the label distribution of test data is always uniform. For experiments of online label shifts (Section 4 (3)), the label distribution of test data is online shifted and imbalanced, while the BS is 64 for Tent and 1 for TTT, and test data only consist of one corruption type. Moreover, it is worth noting that we re-scale the learning rate for entropy minimization (Tent) according to the batch size, since entropy minimization is sensitive to the learning rate and a fixed learning rate often fails to work well. Specifically, the learning rate is re-scaled as (0.00025/32) × BS IF BS < 32 ELSE 0.00025 for ResNet models and (0.001/64) × BS for Vit models. Compared with Tent, the single sample adaptation method TTT is not very sensitive to the learning rate, and thus we set the same learning rate for various batch sizes. We also provide the results of TTT under different batch sizes with dynamic re-scaled learning rates in Table 7 . We also conduct more ablation experiments under the wild test settings of "batch size (BS)=1" in Table 13 and "mix domain shifts" in Table 14 . The results are generally consistent with that in Table 5 . Both the reliable entropy and sharpness-aware optimization work together to stabilize online TTA. It is worth noting that only in VitBase-LN under BS=1 the model recovery scheme is activated and improves the average accuracy from 55.7% to 56.4%. As we mentioned in Section 3.1, models online optimized by entropy (Wang et al., 2021; Niu et al., 2022a) under wild test scenarios are easy to collapse, i.e., predict all samples to a single class independent of the inputs. To alleviate this, prior methods (Liang et al., 2020; et al., 2021) exploit diversity regularization to force the output distribution of samples to be uniform. However, this assumption is unreasonable at test time such as when test data are imbalanced during a period (as in Figure 1 (c)) and this method also relies on a batch of samples (in contrast to Figure 1 (b) ). In this sense, this strategy is infeasible for our problem. In our paper, we resolve the collapse issue for test-time adaptation from an optimization perspective to make the online adaptation process stabilized.

G.2 EFFECTS OF LARGE BATCH SIZES IN BN MODELS UNDER MIX DOMAIN SHIFTS

In Section 3.1, we mentioned that a standard batch size (e.g., 64 on ImageNet) works well when there is only one type of distribution shift. However, when test data contains multiple shifts, this batch size fails to calculate an accurate mean and variance estimation in batch normalization layers. Here, we investigate the effects of super large batch sizes in this setting. From Table 16 , the adapted performance increases as the batch size increases, indicating that a larger batch size helps to estimate statistics more accurately. It is worth noting that the performance on severity level 3 degrades when BS is larger than 1024. This is because we fix the learning rate for various batch sizes and in this sense, BS=1024 may lead to insufficient model updates. Moreover, although enlarging batch sizes is able to boost the performance, the adapt performance is still inferior to the average accuracy of adapting on each corruption type separately (i.e., average adapt). This further emphasizes the necessity of exploiting models with group or layer norm layers to perform test-time entropy minimization. In this subsection, we apply our Model Recovery scheme to Tent (Wang et al., 2021) and EATA (Niu et al., 2022a) . From Table 18 , the model recovery indeed helps Tent a lot (e.g., the average accuracy from 22.0% to 26.1% on ResNet50-GN) while its performance gain on EATA is a bit marginal. Compared with Tent+recovery and EATA+recovery, our SAR greatly boosts the adaptation performance, 



https://zenodo.org/record/2235448#.YzQpq-xBxcA https://download.pytorch.org/models/resnet50-19c8e357.pth https://github.com/rwightman/pytorch-image-models/releases/download/ v0.1-rsb-weights/resnet50_gn_a1h2-8fe6c4d0.pth https://storage.googleapis.com/vit_models/augreg/B_16-i21k-300ep-lr_ 0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0. 01-res_224.npz https://dl.fbaipublicfiles.com/convnext/convnext_base_1k_224_ema.pth https://github.com/yueatsprograms/ttt_imagenet_release https://github.com/DequanWang/tent https://github.com/mr-eggplant/EATA https://github.com/zhangmarvin/memo https://github.com/shiyegao/DDA



Figure 1: An illustration of practical/wild test-time adaptation (TTA) scenarios, in which prior online TTA methods may degrade severely. The accuracy of Tent (Wang et al., 2021) is measured on ImageNet-C of level 5 with ResNet50-BN (15 mixed corruptions in (a) and Gaussian in (b-c)).

Figure 2: Failure case analyses (a-c) of online test-time entropy minimization (Wang et al., 2021). (a) and (b) record the model predictions during online adaptation. (c) illustrates how gradients norm evolves with and without model collapse. (d) investigates the relationship between the sample's entropy and gradients norm. All experiments are conducted on shuffled ImageNet-C of Gaussian noise with ResNet50 (GN), and a larger (severity) level denotes a more severe distribution shift.

Figure 4: Performance of TTA methods on different models (different norm layers) under the mixture of 15 different corruption types (ImageNet-C). We report mean&stdev. over 3 independent runs.

Efficiency

Comparisons with state-of-the-art methods on ImageNet-C (severity level 5) under ONLINE IMBALANCED LABEL SHIFTS (imbalance ratio = ∞) regarding Accuracy (%). "BN"/"GN"/"LN" is short for Batch/Group/Layer normalization. The bold number indicates the best result. SAR (ours) 46.5±3.0 43.1±7.4 48.9±0.4 55.3±0.1 54.3±0.2 58.9±0.1 54.8±0.2 53.6±7.1 46.2±3.5 69.7±0.3 76.2±0.1 66.2±0.3 60.9±0.3 69.6±0.1 66.6±0.1 58.0±0.5

Comparisons with state-of-the-art methods on ImageNet-C (severity level 5) with BATCH SIZE=1 regarding Accuracy (%). "BN"/"GN"/"LN" is short for Batch/Group/Layer normalization. SAR (ours) 23.4±0.3 26.6±0.4 23.9±0.0 18.4±0.1 15.4±0.3 28.6±0.3 30.4±0.2 44.9±0.3 44.7±0.2 25.7±0.6 72.3±0.2 44.5±0.1 14.8±2.7 47.0±0.1 56.1±0.0 34.5±0.2

Comparisons with stateof-the-arts on ImageNet-C under MIXTURE OF 15 CORRUPTION TYPES regarding Accuracy (%).

More Details on Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.2 More Experimental Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 D Additional Results on ImageNet-C of Severity Level 3 22 D.1 Comparisons with State-of-the-arts under Online Imbalanced Label Shift . . . . . . . . . . . 22 D.2 Comparisons with State-of-the-arts under Batch Size of 1 . . . . . . . . . . . . . . . . . . . 22

, to name just a new. These methods aim to pre-anticipate or simulate the possible shifts of test data at training time, so that the training distribution can cover the possible shifts of test data. However, pre-anticipating all possible test shifts at training time may be infeasible and these training strategies are often more computationally expensive. Instead of improving generalization ability at training time, we conquer test shifts by directly learning from test data.

Batch size (BS) effects in TTT(Sun et al., 2020) with different models (different norm layers). The learning rate is dynamically re-scaled by 0.001×BS. We report the accuracy (%) on ImageNet-C with Gaussian noise and severity level 5. .1 COMPARISONS WITH STATE-OF-THE-ARTS UNDER ONLINE IMBALANCED LABEL SHIFTWe provide more results regarding online imbalanced label distribution shift (imbalance ratio = ∞) of all compared methods in Table8. The results are consistent with that of the main paper (severity level 5), and our SAR performs best in the average of 15 different corruption types. It is worth noting that DDA achieves competitive results under noise corruptions while performing worse for other corruption types. The reason is that the diffusion model used in DDA for input adaptation is trained via noise diffusion, and thus its generalization ability to diffuse other corruptions is still limited.

Comparisons with state-of-the-art methods on ImageNet-C of severity level 3 under ON-LINE IMBALANCED LABEL DISTRIBUTION SHIFTS (imbalance ratio q max /q min = ∞) regarding Accuracy (%). "BN"/"GN"/"LN" is short for Batch/Group/Layer normalization. The evolution of gradients norm during online test-time adaptation. Results on VitBase, ImageNet-C, shot noise, severity level 5, online imbalance label shift (imbalance ratio = ∞). "reli-



Effects of components in SAR. We report the Accuracy (%) under MIXED DOMAIN SHIFTS, i.e., mixture of 15 corruption types of ImageNet-C with severity level 5. "reliable" and "sharpness-aware (sa)" denote Eqn. (2) and Eqn. (3), "recover" denotes the model recovery scheme. VISUALIZATION OF LOSS SURFACE LEARNED BY SAR In Figure7in the main paper, we have visualized the loss surface of Tent and our SAR on VitBase-LN. In this section, we further provide visualizations of ResNet50-GN. We select a checkpoint at batch 120 to plot the loss surface for Tent, since after batch 120 this model starts to collapse. In this case, the loss (entropy) is hard to degrade and cannot find a proper minimum. For our SAR, the model weights for plotting are obtained after the adaptation on the whole test set. By comparing Figures10 (a)-(b), SAR helps to stabilize the entropy minimization process and find proper minima.

Sensitivity of ρ in SAR. We report Accuracy (%) on ImageNet-C (shot noise, severity level 5) under online imbalanced label distribution shifts, where the imbalance ratio is ∞.

Effects of large batch sizes (BS) in Tent(Wang et al., 2021) with ResNet-50-BN under MIXTURE OF 15 DIFFERENT CORRUPTION TYPES on ImageNet-C. We report accuracy (%). PERFORMANCE OF TENT WITH CONVNEXT-LN For mainstream neural network models, the normalization layers are often coupled with network architecture. Specifically, group norm (GN) and batch norm (BN) are often combined with conventional networks, while layer norm (LN) is more suitable for transformer networks. Therefore, we investigate the layer normalization effects in TTA in Section 4 through VitBase-LN. Here, we conduct more experiments to compare the performance of online entropy minimization(Wang et al., 2021) on ResNet50-BN and ConvNeXt-LN(Liu et al., 2022). ConvNeXt is a convolutional network equipped with LN. The authors conduct significant modifications over ResNet to make this LN-based convolutional network work well, such as modifying the architecture (ResNet block to ConvNeXt block, activation functions, etc.), various training strategies (stochastic depth, random erasing, EMA, etc.). From Table17, Tent+ConvNeXt-LN performs more stable than Tent+ResNet50-BN, but still suffers several failure cases. These results are consistent with that of ResNet50-BN vs. VitBase-LN.G.4 EFFECTIVENESS OF MODEL RECOVERY SCHEME WITH TENT AND EATA

Results of Tent on ResNet50-BN and ConvNeXt-LN. We report Accuracy (%) on ImageNet-C under online imbalanced label distribution shifts and the imbalance ratio is ∞. Gauss. Shot Impul. Defoc. Glass Motion Zoom Snow Frost Fog Brit. Contr. Elastic Pixel JPEG Avg. Gauss. Shot Impul. Defoc. Glass Motion Zoom Snow Frost Fog Brit. Contr. Elastic Pixel JPEG Avg. .g., the average accuracy 26.1% (Tent+recovery) vs. 37.2% (SAR) on ResNet50-GN, suggesting the effectiveness of our proposed SAR.

Results of combining model recovery scheme with Tent and EATA. We report the Accuracy (%) on ImageNet-C severity level 5 under online imbalanced label distribution shifts and the imbalance ratio is ∞. Shot Impul. Defoc. Glass Motion Zoom Snow Frost Fog Brit. Contr. Elastic Pixel JPEG Avg.

ACKNOWLEDGMENTS

This work was partially supported by the Key Realm R&D Program of Guangzhou 202007030007, National Natural Science Foundation of China (NSFC) 62072190, Ministry of Science and Technology Foundation Project 2020AAA0106900, Program for Guangdong Introducing Innovative and Enterpreneurial Teams 2017ZT07X183, CCF-Tencent Open Fund RAGR20220108.

REPRODUCIBILITY STATEMENT

In this work, we implement all methods (all compared methods and our SAR) with different models (ResNet50-BN, ResNet50-GN, VitBase-LN) on the ImageNet-C/R and VisDA-2021 datasets. Reproducing all the results in our paper depends on the following three aspects:1. DATASET. The first paragraph of Section 5 and Appendix C.1 provide the details of the adopted datasets and the download url. 2. MODELS. All adopted models (with the pre-trained weights) for test-time adaptation are publicly available. Specifically, ResNet50-BN is from torchvision, ResNet50-GN and VitBase-LN are from timm repository (Wightman, 2019) . Appendix C.2 provides the download url of them. 3. PROTOCOLS OF EACH METHOD. The second paragraph of Section 5 and Appendix C.2 provides the implementation details of all compared methods and our SAR. We reproduce all compared methods based on the code from their official GitHub, for which the download url is provided (in Appendix C.2) following each method introduction. The source code of SAR has been made publicly available.

B PSEUDO CODE OF SAR

In this appendix, we provide the pseudo-code of our SAR method. From Algorithm 1, for each test sample x j , we first apply the reliable sample filtering scheme (refer to lines 3-6) to it to determine whether it will be used to update the model. If x j is reliable, we will optimize the model via the sharpness-aware entropy loss of x j (refer to lines 7-10). Specifically, we first calculate the optimal weight perturbation ε( Θ) based on the gradient ∇ ΘE(x j ; Θ), and then update the model with approximate gradients g = ∇ ΘE(x j ; Θ)| Θ+ε( Θ) . Lastly, we exploit a recovery scheme to enable the model to work well even under a few extremely hard cases (refer to lines 11-13). Specifically, when the moving average value e m of entropy loss is smaller than e 0 (indicating that the model occurs to collapse), we will recover the model parameters Θ to its original/initial value. face of minor input changes. Now in order to approximate C, E and these robustness measures, we designed a set of corruptions and perturbations which are frequently encountered in natural images.We will refer to these as "common" corruptions and perturbations. These common corruptions and perturbations are available in the form of IMAGENET-C and IMAGENET-P. 

C.2 MORE EXPERIMENTAL PROTOCOLS

All pre-trained models involved in our paper for test-time adaptation are publicly available, including ResNet50-BN 2 obtained from torchvision library, ResNet-50-GN 3 and VitBase-LN 4 and ConvNeXt-LN 5 obtained from timm repository (Wightman, 2019) . We summarize the detailed characteristics of all involved methods in Table 6 and introduce their implementation details in the following.SAR (Ours). We use SGD as the update rule, with a momentum of 0.9, batch size of 64 (except for the experiments of batch size = 1), and learning rate of 0.00025/0.001 for ResNet/Vit models. The learning rate for batch size = 1 is set to (0.00025/16) for ResNet models and (0.001/32) for Vit models. The threshold E 0 in Eqn. ( 2) is set to 0.4× ln 1000 by following EATA (Niu et al., 2022a) . ρ in Eqn. ( 3) is set by the default value 0.05 in Foret et al. (2021) . For model recovery, we record the E ADDITIONAL RESULTS ON IMAGENET-R AND VISDA-2021We further conduct experiments on ImageNet-R under two wild test settings: online imbalanced label distribution shifts (in Table 10 ) and batch size = 1 (in Table 11 ). The overall results are consistent with that on ImageNet-C: 1) ResNet50-GN and VitBase-LN perform more stable than ResNet50-BN; 2) Compared with Tent and EATA, SAR achieves the best performance on ResNet50-GN and VitBase-LN. 

