VARIATIONAL PSEUDO LABELS FOR META TEST-TIME ADAPTATION

Abstract

Test-time model adaptation has shown great effectiveness in generalizing over domain shifts. A most successful tactic for test-time adaptation conducts further optimization on the target data using the predictions by the source-trained model. However, due to domain shifts, the source-trained model predictions themselves can be largely inaccurate, which results in a model misspecified to the target data and therefore damages their adaptation ability. In this paper, we address test-time adaptation from a probabilistic perspective. We formulate model adaption as a probabilistic inference problem, which incorporates the uncertainty into source model predictions by modeling pseudo labels as distributions. Based on the probabilistic formalism, we propose variational pseudo labels that explore the information of neighboring target samples to improve pseudo labels and achieve a model better specified to target data. By a meta-learning paradigm, we train our model by simulating domain shifts and the test-time adaptation procedure. In doing so, our model learns the ability to generate more accurate pseudo-label distributions and to adapt to new domains. Experiments on five widely used datasets demonstrate the effectiveness of our proposal.

1. INTRODUCTION

Deep neural networks start to exhibit generalizability problems and suffer from performance degradation as soon as test data distributions differ from the ones experienced during training, (Geirhos et al., 2018; Recht et al., 2019) . To deal with the distribution shift, domain adaptation, e.g., (Saenko et al., 2010; Long et al., 2015; Lu et al., 2020; Li et al., 2021) and domain generalization, e.g., (Muandet et al., 2013; Motiian et al., 2017; Li et al., 2017; 2020) have proven effective tactics. However, these two settings either require a large number of (unlabeled) target data during training or do not consider any target information during generalization at all. Both of which are not necessarily valid assumptions in realistic scenarios. Test-time adaptation, e.g., (Sun et al., 2020; Varsavsky et al., 2020; Wang et al., 2021) goes beyond these two setting and introduces a new learning paradigm, which trains a model on source data and further optimizes it using the unlabeled target data at test time to adapt to the target domain. One widely applied strategy for test-time adaptation updates model parameters by self-supervision (Liang et al., 2020; Wang et al., 2021; Iwasawa & Matsuo, 2021; Niu et al., 2022) . However, due to domain shifts, the source-model predictions on the target samples can be uncertain and inaccurate. As self-supervision-based test-time adaptation is often achieved by optimization with pseudo labels or entropy minimization based on the source-trained model predictions, the model can be overconfident on some mispredictions. As a result, the adapted model becomes unreliable and misspecified (Wilson & Izmailov, 2020) to the target data. In this paper we make three contributions. First, we address test-time adaptation in a probabilistic framework by formulating it as a variational inference problem. We define pseudo labels as stochastic variables and estimate a distribution over them by variational inference. By doing so, the uncertainty in source-trained model predictions is incorporated into the adaptation to the target data at test time. Second, thanks to the proposed probabilistic formalism, it is natural and convenient to utilize variational distributions to leverage extra information. By hinging on this benefit, we design the variational pseudo labels to explore the neighboring information of target samples into the inference of the pseudo label distributions. By doing so, the variational pseudo labels are more accurate, which enables the source-trained model to be better specified to target data and therefore conducive to model adaptation. Third, we adopt a meta-learning paradigm for optimization to simulate test-time adaptation on source domains. More specifically, the model is exposed to domain shifts iteratively and optimized to learn the ability of adapting to unseen domains. We conduct experiments on three widely-used benchmarks to demonstrate the promise and effectiveness of our method for test-time adaptation.

2. METHODOLOGY 2.1 PRELIMINARY

We were given data from different domains defined on the joint space X ×Y, where X and Y denote the data space and label space, respectively. The domains are split into several source domains D s = (x s , y s ) i Ns i=1 and target domains D t = (x t , y t ) i Nt i=1 . The goal is to train a model on source domains that is expected to generalize well on the (unseen) target domains. To this end, test-time adaptation methods, e.g., (Wang et al., 2021; Zhang et al., 2021; Niu et al., 2022) , have recently been proposed. These methods adapt the source-trained model by optimization to target domains at test time. A common strategy in these methods is that the model θ is first trained on source data D s by minimizing a supervised loss L train (θ)=E (xs,ys) i ∈Ds [L CE (x s , y s ; θ)]; and then at test time they adapt the model θ s to the target domain by optimization with certain surrogate losses, e.g., entropy minimization, based on unlabeled test data, which is formulated as: L test (θ) = E xt∈Dt [L E (x t ; θ s )], where the entropy is calculated on the source model predictions. However, test samples from the target domain could be largely misclassified by the source model due to the domain shift, resulting in large uncertainty in the predictions. Moreover, the entropy minimization tends to update the model with high confidence even for the wrong predictions, which would cause a misspecified model for the target domain. To solve those problems, in this work we address test-time model adaptation from a probabilistic perspective. We propose a probabilistic inference framework that models the uncertainty of the source-model predictions by defining distributions over pseudo labels. Moreover, under the probabilistic formalism, we propose designing variational pseudo labels, which enables the model to incorporate the neighboring information in test samples to combat domain shifts. We adopt a metalearning paradigm for optimization, which simulates the domain shifts and adaptation procedure. By doing so, the model learns to acquire the ability to further adapt itself with pseudo labels to unseen target domains. We provide a graphical illustration to highlight the differences between common test-time adaptation and our proposals in Figure 1 .

2.2. PROBABILISTIC TEST-TIME ADAPTATION WITH LATENT PSEUDO LABELS

We first provide a probabilistic formulation for test-time adaptation based on pseudo labels. Given the target sample x t and the source-trained model θ s , we would like to make predictions on the target sample. To this end, we formulate the predictive likelihood as follows: p(y t |x t , θ s ) = p(y t |x t , θ t )p(θ t |x t , θ s )dθ t ≈ p(y t |x t , θ * t ), where we use the value θ * t obtained by the maximum a posterior (MAP) to approximate the integration (Finn et al., 2018) . Intuitively, the MAP approximation is interpreted as inferring the posterior over θ t : p(θ t |x t , θ s ) ≈ δ(θ t = θ * t ), which we obtain by adapting θ s using the target data x t . To model the uncertainty of predictions for more robust test-time adaptation, we treat pseudo labels as stochastic variables in the probabilistic framework as shown in Figure 1 (b). The pseudo labels are obtained from the source model predictions, which follows categorical distributions. Then we reformulate eq. ( 2) as follows: p(y t |x t , θ s ) = p(y t |x t , θ t ) p(θ t |ŷ t , x t , θ s )p(ŷ t |x t , θ s )dŷ t dθ t ≈ E p(ŷt|xt,θs) [p(y t |x t , θ * t )], where θ * t is the MAP value of p(θ t |ŷ t , x t , θ s ), which is obtained via gradient descent on the data x t and the corresponding pseudo labels ŷt starting from θ s . The formulation allows us to sample different pseudo labels from the categorical distribution p(ŷ t ) to adapt the model θ * t , which takes into account the uncertainty of predictions by the source-trained model. By approximating the expectation of p(ŷ t ) with the argmax function on p(ŷ t ), θ * t is obtained by gradient descent based on only a point estimation of the pseudo label p(ŷ t ). However, due to domain shifts, the argmax value of p(ŷ t ) is not guaranteed to be always correct. The adaptation then is similar to entropy minimization (eq. 1), where the adapted model can achieve high confidence but wrong predictions of some target samples due to domain shifts. For example, consider a toy binary classification task, where the predicted probability is [0.4, 0.6] with the ground-truth label [1, 0] . The pseudo label generated by selecting the maximum probability is [0, 1], which is inaccurate. Optimization based on these labels would give rise to a model misspecified to target data, failing to adapt to the target domain. In contrast, our probabilistic formulation allows us to sample pseudo labels from the categorical distribution p(ŷ t |x t , θ s ), which incorporates the uncertainty of the pseudo label ŷt in a principled way. Again using the above example, the pseudo label sampled from the predicted distribution has a probability of 40% to be the ground-truth label, which leads to the adaptation of the model in the correct direction. Therefore, our formulation achieves better adaptation by accessing accurate pseudo labels from the inaccurate prediction distributions.

2.3. VARIATIONAL PSEUDO LABELS

Under the probabilistic formalism, we derive variational inference of pseudo labels. To train the ability of the model to generate better variational pseudo labels and to fully utilize the pseudo label distributions for better adaptation, we adopt the meta-learning paradigm to simulate domain shifts and test-time adaptation procedures (Finn et al., 2017; Dou et al., 2019; Xiao et al., 2022) . We split the source domains D s into meta-source domains D s ′ and a meta-target domain D t ′ during training. The meta-target domain is selected randomly in each iteration to mimic diverse domain shifts. To simulate the test-time adaptation and estimation procedure, we maximize the log-likelihood of the meta-target samples after model adaptation on the meta-target data: log p(y t ′ |x t ′ , θ s ′ ) = log p(y t ′ |x t ′ , θ t ′ ) p(θ t ′ |ŷ t ′ , x t ′ , θ s ′ )p(ŷ t ′ |x t ′ , θ s ′ )dŷ t ′ dθ t ′ ≈ log p(y t ′ |x t ′ , θ * t ′ )p(ŷ t ′ |x t ′ , θ s ′ )dŷ t ′ ≥ E p(ŷ t ′ |x t ′ ,θ s ′ ) [log p(y t ′ |x t ′ , θ * t ′ )], where p(ŷ t ′ |x t ′ , θ s ′ ) denotes the distribution of pseudo labels generated by the meta-source model θ s ′ on the meta-target data 3), which is learned to mimic the test-time adaptation procedure. x t ′ . θ * t ′ is the MAP value of p(θ t ′ |ŷ t ′ , x t ′ , θ s ′ ) similar to eq. ( Under the meta-learning setting, the actual labels y t ′ of the meta-target data is accessible since source data are fully labeled as shown in Figure 1 (c ). We then simulate the test evaluation procedure and further supervise the adapted model θ * t ′ on its meta-target predictions by the actual labels. The maximization of the log-likelihood of p(y t ′ |x t ′ , θ * t ′ ) is realised by a cross-entropy loss on the meta-target predictions and the actual meta-target labels. Intuitively, the pseudo-label adapted model is supervised to achieve good performance on the adapted data. Thus, it learns the ability to generate better pseudo labels and achieve better adaptation across domain shifts with these pseudo labels on new unseen domains. Variational pseudo labels. Moreover, we also propose variational pseudo labels that incorporate information of the neighboring target samples to estimate pseudo label distributions that are more robust against domain shifts. The variational pseudo labels is natural and convenient to deployed under the probabilistic formulation. Assume that we have a batch of meta-target data X t ′ = x i t ′ M i=1 , we reformulate eq. ( 4) as: log p(y t ′ |x t ′ , θ s ′ , X t ′ ) = log p(y t ′ |x t ′ , θ t ′ ) p(θ t ′ |ŷ t ′ , x t ′ , θ s ′ )p(ŷ t ′ , w t ′ |x t ′ , θ s ′ , X t ′ )dŷ t ′ dw t ′ dθ t ′ = log p(y t ′ |x t ′ , θ * t ′ )p(ŷ t ′ , w t ′ |x t ′ , θ s ′ , X t ′ )dŷ t ′ dw t ′ dθ t ′ , where θ * t ′ is the MAP value of p(θ t ′ |ŷ t ′ , x t ′ , θ s ′ ). We introduce the latent variable w t ′ to integrate the information of the neighboring target samples X t ′ as shown in Figure 1 . To approximate the true posterior of the joint distribution p(ŷ t ′ , w t ′ ), we introduce a variational posterior q(ŷ t ′ , w t ′ |x t ′ , θ s ′ , X t ′ , Y t ′ ), where Y t ′ = y i t ′ M i=1 denotes the actual labels of the metatarget data X t ′ . To facilitate the estimation of pseudo labels, we set the prior distribution as: p(ŷ t ′ , w t ′ |x t ′ , θ s ′ , X t ′ ) = p(ŷ t ′ |w t ′ , x t ′ )p ϕ (w t ′ |θ s ′ , X t ′ ) (6) where p ϕ (w t ′ |θ s ′ , X t ′ ) is generated by the features of X t ′ together with their output values based on θ s ′ . Similarly, we define the variational posterior distribution as: q(ŷ t ′ , w t ′ |x t ′ , θ s ′ , X t ′ , Y t ′ )=p(ŷ t ′ |w t ′ , x t ′ )q ϕ (w t ′ |θ s ′ , X t ′ , Y t ′ ). (7) where q ϕ (w t ′ |θ s ′ , X t ′ , Y t ′ ) is obtained by the features of X t ′ and the actual labels Y t ′ based on θ s ′ . By introducing eqs. ( 6) and ( 7) into (5), we derive the evidence lower bound (ELBO) of the loglikelihood in eq. ( 5) as follows: log p(y t ′ |x t ′ , θ s ′ , X t ′ ) ≥ E q ϕ (w t ′ ) E p(ŷ t ′ |w t ′ ,x t ′ ) [log p(y t ′ |x t ′ , θ * t ′ )] -D KL [q ϕ (w t ′ |θ s ′ , X t ′ , Y t ′ )||p ϕ (w t ′ |θ s ′ , X t ′ )]. Rather than directly using the meta-source model θ s ′ , we estimate the pseudo labels y t ′ from the latent variable w t ′ , which integrates the features of neighboring target samples. By considering the actual labels Y t ′ , the variational distribution utilizes both the target information and categorical information of the neighboring samples. Thus, the variational posterior models the distribution of different categories in the target domain more reliably and produces more accurate pseudo labels to improve model adaptation.

2.4. META TEST-TIME ADAPTATION: TRAINING AND INFERENCE

To mimic domain shifts during training, we split each iteration into meta-source, meta-adaptation, and meta-target to simulate the training stage on source domains, test-time adaptation, and test stage on target data, respectively. Under the meta-learning paradigm, the model is iteratively exposed to domain shifts and learns the capability to adapt the meta-source model θ s ′ to meta-target MAP θ * t ′ with the variational pseudo labels. The parameters in the variational inference model ϕ are jointly optimized in order to generate better pseudo labels against domain shifts. Meta-source. We first train and optimize the model on meta-source domains by minimizing the supervised loss: θ s ′ = min θ E (x s ′ ,y s ′ ))∈D s ′ [L CE (x s ′ , y s ′ ; θ)], where (x s ′ , y s ′ ) denotes the input-label pairs of samples on meta-source domains. θ s ′ are the model parameters trained on the meta-source data. Meta-adaptation. Once the meta-source-trained model θ s ′ is obtained, we generate the pseudo labels p(ŷ t ′ |w t ′ , x t ′ ) of the meta-target data with the variational posterior q ϕ (w t ′ |θ s ′ , X t ′ , Y t ′ ). The test-time adaptation procedure is simulated by obtaining θ * t ′ : θ * t ′ = θ s ′ -λ 1 ∇ θ L CE (x t ′ , ŷt ′ ; θ s ′ ) ŷt ′ ∼ p(ŷ t ′ |w t ′ , x t ′ ), where λ 1 denotes the learning rate of the optimization in the meta-adaptation stage. Meta-target. Since our final goal is to obtain good performance on the target data after optimization with pseudo labels, we further mimic the test-time inference on the meta-target domain and supervise the meta-target prediction on θ * t ′ by maximizing the log-likelihood, which is equal to minimizing: L meta = E (x t ′ ,y t ′ )∈D t ′ [E q ϕ (w t ′ ) E p(ŷ t ′ |w t ′ ,x t ′ ) L CE (x t ′ , y t ′ ; θ * t ′ )]+D KL [q ϕ (w t ′ )||p ϕ (w t ′ )], where y t ′ denotes the ground truth label of x t ′ . The parameters θ are finally updated by θ = θ s ′ -λ 2 ∇ θ L meta , where λ 2 denotes the learning rate for the meta-target stage. Note that the loss in eq. ( 11) is computed on the parameters θ * t ′ obtained by eq. ( 10), while the optimization is performed over the meta-sourcetrained parameters θ s ′ in eq. ( 12). Intuitively, the parameters are optimized to learn the ability to handle domain shifts, such that adaptation with variational pseudo labels of data from new domains improves the predictions on the new domain. The parameters in the variational inference model ϕ are jointly trained with θ. To guarantee that the variational pseudo labels do extract the neighboring information for discrimination, we add a cross-entropy loss L ĉe on the variational pseudo labels and the corresponding actual labels during training. Thus, ϕ is updated by: ϕ = ϕ -λ 3 (∇ ϕ L ĉe -∇ ϕ L meta ), where λ 3 denotes the learning rates. Test-time adaptation and prediction. At test time, the model trained on the source domains with the above meta-learning strategy θ s is adapted by further optimization using eq. ( 10). The adapted model is then evaluated on the (unseen) target data D t . The prediction is formulated as: p(y t |x t , θ s , X t ) = p(y t |x t , θ t ) p(θ t |ŷ t , x t , θ s )p(ŷ t , w t |x t , θ s , X t )dŷ t dw t dθ t = E p ϕ (wt) E p(ŷt|wt,xt) [log p(y t |x t , θ * t )], where θ * t is the MAP value of p(θ t |x t , ŷt , θ s ). p ϕ (w t )=p ϕ (w t |θ s , X t ) is generated by the features of X t according to the outputs or common pseudo labels based on θ s .

3. RELATED WORK

Test-time adaptation. By combining the advantages of both domain adaptation (Ganin & Lempitsky, 2015; Long et al., 2015; Hoffman et al., 2018; Lu et al., 2020; Tzeng et al., 2017; Shen et al., 2022) and domain generalization (Muandet et al., 2013; Li et al., 2017; 2019; Du et al., 2020; Zhou et al., 2020; 2022) , test-time adaptation (Sun et al., 2020; Dubey et al., 2021; Wang et al., 2021; Zhou & Levine, 2021; Chen et al., 2022) and source-free adaptation (Liang et al., 2020; Eastwood et al., 2021) are proposed to train a model only on source domains while adapting it to unlabeled target data at test time. Several methods update normalization statistics of the model to handle domain shifts (Schneider et al., 2020; Du et al., 2021; Hu et al., 2021) . Sun et al. (2020) proposed to fine-tune the model parameters by a self-supervised loss at test time. Liu et al. (2021) further enhanced the method by introducing a test-time feature alignment strategy. Instead of using an extra self-supervised loss, Wang et al. (2021) proposed fully test-time adaptation by entropy minimization, which is followed in several recent works (Zhang et al., 2021; Niu et al., 2022; Jang & Chung, 2022) . Zhang et al. (2021) minimized the entropy of the marginal output distribution averaged over multiple augmentations of a single target sample. Niu et al. (2022) proposed an efficient test-time adaptation method without forgetting by adapting with low entropy samples with Fisher regularization. Different from these methods, Iwasawa & Matsuo (2021) tried to adjust the classifier with pseudo labels of the target data, without fine-tuning the model parameters. Different from these methods, we propose a probabilistic formulation of test-time adaptation, which models the uncertainty of pseudo labels for better adaptation. We also introduce variational pseudo labels with meta-adaptation to further learn the ability to improve the pseudo labels and adaptations. Meta-learning. Meta-learning-based methods (Alet et al., 2021; Xiao et al., 2022; Goyal et al., 2022) have been studied for test-time adaptation before. Alet et al. (2021) learned to adapt with a contrastive loss. Goyal et al. (2022) meta-learned the loss functions of the test-time adaptation for better adaptation. Xiao et al. (2022) proposed a single sample generalization, which adapts the model to each individual target sample through mimicking domain shifts during training. Our method also learns the adaptation ability under the meta-learning setting. We design meta-adaptation to simulate the test-time adaptation procedure and supervise it based on our probabilistic formulation. We further supervise the meta-adapted model in the meta-target stage to learn the variational pseudo label generation and adaptation ability. Pseudo label learning. In pseudo label learning, the motive is to use a model's best predictions, e.g., high confidence predictions, and use the corresponding samples and their predictions for retraining the model on a given downstream task. Tasks examples include classification (Yalniz et al., 2019; Xie et al., 2020) , segmentation (Zou et al., 2020) , and object detection (Li et al., 2022) . Pham et al. (2021) addresses the problem of confirmation bias in pseudo labeling and utilize a teacher-student network for the task of image classification. (Zou et al., 2019) utilize the output logits of softmax as the prediction probability and train the model to directly maximize the logits. (Wang et al., 2022) utilizes confidence to approximate the domain difference and applies augmentations to improve quality if below the threshold to update the student model. (Rizve et al., 2021) assumes that there is access to target labels in a semi-supervised learning and relies on prediction uncertainties leveraged through the labeled samples in target to and generates new pseudo labels. For semi-supervised learning, (Miyato et al., 2018) proposes a regularization method that uses target data without labels. Shu et al. (2018) constructs a new source domain using pseudo labelling using available target data during source training. For domain generalization and offline adaptation, Abdo et al. ( 2009) utilize pseudo labels from network extracting features and depth features. For online adaptation, Chen et al. ( 2022) utilize a contrastive method and utilize pseudo labels to ensure that same class negative samples are not used in contrastive loss optimization. For semi-supervised adaptation Zhou et al. (2021) utilize pseudo labels and use confidence as a criteria. In contrast to existing works, our method uses meta learning and is probabilistic in nature for test-time adaptation.

4.1. SETTINGS

Five datasets. We demonstrate the effectiveness of our method on image classification and domain generalization settings. We evaluate our method on five widely used datasets in domain generalization. PACS (Li et al., 2017 ) consists of 7 classes and 4 domains: Photo, Art painting, Cartoon, and Sketch with 9991 samples. VLCS (Fang et al., 2013) consists of 5 classes and 4 domains: Pascal, LabelMe, Caltech, SUN with 10,729 samples. TerraIncognita (Beery et al., 2018) samples. We follow the training and validation split in (Li et al., 2017) and evaluate the model according to the "leave-one-out" protocol (Li et al., 2019; Carlucci et al., 2019) . We also evaluate our method on the Rotated MNIST and Fashion-MNIST datasets following Piratla et al. ( 2020), where the images are rotated by different angles as different domains. We use the subsets with rotation angles from 15 • to 75 • in intervals of 15 • as five source domains, and images rotated by 0 • and 90 • as the target domains. Two adaptation settings. To demonstrate the effectiveness of our method, we evaluate on both offline and online test-time adaptation settings. Offline test-time adaptation assumes there are already unlabeled target data available for adaptation. To be close to real-world application where it is difficult to access the whole target set, we use different amounts of target data for offline adaptation. We adapt the model on the available target data and evaluate it on the entire target set without continuously fine-tuning on the entire target set. We also evaluate our method for online test-time adaptation (Iwasawa & Matsuo, 2021) . In real-world application, as aforementioned, instead of accessing the entire target set, we usually obtain unlabeled target data in an online manner. To achieve continuous adaptation and improvement of the model on target data, we increment the target data iteratively and keep adapting and evaluating the model on the online target data.

Implementation details

We make use of ResNet-18 for all our experiments and ablation studies and report the accuracies on ResNet-50 for comparison as well. The backbones are pretrained on ImageNet same as the previous methods. During training, we use a varied learning rate throughout the model. We set the learning rate for the pretrained ResNet to 5e-5 and the learning rate of the variational module and classifiers as 1e-4 for all datasets. During test-time adaptation, we utilize a learning rate of 1e-4 for all layers. We intend to release the code on acceptance of this paper.

4.2. ABLATION STUDIES

Benefits of our probabilistic test-time adaptation We first investigate the effectiveness of our probabilistic formulation of test-time adaptation and its meta-learned variational pseudo labels. To demonstrate the benefits of the probabilistic formulation, we conduct test-time adaptation with eq. ( 3) and compare it with a common test-time adaptation tactic (Wang et al., 2021) , as in eq. ( 1). As shown in Table 1 , both adaptation with and without probabilistic formulation achieve good improvements over the ERM baseline. Our probabilistic test-time adaptation performs better than the common one on most target domains for both the online and offline adaptation settings, which demonstrates the benefits of modeling uncertainty during adaptation at test time. Moreover, we incorporate the distribution of pseudo labels into the probabilistic formulation, and further propose the variational pseudo labels with meta-adaptation based on the pseudo label distributions. As shown in the fourth and last row in Table 1 , our method further improves the performance of both online and offline adaptation, demonstrating the effectiveness of the variational pseudo labels with meta-adaptation. With the probabilistic formulation, it is natural and simple to define the problem as a variational inference problem and solve the problem under the meta-learning ResNet-18 as the backbone. We compare our variational pseudo labels with common pseudo labels with different amounts of adaptation data under the offline settings (left figure). Our variational pseudo labels achieve better overall accuracy consistently. We also provide the accuracy along with adaptation steps on art-painting in the right figure. Our method adapts faster and achieves better performance than using the common pseudo labels. Table 2 : Ablation of meta-learning setting: We conduct the below experiments on PACS using ResNet-18. In the first framework, we do not use meta-learning across all stages. We observe that our method performs better on all domains comparatively. framework. The results further demonstrate the benefits and importance of our probabilistic formulations of test-time adaptation.

Settings

Benefits of variational pseudo-labels Based on the probabilistic formulation, we introduce variational pseudo labels to incorporate the target information of the neighboring target data for each target sample. To demonstrate the effectiveness of our variational pseudo label, we compare it with the normal pseudo labels drawn directly from the prediction distributions of source-trained models. We evaluate the methods in offline adaptation settings with different amounts of target data. As shown in Figure 2 (left), adaptation with our variational pseudo labels achieves better overall results than the normal pseudo labels consistently. We also provide the adaptation results along with adaptation steps in Figure 2 (right). Starting from the same baseline accuracy, the variational pseudo labels achieve faster adaptation than the normal pseudo labels. Adaptation with variational pseudo labels is less prone to saturating in performance, leading to better final accuracy.

Benefits of meta learning

We also investigate the importance of meta-learning in our method. For this experiment, we do not make meta-learning for source training and test-time adaptation and only use variational labels that have been generated for adaptation. We observe that meta-learning indeed helps in our formulation. Without meta-learning, it is difficult for the model to learn the ability to handle domain shifts. Thus, there is a significant decrease in accuracy as shown in Table 2 . Offline adaptation with less target data In real applications, it is difficult to access the entire target set at once for adaptation. Therefore, we estimate our method under the offline test-time adaptation settings with different amounts of target data. As shown in Figure 3 , the accuracy increases obviously with small numbers of target samples, e.g., 10% and 25%. However, the overall accuracy tends to saturate when keeping increasing the number of target data for adaptation. This indicates that our method is able to achieve good adaptation under the offline adaptation setting with even small amounts of target data, showing the applicability of the proposed method in practice. (Dou et al., 2019) 81.0 82.7 --ER (Zhao et al., 2020) 81.5 85.3 74.4 -Test-time Adaptation Methods Tent-BN (Wang et al., 2021) 81.3 83.7 61.3 39.8 SHOT (Liang et al., 2020) 82.4 84.1 65.2 33.5 T3A (Iwasawa & Matsuo, 2021) 

4.3. COMPARISONS

To further demonstrate the effectiveness of our method, we compare our method with some stateof-the-art test-time adaptation methods and standard domain generalization methods. Table 3 shows the results on PACS, VLCS and TerraIncognita using ResNet-18. Compared with the other state-ofthe-art domain generalization methods and test-time adaptation methods with the online adaptation setting, our method performs better. On datasets such as PACS and TerraIncognita, we are better by state-of-the-art methods significantly. We also report results on the PACS dataset using ResNet-50 as the backbone, the performance of our method is competitive and better than most of the state-ofthe-art methods. We provide the detailed comparison in Appendix 5.

5. CONCLUSION

We propose to cast test-time adaptation as a probabilistic inference problem and model pseudolabels as distributions in the formulation. By modeling the uncertainty into the pseudo label distributions, the probabilistic formulation mitigates adaptation with inaccurate pseudo labels or predictions, which arises due to domain shifts and lead to misspecified models after adaptation. Based on the probabilistic formulation, we further propose variational pseudo labels under the meta-adaptation paradigm, which exposes the model to domain shifts and learns the ability to adapt with pseudo labels incorporating target information of the neighboring target samples. Ablation studies and further comparisons show the effectiveness of our method on five common domain generalization datasets. Table 5 : Comparison on rotated MNIST and Fashion-MNIST. The models are evaluated on the test sets of MNIST and Fashion-MNIST with rotation angles of 0 • and 90 • . Our method performs better than both non-adaptive domain generalization methods (Dou et al., 2019; Piratla et al., 2020) and adaptive methods (Wang et al., 2021; Xiao et al., 2022) . MNIST Fashion-MNIST Extra ablations We also provide extra ablation studies on the pseudo label generation and adaptation with our variational pseudo labels. We directly make predictions using the variational pseudo labels that are generated by sampling from the pseudo label distributions at test time. As shown in Table 6 , the predictions based on the pseudo-label distributions are better than the ERM baseline, demonstrating that our variational pseudo labels are better than the original pseudo labels. Moreover, after adapting the model parameters by our variational pseudo labels, the performance further improves obviously. Table 6 : Investigating by evaluating directly using the pseudo label distributions. We conduct the below experiments on PACS using ResNet-18. Our method that adapts the models with variational pseudo labels performs better than the prediction by directly sampling from the pseudo label distributions. 

VLCS PACS Terra-Incognita

Tent (Wang et al., 2021) 7m 28s 3m 16s 10m 34s Tent-BN (Wang et al., 2021) 2m 8s 33s 2m 58s SHOT (Liang et al., 2020) 8m 9s 4m 22s 12m 40s T3A (Iwasawa & Matsuo, 2021) 2m 9s 33s 2m 59s TAST (Jang & Chung, 2022) 10m 34s 9m 30s 26m 14s Our method 2m 20s 5m 33s 14m 30s Time cost analyses. We also provide the time cost of our method in both the training (Table 7 ) and inference stage (Table 8 ). As we utilize the meta-learning strategy to learn the ability to handle domain shifts during training, the time cost during training is larger than the ERM baseline. Moreover, compared with the ERM baseline, our variational pseudo-label learning and meta-learning framework only introduce a few parameters. Since the meta-learning strategy only complicates the training process, our method has similar runtime in inference compared with the other test-time adaptation methods, e.g., Tent (Wang et al., 2021) and source-free domain adaptation methods, e.g., SHOT (Liang et al., 2020) .



Figure 1: Graphical illustrations for test-time adaptation. (a) The original test-time adaptation algorithm (Wang et al., 2021) obtains an adapted θ t by entropy minimization of the unlabeled target data x t on source trained model θ s . (b) Our probabilistic formulation models the uncertainty of pseudo labels p(ŷ t ) for more robust adaptation. (c) Furthermore, we propose meta adaptation with variational pseudo labels to incorporate neighboring target information into pseudo label generation and train the model under the meta-learning setting. Note that y t ′ are labels of meta-target data and observed only in training. The actual labels y t of target data are unavailable for test-time adaptation at inference time.

Figure2: Benefits of variational pseudo labels. The experiments are conducted on PACS with ResNet-18 as the backbone. We compare our variational pseudo labels with common pseudo labels with different amounts of adaptation data under the offline settings (left figure). Our variational pseudo labels achieve better overall accuracy consistently. We also provide the accuracy along with adaptation steps on art-painting in the right figure. Our method adapts faster and achieves better performance than using the common pseudo labels.

Figure 3: Offine adaptation with less target data. The experiments are conducted on PACS using ResNet-18 averaged over five runs. Under the offline-adaptation setting we observe that with increments in the amount of test data samples the accuracy increases steadily for individual domains.

consists of 10 classes and 4 domains: Location 100, Location 38, Location 43 and Location 46 with 24,778 Benefits of our probabilistic test-time adaptation. The experiments are conducted on PACS based on ResNet-18. Our probabilistic formulation achieves better performance than the common test-time adaptation in both online and offline settings. The variational pseudo labels with meta adaptation further improve the overall performance.

Photo Art-painting Cartoon Sketch Mean W/o meta-learning 94.76 ±1.0 80.7 ±0.5 78.87 ±0.7 68.39 ±0.5 80.68 W/ meta-learning 95.80 ±0.5 84.32 ±0.8 83.44 ±0.4 74.57 ±0.2 84.78 ±0.36

Comparisons on common DG datasets. The experiments are conducted on all datasets averaged over five runs. We provide the results of our method for the online setting. Our method also performs better than the state-of-the-art domain generalization methods across all datasets.

Comparisons on PACS. The experiments are conducted on PACS averaged over five runs. We provide the results of our method under both the online and offline adaptation settings. Our method is better than other methods on both settings with different backbones. Our method also performs better than the state-of-the-art domain generalization methods.

Photo Art-painting Cartoon Sketch Mean Prediction by pseudo label distributions 95.97 ±1.0 81.23 ±0.5 79.19 ±0.7 73.22 ±0.5 82.40 ±0.55 Our method 95.80 ±0.5 84.32 ±0.8 83.44 ±0.4 74.57 ±0.2 84.78 ±0.36

Runtime required for source training on PACS using ResNet-18 as a backbone network. The proposed method has larger time costs during training due to the meta-learning strategy but introduces few extra parameters.

Runtime averaged for datasets using ResNet-18 as a backbone network. The proposed method has similar or even better time costs at test time with the other test-time adaptation methods.

A DETAILED FORMULATION

We start the objective function from p(y t ′ |x t ′ , θ s ′ , X t ′ ). Here we provide the detailed generating process of the formulation:We then introduce the pseudo labels ŷt ′ as the latent variable into eq. ( 15) and derive it as:Theoretically, the distribution p(θ To obtain better pseudo labels ŷt ′ , we further introduce the latent variable w t ′ into eq. ( 17) and a variational posterior of the joint distribution q(ŷ t ′ , w t ′ ). The formulation is then derived as:where q(ŷ t ′ , w t ′ |x t ′ , θ s ′ , X t ′ , Y t ′ ) and p(ŷ t ′ , w t ′ |x t ′ , θ s ′ , X t ′ ) denote the prior and posterior distributions, respectively.

B IMPLEMENTATION DETAILS

Our train setup follows Iwasawa & Matsuo (2021) . We use a batch size of 70 and train our method using the ERM algorithm Gulrajani & Lopez-Paz (2020) . As stated, our backbones such as ResNet-18 and ResNet-34 are pretrained on ImageNet same as the previous methods.During our training, the model with highest validation accuracy is selected for adaptation on the target domain. We use similar settings for all Domain Generalization benchmarks that have been reported in the paper. We train all our models on NVIDIA Tesla 1080Ti GPU. In Table 8 we report the runtime required for test-time adaptation. We observe that our method is comparable to methods that are not optimization free such as SHOT and PL. We also in Table 7 report the amount of time consumed during source training.

C DETAILED EXPERIMENTAL RESULTS

Detailed experimental results. In Table 4 we report our detailed performance and comparison to existing methods on PACS with both ResNet-18 and ResNet-50 as the backbone. We observe that our method shows an improvement in accuracy compared to other methods especially on "Art-Painting", "Cartoon" and "Photo" Domains. The results show the benefits of using Variational Pseudo labels for adaptation.We also conduct experiments on rotated MNIST and rotated Fashion-MNIST for comparison as shown in Table 5 . We follow the settings in Piratla et al. ( 2020) and use ResNet-18 as the backbone.The conclusion is similar to that in PACS. Our method achieves better performance than both the non-adaptive domain generalization methods and adaptation methods.

