VISUAL PROMPT TUNING FOR TEST-TIME DOMAIN ADAPTATION Anonymous

Abstract

Models should be able to adapt to unseen data during test-time to avoid performance drops caused by inevitable distribution shifts in real-world deployment scenarios. In this work, we tackle the practical yet challenging test-time adaptation (TTA) problem, where a model adapts to the target domain without accessing the source data. We propose a simple recipe called Data-efficient Prompt Tuning (DePT) with two key ingredients. First, DePT plugs visual prompts into the vision Transformer and only tunes these source-initialized prompts during adaptation. We find such parameter-efficient finetuning can efficiently adapt the model representation to the target domain without overfitting to the noise in the learning objective. Second, DePT bootstraps the source representation to the target domain by memory bank-based online pseudo-labeling. A hierarchical self-supervised regularization specially designed for prompts is jointly optimized to alleviate error accumulation during self-training. With much fewer tunable parameters, DePT demonstrates not only state-of-the-art performance on major adaptation benchmarks VisDA-C, ImageNet-C, and DomainNet-126, but also superior data efficiency, i.e., adaptation with only 1% or 10% data without much performance degradation compared to 100% data. In addition, DePT is also versatile to be extended to online or multi-source TTA settings.

1. INTRODUCTION

Deep neural networks achieve excellent performance when the testing data (target domain) follow the same distribution as the training data (source domain). However, their generalization ability on the testing data is not guaranteed when a distribution shift occurs between the source and the target. Even simple domain shifts, like common corruptions (Hendrycks & Dietterich, 2019) or appearance variations (Geirhos et al., 2018) , can lead to a significant performance drop. Solving the problem of domain shift is non-trivial. On one hand, it is impossible to train a single model to cover an infinite number of domains. On the other hand, training individual models for each domain requires lots of annotated samples, which induces significant data collection and labeling costs. In this paper, we tackle the practical yet challenging test-time domain adaptation (TTA) problem. Compared with conventional unsupervised domain adaptation (UDA) (Long et al., 2015) , TTA adapts the source domain initialized model with the unlabeled target domain data during testing without access to source data. TTA avoids the privacy issue of sharing the source data and is desirable for real-world applications. We focus on both offline and online TTA settings. For offline adaptation, also known as source-free adaptation (Kundu et al., 2020; Liang et al., 2020) , the model is first updated with unlabeled target data and then inference. For online adaptation, the model keeps updating and inferencing at the same time, given the testing data that comes in a stream. The key challenges of TTA lie in two folds. First, how to effectively modulate the source domain initialized model given a noisy unsupervised learning objective? Tent (Wang et al., 2020) optimizes the parameters of batch normalization layers, which is parameter-efficient but lacks adaptation capacity. On the other hand, SHOT (Liang et al., 2020) tunes the feature encoder; AdaContrast (Chen et al., 2022) trains the whole model. Given the current over-parameterized models, these methods are prone to overfitting to the unreliable unsupervised learning objective, especially when the amount of target domain data is limited. We present more evidences in Appendix A to illustrate our motivation to use visual prompt tuning. In this work, we propose Data-efficient test-time adaptation via Prompt Tuning (DePT). Our recipe is simple yet effective, where two key ingredients are proposed to address the two challenges mentioned above, see in Fig. 2 . First, a collection of learnable visual prompts (Jia et al., 2022) are trained with labeled source data alongside the vision Transformer (Dosovitskiy et al., 2020). In the adaptation phase, we only finetune the prompts and the classification head, while freezing the backbone parameters.Although the knowledge learned in the source domain is retained in the unchanged backbone, tuning solely the prompts can effectively modulate the model for target domain adaptation. Second, for the learning objective given only unlabeled target domain data, DePT bootstraps the source domain initialized model via self-training, where the pseudo labels are first predicted and then refined by soft voting from nearest neighbor data points in a memory bank. To further alleviate the accumulation of errors during self-training, we design a hierarchical fine-grained self-supervised regularization term for the prompts to encourage better target domain representation learning. These two objectives are complementary and jointly optimized. The two ingredients in DePT offer multiple merits and lead to state-of-the-art performance on major domain adaptation benchmarks: VisDA-C (Peng et al., 2017), ImageNet-C (Hendrycks & Dietterich, 2019) and DomainNet-126 (Peng et al., 2019) . Tuning prompts during test-time is much more parameter-efficient than full fine-tuning. As illustrated in Fig. 1 , with only 0.19% tunable parameters, DePT achieves 1.2% higher performance than previous state-of-the-art AdaContrast (Chen et al., 2022) . Moreover, parameter efficiency brings data efficiency to DePT. Compared with AdaContrast, DePT has significant improvement in the low data regime. For example, with only 1% of unlabeled target domain data in VisDA-C, DePT achieves 88.0% average accuracy, which surpasses AdaContrast by 7.1%. Under the more challenging online TTA setting, DePT demonstrates superior performance with 85.9% average accuracy on VisDA-C. For robustness to corruptions, DePT achieves the least error rate on 14 out of 15 types of corruptions on the severity level of 5 on ImageNet-C with a 34.9% average error rate. Besides, we also show that DePT is flexible to extend to more DA scenarios like multi-source (Peng et al., 2019) domain adaptation.

2. RELATED WORK

Unsupervised domain adaptation (UDA) setting is commonly used, where unlabeled data from the testing distribution (target domain) is available during training along with the labeled data from the training distribution (source domain). Many studies attempt to solve this problem using style transfer (Hoffman et al., 2018; Tang et al., 2021; Taigman et al., 2016) , feature alignment (Long et al., 2015; Sun et al., 2017; Peng et al., 2019) or learning domain-invariant feature via adversarial training (Ganin et al., 2016; Tzeng et al., 2017) . However, the UDA setting has a strong assumption that the source and target domain data are both accessible during training, which is not always true as it is difficult to access source data after the model is deployed. Moreover, these methods usually need to retrain the whole framework if new domains come.



Second, given only unlabeled target domain data, use what kind of learning objective for optimization? Existing works propose to use unsupervised objectives, including entropy minimization (Wang et al., 2020), self-supervised auxiliary task (Sun et al., 2019b), pseudo labeling (Lee et al., 2013), or a combination of the above objectives (Lianget al.,  2020; Chen et al., 2022; Liu et al., 2021b). However, these unsupervised objectives are either not well aligned with the main task(Liu et al., 2021b), or give noisy supervision(Liang et al., 2020).

Figure 1: The test-time adaptation performance of different methods with respect to the data ratio on the VisDA dataset. The number in the legend denotes the number of tunable parameters. DePT-G outperforms previous SOTA AdaContrast on 100% target data with only 0.19% tunable parameters. The superiority of DePT is more significant on the low data settings.

