MULTI-PROMPT ALIGNMENT FOR MULTI-SOURCE UNSUPERVISED DOMAIN ADAPTATION

Abstract

Most existing methods for multi-source unsupervised domain adaptation (UDA) rely on a common feature encoder to extract domain-invariant features. However, learning such an encoder involves updating the parameters of the entire network, which makes the optimization computationally expensive, particularly when coupled with min-max objectives. Inspired by recent advances in prompt learning that adapts high-capacity deep models for downstream tasks in a computationally economic way, we introduce Multi-Prompt Alignment (MPA), a simple yet efficient two-stage framework for multi-source UDA. Given a source and target domain pair, MPA first trains an individual prompt to minimize the domain gap through a contrastive loss, while tuning only a small set of parameters. Then, MPA derives a low-dimensional latent space through an auto-encoding process that maximizes the agreement of multiple learned prompts. The resulting embedding further facilitates generalization to unseen domains, making MPA naturally suitable for test time adaptation. Extensive experiments show that our method achieves state-of-the-art results on popular benchmark datasets while requiring substantially fewer tunable parameters. To the best of our knowledge, we are the first to apply prompt learning to the multi-source UDA problem and our method achieves the highest reported average accuracy of 54.1% on DomainNet, the most challenging UDA dataset to date, with only 15.9M parameters trained. More importantly, we demonstrate that the learned embedding space can be easily adapted to novel unseen domains with even fewer tuned parameters.

1. INTRODUCTION

Deep learning has achieved remarkable progress in various computer vision tasks such as image classification (Krizhevsky et al., 2012; He et al., 2016) , object detection (Ren et al., 2015; Redmon et al., 2016; Liu et al., 2016) and image segmentation (Long et al., 2015a; Chen et al., 2017) . However, these success relies on high capacity models trained in a supervised manner using a massive amount of manually labeled data, which are oftentimes expensive and time-consuming to collect. Furthermore, current deep models are brittle to the presence of domain shift (Quinonero-Candela et al., 2008; Torralba & Efros, 2011; Zhang et al., 2013) in the forms of different image styles, varied lighting conditions, diverse viewpoints, etc., between training and testing distributions. Unsupervised domain adaptation (UDA) is a popular strategy that mitigates domain discrepancies through transferring knowledge learned from a well-labeled source domain to an unlabeled target domain (Pan & Yang, 2010; Csurka, 2017; Wang & Deng, 2018) . While significant advances have been achieved, current approaches focus on the single source setting, where all the labeled training data share the same distribution. In practice, however, it is more common for the labeled data to be collected from multiple sources that are diverse in distribution. Naturally, one could still tackle this problem by straightforwardly combining all the data into one single source and apply off-the-shelf UDA methods. However, directly applying single source UDA methods often results in a limited performance, as domain shift also exists among different source domains. The integration of multiple source domains for improved adaptation results on the unlabeled target domain is generally known as multi-source unsupervised domain adaptation. Inspired by the theoretical analysis of Ben-David et al. (2006) , learning domain-invariant feature representations has become a prevailing paradigm for multi-source UDA. One typical approach is to jointly learn In this paper, we introduce a simple yet efficient approach for multi-source UDA without the need of retraining the entire network and the use of complicated min-max loss functions (See Figure 1 for a comparison). In particular, we build upon prompt learning (Lester et al., 2021; Wei et al., 2021) that has been designed to transfer knowledge learned from large pre-trained vision language models like CLIP (Radford et al., 2021) . In prompt learning, image representations are learned contrastively with a piece of language text termed as "prompt" that describes the class of the image. While recent studies (Ge et al., 2022; Ben-David et al., 2022) suggest that learnable prompts can be used for UDA, they are restricted to the single source scenario and directly generalizing them to the multisource setting produces limited results. Furthermore, while prompt learning provides an efficient alternative, we argue that the number of parameters needed to be tuned could be further reduced. In light of this, we present a surprisingly easy framework, Multi-Prompt Alignment (MPA), for multi-source UDA. MPA is composed of two stages, one to learn an individual prompt by tuning a small set of parameters for each source and target domain pair and one to mine the relationships among learned prompts through deriving a shared embedding space. The resulting embedding is expected to be domain-invariant and can generalize to unseen domains. More specifically, for the first stage, given a source domain and a target domain, we use CLIP as our backbone and learn one prompt tailored for such a pair. We then align all the learned prompts in a latent space of a lower dimension d I . This is accomplished by a simple auto-encoder network with a reconstruction loss. Additionally, we incorporate an L 1 loss so that the reconstructed prompts agree on the classification of target images. This is beneficial for prompts to handle situations in which the target data lies near the decision boundary. We conduct extensive experiments on multiple benchmarks and the results clearly show that our method outperforms state-of-the-art methods in the multi-source setting. In particular, on DomainNet (Peng et al., 2019) , the most challenging dataset for multi-source UDA so far, MPA surpasses all state-of-the-arts methods. More importantly, as the latent space is optimized with prompts from multiple source domains, it encodes knowledge shared by different domains and could potentially generalize to unseen domains by traversing the space. Consequently, we show how surprisingly easy it is to tune the learned low-dimensional embedding for deployment in unseen



Figure 1: (a) Most conventional multi-source UDA methods use a common feature extractor with domain-specific classifier heads, while we introduce prompt learning to multi-source UDA and omit the repeated need for classifier heads. (b) MPA outperforms all other multi-source UDA methods by a large margin on the DomainNet dataset with roughly one third tunable parameters. We also introduce a Latent Space Tuning (LST) strategy that further reduces the trainable parameters from 15.9M (MPA) to 1.47M while still capable of achieving a high accuracy. See texts for more details. a common feature extractor together with domain-specific feature extractors and classifier heads. Various feature distance metrics (Long et al., 2015b; Sun & Saenko, 2016; Kang et al., 2019) or domain adversarial training (Tzeng et al., 2017) can be leveraged that serves as a preliminary alignment between source and target domains, followed by different auxiliary losses carefully designed to further reduce the domain shift. While these methods offer decent results, they require optimizing the entire set of parameters in the network and hence poses a significant challenge for optimization even without the widely used min-max objective(Zhao et al., 2018; Li et al., 2018; Hoffman et al.,  2018). Such problem is further amplified if we wish to apply more advanced backbones such as Vision Transformer(Dosovitskiy et al., 2021)  for improved performance.

