USING LANGUAGE TO EXTEND TO UNSEEN DOMAINS

Abstract

It is expensive to collect training data for every possible domain that a vision model may encounter when deployed. We instead consider how simply verbalizing the training domain (e.g. "photos of birds") as well as domains we want to extend to but do not have data for (e.g. "paintings of birds") can improve robustness. Using a multimodal model with a joint image and language embedding space, our method LADS learns a transformation of the image embeddings from the training domain to each unseen test domain, while preserving task relevant information. Without using any images from the unseen test domain, we show that over the extended domain containing both training and unseen test domains, LADS outperforms standard fine-tuning and ensemble approaches over a suite of four benchmarks targeting domain adaptation and dataset bias. Code is available at https://github.com/lisadunlap/LADS. "Stop Sign" similarity I have photos of sunny road signs, but I want do well on snowy road signs.

1. INTRODUCTION

The ability to extend a model beyond the domain of the training data is central to building robust computer vision models. Methods for dealing with unseen test distributions often require leveraging additional image data, but linguistic knowledge of the anticipated domain shift is much cheaper and easier to obtain. For example, in many settings, the training images are collected in certain conditions (e.g., daylight, clear weather, ...) but our sensors may also experience less common but easy to anticipate conditions (e.g., night, snow, haze, illustrations, ...). Directly collecting or creating data in all possible anticipated settings is often prohibitively expensive. Thus it is of great interest how one can linguistically extend to unseen domains: that is, to utilize language to improve performance on an unseen test domain without sacrificing performance on the training domain. The use of language in domain generalization has generated significant interest with the development of large vision-language models such as CLIP (Radford et al., 2021 ), Flamingo (Alayrac et al., 2022 ), and ALIGN (Jia et al., 2021) , which allow users to create zero-shot classifiers using only class names. However, while these models have been shown to achieve remarkable cross-domain generalization, their zero-shot classifiers often perform far worse than models trained for a particular downstream task (Radford et al., 2021; Kumar et al., 2022) . When training data is available for the downstream task, a common practice is to fine-tune these models on the training data. While this significantly improves in-domain accuracy, it degrades performance on unseen domains. We show that it is possible to leverage the domain-level knowledge (e.g. sunny environments vs. snowy environments in our example) contained in CLIP or similar models to deal with a variety of domain shifts in a way that requires no data from the new test domain, exploits the labeled training data, and is fast to train. Our method only requires users to input text descriptions of the training and unseen test domains (e.g. "a sunny stop sign" and "a snowy stop sign") along with their training data. To achieve language-guided domain generalization, we leverage the broad domain knowledge encoded in CLIP coupled with its shared image-language embedding space to perform latent feature augmentation of the training set. More precisely, the embeddings of these textual descriptions are used to train an augmentation model which learns a transformation on the CLIP image embeddings of the training domain and "places" them in the new domain (see Figure 1 ). We train this augmentation model with two objectives: (1) translating the image embedding from the training domain to the unseen testing domain, while (2) retaining the class-specific information of the original image. Once this transformation is learned, we train a simple linear classifier on the combined augmented and unaugmented image embeddings, resulting in a classifier that outperforms common fine-tuning methods on the extended domain while achieving similar performance on the training domain. We introduce LADS, a method to extend a model to new domains given only a language description of the distribution shift. Our main contributions are (1) the introduction of the Domain Extension with Language problem, (2) a novel language-guided latent feature augmentation training procedure, and (3) the extension of our method to address spurious correlation biases in the training data. We evaluate LADS on two domain adaptation benchmarks, DomainNet (Peng et al., 2019) and CUB-Paintings (Wang et al., 2020) , as well as two benchmarks exhibiting color and contextual bias, Colored MNIST (Arjovsky et al., 2021) and Waterbirds (Sagawa et al., 2019) . On the domain adaptation benchmarks, we show that we improve out-of-domain performance by 1-3% while matching in-domain performance of fine-tuned and ensembled models. On the biased benchmarks, we show an almost 2x improvement in out-of-domain performance over fine-tuned models. Across all benchmarks, LADS achieves the highest accuracy on the entire extended test domain containing both training and unseen test domain samples. Finally, we perform an in-depth analysis of the altered image embeddings, the effect of each loss function, and the effect of different vision and language models to understand our framework better.

2. RELATED WORK

Domain Adaptation/Generalization. The challenge of out-of-domain generalization is well studied (Recht et al., 2019; Petryk et al., 2022; Kumar et al., 2022; Santurkar et al., 2021; Hendrycks & Dietterich, 2019) with a large body of work in domain adaptation addressing the problem of adapting a model to perform well on a new target domain. A typical domain adaptation approach involves collecting additional unlabeled data from the target domain (Ganin & Lempitsky, 2015; Saito et al., 2017; Arjovsky et al., 2021; Kim et al., 2018; Tzeng et al., 2015) , and aims to train a classifier such that it cannot tell the difference between source and target domain. In the limited data setting, few-shot domain adaptation (Motiian et al., 2017; Yue et al., 2021) aims to learn from as little as one example in the target domain. Work in domain generalization (Wang & 



Figure1: Consider a model trained to recognize road signs in sunny weather. We aim to extend to a new domain of snowy weather. Our method LADS (Latent Augmentation using Domain descrip-tionS) leverages a multimodal model's knowledge of the classes and the domain shift verbalized in natural language ("sunny" to "snowy") to train an augmentation network without any samples from the unseen test domain. This network is used to translate multimodal image embeddings from the training domain to the unseen test domain, while retaining class-relevant information. Then, real and augmented embeddings are used jointly to train a classifier.

