USING LANGUAGE TO EXTEND TO UNSEEN DOMAINS

Abstract

It is expensive to collect training data for every possible domain that a vision model may encounter when deployed. We instead consider how simply verbalizing the training domain (e.g. "photos of birds") as well as domains we want to extend to but do not have data for (e.g. "paintings of birds") can improve robustness. Using a multimodal model with a joint image and language embedding space, our method LADS learns a transformation of the image embeddings from the training domain to each unseen test domain, while preserving task relevant information. Without using any images from the unseen test domain, we show that over the extended domain containing both training and unseen test domains, LADS outperforms standard fine-tuning and ensemble approaches over a suite of four benchmarks targeting domain adaptation and dataset bias. Code is available at https://github.com/lisadunlap/LADS. "Stop Sign" similarity I have photos of sunny road signs, but I want do well on snowy road signs.

1. INTRODUCTION

The ability to extend a model beyond the domain of the training data is central to building robust computer vision models. Methods for dealing with unseen test distributions often require leveraging additional image data, but linguistic knowledge of the anticipated domain shift is much cheaper and easier to obtain. For example, in many settings, the training images are collected in certain



Figure1: Consider a model trained to recognize road signs in sunny weather. We aim to extend to a new domain of snowy weather. Our method LADS (Latent Augmentation using Domain descrip-tionS) leverages a multimodal model's knowledge of the classes and the domain shift verbalized in natural language ("sunny" to "snowy") to train an augmentation network without any samples from the unseen test domain. This network is used to translate multimodal image embeddings from the training domain to the unseen test domain, while retaining class-relevant information. Then, real and augmented embeddings are used jointly to train a classifier.

