LEARNING TO COMPOSE SOFT PROMPTS FOR COMPOSITIONAL ZERO-SHOT LEARNING

Abstract

We introduce compositional soft prompting (CSP), a parameter-efficient learning technique to improve the zero-shot compositionality of large-scale pretrained vision-language models (VLMs) like CLIP. We develop CSP for compositional zero-shot learning, the task of predicting unseen attribute-object compositions (e.g., old cat and young tiger). VLMs have a flexible text encoder that can represent arbitrary classes as natural language prompts but they often underperform taskspecific architectures on the compositional zero-shot benchmark datasets. CSP treats the attributes and objects that define classes as learnable tokens of vocabulary. During training, the vocabulary is tuned to recognize classes that compose tokens in multiple ways (e.g., old cat and white cat). At test time, we recompose the learned attribute-object vocabulary in new combinations to recognize novel classes. We show that CSP outperforms the CLIP on benchmark datasets by an average of 10.9 percentage points on AUC. CSP also outperforms CoOp, a soft prompting method that fine-tunes the prefix context tokens, by an average of 5.8 percentage points on AUC. We perform additional experiments to show that CSP improves generalization to higher-order attribute-attribute-object compositions (e.g., old white cat) and combinations of pretrained attributes and fine-tuned objects.

1. INTRODUCTION

Compositionality is the long-standing goal of artificial intelligence of creating new concepts by combining existing primitive concepts (Chomsky, 1956; Fodor & Pylyshyn, 1988; Hupkes et al., 2020; Lake & Baroni, 2018; Marcus, 2003) . The practical advantage of compositionality for deep neural networks lies in the ability to build new classifiers by combining existing classifiers. In this work, we consider compositional zero-shot learning, a classification task where the model learns to predict unseen or novel compositions of primitive concepts (Naeem et al., 2021; Nagarajan & Grauman, 2018; Purushwalkam et al., 2019) . Research on compositional zero-shot learning in language and vision focuses on attribute-object compositions such as old tiger and young tiger, where tiger is the object category described by the attributes old and young. Existing methods for compositional zero-shot learning typically map attributes and objects to pretrained word embeddings and use a pretrained image encoder backbone to jointly align the image and the attribute-object text representations to learn compositionality (Li et al., 2020; Mancini et al., 2021a; b; Misra et al., 2017; Naeem et al., 2021; Nagarajan & Grauman, 2018; Purushwalkam et al., 2019; Xu et al., 2021) . However, the pretraining of the word embeddings and image encoder is disjoint and isolated from each other, i.e., these methods learn to align image and text representations from scratch. These task-specific architectures also are limited in flexibility. For example, to adapt these methods to higher-order compositions with multiple attributes and objects such as small furry cat or old white tiger, the original architecture needs to be modified. The ability to generalize beyond the original training length is a key test for compositionality (Hupkes et al., 2020) . In contrast, we propose to build on large-scale pretrained vision-language models (VLMs), which are trained on massive amounts of aligned images and text (Jain et al., 2021; Jia et al., 2021; Li et al., 2021; Radford et al., 2021) . We focus on CLIP (Radford et al., 2021) , a powerful vision-language model pretrained on 400 million image-text pairs. CLIP has two main components: the image encoder and the text encoder that produce vector representations for images and text in a multi-modal embedding space. The text encoder accepts a textual input, or a prompt such as A photo of dog to produce a vector representation for the class dog. Taking the cosine similarity with all the class prompts and the image, we get a compatibility score for the classes and pick the one with the highest score. However, CLIP without any fine-tuning underperforms task-specific architectures, even though it has been pre-trained on vastly more data. (See Appendix A for details.) This finding suggests that there is significant room for improvement from teaching VLMs like CLIP about composing concepts. To improve VLMs for compositional zero-shot learning, we introduce compositional soft prompting (CSP), a parameter-efficient learning technique that tunes tokens of vocabulary to represent primitive concepts in a composable way. Fine-tuning large pre-trained models such as CLIP requires huge amounts of compute and may lead to overfitting (Sung et al., 2021; Mitchell et al., 2022) (see also Section 5). This challenge has motivated several soft prompting techniques in both language and vision (Lester et al., 2021; Qin & Eisner, 2021; Vu et al., 2021; Zhou et al., 2021) . These works tune a single prompt on a downstream supervised task, often in a few-shot setting. For instance, they typically use prompts such as A photo of [class] and tune the prefix A photo of on the entire dataset. In contrast, CSP is a novel way of soft prompting. We treat the attributes and objects that are composed to define classes as learnable tokens of vocabulary in a prompt as A photo of [attribute] [object] . We tune on multiple [attribute] and [object] prompt compositions, and then we recompose them into new prompts for zero-shot inference (Figure 1 ). Our results show that CSP improves over the zero-shot performance of CLIP. CSP significantly improves over CLIP across three benchmark datasets by an average accuracy of 13.7 percentage points in the closed-world setting and 8.0 percentage points in the open-world setting (using the AUC metric). CSP also outperforms CoOp, a soft prompting method that tunes the prefix context, by an average of 7.3 percentage points in the closed-world setting and 4.3 percentage points in the open-world setting on the AUC metric. In addition to improved benchmark accuracy, CSP has several other advantages when tested on other kinds of zero-shot inferences without any changes to training. We show that the learned attribute vocabulary can be decomposed to better classify attributes in isolation, using prompts of the form A photo of [attribute] object. We also show that training CSP with attribute-object compositions improves CLIP's performance on attribute-attribute-object compositions. Finally, we show that CSP improves generalization to compositions of unseen attributes and seen objects. Prior work on compositional zero-shot learning typically only evaluates unseen compositions of seen attributes and seen objects. In summary, our main contributions are:



Figure 1: An overview of compositional zero-shot learning with CSP. We fine-tune the vocabulary for attributes and objects on the seen classes. Then we compose novel soft prompts to test on the unseen classes.

availability

//github.com/BatsResearch/csp.

