VARIATIONAL PROMPT TUNING IMPROVES GENERAL-IZATION OF VISION-LANGUAGE MODELS

Abstract

Prompt tuning provides an efficient mechanism to adapt large vision-language models to downstream tasks by treating part of the input language prompts as learnable parameters while freezing the rest of the model. Existing works for prompt tuning are however prone to damaging the generalization capabilities of the foundation models, because the learned prompts lack the capacity of covering certain concepts within the language model. To avoid such limitation, we propose a probabilistic modeling of the underlying distribution of prompts, allowing prompts within the support of an associated concept to be derived through stochastic sampling. This results in a more complete and richer transfer of the information captured by the language model, providing better generalization capabilities for downstream tasks. The resulting algorithm relies on a simple yet powerful variational framework that can be directly integrated with other developments. We show our approach is seamlessly integrated into both standard and conditional prompt learning frameworks, improving the performance on both cases considerably, especially with regards to preserving the generalization capability of the original model. Our method provides the current state-of-the-art for prompt learning, surpassing CoCoOp by 1.6% average Top-1 accuracy on the standard benchmark. Remarkably, it even surpasses the original CLIP model in terms of generalization to new classes. Implementation code will be released.

1. INTRODUCTION

In a continuous quest for better pre-training strategies, models based on image and language supervision have set impressive milestones, with CLIP (Radford et al., 2021) , ALIGN (Jia et al., 2021) and Flamingo (Alayrac et al., 2022) being leading examples. Contrastively trained vision-language models consist of image and text encoders that align semantically-related concepts in a joint embedding space. Such models offer impressive zero-shot image classification by using the text encoder to generate classifier weights from arbitrarily newly defined category classes without relying on any visual data. In particular, the class name is used within a handcrafted prompt template and then tokenized and encoded into the shared embedding space to generate new classifier weights. Rather than manually defining prompts, Zhou et al. (2022b) and Lester et al. (2021) proposed that prompts can be instead optimized in a data-driven manner through back-propagation by maximizing a cross-entropy loss on the downstream task. However, despite the performance improvement on downstream tasks, prompt learning negatively affects the generalization capability of the vision-language model. While subsequent works have focused on how to bridge the generalization gap, e.g. Zhou et al. (2022a); Zhu et al. (2022) , in practice the generalization power of the foundation model is significantly degraded. Our work tackles this same problem, as it seeks to improve downstream performance but without degrading the generalization capability of the original model. To do so, we propose a data-driven method for directly learning the underlying distribution within the prompt space associated to the target concept. In particular, we frame prompt tuning as a variational inference problem, where a base learned prompt is combined with a residual vector sampled from the instance-specific underlying distribution. This formulation provides two advantages. First, it investigates the prompt space more thoroughly and results in more informative use of the language space, leading to better generalization. Second, it enables us to boost performance by capturing the uncertainty information in fine-grained classification problems. The resulting approach is orthogonal to standard prompt learning approaches, being effective when combined with both standard (Zhou et al., 2022b) and conditional (Zhou et al., 2022a) approaches. In fact, when combined with the conditional approach, our method maintains the gains on the seen classes provided by the conditional method while simultaneously matching or even surpassing the generalization capability on unseen classes of the original vision-language model. In summary, our contributions in this paper are as follows: 1. We propose a variational framework that is capable of capturing the general or instance specific distribution within the prompt space. Since generalization is obtained through transfer from the language space, we obtain better generalization capability. 2. We show that the proposed approach is orthogonal to recent developments, and can be successfully combined with both standard and conditional prompt learning variants. 3. We empirically show that our proposed method improves performance and provides better generalization, leading to state-of-the-art accuracy in 24 out of 28 standard benchmarks set forth by prior work, surpassing CoCoOp by 1.6% average Top-1 accuracy.

2. RELATED WORKS

Prompt learning in NLP. Prompt learning was originally proposed within the NLP domain, following the appearance of foundation models such as GPT-3 (Brown et al., 2020) . Early prompt learning methods constructed prompts by combining words in the language space such that the model would perform better on downstream evaluation (Shin et al., 2020; Jiang et al., 2020) . Subsequent methods, e.g. Li & Liang (2021); Lester et al. ( 2021), prepend a set of learnable prompts to the input of a frozen model and optimize through back-propagation, which allows better flexibility than using existing words, at the cost of leading to prompts that do not correspond to an actual phrase. Instead, He et al. ( 2022) focus on a multi-task scenario and use a HyperNetwork to conditionally generate task-specific and layer-specific prompts that are pre-pended to the values and keys inside the self-attention layers of a the frozen model. Within the NLP domain, prompt learning has also been shown to work better than in-context learning (Liu et al., 2022) . Prompting in Vision and Language models. Research on prompt learning for vision-language models have been largely inspired by prior work within NLP. Similar to e.g. Li & Liang (2021), CoOp (Zhou et al., 2022b) proposes a prompt learning method that optimizes unified or class specific prompts in the continuous space through back-propagation. While CoOp obtains good accuracy on downstream tasks, it negatively affects the generalization ability to new unseen classes. Co-CoOp (Zhou et al., 2022a) extends CoOp and partially bridges the generalization gap by generating instance-specific prompt residuals through a conditioning mechanism dependent on the visual data. ProGrad (Zhu et al., 2022) shares the same goal as CoCoOp of bridging the generalization gap, but instead proposes to match the gradient of the prompt to the general knowledge of the CLIP model to prevent prompt tuning from forgetting the general knowledge learned from the foundation model. Alternative directions consist of test-time prompt tuning (Shu et al., 2022) , where consistency across multiple views is used as the supervisory signal, and unsupervised prompt learning (Huang et al., 2022) , where a pseudo-labelling strategy is proposed instead to obtain the labels needed to drive the prompt learning. Perhaps the most similar work to ours is Lu et al. (2022) . In this work, the authors use an ensemble of prompts and model their distribution within the language embedding space, with optimization seeking to minimize the negative log-likelihood with respect to the corresponding visual embedding. Unlike ours, their method relies on hand-crafted rules to define the prompt ensemble, thus still relying on the effectiveness of hand-crafted designs. The number of learnable prompts is also pre-defined, potentially offering sub-optimal coverage of an NLP concept. Finally, it is not clear how to apply their strategy within the context of conditional prompt learning. We believe that modelling the input prompt space rather than relying on a fixed number of templates is a more powerful and flexible approach. We provide empirical evidence of the superiority of our approach in the experiments. While beyond our current scope, it is worth noting that prompt learning has been applied to a wider range of problems and scenarios, which highlights its power and flexibility. Among them are important topics such as unsupervised domain adaptation (Ge et al., 2022) , multi-label classification (Sun et al., 2022 ), video classification (Ju et al., 2022 ), object detection (Du et al., 2022; Feng et al., 2022) 

