VARIATIONAL PROMPT TUNING IMPROVES GENERAL-IZATION OF VISION-LANGUAGE MODELS

Abstract

Prompt tuning provides an efficient mechanism to adapt large vision-language models to downstream tasks by treating part of the input language prompts as learnable parameters while freezing the rest of the model. Existing works for prompt tuning are however prone to damaging the generalization capabilities of the foundation models, because the learned prompts lack the capacity of covering certain concepts within the language model. To avoid such limitation, we propose a probabilistic modeling of the underlying distribution of prompts, allowing prompts within the support of an associated concept to be derived through stochastic sampling. This results in a more complete and richer transfer of the information captured by the language model, providing better generalization capabilities for downstream tasks. The resulting algorithm relies on a simple yet powerful variational framework that can be directly integrated with other developments. We show our approach is seamlessly integrated into both standard and conditional prompt learning frameworks, improving the performance on both cases considerably, especially with regards to preserving the generalization capability of the original model. Our method provides the current state-of-the-art for prompt learning, surpassing CoCoOp by 1.6% average Top-1 accuracy on the standard benchmark. Remarkably, it even surpasses the original CLIP model in terms of generalization to new classes. Implementation code will be released.

1. INTRODUCTION

In a continuous quest for better pre-training strategies, models based on image and language supervision have set impressive milestones, with CLIP (Radford et al., 2021) , ALIGN (Jia et al., 2021) and Flamingo (Alayrac et al., 2022) being leading examples. Contrastively trained vision-language models consist of image and text encoders that align semantically-related concepts in a joint embedding space. Such models offer impressive zero-shot image classification by using the text encoder to generate classifier weights from arbitrarily newly defined category classes without relying on any visual data. In particular, the class name is used within a handcrafted prompt template and then tokenized and encoded into the shared embedding space to generate new classifier weights. Rather than manually defining prompts, Zhou et al. (2022b) and Lester et al. (2021) proposed that prompts can be instead optimized in a data-driven manner through back-propagation by maximizing a cross-entropy loss on the downstream task. However, despite the performance improvement on downstream tasks, prompt learning negatively affects the generalization capability of the vision-language model. While subsequent works have focused on how to bridge the generalization gap, e.g. Zhou et al. (2022a) ; Zhu et al. (2022) , in practice the generalization power of the foundation model is significantly degraded. Our work tackles this same problem, as it seeks to improve downstream performance but without degrading the generalization capability of the original model. To do so, we propose a data-driven method for directly learning the underlying distribution within the prompt space associated to the target concept. In particular, we frame prompt tuning as a variational inference problem, where a base learned prompt is combined with a residual vector sampled from the instance-specific underlying distribution. This formulation provides two advantages. First, it investigates the prompt space more thoroughly and results in more informative use of the language space, leading to better generalization. Second, it enables us to boost performance by capturing the uncertainty information in fine-grained classification problems. The resulting approach is orthogonal to standard prompt learning approaches, being effective when combined with both standard (Zhou et al., 2022b) and conditional (Zhou et al., 2022a) approaches. In fact, when combined with the conditional approach, our method maintains the gains on the seen classes provided by the conditional method while simultaneously matching or even surpassing the generalization capability on unseen classes of the original vision-language model. In summary, our contributions in this paper are as follows: 1. We propose a variational framework that is capable of capturing the general or instance specific distribution within the prompt space. Since generalization is obtained through transfer from the language space, we obtain better generalization capability. 2. We show that the proposed approach is orthogonal to recent developments, and can be successfully combined with both standard and conditional prompt learning variants. 3. We empirically show that our proposed method improves performance and provides better generalization, leading to state-of-the-art accuracy in 24 out of 28 standard benchmarks set forth by prior work, surpassing CoCoOp by 1.6% average Top-1 accuracy.

2. RELATED WORKS

Prompt learning in NLP. Prompt learning was originally proposed within the NLP domain, following the appearance of foundation models such as GPT-3 (Brown et al., 2020) . Early prompt learning methods constructed prompts by combining words in the language space such that the model would perform better on downstream evaluation (Shin et al., 2020; Jiang et al., 2020) . Subsequent methods, e.g. Li & Liang (2021); Lester et al. (2021) , prepend a set of learnable prompts to the input of a frozen model and optimize through back-propagation, which allows better flexibility than using existing words, at the cost of leading to prompts that do not correspond to an actual phrase. Instead, He et al. (2022) focus on a multi-task scenario and use a HyperNetwork to conditionally generate task-specific and layer-specific prompts that are pre-pended to the values and keys inside the self-attention layers of a the frozen model. Within the NLP domain, prompt learning has also been shown to work better than in-context learning (Liu et al., 2022) . Prompting in Vision and Language models. Research on prompt learning for vision-language models have been largely inspired by prior work within NLP. Similar to e.g. Li & Liang (2021), CoOp (Zhou et al., 2022b) proposes a prompt learning method that optimizes unified or class specific prompts in the continuous space through back-propagation. While CoOp obtains good accuracy on downstream tasks, it negatively affects the generalization ability to new unseen classes. Co-CoOp (Zhou et al., 2022a) extends CoOp and partially bridges the generalization gap by generating instance-specific prompt residuals through a conditioning mechanism dependent on the visual data. ProGrad (Zhu et al., 2022) shares the same goal as CoCoOp of bridging the generalization gap, but instead proposes to match the gradient of the prompt to the general knowledge of the CLIP model to prevent prompt tuning from forgetting the general knowledge learned from the foundation model. Alternative directions consist of test-time prompt tuning (Shu et al., 2022) , where consistency across multiple views is used as the supervisory signal, and unsupervised prompt learning (Huang et al., 2022) , where a pseudo-labelling strategy is proposed instead to obtain the labels needed to drive the prompt learning. Perhaps the most similar work to ours is Lu et al. (2022) . In this work, the authors use an ensemble of prompts and model their distribution within the language embedding space, with optimization seeking to minimize the negative log-likelihood with respect to the corresponding visual embedding. Unlike ours, their method relies on hand-crafted rules to define the prompt ensemble, thus still relying on the effectiveness of hand-crafted designs. The number of learnable prompts is also pre-defined, potentially offering sub-optimal coverage of an NLP concept. Finally, it is not clear how to apply their strategy within the context of conditional prompt learning. We believe that modelling the input prompt space rather than relying on a fixed number of templates is a more powerful and flexible approach. We provide empirical evidence of the superiority of our approach in the experiments. While beyond our current scope, it is worth noting that prompt learning has been applied to a wider range of problems and scenarios, which highlights its power and flexibility. Among them are important topics such as unsupervised domain adaptation (Ge et al., 2022) , multi-label classification (Sun et al., 2022) , video classification (Ju et al., 2022) , object detection (Du et al., 2022; Feng et al., 2022) Figure 1 : Overview of variational prompt tuning. For each input image x, we use image features f (x) to infer the mean µ(x) and standard deviation Σ(x) of the residual distribution using metanet π ϕ . The prompts to generate the classifier weights are constructed by summing up learnable prompts p and residual samples from the residual distribution. The obtained prompts are fed through a text encoder g(t), and the classifier weights are estimated. Finally, the cosine similarity scores are computed between the image features f (x) and the classifier weights. or pixel-level labelling (Rao et al., 2022) . Finally, prompt learning as a means to adapt pre-trained models has also been applied to purely vision models (Jia et al., 2022; Sandler et al., 2022) providing similar performance to fine tuning the whole model but with great parameter efficiency.

3.1. BACKGROUND

Contrastive Language-Image Pretraining (CLIP) (Radford et al., 2021) consists of an image encoder f (x) and text encoder g(t), each producing a d-dimensional (L 2 normalized) embedding from an arbitrary image x ∈ R 3×H×W , and word embeddings t ∈ R L×e , with L representing the text length and e the embedding dimensionfoot_0 . Both encoders are trained together using a contrastive loss from a large-scale dataset composed of paired images and captions. Once trained, CLIP can be used for zero-shot C-class image classification by generating each of the c classifier weights w c as the d-dimensional text encoding g(t c ). Here t c results from adding the classspecific word embedding e c to a pre-defined prompt p ∈ R L-1×e , i.e., w c =g(t c ) with t c ={p, e c }. The prompt p is manually crafted to capture the semantic meaning of the downstream task, e.g., t c = "An image of a {class}". The probability of image x being classified as y ∈ {1...C} is thus defined as p(y|x)= e f (x) T wy C c e f (x) T wc . Context Optimization (CoOp) (Zhou et al., 2022b) provides a learned alternative to manually defining prompts. CoOp learns a fixed prompt from a few annotated samples. The prompt is designed as a learnable embedding matrix p ∈ R L×e which is updated via back-propagating the classification error through the frozen CLIP model. Specifically, for a set of N annotated meta-training samples {x i , y i } N i=1 , the prompt p is obtained by minimizing the cross-entropy loss, as: p * = arg min p E xi,yi [-log p(y i |x i , p)]. Note that this approach, while resembling that of common meta-learning approaches, can still be deployed in a zero-shot scenario provided that for new classes the classification weights will be given by the text encoder. Although this approach generalizes to new tasks with few training iterations, learning a fixed prompt is sensitive to domain shifts between the annotated samples and the test set. Conditional Prompt Learning (CoCoOp) (Zhou et al., 2022a) attempts to overcome domain shifts by learning an instance-specific continuous prompt that is conditioned on the input image. To ease the training of a conditional prompt generator, CoCoOp defines each conditional token in a residual way, with a task-specific, learnable set of tokens p and a residual vector that is conditioned on the input image. Assuming p to be composed of L learnable tokens p=[p 1 , p 2 , ..., p L ], the residual vector r(x)=π ϕ (f (x)) ∈ R e is produced by a small neural network π ϕ with as input the image features f (x). The new prompt is then computed as p(x)=[p 1 + r(x), p 2 + r(x), ..., p L + r(x)]. The training now comprises learning the task-specific prompt p and the parameters ϕ of the neural network π ϕ . Defining the context-specific text embedding t c (x)={p(x), e c }, and p(y|x) as : p(y|x) = e f (x) T g(tc(x)) C c e f (x) T g(tc(x)) , ( ) the learning is formulated as: p * , ϕ * = arg min p,ϕ E xi,yi [-log p(y i |x i , p, ϕ)]. While CoCoOp achieves state-of-the-art results in a large variety of downstream tasks, it is still prone to the domain shift problem, considering that π ϕ provides a deterministic residual vector from the image features f (x) which are expected to be domain-specific. Prompt Distribution Learning (ProDA) (Lu et al., 2022) is work concurrent to CoCoOp that focuses learning a distribution of prompts that generalize to a broader set of tasks. ProDA proposes to learn a collection of prompts P={p k } K k=1 that can be used to subsequently generate an a posteriori distribution of the classifier weights for each of the target classes. For a given mini-batch of K sampled prompts p k ∼ P, the classifier weights w c are sampled from the posterior distribution N (µ w 1:C , Σ w 1:C ), with mean µ w 1:C and covariance Σ w 1:C computed from the collection {w k,c = g(t k,c )} c=1:C,k=1:K , with t k,c = {p k , e c }. The objective is now formulated as: P * = arg min P E xi,yi [-log E w l ∼N (µw 1:C ,Σw 1:C ) p(y i |x i , w l )]. Computing E w l p(y i |x i , w l )] is intractable and an upper bound to Eq. 4 is derived. During inference, the classifier weights are set to those given by the predictive mean w c = µ w 1:C , computed across the collection of learned prompts P. While showing promising results compared to CoOp, how to combine it with the conditional prompt learning framework of CoCoOp is unclear.

3.2. VARIATIONAL PROMPT TUNING

In this paper, we propose to model the input prompt space in a probabilistic manner, as an a priori, instance-specific distribution. In particular, we define a distribution p γ over the prompts p that is instance-specific, i.e. p ∼ p γ (x). To this end, we assume that p can be split into a fixed set of prompts p i and an instance specific residual vector r that act as a latent variable over p. The instance-specific prompt is then defined as: p γ (x) = [p 1 + r γ , p 2 + r γ , • • • , p L + r γ ], r γ ∼ p γ (x), where p γ (x) refers to the real posterior distribution over r conditioned on the observed features x. Denoting the class-specific input as t c,γ (x)={p γ (x), e c }, the marginal likelihood p(y|x) is: p(y|x) = γ e f (x) T g(tc,γ (x)) c ′ e f (x) T g(t c ′ ,γ (x)) p(p γ (x))dγ. Solving Eq. 3 with the marginal likelihood defined as in Eq. 6 is intractable, as it requires computing p γ (r|x)p γ (x). Instead, we resort to deriving a lower bound, by introducing a variational posterior distribution π ϕ (x) from which the residual r γ can be sampled. The variational bound is defined as: log p(y|x) ≥ E π ϕ (r|x) [log p(y|x, r)] -D KL π ϕ (r|z)∥p γ (r) , with p(y|x, r) ∝ e f (x) T g(tc,γ (x)) , where the dependency on r comes through the definition of t c,γ . The variational posterior distribution π ϕ plays a role akin to the metanet in CoCoOp. We thus refer to it as metanet in the following to align terminology. Following standard variational optimization practices (Kingma & Welling, 2014; Gordon et al., 2019) , we define π ϕ as a Gaussian distribution conditioned on the input image features x, as r(x) ∼ N (µ(x), Σ(x)), with µ and Σ parameterized by two linear layers placed on top of the metanet π ϕ (see Figure 1 ). The prior p γ (r) is defined as N (0, I), and we make use of the reparameterization trick to generate Monte-Carlo samples from π ϕ to maximize the right side of Eq. 7. The optimization of Eq. 7 comprises learning the prompt embeddings {p i } L i=1 as well as the parameters of the metanet π ϕ and the linear layers parameterizing µ and Σ. Note that this adds little complexity as it requires learning p and π ϕ , given that µ and Σ are defined as two linear layers on top of π ϕ . Inference. At test time, K residuals are sampled from the conditional distribution π ϕ (x), which are used to generate K different prompts per class p k =[p 1 + r k , p 2 + r k , • • • , p L + r k ]. Each prompt is prepended to the class-specific embedding to generate a series of K separate classifier weights w k,c . We then compute p(y=c | x)=(1/K) K k=1 p(y=c | x, w k,c ) and select ĉ= arg max c p(y=c | x) as the predicted class. It is worth noting that because the posterior distribution is generated by the text encoder, it is not expected that for y=c|x, g({µ(x) , e c }), meaning that sampling at inference time remains relevant. We study the dependency on the number of samples in the ablations. K → ∞, (1/K) k p(y=c | x, w c,k ) → p( Notably, our framework can be also used with the unconditional setting of CoOp, by simply removing the dependency of the input image from the latent distribution. In such scenario, we keep a fixed set of prompt embeddings and learn a global latent distribution p γ over the residual vectors r, as r ∼ N (µ, Σ), where µ and Σ are parameterized by two learnable vectors. In this case, p γ is a general distribution learned during training with no dependency on the input sample x. We show that CoOp with our proposed approach improves generalization on new classes.

4.1. EXPERIMENTAL SETUP

We follow the exact experimental setup of CoCoOp (Zhou et al., 2022a) , currently the state-of-theart for prompt tuning for vision-language models. We describe the setup in the following. Three tasks and fifteen datasets. We evaluate variational prompt tuning for three different tasks: base-to-new generalization, cross-dataset transfer, and cross-domain generalization. For baseto-new generalization and cross-dataset transfer tasks, we rely on the same 11 image recognition datasets as Zhou et al. (2022b; a) . These include generic image classification datasets (ImageNet by (Deng et al., 2009) and Caltech101 by (Fei-Fei et al., 2004 )), fine-grained classification datasets (OxfordPets by (Parkhi et al., 2012) , StanfordCars by (Krause et al., 2013) , Flowers102 by (Nilsback & Zisserman, 2008) , Food101 by (Bossard et al., 2014) and FGVCAircraft by (Maji et al., 2013) ), scene recognition (SUN397 by (Xiao et al., 2010) ), action recognition (UCF101 by (Soomro et al., 2012) ), texture classification (DTD by (Cimpoi et al., 2014) ), and satellite imagery recognition (Eu-roSAT by (Helber et al., 2019) ). For the cross-domain generalization task, we train our model on ImageNet and report on ImageNetV2 (Recht et al., 2019) , ImageNet-Sketch (Wang et al., 2019) , ImageNet-A (Hendrycks et al., 2021b) , and ImageNet-R (Hendrycks et al., 2021a) . Evaluation metrics. We report average accuracy and harmonic mean H=2 × (base × new)/(base + new) (Xian et al., 2017) for the base-to-new generalization tasks. For cross-dataset transfer learning and domain adaptation, we provide average accuracy results. Baselines. We compare against zero-shot CLIP (Radford et al., 2021) , CoOp (Zhou et al., 2022b) , CoCoOp (Zhou et al., 2022a) , and ProDA (Lu et al., 2022) . For zero-shot CLIP, CoOp, CoCoOp, all results are adopted from (Zhou et al., 2022a) , and we reproduce all results for ProDA. Implementation details. Our variational prompt tuning contains three sub-networks: an image encoder f (x), a text encoder g(t), and a metanet π ϕ . The image encoder f (x) and text encoder g(t) are a ViT-B/16 (Dosovitskiy et al., 2021) and transformer (Vaswani et al., 2017) , which are initialized with CLIP's pre-trained weights and kept frozen during training, as in Zhou et al. (2022b; a) . The metanet π ϕ consists of two linear layers followed by ELU activation function as trunk and two linear heads on top to estimate the µ and Σ of the residual distribution. For each task and dataset, we optimize the number of samples K and epochs. Other hyper-parameters as well as the training pipeline in terms of few-shot task definitions are identical to Zhou et al. (2022b; a) (see table 6 and 7 in the appendix). Implementation code will be released.  (b) ∆(H) = H(CoCoOp+VPT)-H(CoCoOp) Figure 2 : Relative enhancement of variational prompt tuning over CoOp and CoCoOp in terms of harmonic mean over 11 datasets for 3 distinct random seeds. Variational prompt tuning improves the harmonic mean for all baselines other than OxfordPets for CoOp and Flowers102 for CoCoOp.

4.2. BASE-TO-NEW GENERALIZATION

Setup. We report the few-shot generalization of our method on 11 datasets for three different random seeds. Each dataset is divided into two disjoint subsets: base classes and new classes. We train our method on base classes and evaluate it on both base and new classes. For a fair comparison, we follow Zhou et al. (2022a; a) in terms of dataset split and number of shots (see Table 6 appendix.) Results. From the results in Table 1 , the lack of generalization capability of the CoOp approach is evidenced by the considerable discrepancy between the base and new classes' accuracy. This is expected as CoOp only observes a small number of training samples to adapt the CLIP model for downstream tasks, resulting in overfitting. Adding variational prompt tuning to CoOp reduces overfitting to base classes, improving the performance of new classes by 11.54% and the harmonic mean by 1.7% across all datasets, at the expense of a base classes accuracy drop by 10.7%. Figure 2(a) depicts the relative improvement of our proposed strategy when compared to CoOp in terms of the harmonic mean, where we observe an improvement in 10 out of 11 datasets. The limited generalization capability of CoOp is mitigated by CoCoOp by exploiting instance-conditional prompts, which improves the accuracy on new classes from 63.22% to 72.23%. Nonetheless, augmenting CoCoOp with variational prompts still improves performance on the new classes and the harmonic mean by 3.25% and 1.6%, respectively, with a small decrease of 0.37% in base accuracy. Note here however that training for longer increases the performance on base classes and lowers it on new classes. Our setting optimizes harmonic mean, but a simple tweak to the training scheduler on EuroSAT is enough to surpass CoCoOp on base class accuracy, see Table 8  on appendix A.2.2. Fig- ure 2(b) shows per-dataset relative harmonic mean improvement. We observe an improvement in 10 out of 11 datasets. Note that the drop in the average base accuracy for CoCoOp+VPT is negligible compared with CoOp+VPT . Moreover, our best performing model CoCoOp+VPT performs better than ProDA (Lu et al., 2022) in terms of new classes accuracy and harmonic mean by %2.64 and %0.78. We also considered ensemble learning the soft prompt from different ways of initialization, this does not lead to an improvement (see Appendix A.2.3). It is also worth noting that CoOp+VPT and CoCoOp+VPT excel at CLIP zero-shot performance in base, new and harmonic accuracies.

4.3. CROSS-DATASET TRANSFER LEARNING

Setup. For cross-dataset transfer learning, the model is trained on a source dataset (ImageNet) and then assessed on 10 distinct target datasets. This experiment tries to determine how effectively our methods generalizes transfer beyond the scope of a single dataset. Results. As reported in Table 2 , CoOp+VPT has a drop in performance in ImageNet by 1.76% while outperforming the target dataset on average by 1.63%. Moreover, our proposed method leads to an increase on 9 out of 10 target datasets, with a small drop in accuracy of 0.03% in Caltech101. Note that on target datasets such as FGVCAircraft and DTD, our proposed method achieves an improvement of more than 3%. Similarly to CoOp, augmenting CoCoOp with our method still leads to an Table 1: Base-to-new generalization comparison between the state-of-the-art and variational prompt tuning. We average our accuracy over three random seeds. Our proposed model is trained on a few-shot training set (base) and then evaluated on held-out classes (new). As shown, CoOp and CoCoOp overfit on base classes and do not provide good generalization on new classes. However, our model provides better generalization performance on new classes as well as harmonic mean. overall performance enhancement of 0.16% on 7 out of 10 target datasets, showing its effectiveness for cross-dataset transfer learning. In addition, unlike CoCoOp , which has better performance in ImageNet-like datasets such as Caltech101 and OxfordPets, our proposed method exhibits improvement on dissimilar datasets (e.g. FGVCAircraft, DTD, and EuroSAT), demonstrating its capacity to capture the unique characteristics of each dataset.

4.4. CROSS-DOMAIN GENERALIZATION

Setup. Lastly, We examine variational prompt tuning through the lens of distribution shift and robustness. We train our proposed model on the source dataset (ImageNet) for three different random seeds, and assess it on ImageNetV2, ImageNet-Sketch, ImageNet-A, and ImageNet-R. Prior work such as CoOp (Zhou et al., 2022b) and CoCoOp (Zhou et al., 2022a) demonstrate empirically that learning a soft-prompt improves the model's resilience against distribution shift and adversarial attack. Following their experiments, we are also interested in determining if treating prompts in a variational manner maintain or improve the performance. Results. As reported in Table 3 our method enhances the accuracy of CoOp on ImageNet-Sketch, ImageNet-A, and ImageNet-R by 0.88%, 0.70%, and 2.19% while degrading the performance on ImageNet and ImageNetV2 by 1.78%, by 1.03%. However, on CoCoOp, adding VPT, while losing the performance on source dataset similar to CoOp, consistently improves the accuracy on all target datasets which highlight the effectiveness of our proposed method. 

4.5. ABLATIONS

Effectiveness of the posterior distribution q ϕ . We first ablate the effectiveness of the variational posterior distribution. To do this, we consider sampling one residual vector from the uniform distribution U(0, 1), normal distribution N (0, I), normal distribution N (µ(x), 0), normal distribution N (µ(x), Σ(x)), and report the new class accuracy for CoCoOp+VPT for one random seed in Table 4. Except for the EuroSAT dataset, a sample from the normal distribution N (µ(x), 0) obtains the best-performingce in comparison with alternatives, showing that the mean of the normal distribution µ(x) is the most effective sample. In addition, we find that drawing one sample from N (µ(x), Σ(x)) yields superior results compared to drawing one sample from uniform distribution U(0, 1) and normal distribution N (0, I), further demonstrating the efficacy of our proposed method in capturing the underlying distribution of the prompt space. We also ablate increasing the number of samples from the normal distribution N (µ(x), Σ(x)) to understand the informativeness of the learned variational distribution. It is shown that enlarging the number of samples further improves the model performance as they capture the prompt space appropriately. Prompt initialization. We investigate the effectiveness of the prompt initialization on the new class accuracy for CoCoOp+VPT for one random seed. We consider two variants. In the first one, we ini- tialize the context tokens randomly using a normal distribution, whereas in second one we initialize the context tokens with "An image of a {class}". Table 5 summarizes this ablation. Comparing the two variants demonstrates that an appropriately initialized prompt consistently outperforms a randomly initialized prompt, highlighting the necessity for further research of the prompt space. We will leave it open for future research direction. Number of Monte-Carlo samples. When approximating the log-likelihood of input data, the number of Monte Carlo samples is an important hyperparameter. Generally, a large number of samples should lead to a better approximation and better classification accuracy. We ablate this hyperparameter on new accuracy for CoCoOp+VPT by varying the number of Monte Carlo samples at inference time. We show results for a varying number of samples in Figure 3 for DTD, Flowers102, EuroSAT, FGVCAircraft, and UCF101. Increasing the Monte Carlo samples from 1 to 10 consistently improves the new accuracy, afterwards the accuracy saturates. Hence, we recommend evaluating variational prompt tuning on a larger number of Monte Carlo samples for better model accuracy. To further aid the interpretation and understanding of the learned prompts we provide a further ablation on the variational distribution of our best performing model, CoCoOp+VPT in Appendix A.2.4 and A.2.5

5. CONCLUSION

In this paper, we introduce variational prompt tuning, a probabilistic modeling of the underlying distribution of prompts, allowing prompts within the support of an associated concept to be derived through stochastic sampling. By doing so, we are able to generate adaptive context tokens for each data point. This formulation leads to better generalization capabilities in terms of the new accuracy and harmonic mean for downstream tasks. We show that it can be seamlessly integrated into both standard and conditional prompt learning frameworks, considerably improving the performance in both cases. We conduct extensive experiments on 15 datasets and demonstrate the benefits of a variational formulation in learning data-driven prompts. Our method provides the current state-ofthe-art for prompt learning, and constitutes, to the best of our knowledge, the first method for CLIP adaptation that fully maintains the generalization capability to new classes of the original model. Finally, prompt distribution can be obtained using other ways of density estimation, e.g. normalizing flows, energy-based models or diffusion models, and we will leave them open for future research.

A.2 MORE ABLATIONS

A.2.1 VISION ENCODER ALTERNATIVES. All previous experiments benefit from ViT-B/16 as the vision encoder's backbone following (Zhou et al., 2022b; a; Lu et al., 2022) . For completeness, in Figures 4, 5 , and 6, we replace this vision encoder with a Resnet50 and Resnet100 and examine its impact on base accuracy, new accuracy and harmonic mean for one random seed. The visual transformer outperforms the Resnet alternatives across all 10 datasets for new accuracy and harmonic mean and in 9 out of 10 datasets for base accuracy. Hence, we suggest training and evaluating variational prompt tuning on visual transformer for better model performance. 

A.2.2 BASE AND NEW CLASSES ACCURACY TRADE-OFF

In this section, we ablate whether there is a trade-off between base and new classes accuracy. To do so, we train our proposed method on EuroSAT dataset for 60 and 80 epochs, and compare them with in Table 8 . As shown, training EuroSAT for 20 more epochs raises the performance on base classes by +5.37. This alone improves the average performance on base classes across the 11 datasets by +0.49, resulting in 80.58 (now +0.11 over CoCoOp) . This change slightly affects average harmonic mean, reducing it by 0.17% (now +1.43% better than CoCoOp). Consequently, we indeed observe a trade-off between performance on base and on new classes. Training for longer will increase the performance on base classes and lower it on new classes. Variational prompt tuning can be considered as an efficient ensemble approach as it generates several samples from the prompt space and uses them in conjunction. Nonetheless, we compared Co-CoOp+VPT against a modified CoCoOp that naively uses an ensemble of prompts. We implement this by initializing several learnable prompts per class using M hand-crafted templates provided in the CLIP codebasefoot_1 and fine-tuning them together. The final text feature representation is computed as a simple average of the ensemble of prompts for each class as in Section 3.1.4 of (Radford et al., 2021) . For a fair comparison, the number of ensembles, M , equates to the number of Monte Carlo samples K per dataset (see Table 7 ). In Table 9 , when comparing CoCoOp and CoCoOp + Ens, the ensemble counterpart, we can see that adding an ensemble of prompts does not necessarily lead to an improvement in all datasets. For instance, on DTD and UCF101 datasets, CoCoOp performs better, while on Flowers102, EuroSAT, and FGVCAircraft datasets, CoCoOp + Ens has a higher harmonic mean. Moreover, when comparing CoCoOp + Ens against CoCoOp+VPT , we can see that our method performs better in terms of new classes accuracy and harmonic mean in all datasets other than Flowers102, with an average increase of 1.94% in harmonic mean, whereas the drop in Flower102 is only 0.66%.

A.2.4 FACTOR OF VARIATION ANALYSIS

Here, we analyze whether prompts that are sampled at the distributional modes correlate with features that characterize any known factors of variation (e.g., sub-domains) in a dataset. We perform this experiment on our best-performing model, CoCoOp+VPT. First, we randomly select three classes c 1 , c 2 , and c 3 from the Flowers102 dataset (Columbine, Passion Flower and Cyclamen, for reference). Then, for each class, we compute image features by forwarding the respective input images x into the image encoder of the CLIP. Next, we apply K-means to group the image features into Q = 5 cluster centroids. Note that each cluster centroid is assumed to capture some factors of visual variation within each class. Afterwards, we treat the cluster assignments as pseudo-labels of those images. Note that this is done separately for each class. Additionally, for an input image x, we generate K = 21 different prompts. These prompts are fed into the clip's text encoder, which generates K different weights w 1 , w 2 , • • • , w K per image instance. We now have a set of K weights for Q pseudo-labels across three classes, which are visualized in each row of Figure 7 . In each row, on the left side, we visualize how prompt samples are distributed across the five different clusters, while on the right side, we provide the top-3 most representative samples of the individual clusters (i.e. the images that produce feature representations that are the closets to the centroids). From this figure, it can be seen that there is a high intersection between the distribution of all sub classes, representing shared knowledge. However, some particular visual differences are expressed in regions without intersection in the distribution. Consequently, we believe that there is a correlation between the modes of the variational distribution and visual particularities of subsets of the same class. (c) Cyclamen Figure 7 : Factor of variation analysis. For three different classes, the contours intersect in a region that represents shared knowledge related to their corresponding class, while diverging a bit and expressing a specific factor of variation. Consequently, we believe that there is a correlation between the modes of the variational distribution and characteristics that adequately represent all known drivers of variation in a dataset (see the text for more details).

A.2.5 INTERPRETATION OF VARIATIONAL DISTRIBUTION

We now provide an intuition of how prompt samples are distributed within the variational distribution. We perform two experiments on our best-performing model, CoCoOp+VPT. Experiment 1: For an input image x and its corresponding target y, K residuals are sampled from the conditional distribution π ϕ (x), which are used to generate K different prompts per class p k =[p 1 + r k , p 2 + r k , • • • , p L + r k ]. We also construct a new prompt based on the mean of variational distribution µ(x) named as mean prompt p µ(x) =[p 1 +µ(x), p 2 +µ(x), • • • , p L +µ(x)]. These K prompts are sorted based on their distance to the mean of variational distribution µ(x), where for any i and j (i ≤ j), d(µ(x), r i ) ≤ d(µ(x), r j ). The K prompts and mean prompt are prepended to the class-specific embedding to generate a series of K + 1 separate classifier weights w µ(x),c , w 1,c , w 2,c , • • • , w K,c . For each class y, we compute the cosine similarity of the image encoding f (x) and w µ(x),y , w 1,y , w 2,y , • • • , w K,y and average across all samples per class. In Figure 8 , we can see that the mean prompt p µ(x) is the most similar classifier weight to the image encoding , and, as we move further away from the mean prompt, the cosine similarity decreases . Experiment 2: Given weights w µ(x),y , w 1,y , w 2,y , • • • , w K,y from Experiment 1, we define a new weight as w J y =w µ(x),y + J j=1 w j,y , where w J y is the cumulative sum of weights regarding prompt p 1 to p J and mean prompt p µ(x) . The cosine similarity between the image encoding f (x) and all {w j y } j=K j=1 are computed and visualized in Figure 8 . As shown, summing up classifier's weights together increases the cosine similarity , with the maximum similarity obtained when all samples are combined . From these two experiments, we believe that prompt samples are well-distributed inside the prompt space such that the prompt distribution provides adequate coverage of the underlying distribution for downstream tasks. 

Cosine Similarity

Class Label: Sea or Lake 



In CLIP the word embedding is learned together with the text encoder. A tokenizer is used to convert the text into one-hot vectors, or tokens, that can be directly mapped into the word embeddings. For the sake of clarity we refer indistinctly to words and word embeddings. https://github.com/openai/CLIP



Figure 4: Ablation of different vision encoder backbones with respect to base accuracy. A more over-parameterized model leads to better performance across all datasets except EuroSAT.

Figure 8: Interpretation of variational distribution for EuroSAT dataset. Experiment 1: As shown, the mean prompt p µ(x) is the most similar classifier weight to the image embedding, and as we move further away from the mean prompt, the cosine similarity scores decreases. Experiment 2 : As shown, suming up classifier's weights increases the cosine similarity, where the maximum similarity is obtained when all classifier weights combined.

Cross-dataset transfer learning comparison between the state-of-the-art and our variational prompt tuning in terms of average accuracy on three different random seeds. Following(Zhou  et al., 2022a), our proposed model is trained on a source dataset and evaluated on target datasets. As shown, variational prompt tuning performs better than other baselines 16 out of 20 datasets, although it loses performance on the source dataset.

Cross-domain Generalization comparison between the state-of-the-art and variational prompt tuning in terms of average accuracy on three different random seeds. Our proposed model is trained on a source dataset and evaluated on target classes. Variational prompt tuning outperforms alternative baselines on the target datasets while losing performance on the source dataset.

Effectiveness of the posterior distribution. The informative posterior distribution N (µ(x), Σ(x)) outperforms the two uninformative distributions U(0, 1) and N (0, I) by a large margin for all datasets. Increasing the number of samples further improves results.

Prompt initialization. Initializing the context tokens with an appropriate prompt "An image of a {class}" improves the performance compared to random tokens.

Base and new classes accuracy trade-off for EuroSAT dataset. As reported, We indeed observe a trade-off between performance on base and on new classes. Training for longer will increase the performance on base classes and lower it on new classes.

Comparison between CoCoOp, CoCoOp + Ensembling, and CoCoOp+VPT. As shown, variational prompt tuning performs better than other alternatives on 4 out of 5 datasets in terms of new accuracy and harmonic mean.

p µ(x) p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9 p 10 p 11 p 12 p 13 p 14 p 15 p 16 p 17 p 18 p 19 p 20 µ(x) p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9 p 10 p 11 p 12 p 13 p 14 p 15 p 16 p 17 p 18 p 19 p 20 µ(x) p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9 p 10 p 11 p 12 p 13 p 14 p 15 p 16 p 17 p 18 p 19 p 20 µ(x) p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9 p 10 p 11 p 12 p 13 p 14 p 15 p 16 p 17 p 18 p 19 p 20 µ(x) p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9 p 10 p 11 p 12 p 13 p 14 p 15 p 16 p 17 p 18 p 19 p 20

ETHICS STATEMENT

Our method has the potential to affect applications that often need rapid adaptation, such as medical imaging, astronomical imaging, and autonomous driving. Because of this, there may be negative societal consequences associated with the adoption of our technology. For example, a lack of fairness with models trained on insufficient data, regulatory compliance, patient privacy in medical imaging, and more general biases encoded into large vision-language models such as CLIP.

REPRODUCIBILITY STATEMENT

Details on benchmarks, metrics, and the implementation of variational prompt tuning in terms of architecture and training are contained in Section 4. We also list of all hyperparameters in Table 6 and 7 . Lastly, we will release the source code, scripts to reproduce the results and evaluating the performance of the model at: https://github.com/<redacted>/<redacted>.

A APPENDIX

A.1 HYPERPARAMETERS In this section, we provide the detailed hyperparameter settings in Tables 6 and 7 that are used to generate results in the main paper for each dataset. There are two sets of hyperameters. The first are shared among the two variants of variational prompt tuning CoOp+VPT and CoCoOp+VPT (See Table 6 ). The second correspond to dataset-specific parameters that are optimized per dataset (See Table 7 ). 

