PLOT: PROMPT LEARNING WITH OPTIMAL TRANS-PORT FOR VISION-LANGUAGE MODELS

Abstract

With the increasing attention to large vision-language models such as CLIP, there has been a significant amount of effort dedicated to building efficient prompts. Unlike conventional methods of only learning one single prompt, we propose to learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts. However, directly matching each prompt to the same visual feature is problematic, as it pushes the prompts to converge to one point. To solve this problem, we propose to apply optimal transport to match the vision and text modalities. Specifically, we first model images and the categories with visual and textual feature sets. Then, we apply a two-stage optimization strategy to learn the prompts. In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data. Extensive experiments are conducted on the few-shot recognition task and the improvement demonstrates the superiority of our method.

1. INTRODUCTION

In the past few years, large-scale vision-language pre-trained (VLP) models, such as CLIP (Radford et al., 2021) , ALIGN (Jia et al., 2021) , and BLIP (Li et al., 2022) have achieved remarkable success in open-world visual concept learning. These methods have brought new light but also pose a new question: how to efficiently adapt the knowledge from pretraining to the downstream tasks since these models are typical of massive sizes which are not feasible for normal users to re-train.

Brambling

A bird that lives in winter wood A bird with dark fan-tail A bird with orange and black texture A bird with black crown and eye Figure 1 : The motivation that one category can be complementarily described in different views (An example of "Brambling"). One of the conventional paradigms of utilizing pretrained knowledge is "pre-training, fine-tuning", which fixes the architecture of the pre-trained neural network and tunes its parameters using task-specific objective functions. Beyond fine-tuning the parameters, many recent methods (Zhou et al., 2021b; 2022) introduce the concept of prompt learning from the field of NLP to the vision domain and achieve striking performance gain for the few-shot visual classification. They fix the model parameters and instead learn suitable prompts by turning a template sentence into a set of learnable vectors. Then, these prompts are learned by minimizing the distance between the visual features and prompt-based language features. Despite significant improvements over manual prompts, learning only a sentence is intuitively insufficient to represent a class. One class can be described by many intrinsic characteristics and even extrinsic context relations. Thus, for one object, we may have multiple prompt candidates which focus on different attributes. As shown in Figure 1 , we can describe the class "Brambling" in different views: such as the color of the wing, the color of the crown and eyes, the shape and color of the tail, and even the living environment information. It motivates us to learn multiple prompts to comprehensively represent the class and thus facilitate classification. The most natural solution is to directly learn multiple prompts by respectively matching each prompt with the visual features. However, it is the same as matching the mean of prompt features and the visual features. This solution is problematic since all prompts are encouraged to be closer to one single point and thus tend to learn the same characteristics. It contradicts our purpose to learn comprehensive prompts. To solve this problem, we tested adding some constraints to push away the prompt from each other, but found that this solution still fails to learn representative and comprehensive prompts. This solution treats the visual representation as one single point, and such a unified view of visual features ignores the fact that different prompts may only focus on one or a subset of characteristics. To address this problem, in this paper, we propose Prompt Learning with Optimal Transport (PLOT), which applies optimal transport (OT) to align the local visual features and multiple textual prompts. Optimal transport can calculate the distance between two distributions under the form of multiple sampling. In our prompt learning framework, we formulate local visual features and multiple prompts as the samplings of two discrete distributions and use OT to encourage fine-grained cross-modal matching. Specifically, to obtain the local visual features with different semantic clues, we extract all feature maps as the visual representation instead of the single global representation. Fortunately, we can easily obtain the visual feature maps from the visual encoder of CLIP by using all outputs of the multi-head self-attention layer (Rao et al., 2021) . Then the problem comes down to how to calculate the distance between two feature sets. We solve this problem by introducing the optimal transport theory (Villani, 2009) and formulate the feature sets as a discrete probability distribution where each feature has an equal probability value. Furthermore, to reduce the computational cost and avoid the extra model parameters, we learn the prompts with a two-stage optimization strategy. At the first stage in the inner loop, we fix both visual and text features and optimize the optimal transport problem by a fast Sinkhorn distances algorithm (Cuturi, 2013) . Then, in the outer loop, we fix all parameters of optimal transport and back-propagate the gradient to learn the prompts with different characteristics. Compared with conventional distance (such as Euclidean distance of mean features), optimal transport can align different visual features for each local prompt, which is more robust to the visual misalignment and tolerates well feature shift (Rubner et al., 2000) . It is because OT learns an adaptive transport plan to align features, which achieves fine-grained matching across two modalities. We conduct experiments on 11 datasets following the standard setting of CLIP (Radford et al., 2021) and CoOp (Zhou et al., 2021b) to evaluate our method. These experiments span the visual classification of generic objects, scenes, actions, fine-grained categories, and so on. The significant result improvement demonstrates that PLOT can effectively learn representative and comprehensive prompts.

2. RELATED WORK

Optimal Transport The Optimal Transport (Monge, 1781) is initially introduced to solve the problem of how to reduce the cost when moving several items simultaneously. Recently, OT theory has drawn wide attention in the machine learning and computer vision community by comparing distributions readily available to them under the form of feature sets (Peyre & Cuturi, 2019) . Due to the brilliant property of distribution matching, OT has been applied in many theoretic and application tasks including generative models (Arjovsky et al., 2017; Salimans et al., 2018; Zhao et al., 2021a) , structural matching (Chen et al., 2019; Xu et al., 2020; Zhao et al., 2021b; Xu et al., 2019 ) (e.g. sequence matching (Chen et al., 2019) and graph matching (Xu et al., 2019) , and image matching (Zhang et al., 2020; Liu et al., 2021a; Zhao et al., 2021b) ), and other distribution-based tasks (such as clustering (Laclau et al., 2017) , distribution estimation (Boissard et al., 2015) , and causal discovery (Tu et al., 2022) ). In this paper, we use OT to align the features of vision and language modalities by learning an adaptive transport plan (Rubner et al., 2000) . Vision-Language Pre-trained Models Vision-Language Pre-trained (VLP) models aim to explore the semantic correspondence between the vision and language modalities through large-scale pretraining. Recently, VLP models have achieved an exciting performance improvement in few-shot visual recognition (Radford et al., 2021; Gao et al., 2021; Zhou et al., 2021b; 2022; Zhang et al., 2021b) , which shows the great potential to promote open-world visual understanding with the help of language. In terms of objectives, VLP methods can be divided into reconstruction (Li et al., 2019; Hong et al., 2021; Dou et al., 2021; Kim et al., 2021) , contrastive matching (Radford et al., 2021; Jia et al., 2021; Jain et al., 2021) , or the combination of both two (Li et al., 2021; Wang et al., 2021b; Kamath et al., 2021) . Besides, recent progress in the field of VLP also benefits a lot from large-scale pair-wised datasets. For example, CLIP (Radford et al., 2021) applies 400 million image-text pairs for contrastive learning. Beyond recognition, these VLP models also show great potential for other downstream applications, such as dense prediction (Rao et al., 2021; Zhou et al., 2021a) , image generation (Nichol et al., 2021; Ramesh et al., 2022; Patashnik et al., 2021) , and action understanding (Wang et al., 2021a; Tevet et al., 2022) . Prompt Learning Prompt learning is introduced from the field of NLP to efficiently adapt the large language model to downstream tasks. Different from the conventional "pre-training, fine-tuning" paradigm which initializes the pre-trained model and tunes the parameters of the network using downstream task-specific objective functions, prompt learning applies textual prompt to reformulate the downstream tasks as the original pretrained task (Liu et al., 2021b; Petroni et al., 2019) . By the prompt, the domain shift between pretrained task and the downstream application is reduced and thus the pretrained knowledge can be easier adapted to downstream tasks. The concept of prompt learning (Petroni et al., 2019; Radford et al., 2019; Poerner et al., 2019) begins from the success of GPT (Radford et al., 2019) series. Early prompt learning methods (such as Petroni et al. (Petroni et al., 2019) and Pörner et al. (Poerner et al., 2019) ) always manually create templates based on human prior knowledge. Furthermore, some mining-based methods (Jiang et al., 2020) and gradient-based methods (Shin et al., 2020) are proposed to automatically search for appropriate templates. Beyond search in the discrete space, some methods (Li & Liang, 2021; Tsimpoukelli et al., 2021; Liu et al., 2021c) remove the constraint that the prompts are "words" and instead learn prompts in the continuous embedding space. Recently, CoOp (Zhou et al., 2021b) and its extended version (Zhou et al., 2022) introduce prompt learning into open-world visual understanding to adapt the knowledge from the large-scale visual-language pretrained models and achieve great performance improvement on the few-shot visual recognition. Compared with CoOp, our PLOT method further improves prompt learning by introducing the optimal transport distance to learn multiple local prompts and achieves fine-grained vision-language matching. PDL (Lu et al., 2022) is also motivated by the more diverse prompts, which assumes a parametric distribution of prompts and fits the parameters during training. Different from it, PLOT learns multiple prompts without parametric distribution.

3. APPROACH

In this section we first revisit the baseline CoOp (3.1), review the preliminaries of optimal transport (3.2), and then introduce our PLOT (3.3) to show how we learn multiple comprehensive prompts.

3.1. A REVISIT OF COOP

CoOp (Zhou et al., 2021b) is one of the pioneering methods to learn the prompts for using vision language pretrained knowledge (such as CLIP (Radford et al., 2021) ) for downstream open-world visual recognition. Different from CLIP which manually designs the prompt templates, CoOp sets a part of context words in the template as continuous learnable parameters which can be learned from the few-shot data. Then the classification weights can be represented by the distance between the learned prompt and visual feature. Specifically, given an image x, a visual feature f = f (x) is obtained by the visual encoder f of CLIP. Then, the textual prompt can be formulated as t k = {ω 1 , ω 2 , . . . , ω L , c k }, where c k is the word embedding of the class name, ω = {ω l | L l=1 } are learnable vectors where each vector has the same dimension as the original word embedding and L is the length of context words. With prompt t k as the input, the text encoder g outputs the textual feature as g k = g(t k ). The final prediction probability is computed by the matching score as follows: p(y = k|x) = exp(sim(f , g k )/τ ) K k ′ =1 exp(sim(f , g k ′ )/τ ) , where sim(•, •) denotes a metric function such as cosine similarity, and τ stands for the temperature of Softmax. Then we can optimize the parameters of {vec l | L l=1 } with the cross-entropy loss between the prediction and the labeled target. The image is also encoded as a set of local features. Then the optimal transport is used as the metric between prompts and visual features.

3.2. OPTIMAL TRANSPORT

Optimal transport (OT) distance is a widely used metric for the comparison of distributions. Here, we only focus on the discrete situation which is more related to our framework. Assuming we have two sets of points (features), the discrete distributions are formulated as: U = M m=1 u m δ fm and V = N n=1 v n δ gn , where u and v are the discrete probability vectors that sum to 1, and δ f is a Dirac delta function placed at support point f in the embedding space. Then, the total distance is written as: < T , C >= M m=1 N n=1 T m,n C m,n . We call C the cost matrix in which each point denotes the cost between f m and g n , such as C m,n = 1 -sim(f m , g n ). While the T is called the transport plan, which is learned to minimize the total distance. The optimization problem of optimal transport is formulated as: d OT (u, v|C) = minimize T < T , C > subject to T 1 N = u, T ⊤ 1 M = v, T ∈ R M ×N + . As directly optimizing the above objective is always time-consuming, we apply the Sinkhorn distance (Cuturi, 2013) to use an entropic constraint for fast optimization. The optimization problem with a Lagrange multiplier of the entropy constraint is: d OT,λ (u, v|C) = minimize T < T , C > -λh(T ) subject to T 1 N = u, T ⊤ 1 M = v, T ∈ R M ×N + , where h(•) is entropy and λ ≥ 0 is a hyper-parameter. Then we can have a fast optimization solution with a few iterations as: T * = diag(u (t) ) exp(-C/λ)diag(v (t) ), where t denotes the iteration and in each iteration t) , with the initiation v (0) = 1. u (t) = u/ (exp(-C/λ)v (t-1) and v (t) = v/ (exp(-C/λ) ⊤ u (

3.3. PROMPT LEARNING WITH OPTIMAL TRANSPORT

In this subsection, we introduce the details of our PLOT , which learns multiple prompts to describe different characteristics of the category by minimizing the OT distance. Specifically, as shown in Figure 2 , given an image x, we first feed it to the visual encoder branch of CLIP. Apart from the global visual feature f , we can also obtain a set of local features {f m | M m=1 }. The visual encoder has a multi-head attention pooling layer in which the input is the combination of the global feature and a set of local features (feature map) and the output is a tensor with the shape R (H×W +1)×C , where H and W is the height and width of feature map and C is the feature dimension. Therefore, we can obtain M = H × W local features and a global feature. At the same time, for class k, we can initialize N local prompts as {t k,n | N n=1 } with learnable vectors {ω n | N n=1 }, where each is the same as the prompt in CoOp. With both visual and textual encoders, we can obtain local visual features F = {f m | M m=1 } ∈ R M ×C and prompt features G k = {g n | N n=1 } ∈ R N ×C . In the inner loop, we learn the transport plan T with these fixed support sets F , G k , by minimizing the following OT distance to push G k to F : d OT (k) = d OT (u, v|1 -F ⊤ G k ), where C = 1 -F ⊤ G k denotes that we use the cosine distance between F and G k as the cost matrix. Then we can obtain the solution of transport plan T * as Eq. 6 and the final OT distance d OT (k). Given the OT distance between G k and F , we reformulate the prediction probability as: p OT (y = k|x) = exp ((1 -d OT (k)) /τ ) K k ′ =1 exp ((1 -d OT (k ′ )) /τ ) . ( ) In the outer loop, we fix the transport plan T * and optimize {vec l,n | L,N l=1,n=1 } with cross entropy: L CE = - 1 |X | x∈X K k=1 y x,k p OT (y = k|x), where y x is a one-hot label vector. The detailed algorithm can be found in Appendix A1. Though the optimization strategy of the optimal transport and prompts is two-stage, the whole training flow is end-to-end. It is because the transport plan is computed using a small number of matrix multiplications as a forward module. The gradients of these matrix multiplications are taped for back-propagation for end-to-end optimization, which makes the whole system fully differentiable (including the iterative algorithm) and easy to implement using an autograd library like PyTorch. In the experiments, we found that it is natural and relatively easy to this optimization strategy.

4. EXPERIMENTS

Extensive experiments are conducted to evaluate our method, including comparison with CoOp, ablation studies, parameter analysis extensibility analysis, computing cost analysis, and visualization.

4.1. DATASETS

We followed the experimental settings in the CoOp (Zhou et al., 2021b) for the few-shot learning evaluation. The experiments are conducted on the 11 visual recognition datasets, including Caltech101 (Fei-Fei et al., 2004) , DTD (Cimpoi et al., 2014) , EuroSAT (Helber et al., 2019) , FGV-CAircraft (Maji et al., 2013) , Flowers102 (Nilsback & Zisserman, 2008) , Food101 (Bossard et al., 2014) , ImageNet (Deng et al., 2009) , OxfordPets (Parkhi et al., 2012) , StanfordCars (Krause et al., 2013) , SUN397 (Xiao et al., 2010) , and UCF101 (Soomro et al., 2012) . These datasets span visual classification of generic objects, scenes, actions, fine-grained categories, and so on, which constitutes a comprehensive evaluation of our method. All experiments adopted the few-shot evaluation protocol used in CLIP (Radford et al., 2021) and CoOp (Zhou et al., 2021b) , where we respectively choose 1, 2, 4, 8, and 16 shots for model training and use the original test set for evaluation. Besides, we also evaluated the robustness of our method with domain shift. Following CoOp, we used the ImageNet as the source domain and evaluate our method with ImageNet-based robustness evaluation datasets including ImageNetV2 (Recht et al., 2019) , ImageNet-Sketch (Wang et al., 2019) , ImageNet-A (Hendrycks et al., 2019) , and ImageNet-R (Hendrycks et al., 2020) . A detailed introduction of each dataset can be found in the appendix.

4.2. IMPLEMENTATION DETAILS

We chose CoOp (Zhou et al., 2021b) as our main competitor to evaluate our method. Compared with CoOp which only learns a global prompt for one class, our PLOT method learns multiple local prompts and applies the OT distance for fine-grained alignment. Besides, we also reported (Zhou et al., 2022) . They are also widely-used methods to adapt the pretrained knowledge for the downstream task. Please note that we evaluate CoCoOp in the same setting for a fair comparison (the base-to-new setting can be found in the appendix). The original CoOp method has different versions with different class token positions and parameter initialization strategies. For easy comparison, we directly chose one of them as our baseline with "end" token position, "random" initialization, 16 context tokens, and RN50 backbone. More implementation details can be found in Section A2.

4.3. COMPARISON WITH COOP

In this subsection, we compare our PLOT with the baseline CoOp on the few-shot recognition and domain generalization tasks.

Few-Shot Learning

We summarized the experimental results in Figure 3 where the red line denotes our PLOT method, the blue one denotes CoOp, the purple line denotes CoCoOp (Zhou et al., 2022) , and the green one is the CLIP linear probe. As the settings in the CoCoOp and CoOp are different, we re-run the CoCoOp method in the setting of CoOp. We observed that all prompt learning methods outperform the linear probe method by a large margin. 

Domain generalization

The robustness also plays a critical role in model applications since the real-world environment may have large domain shifts with the training data. Therefore, we conducted a robustness evaluation to investigate the transferability of models learned by PLOT . Table 1 summarizes the results of our PLOT method and CoOp on four ImageNet-based robustness evaluation datasets. For both methods, we trained the models on Ima-geNet with 16 shots per class. For PLOT , we set the number of prompts as N = 4. We can observe that PLOT outperforms CoOp consistently on both source and target domains. These experimental results demonstrate that the performance improvement of our learning multiple prompts doesn't rely on single-domain overfitting.

4.4. ABLATION STUDIES AND MORE ANALYSIS

In this subsection, we conducted the ablation studies to investigate the effectiveness of different components, in order to answer the following questions. Q: Can we directly learn multiple prompts by matching the prompt ensemble with the global visual feature? A: No. As shown in Table 2 , we report the performance of directly matching the prompt ensemble with the global visual feature (notated as "G") on three datasets including Caltech101, DTD, and FOOD101. The performance improvement of this method over CoOp is limited and far lower than PLOT . It may be because this "G" method is incentivized to learn the indistinguishable prompts, which contradicts our purpose to learn multiple comprehensive prompts. Q: Can ensemble methods that encourage the variety of prompts work well? A: Not really. As shown in Table 2 , we further apply two methods to encourage the variety of prompts and then use the ensemble to match the global feature. In method "V", we add an objective function to add the distance between every two prompts as a regularization term. In method "E", we use predefined different initializations to replace the random initializations, such as "a photo of a", "this is a photo", "this is a", and "one picture of a". However, "G+V" did not achieve consistent improvement over the "G". Despite the clear improvement brought by "G+E", our PLOT showed consistent superiority over "G+E", which further demonstrates the effectiveness of the OT distance. Q: Does the improvement mainly come from using all feature maps? A: No. In PLOT , we apply all feature maps of the visual encoder branch, where each feature is a local embedding at one spatial position. However, we demonstrate that the improvement of PLOT does not only rely on using all feature maps. On the contrary, directly using the feature map to replace the global feature causes a large performance drop. For example, on all three datasets, directly using the feature map ("M" or "M+V") has around 20% 1 shot accuracy drop over using the global visual feature. It is not surprising since the original CLIP model is trained by matching the global visual feature and language feature. Without using the OT method, the distance between the feature map and multiple textual prompts degenerates to the mean distance of each feature-prompt pair. Q: How many prompts are needed? A: 4 prompts are enough One important hyper-parameter in PLOT is the number of prompts. To analyze the effect of the number of prompts, we conducted the experiments on three datasets with 1, 2, 4, 8 prompts. The results are summarized in the white part of Table 3 . We can observe that the performance obviously increases when adding the number of prompts from 1 to 4. For example, PLOT (N=4) respectively obtains 1.36%, 2.64%, and 1.68% 1-shot accuracy improvement over PLOT (N=1) on three datasets. Besides, when we further increase the number of prompts, the improvement is not consistent. To balance the improvement and cost, we set N = 4 as the default configuration of our PLOT model. In the experiments, we tuned this hyper-parameter on the Caltech101 dataset and applied it to other datasets. Q: Can PLOT benefit Adapter-based methods? A: Yes. Adapter-based methods (Gao et al., 2021; Zhang et al., 2021a) is another research direction of the efficient adaptation of pre-trained vision-language models. Different from prompt learning that fixes the model parameters and tunes the language prompt, adapter-based methods (Gao et al., 2021; Zhang et al., 2021a) allow for fine-tuning a part of the network or adding an extra model for training. Recently, adapter-based methods also achieved good performance on few-shot visual recognition. Therefore, we would like to explore whether our PLOT approach can benefit them, and how. We apply the Tip-adapter-F (Zhang et al., 2021a) as our baseline method, which learns a Linear(d, N cls × K shots ) model to describe one image by the similarity with all training samples, where d is the dimension of visual feature, N cls is the number of categories (e.g. 1000 in ImageNet), and K shots is the number of shots. Then, the final similarity consists of the original distance between the visual feature and prompt ensembling and the new distance calculated by the learned feature and one-hot vector of labels (whose dimension is (N cls × K shots , N cls )). Please find details in Tip-adapter-F (Zhang et al., 2021a) . To introduce PLOT to this framework, we first used the feature map to replace the global feature and then learned multiple linear models. As a result, with different local features and different linear models, we can obtain a M × N distance matrix and apply the Sinkhorn algorithm (Cuturi, 2013) to calculate the OT distance. Furthermore, we can apply the learned prompts as co-partner of the ensembling prompt to refine the final similarity. Table 4 summarizes the few-shot recognition results of the original Tip-Adapter-F method and our adapter-based PLOT methods on all 11 datasets. Q: Can PLOT benefit zero-shot learning? A: No. The detailed analysis and discussions can be found in the appendix. Q: What is the extra computation time cost of PLOT over CoOp baseline? A: Around 10% inference speed and 5% training time. Please see the detailed analysis in the appendix.

4.5. VISUALIZATION

In this subsection, we provide some visualization examples of the transport plans T related to different prompts (N=4) in Figure 4 . A detailed analysis of these visualization examples and further visualization results including the interpretation of the learned prompt, a T-SNE visualization of prompts, and the visualization of the false case can be found in Section A3

5. CONCLUSION

In this paper, we present a method, named PLOT, to learn multiple comprehensive prompts to describe diverse characteristics of one category. To avoid convergence to one point, we propose to apply the optimal transport to achieve the fine-grained alignment between both vision and language domains. We apply a two-stage optimization strategy where the inner loop fixes the prompts and learns the transport plan to calculate the cross-modality distance, and the outer loop uses this distance to optimize the prompt learner. We build our method on the base of CoOp and achieve significant improvement on the few-shot recognition task in various datasets, which demonstrates the advantage to learn multiple prompts instead of a single one. Calculate the cost matrix C k = 1 -F ⊤ G k ∈ R M ×N of each class 6: Calculate the OT distance with an inner loop: Initialize the v (0) = 1, δ = 0.01 and ∆ v = ∞ 7: for t in = 1, 2, . . . , T in do 8: Update Obtain optimal transport plan as u (tin) = u/((exp(-C/λ)v (tin-1) ) 9: Update v (tin) = v/((exp(-C/λ) ⊤ u (tin) ) 10: Update ∆ v = |v (tin) -v (tin-1) |/N 11: if ∆ v < δ T * k = diag(u (t) ) exp(-C k /λ)diag(v (t) ), 16: Calculate the OT distance d OT (k) =< T * k , C k > 17: Calculate the classification probability p OT (y = k|x) with the OT distance 18: Update the parameters of prompts {ω n | N n=1 } with cross-entropy loss L CE 19: end for 20: return {ω n | N n=1 } Then, the total distance of these two distributions is written as: < T , C >= M m=1 N n=1 T m,n C m,n , where the T ∈ R M ×N is a matrix of transport plan, which is learned to minimize the total distance. Each point T m,n in T is a weight of local cost C m,n . The optimization problem of optimal transport is formulated as: d OT (u, v|C) = minimize T < T , C > subject to T 1 N = u, T ⊤ 1 M = v, T ∈ R M ×N + . These constraints of T are used to match its marginal distributions and original discrete distributions in Eq. 11. In our framework, we treat visual features f m and prompt features g n equally and thus u = 1 M ×1 /M and v = 1 N ×1 /N . As directly optimizing the above objective is always time-consuming, we apply the Sinkhorn distance (Cuturi, 2013) to use an entropic constraint for fast optimization. The optimization problem with a Lagrange multiplier of the entropy constraint is: d OT,λ (u, v|C) = minimize T < T , C > -λh(T ) subject to T 1 N = u, T ⊤ 1 M = v, T ∈ R M ×N + , where h(•) is entropy and λ ≥ 0 is a hyper-parameter. Then we can have a fast optimization solution with a few iterations as: T * = diag(u (t) ) exp(-C/λ)diag(v (t) ), where t denotes iteration and in each iteration Calculate the cost matrix u (t) = u/ (exp(-C/λ)v (t-1) and v (t) = v/ (exp(-C/λ) ⊤ u (t) , C k = 1 -F ⊤ G k ∈ R M ×N of each class 5: Calculate the OT distance with an inner loop: Initialize the v (0) = 1, δ = 0.01 and ∆ v = ∞ 6: for t in = 1, 2, . . . , T in do 7: Update u (tin) = u/((exp(-C/λ)v (tin-1) ) 8: Update v (tin) = v/((exp(-C/λ) ⊤ u (tin) ) 9: Update ∆ v = |v (tin) -v (tin-1) |/N 10: if ∆ v < δ then 11: break 12: end if 13: end for 14: Obtain optimal transport plan as T * k = diag(u (t) ) exp(-C k /λ)diag(v (t) ), 15: Calculate the OT distance d OT (k) =< T * k , C k > 16: Calculate the classification probability p OT (y = k|x) with the OT distance A1 , including the number of classes, the sizes of training and testing sets, and the original tasks.

A2.2 IMPLEMENTATION DETAILS

The original CoOp method has different versions with different class token positions and parameter initialization strategies. As the performance gap among different versions is limited, we directly chose one of them as our baseline, where the token position is "end", the parameter initialization strategy is "random", and the length of learnable context tokens is set as 16. Following the widely used setting in (Zhou et al., 2021b; 2022; Gao et al., 2021; Zhang et al., 2021a) , we also chose RN50 (He et al., 2016) as the backbone network of the visual branch. All the code of our method is based on CoOp, which adopted the SGD optimizer with 0.002 initial learning rate, CosineAnnealingLR schedule, and a warmup trick with 1e-5 learning rate. We also followed the epoch strategy to train more epochs for more shots. For small datasets such as FGVCAircraft, OxfordFlowers, and StanfordCars, the batch size is set as 32, while for the larger dataset such as Imagenet and SUN397, the batch size is set as 128. We apply N = 4 prompts for each category and use M = 7 × 7 due to the feature map size. We set the hyper-parameters in the Sinkhorn distances algorithm (Cuturi, 2013) as λ = 0.1 for all the datasets. We set the maximum iteration number of the inner loop as 100 and will early stop the iteration when the average absolute update value Λ < 0.01. We initialize all values in the vector v and µ as 1/N and 1/M respectively. All models are conducted on the Pytorch (Paszke et al., 2019) 1.7.1 and trained on 4 NVIDIA A100 GPUs. We repeated the experiments three times with different seeds and reported the average. (Cimpoi et al., 2014) 47 2,820 1,692 Texture recognition EuroSAT (Helber et al., 2019) 10 13,500 8,100 Satellite image recognition FGVCAircraft (Maji et al., 2013) 100 3,334 3,333 Fine-grained aircraft recognition Flowers102 (Nilsback & Zisserman, 2008) 102 4,093 2,463 Fine-grained flowers recognition Food101 (Bossard et al., 2014) 101 50,500 30,300 Fine-grained food recognition ImageNet (Deng et al., 2009) 1,000 1.28M 50,000 Object recognition OxfordPets (Parkhi et al., 2012) 37 2,944 3,669 Fine-grained pets recognition StanfordCars (Krause et al., 2013) 196 6,509 8,041 Fine-grained car recognition SUN397 (Xiao et al., Here, we provide detailed performance results on all 11 few-shot recognition datasets in Table A2 , where we use gray for our method and white for CoOp. To highlight, we respectively use dark cyan and light cyan to represent the performance of PLOT and CoOp on the average of all 11 datasets. We repeat all experiments 3 times and report the mean and standard deviation in the table.

A2.4 ABLATION STUDIES DETAILS

In this section, we provide more details about the different variants in Table 2 . We compare PLOT with the other 6 baseline methods briefly described below: • CoOp: CoOp is the baseline method that only learns a single prompt and matches this single prompt and the global visual feature. We apply the officially released code to reproduce this method. • "G": In this paper, we propose to explore whether we can learn multiple prompts for more comprehensive textual representation and fine-grained visual-textual alignment. "G" denotes that we build multiple prompts (similar to our PLOT ) and learn them by matching them with the single global visual feature. • "G+V": Matching all local prompts to a single visual feature will reduce the diversity of the learned prompts. To improve the variety of learned prompts, "G+V" further adds an objective function to increase the distances between every two prompts. • "G+E": "G+E" is also a method to increase the variety of prompts by separated initializations. It applies predefined different initializations to replace the random initialization, such as "a photo of a", "this is a photo", "this is a", and "one picture of a". • "M": One key difference between PLOT and CoOp is to utilize the feature map for more fine-grained information. To evaluate whether our improvement mainly comes from using a feature map, we design a method "M", which removes the OT distance of PLOT and matches local visual features and multiple textual prompts by the average distance of each visual-textual pair. • "M+V": Similar to "G+V", we add an objective function to increase the distances between every two prompts to the method "M" to increase the variety of prompts.

A2.5 BASE-TO-NEW RESULTS

To investigate the generalization of our method for other baseline prompt-learning-based methods, we apply our PLOT to CoCoOp (Zhou et al., 2022) , by learning multiple textual prompts (e.g. N=4) instead of the single prompt in CoCoOp. We name it CoPLOT. Specially, we learn multiple prompts and use the same meta-network for all local prompts. Then we apply the Optimal Transport to calculate the distance between multiple local prompts and local visual features. We evaluate both CoCoOp and CoPLOT in the setting of "base-to-new" and implement them using the same RN50 backbone. The results on the 11 datasets with 16 shots are provided in Table A3 . We observe that PLOT achieves improvement on most datasets and on average, which demonstrates that it can be applied to different prompt-learning-based methods. For example, on average,PLOT achieves almost 3% improvement on the "new" side without reducing "base" performance. It suggests that these two methods are complementary: CoCoOp proposes a conditional formulation that uses each image feature as the context condition to refine the single prompt, while PLOT aims to learn multiple prompts.

A2.6 ZERO-SHOT SETTING ANALYSIS

PLOT can not benefit in the setting of zero-shot learning. Below we provide some experimental details and corresponding analysis. CLIP shows that manually designing the prompts can still achieve good performance. We obtain 7 prompts by prompt engineering on the ImageNet dataset and can further ensemble them to obtain 60.38% top 1 accuracy. In this section, we replace the cosine distance between the global visual feature and prompt ensemble with the OT distance between the feature map and all 7 prompts. However, without any learning, the OT distance only obtains 58.78% accuracy. It is a limitation of the PLOT to still need few-shot data for optimization, which cannot be directly applied in the zero-shot setting. We argue there are two reasons why the OT distance does not work without learning: 1) prompt engineering selects prompts based on the global feature and cosine distance, instead of OT distance with feature map; 2) all these selected prompts are close to the global feature and lack the complementarity. 

A3.2 VISUALIZATION OF FAILURE CASES

To better understand the method and further discover the reason for the failure cases. we visualize the attention maps of some failure cases. As shown in Figure A1 , we showed two failure examples with class "2000 AM General Hummer" in the StanfordCars dataset. During the training, we set the number of prompts as 4, but in these visualization results, we found that some of the learned prompts remarkably coincide with each other. These prompts can be roughly divided into two classes: Foreground and Background. For example, in both images, prompts 2 (right top) and 3 (left down) focus on the foreground car, while the others focus on the background. It demonstrates that not all classes have multiple complementary attributes, which motivates us to go further to learn the dynamic local prompts numbers to reduce the computational load in the future.

A3.3 INTERPRETATION OF TEXT PROMPTS

The learned prompts are difficult to be understood by humans since the parameters are optimized in the continuous space (Zhou et al., 2021b) . CoOp proposes to use the word which is nearest to learned prompts in the embedding space to visualize the prompts. Following this manner, we show the nearest words of our learned prompts in Table A5 . Similar to CoOp, most words can not be directly understood by human logic. However, we still find the relations between the learned prompts and the corresponding optimal transport plan. As shown in Figure 4 in the main paper, we can observe that the optimal transport plan for Prompt 1 always focuses on the "head", such as the head of "brambling", the head of "rooster", and even the head of "aircraft carrier". It is because the word "head" is in Prompt 1. Similarly, we can find that Prompt 4 prefers the white part of images, such as the white environment in the image of "brambling" and the snow in the image of "dog sled". It demonstrates that the learned multiple prompts focus on different characteristics of categories.

A3.4 T-SNE OF PROMPTS

To better understand the learned prompts, we provide a visualization with T-SNE ( Van der Maaten & Hinton, 2008) for the learned textual prompts. Specifically, we randomly select 10 classes from ImageNet and generate the textual embedding with our learned prompts. Then, we obtain 4 × 10 embeddings with dimension d = 1024. Then we apply the T-SNE to reduce the dimension and visualize the embeddings. As shown in Figure A2 , the textual embeddings of the same class with different prompts are clustered well. Besides, despite being well clustered, we found that the textual embeddings also have intra-diversities.



Figure 2: The framework: PLOT first describes each category with multiple prompts and obtains a set of prompt features by text encoder. The image is also encoded as a set of local features. Then the optimal transport is used as the metric between prompts and visual features.

Figure 3: The few-shot learning results on 11 datasets. We compare our PLOT with CoOp, CoCoOp, and the Linear Probe method and observe the consistent and significant performance improvement on most datasets. (The average accuracy on all datasets is shown on the left top.)

Figure 4: Visualization. We provide the heatmaps of transport plan T related to each prompt on 4 categories in ImageNet. Different transport plans focus on different attributes of the object.

The training process of Prompt Learning with Optimal Transport Input: Training few-shot image data: X = {x}, pretrained CLIP model f and g, number of prompts N , entropy parameter λ, maximum number of iterations in inner and outer loops T in , T out . Output: The parameters of prompts {ω n | N n=1 } 1: Initialize {ω n | N n=1 } 2: for t out = 1, 2, . . . , T out in the outer loop do 3: Obtain a visual feature set F ∈ R M ×C with the visual encoder f (x); 4: Generate prompt feature set G k ∈ R N ×C of each class with the textual encoder {g(t n k

with the initiation v (0) = 1. The detailed algorithms of the training and testing processes are shown in Algorithms A1 and A2 Algorithm A2: The inference process of Prompt Learning with Optimal Transport Input: Testing image data: X = {x}, number of prompts N , number of classes K, learned prompts {t n k | K,N k=1,n=1 }, a frozen pretrained CLIP model including image encoder f and text encoderg Output: The classification of each image 1: for x in X do 2: Obtain a visual feature set F ∈ R M ×C with the visual encoder f (x); 3: Generate prompt feature set G k ∈ R N ×C of each class with the textual encoder {g(t n k

used in the experiments follow CoOp(Zhou et al., 2021b), which include 11 datasets for few-shot visual recognition and 4 ImageNet-based datasets for generalization (robustness) evaluation. The details of each dataset are shown in Table

Besides, PLOT can further outperform CoOp and CoCoOp on most of the datasets. Taking the average accuracy (at the left top) as the example,

Ablation studies on few-shot recognition. PLOT : our defined model with N = 4. CoOp: the baseline method. G: respectively matching the global visual feature and multiple textual prompts V: applying a constraint to add the variance of prompts. E: using different initializations as the ensemble: M: using the visual feature map instead of the global visual feature. More details of different variants can be found in Section A2.4 in the appendix. ± 1.75 71.57 ± 1.59 77.18 ± 2.16 81.77 ± 0.47 86.21 ± 0.20 M+V 66.11 ± 8.29 71.45 ± 3.98 79.30 ± 3.96 86.96 ± 0.78 89.80 ± 0.17 DTD PLOT 46.55 ± 2.62 51.24 ± 1.95 56.03 ± 0.43 61.70 ± 0.35 65.60 ± 0.82 CoOp 43.62 ± 1.96 45.35 ± 0.31 53.94 ± 1.37 59.69 ± 0.13 62.51 ± 0.25 G 45.12 ± 1.69 48.39 ± 2.08 54.75 ± 0.48 60.15 ± 0.70 63.59 ± 0.76 G+V 45.90 ± 2.00 48.50 ± 0.99 53.96 ± 0.48 59.69 ± 1.01 63.51 ± 0.66 G+E 46.39 ± 1.00 49.31 ± 0.56 52.99 ± 0.60 60.44 ± 1.64 63.97 ± 0.48 M 13.18 ± 4.57 12.25 ± 3.86 13.00 ± 4.73 20.76 ± 5.42 26.99 ± 1.98 M+V 12.61 ± 5.93 15.11 ± 1.81 20.35 ± 1.33 44.13 ± 2.39 56.85 ± 0.54 FOOD101 PLOT 77.74 ± 0.47 77.70 ± 0.02 77.21 ± 0.43 75.31 ± 0.30 77.09 ± 0.18 CoOp 74.25 ± 1.52 72.61 ± 1.33 73.49 ± 2.03 71.58 ± 0.79 74.48 ± 0.15 G 74.63 ± 0.11 70.15 ± 0.49 70.41 ± 0.46 70.72 ± 0.98 73.68 ± 0.46 G+V 74.83 ± 0.31 70.09 ± 0.85 70.86 ± 0.22 70.80 ± 0.68 73.93 ± 0.35 G+E 75.77 ± 0.62 73.54 ± 0.88 75.82 ± 0.44 72.40 ± 0.50 75.52 ± 0.33 M 52.02 ± 4.86 46.12 ± 1.46 46.86 ± 1.39 53.43 ± 0.88 61.28 ± 0.23 M+V 46.52 ± 1.15 45.95 ± 2.66 53.57 ± 0.83 62.95 ± 0.37 67.63 ± 1.11

Parameter analysis for the number of prompts

Comparisons on robustness to domain shift.

The few-shot accuracies of Tip-adapter-F and our adapter-based PLOT on 11 datasets.

The detailed statistics of datasets used in experiments.

The few-shot visual recognition accuracy on 11 datasets.

The nearest words for 16 context vectors of all N = 4 prompts learned by PLOT . N/A means non-Latin characters.

A1 METHOD DETAILS

The Optimal Transport (Monge, 1781) is initially introduced to find a transportation plan to move simultaneously several items at a minimal cost, such as moving a pile of sand to fill all the holes. Recently, it is widely used for the comparison of distributions. Mathematically, given two probability density function U and V over space X and Y, the OT (Wasserstein) distance (Thorpe, 2019) can be defined aswhere C(x, y) is the cost between two points in the space X × Y, and Γ denotes the set of transport plans between support points x and y (e.g. γ(x, y)). We can regard two probability density functions U and V as piles and holes and C is the cost function of moving a unit of sand.In our problem of multiple prompts learning, we formulate the sets of visual features and prompt features as two discrete distributions aswhere u and v are the discrete probability vectors that sum to 1, and δ f is a Dirac delta function placed at support point f in the embedding space. Given two support points f m and g n , the cost function is written asFor simply, in this discrete situation, C ∈ R M ×N is a cost matrix in which each point denotes the cost between f m and g n .Table A3 : Comparison of CoCoOp (Zhou et al., 2022) and CoPLOT(ours) in the base-to-new generalization setting. All methods are implemented with RN50 backbone and evaluated with 16 shots. We report the performance of the base classes, new classes, and the mean of them. We show that PLOT can be applied to CoCoOp and achieve improvement. 

A3 VISUALIZATION A3.1 MORE ANALYSIS ON VISUALIZATION

In this section, we provide some visualization examples of the transport plans T related to different prompts (N=4). We translate each transport plan into colorful heatmaps and resize them to their original size and combine them with the raw image. As shown in Figure 4 , we provide the heatmaps of 4 categories in ImageNet. We observe that different transport plans highlight different regions of the image, which demonstrates that the learned multiple prompts are complementary. For the class "Brambling", the prompts respectively focus on the head, tail, wing, and environment. For "Dog Sled", the prompts are related to dogs, the sled, some ties, and the snow environment. 

