PLOT: PROMPT LEARNING WITH OPTIMAL TRANS-PORT FOR VISION-LANGUAGE MODELS

Abstract

With the increasing attention to large vision-language models such as CLIP, there has been a significant amount of effort dedicated to building efficient prompts. Unlike conventional methods of only learning one single prompt, we propose to learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts. However, directly matching each prompt to the same visual feature is problematic, as it pushes the prompts to converge to one point. To solve this problem, we propose to apply optimal transport to match the vision and text modalities. Specifically, we first model images and the categories with visual and textual feature sets. Then, we apply a two-stage optimization strategy to learn the prompts. In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data. Extensive experiments are conducted on the few-shot recognition task and the improvement demonstrates the superiority of our method.

1. INTRODUCTION

In the past few years, large-scale vision-language pre-trained (VLP) models, such as CLIP (Radford et al., 2021) , ALIGN (Jia et al., 2021), and BLIP (Li et al., 2022) have achieved remarkable success in open-world visual concept learning. These methods have brought new light but also pose a new question: how to efficiently adapt the knowledge from pretraining to the downstream tasks since these models are typical of massive sizes which are not feasible for normal users to re-train.

Brambling

A bird that lives in winter wood A bird with dark fan-tail A bird with orange and black texture A bird with black crown and eye Figure 1 : The motivation that one category can be complementarily described in different views (An example of "Brambling"). One of the conventional paradigms of utilizing pretrained knowledge is "pre-training, fine-tuning", which fixes the architecture of the pre-trained neural network and tunes its parameters using task-specific objective functions. Beyond fine-tuning the parameters, many recent methods (Zhou et al., 2021b; 2022) introduce the concept of prompt learning from the field of NLP to the vision domain and achieve striking performance gain for the few-shot visual classification. They fix the model parameters and instead learn suitable prompts by turning a template sentence into a set of learnable vectors. Then, these prompts are learned by minimizing the distance between the visual features and prompt-based language features. Despite significant improvements over manual prompts, learning only a sentence is intuitively insufficient to represent a class. One class can be described by many intrinsic characteristics and even extrinsic context relations. Thus, for one object, we may have multiple prompt candidates which focus on different attributes. As shown in Figure 1 , we can describe the class "Brambling" in different views: such as the color of the wing, the color of the crown and eyes, the shape and color of the tail, and even the living environment information. It motivates us to learn multiple prompts to comprehensively represent the class and thus facilitate classification. The most natural solution is to directly learn multiple prompts by respectively matching each prompt with the visual features. However, it is the same as matching the mean of prompt features and the visual features. This solution is problematic since all prompts are encouraged to be closer to one single point and thus tend to learn the same characteristics. It contradicts our purpose to learn comprehensive prompts. To solve this problem, we tested adding some constraints to push away the prompt from each other, but found that this solution still fails to learn representative and comprehensive prompts. This solution treats the visual representation as one single point, and such a unified view of visual features ignores the fact that different prompts may only focus on one or a subset of characteristics. To address this problem, in this paper, we propose Prompt Learning with Optimal Transport (PLOT), which applies optimal transport (OT) to align the local visual features and multiple textual prompts. Optimal transport can calculate the distance between two distributions under the form of multiple sampling. In our prompt learning framework, we formulate local visual features and multiple prompts as the samplings of two discrete distributions and use OT to encourage fine-grained cross-modal matching. Specifically, to obtain the local visual features with different semantic clues, we extract all feature maps as the visual representation instead of the single global representation. Fortunately, we can easily obtain the visual feature maps from the visual encoder of CLIP by using all outputs of the multi-head self-attention layer (Rao et al., 2021) . Then the problem comes down to how to calculate the distance between two feature sets. We solve this problem by introducing the optimal transport theory (Villani, 2009) and formulate the feature sets as a discrete probability distribution where each feature has an equal probability value. Furthermore, to reduce the computational cost and avoid the extra model parameters, we learn the prompts with a two-stage optimization strategy. At the first stage in the inner loop, we fix both visual and text features and optimize the optimal transport problem by a fast Sinkhorn distances algorithm (Cuturi, 2013) . Then, in the outer loop, we fix all parameters of optimal transport and back-propagate the gradient to learn the prompts with different characteristics. Compared with conventional distance (such as Euclidean distance of mean features), optimal transport can align different visual features for each local prompt, which is more robust to the visual misalignment and tolerates well feature shift (Rubner et al., 2000) . It is because OT learns an adaptive transport plan to align features, which achieves fine-grained matching across two modalities. We conduct experiments on 11 datasets following the standard setting of CLIP (Radford et al., 2021) and CoOp (Zhou et al., 2021b) to evaluate our method. These experiments span the visual classification of generic objects, scenes, actions, fine-grained categories, and so on. The significant result improvement demonstrates that PLOT can effectively learn representative and comprehensive prompts.

2. RELATED WORK

Optimal Transport The Optimal Transport (Monge, 1781) is initially introduced to solve the problem of how to reduce the cost when moving several items simultaneously. Recently, OT theory has drawn wide attention in the machine learning and computer vision community by comparing distributions readily available to them under the form of feature sets (Peyre & Cuturi, 2019) . Due to the brilliant property of distribution matching, OT has been applied in many theoretic and application tasks including generative models (Arjovsky et al., 2017; Salimans et al., 2018; Zhao et al., 2021a ), structural matching (Chen et al., 2019; Xu et al., 2020; Zhao et al., 2021b; Xu et al., 2019 ) (e.g. sequence matching (Chen et al., 2019) and graph matching (Xu et al., 2019) , and image matching (Zhang et al., 2020; Liu et al., 2021a; Zhao et al., 2021b) ), and other distribution-based tasks (such as clustering (Laclau et al., 2017 ), distribution estimation (Boissard et al., 2015) , and causal discovery (Tu et al., 2022) ). In this paper, we use OT to align the features of vision and language modalities by learning an adaptive transport plan (Rubner et al., 2000) . Vision-Language Pre-trained Models Vision-Language Pre-trained (VLP) models aim to explore the semantic correspondence between the vision and language modalities through large-scale pretraining. Recently, VLP models have achieved an exciting performance improvement in few-shot visual recognition (Radford et al., 2021; Gao et al., 2021; Zhou et al., 2021b; 2022; Zhang et al., 2021b) , which shows the great potential to promote open-world visual understanding with the help of language. In terms of objectives, VLP methods can be divided into reconstruction (Li et al., 

