VARIATIONAL PROMPT TUNING IMPROVES GENERAL-IZATION OF VISION-LANGUAGE MODELS

Abstract

Prompt tuning provides an efficient mechanism to adapt large vision-language models to downstream tasks by treating part of the input language prompts as learnable parameters while freezing the rest of the model. Existing works for prompt tuning are however prone to damaging the generalization capabilities of the foundation models, because the learned prompts lack the capacity of covering certain concepts within the language model. To avoid such limitation, we propose a probabilistic modeling of the underlying distribution of prompts, allowing prompts within the support of an associated concept to be derived through stochastic sampling. This results in a more complete and richer transfer of the information captured by the language model, providing better generalization capabilities for downstream tasks. The resulting algorithm relies on a simple yet powerful variational framework that can be directly integrated with other developments. We show our approach is seamlessly integrated into both standard and conditional prompt learning frameworks, improving the performance on both cases considerably, especially with regards to preserving the generalization capability of the original model. Our method provides the current state-of-the-art for prompt learning, surpassing CoCoOp by 1.6% average Top-1 accuracy on the standard benchmark. Remarkably, it even surpasses the original CLIP model in terms of generalization to new classes. Implementation code will be released.

1. INTRODUCTION

In a continuous quest for better pre-training strategies, models based on image and language supervision have set impressive milestones, with CLIP (Radford et al., 2021) , ALIGN (Jia et al., 2021) and Flamingo (Alayrac et al., 2022) being leading examples. Contrastively trained vision-language models consist of image and text encoders that align semantically-related concepts in a joint embedding space. Such models offer impressive zero-shot image classification by using the text encoder to generate classifier weights from arbitrarily newly defined category classes without relying on any visual data. In particular, the class name is used within a handcrafted prompt template and then tokenized and encoded into the shared embedding space to generate new classifier weights. Rather than manually defining prompts, Zhou et al. (2022b) and Lester et al. (2021) proposed that prompts can be instead optimized in a data-driven manner through back-propagation by maximizing a cross-entropy loss on the downstream task. However, despite the performance improvement on downstream tasks, prompt learning negatively affects the generalization capability of the vision-language model. While subsequent works have focused on how to bridge the generalization gap, e.g. Zhou et al. (2022a); Zhu et al. (2022) , in practice the generalization power of the foundation model is significantly degraded. Our work tackles this same problem, as it seeks to improve downstream performance but without degrading the generalization capability of the original model. To do so, we propose a data-driven method for directly learning the underlying distribution within the prompt space associated to the target concept. In particular, we frame prompt tuning as a variational inference problem, where a base learned prompt is combined with a residual vector sampled from the instance-specific underlying distribution. This formulation provides two advantages. First, it investigates the prompt space more thoroughly and results in more informative use of the language space, leading to better generalization. Second, it enables us to boost performance by capturing the uncertainty information in fine-grained classification problems. The resulting approach is orthogonal

