RETHINKING THE VALUE OF PROMPT LEARNING FOR VISION-LANGUAGE MODELS Anonymous authors Paper under double-blind review

Abstract

Large-scale visual-language pre-training like CLIP has demonstrated great success in open-set visual concept learning that enables zero-shot transfer to downstream tasks through prompting. To automate prompt engineering, prompt learning is proposed to automatically learn the optimal task-relevant prompts. In this paper, we make some surprising observations that contradict common beliefs about prompts. We observe that even random prompts can achieve pretty good performance for zero-shot recognition. We also find that prompt learning gives comparable or worse performance than directly fine-tuning of the linear classifier. Moreover, prompt learning is no more than parameter-efficient learning, and is a trade-off between optimality and generalization. Our results highlight the need for the rethinking of existing prompt learning, more careful baseline evaluations in future research on prompt learning methods in vision-language models.

1. INTRODUCTION

Building a state-of-the-art visual recognition system is one of the core tasks in the field of computer vision. Current state-of-the-art visual recognition systems are almost all based on Deep Neural Networks (DNNs), which can be roughly divided into two parts: a non-linear feature extractor and a linear classifier. For traditional visual recognition, where the class number are fixed and the labels are discretized, the standard practice is to assign each category with a weight vector, which is optimized to maximize the classification accuracy. Take the ResNet for ImageNet classification as an example, the weight vectors for 1000 classes form the weight matrix W ∈ R 1000×4096 of the linear classifier (the last fully-connected layer of ResNet), where 4096 is the dimension of the features from the feature extractor. This learning paradigm can only learn closed-set visual concepts related to the pre-defined categories, and can not generalize to new classes beyond these closed-set categories. In contrast to supervised learning with fixed labels of a closed-set categories, visual concept learning with the supervision of text has shown great potential. The main inspiration is that language is a high level abstraction of human understanding the world, thus it contains rich information and can naturally generalize well. One of the representative works is the CLIP (Contrastive Language-Image Pretraining) (Radford et al., 2021) , which learns joint representations of vision and language using contrastive learning on large-scale image and text data. Thanks to the rich information and the generality of natural language, the CLIP model can learn diverse and task-agnostic visual-textual representations, which can be generalized to many downstream tasks even under the zero-shot setting. This is done by using the names of all classes of a downstream task as the text for textual feature extraction, and conducting classification based on the alignment score of the visual features and the textual features for each class. However, using the class names as the text is deficient due to the lack of context. To this end, the authors of Radford et al. (2021) resort to the technique of prompt tuning (Liu et al., 2021a) . Here the "prompt" is a cloze templates which specifies the context about the task at hand. They find that the template "a photo of a {CLASS}." is a good prompt for image classification. By using elaborate prompt engineering and ensemble, much higher zero-shot performance can be achieved. Prompt engineering has shown greater transferability than the contextless baseline of using class names. The drawback is that the handcrafted prompt tuning requires prior knowledge about the downstream task. Moreover, as pointed out in Zhou et al. (2022b), the performance is very sensitive to a slight change in the wording of the prompt template. Thus prompt tuning is a non-trivial task. To solve this problem, the authors of Zhou et al. (2022b) bring the concept of prompt learning from natural language processing (NLP) and propose Context Optimization (CoOp) to automate the prompt engineering in vision-language models. More recent works including (Ju et al., 2021; Yao et al., 2021; Zhou et al., 2022a ) are continually developed. The core idea of these prompt learning approaches is to treat the embeddings of the words in a prompt as a set of learnable vectors, which are learned through back-propagation w.r.t. the downstream task loss. Prompts can encode context information expressed in natural language about the target tasks, thus they can generalize well and show promising results even in zero-shot. Prompt learning, which automatically optimize the prompts in the same word embedding space of natural language, is believed to have two advantages. First, it is believed that prompt learning converge faster and requires fewer training examples than fine-tuning. This is because only the context vectors are updated while the pre-trained parameters of both text encoder and image encoder are fixed. Moreover, during the gradients calculation, the pre-trained knowledge encoded in the text encoder can also be back-propagated through the network to the context vectors. Therefore, prompt learning is commonly believed to be superior to linear probe, partial fine-tuning or even full fine-tuning. Second, it is believed that the learned prompts have strong robustness and generalization ability, as the optimization is conducted in the NLP embedding space, thus the learned prompts are expected to provide high generalization ability in the same way as natural language. In this paper, we test these two beliefs by evaluating the prompt tuning/learning performance of CLIP on various downstream tasks. We start from examining the influence of text encoder on the prompts through handcrafted prompts and random prompts and show that the text encoder can indeed provide some regularization on the prompts. To our surprise, we find that even random prompts can still achieve pretty good performance for zero-shot recognition. Then, we compare prompt learning and fine-tuning for closed-set recognition, and observe that prompt learning gives comparable or worse performance than directly fine-tuning the weights of the linear classifier. Last, we examine the generalization ability of the learned prompts, and reveal that prompt learning is no more than parameter-efficient learning, and is a trade-off between optimality and generalization.

2. RELATED WORKS

Prompt learning is originally proposed to transfer knowledge from pre-trained language models to downstream tasks, which has demonstrated great performance in NLP domain Devlin et al. (2018); Brown et al. (2020) . A typical example of prompt learning is "fillin-the-blank" cloze templates Petroni et al. (2019) , which transforms the down-stream task to a format familiar to the pre-trained model. Instead of manually designing prompt templates, later studies focus on automated prompt learning which can be categorized into discrete prompts and continuous prompts Liu et al. (2021a) . Researchers discover the discrete prompts in a discrete space, e.g. natural language phrases, and most works generate discrete prompts by either gradient-based search Wallace et al. ( 2019 



), or prompt mining Jiang et al. (2020), or prompt generation Gao et al. (2020), etc. Instead of limiting the prompt to human-interpretable natural language domain, continuous prompts in the embedding space of the model are proposed. Several representative methods on continuous prompts learning include prefix tuning Li & Liang (2021), tuning initialized with discrete prompts Zhong et al. (2021), and hard-soft prompt hybrid tuning Liu et al. (2021b). Motivated by the well performance of prompt learning on NLP, recently researchers begin to apply it into the vision-language models. CLIP Radford et al. (2021) uses a manually designed prompt on the text encoder, which enables the zero-shot image classification of vision-language model. To avoid human efforts on prompt design, CoOp Zhou et al. (2022b) proposes a continuous prompts learning method and two implementations that can be applied on different recognition tasks. Yet CoOp Zhou et al. (2022b) seems over-fitting the base classes in the training, resulting in inferior performance on unseen classes even within the same dataset. To cure this problem, CoCoOp Zhou et al. (2022a) propose to generate an input-conditional vector for each image by a lightweight neural network, which boosts the classifier performance on new classes. Although CoOp and CoCoOp achieve promising improvements, they requires supervised data from the target datasets which may restrict the model scalability. In the contrary, Huang et al. Huang et al. (2022) propose the unsupervised prompt learning (UPL) method which improves transfer performance of CLIP-like VL models

