RETHINKING THE VALUE OF PROMPT LEARNING FOR VISION-LANGUAGE MODELS Anonymous authors Paper under double-blind review

Abstract

Large-scale visual-language pre-training like CLIP has demonstrated great success in open-set visual concept learning that enables zero-shot transfer to downstream tasks through prompting. To automate prompt engineering, prompt learning is proposed to automatically learn the optimal task-relevant prompts. In this paper, we make some surprising observations that contradict common beliefs about prompts. We observe that even random prompts can achieve pretty good performance for zero-shot recognition. We also find that prompt learning gives comparable or worse performance than directly fine-tuning of the linear classifier. Moreover, prompt learning is no more than parameter-efficient learning, and is a trade-off between optimality and generalization. Our results highlight the need for the rethinking of existing prompt learning, more careful baseline evaluations in future research on prompt learning methods in vision-language models.

1. INTRODUCTION

Building a state-of-the-art visual recognition system is one of the core tasks in the field of computer vision. Current state-of-the-art visual recognition systems are almost all based on Deep Neural Networks (DNNs), which can be roughly divided into two parts: a non-linear feature extractor and a linear classifier. For traditional visual recognition, where the class number are fixed and the labels are discretized, the standard practice is to assign each category with a weight vector, which is optimized to maximize the classification accuracy. Take the ResNet for ImageNet classification as an example, the weight vectors for 1000 classes form the weight matrix W ∈ R 1000×4096 of the linear classifier (the last fully-connected layer of ResNet), where 4096 is the dimension of the features from the feature extractor. This learning paradigm can only learn closed-set visual concepts related to the pre-defined categories, and can not generalize to new classes beyond these closed-set categories. In contrast to supervised learning with fixed labels of a closed-set categories, visual concept learning with the supervision of text has shown great potential. The main inspiration is that language is a high level abstraction of human understanding the world, thus it contains rich information and can naturally generalize well. One of the representative works is the CLIP (Contrastive Language-Image Pretraining) (Radford et al., 2021) , which learns joint representations of vision and language using contrastive learning on large-scale image and text data. Thanks to the rich information and the generality of natural language, the CLIP model can learn diverse and task-agnostic visual-textual representations, which can be generalized to many downstream tasks even under the zero-shot setting. This is done by using the names of all classes of a downstream task as the text for textual feature extraction, and conducting classification based on the alignment score of the visual features and the textual features for each class. However, using the class names as the text is deficient due to the lack of context. To this end, the authors of Radford et al. (2021) resort to the technique of prompt tuning (Liu et al., 2021a) . Here the "prompt" is a cloze templates which specifies the context about the task at hand. They find that the template "a photo of a {CLASS}." is a good prompt for image classification. By using elaborate prompt engineering and ensemble, much higher zero-shot performance can be achieved. Prompt engineering has shown greater transferability than the contextless baseline of using class names. The drawback is that the handcrafted prompt tuning requires prior knowledge about the downstream task. Moreover, as pointed out in Zhou et al. (2022b) , the performance is very sensitive 1

