PROMPT TUNING WITH PROMPT-ALIGNED GRADIENT FOR VISION-LANGUAGE MODELS

Abstract

Thanks to the large pre-trained vision-language models (VLMs) like CLIP (Radford et al., 2021), we can craft a zero-shot classifier by discrete prompt design, e.g., the confidence score of an image being "[CLASS]" can be obtained by using the VLM provided similarity between the image and the prompt sentence "a photo of a [CLASS]". Furthermore, prompting shows great potential for fast adaptation of VLMs to downstream tasks if we fine-tune the soft prompts with few samples. However, we find a common failure that improper fine-tuning or learning with extremely few-shot samples may even under-perform the zeroshot prediction. Existing methods still address this problem by using traditional anti-overfitting techniques such as early stopping and data augmentation, which lack a principled solution specific to prompting. In this paper, we present Promptaligned Gradient, dubbed ProGrad to prevent prompt tuning from forgetting the general knowledge learned from VLMs. In particular, ProGrad only updates the prompt whose gradient is aligned (or non-conflicting) to the general knowledge, which is represented as the optimization direction offered by the pre-defined prompt predictions. Extensive experiments demonstrate the stronger few-shot generalization ability of ProGrad over state-of-the-art prompt tuning methods. Codes and theoretical proof are in Appendix.

1. INTRODUCTION

After seeing and reading countless image-text association pairs, large and deep vision-language models (VLM) (Radford et al., 2021; Jia et al., 2021) can memorize the general knowledge (a.k.a. encyclopedic knowledge) about what visual patterns correspond to what textual sequence and vice versa. Thanks to the powerful language modeling of VLMs, we can establish a communication channel in human-readable natural language, i.e., prompt (Liu et al., 2021a; Yao et al., 2021; Jin et al., 2022) , to query the general knowledge. Prompting bridges the interface gap between the pre-trained and downstream tasks (e.g., regression vs. classification) without the need for additional fine-tuning adaptation. For example, we can craft a concrete prompt-"a photo of a [CLASS]"-to achieve zero-shot image classification: by using the popular vision-language model CLIP (Radford et al., 2021) , we input the image to the vision end and the prompt sentence to the language end, then obtain a vision-language similarity as the confidence score of classifying the image as "[CLASS]". In practice, the prompt-based zero-shot image classification is not accurate because the hand-crafted prompt may not be the most machine-favorable (e.g., "this is a picture of" could be more grammatically prevailing in VLM training), or not specific to the downstream domain (e.g., "a photo of a person doing" is better in action recognition) (Radford et al., 2021) . Recently, prompt tuning or prefix tuning (Lester et al., 2021; Liu et al., 2021b; Zhou et al., 2021; 2022) has been proposed to replace the hand-crafted prompt with a set of tunable word embedding vectors, which do not have to be translatable back to human-readable words. Yet, prompt tuning is still as tricky as conventional fine-tuning: as the training continues, the generalization ability may decrease and even under-perform the zero-shot baseline. As shown in Figure 1 (a&b), the prompt tuning method CoOp (Zhou et al., 2021) achieves the best results via early stopping, and its accuracies heavily drop by at most 4% when the training continues. Besides, Figure 1 (c&d) show that CoOp underperforms zero-shot CLIP without augmentation or enough samples from downstream tasks. To the best of our knowledge, existing methods still rely on the 1

