PROMPT TUNING WITH PROMPT-ALIGNED GRADIENT FOR VISION-LANGUAGE MODELS

Abstract

Thanks to the large pre-trained vision-language models (VLMs) like CLIP (Radford et al., 2021), we can craft a zero-shot classifier by discrete prompt design, e.g., the confidence score of an image being "[CLASS]" can be obtained by using the VLM provided similarity between the image and the prompt sentence "a photo of a [CLASS]". Furthermore, prompting shows great potential for fast adaptation of VLMs to downstream tasks if we fine-tune the soft prompts with few samples. However, we find a common failure that improper fine-tuning or learning with extremely few-shot samples may even under-perform the zeroshot prediction. Existing methods still address this problem by using traditional anti-overfitting techniques such as early stopping and data augmentation, which lack a principled solution specific to prompting. In this paper, we present Promptaligned Gradient, dubbed ProGrad to prevent prompt tuning from forgetting the general knowledge learned from VLMs. In particular, ProGrad only updates the prompt whose gradient is aligned (or non-conflicting) to the general knowledge, which is represented as the optimization direction offered by the pre-defined prompt predictions. Extensive experiments demonstrate the stronger few-shot generalization ability of ProGrad over state-of-the-art prompt tuning methods. Codes and theoretical proof are in Appendix.

1. INTRODUCTION

After seeing and reading countless image-text association pairs, large and deep vision-language models (VLM) (Radford et al., 2021; Jia et al., 2021) can memorize the general knowledge (a.k.a. encyclopedic knowledge) about what visual patterns correspond to what textual sequence and vice versa. Thanks to the powerful language modeling of VLMs, we can establish a communication channel in human-readable natural language, i.e., prompt (Liu et al., 2021a; Yao et al., 2021; Jin et al., 2022) , to query the general knowledge. Prompting bridges the interface gap between the pre-trained and downstream tasks (e.g., regression vs. classification) without the need for additional fine-tuning adaptation. For example, we can craft a concrete prompt-"a photo of a [CLASS]"-to achieve zero-shot image classification: by using the popular vision-language model CLIP (Radford et al., 2021) , we input the image to the vision end and the prompt sentence to the language end, then obtain a vision-language similarity as the confidence score of classifying the image as "[CLASS]". In practice, the prompt-based zero-shot image classification is not accurate because the hand-crafted prompt may not be the most machine-favorable (e.g., "this is a picture of" could be more grammatically prevailing in VLM training), or not specific to the downstream domain (e.g., "a photo of a person doing" is better in action recognition) (Radford et al., 2021) . Recently, prompt tuning or prefix tuning (Lester et al., 2021; Liu et al., 2021b; Zhou et al., 2021; 2022) has been proposed to replace the hand-crafted prompt with a set of tunable word embedding vectors, which do not have to be translatable back to human-readable words. Yet, prompt tuning is still as tricky as conventional fine-tuning: as the training continues, the generalization ability may decrease and even under-perform the zero-shot baseline. As shown in Figure 1 (a&b), the prompt tuning method CoOp (Zhou et al., 2021) achieves the best results via early stopping, and its accuracies heavily drop by at most 4% when the training continues. Besides, Figure 1 (c&d) show that CoOp underperforms zero-shot CLIP without augmentation or enough samples from downstream tasks. To the best of our knowledge, existing methods still rely on the To this end, we present a novel prompt tuning method called Prompt-aligned Gradient (ProGrad) to overcome the improperly biased tuning for CLIP. The principle of ProGrad is to regularize each tuning step not to conflict with the general knowledge offered by the original prompt, e.g., the zero-shot CLIP predictions. Specifically, we measure the general knowledge direction G g using the gradient of Kullback-Leibler (KL) divergence between the predictions of the zero-shot prompted CLIP and the few-shot fine-tuned model, which we name as general direction. Similarly, we compute the domain-specific knowledge direction G d using the gradient of cross-entropy between the groundtruth and the few-shot fine-tuned model, dubbed domain-specific direction. We decompose the domain-specific direction G d into: 1) a vector G ⊥ orthogonal to the general direction, which denotes the non-conflicting domain-specific knowledge; and 2) a vector G ∥ parallel to the general direction, which denotes the general knowledge. Note that the first gradient component does NOT override the general direction as any two orthogonal vectors can be transformed into two non-conflicting base vectors. For the second component, it must be one of the two directions: 1) the same of the general direction, which indicates that the update is aligned to the general knowledge, and 2) the opposite of general direction, indicating a conflicting update that should be discarded to avoid forgetting. Overall, in each iteration, ProGrad only updates the parameters in the prompt-aligned direction that has an



Figure 1: Comparison of Zero-shot CLIP, CoOp, and our ProGrad on Stanford Cars and OxfordPets datasets. (a)&(b): Given 1 shot training sample, CoOp's performance severely drops and underperforms zero-shot CLIP by large margins when the training continues. (c)&(d): CoOp may fail to improve CLIP without data augmentation or plenty of samples. Image Domain-specific Direc1on ( ) d

