PROMPT TUNING WITH PROMPT-ALIGNED GRADIENT FOR VISION-LANGUAGE MODELS

Abstract

Thanks to the large pre-trained vision-language models (VLMs) like CLIP (Radford et al., 2021) , we can craft a zero-shot classifier by discrete prompt design, e.g., the confidence score of an image being "[CLASS]" can be obtained by using the VLM provided similarity between the image and the prompt sentence "a photo of a [CLASS]". Furthermore, prompting shows great potential for fast adaptation of VLMs to downstream tasks if we fine-tune the soft prompts with few samples. However, we find a common failure that improper fine-tuning or learning with extremely few-shot samples may even under-perform the zeroshot prediction. Existing methods still address this problem by using traditional anti-overfitting techniques such as early stopping and data augmentation, which lack a principled solution specific to prompting. In this paper, we present Promptaligned Gradient, dubbed ProGrad to prevent prompt tuning from forgetting the general knowledge learned from VLMs. In particular, ProGrad only updates the prompt whose gradient is aligned (or non-conflicting) to the general knowledge, which is represented as the optimization direction offered by the pre-defined prompt predictions. Extensive experiments demonstrate the stronger few-shot generalization ability of ProGrad over state-of-the-art prompt tuning methods. Codes and theoretical proof are in Appendix.

1. INTRODUCTION

After seeing and reading countless image-text association pairs, large and deep vision-language models (VLM) (Radford et al., 2021; Jia et al., 2021) can memorize the general knowledge (a.k.a. encyclopedic knowledge) about what visual patterns correspond to what textual sequence and vice versa. Thanks to the powerful language modeling of VLMs, we can establish a communication channel in human-readable natural language, i.e., prompt (Liu et al., 2021a; Yao et al., 2021; Jin et al., 2022) , to query the general knowledge. Prompting bridges the interface gap between the pre-trained and downstream tasks (e.g., regression vs. classification) without the need for additional fine-tuning adaptation. For example, we can craft a concrete prompt-"a photo of a [CLASS]"-to achieve zero-shot image classification: by using the popular vision-language model CLIP (Radford et al., 2021) , we input the image to the vision end and the prompt sentence to the language end, then obtain a vision-language similarity as the confidence score of classifying the image as "[CLASS]". In practice, the prompt-based zero-shot image classification is not accurate because the hand-crafted prompt may not be the most machine-favorable (e.g., "this is a picture of" could be more grammatically prevailing in VLM training), or not specific to the downstream domain (e.g., "a photo of a person doing" is better in action recognition) (Radford et al., 2021) . Recently, prompt tuning or prefix tuning (Lester et al., 2021; Liu et al., 2021b; Zhou et al., 2021; 2022) has been proposed to replace the hand-crafted prompt with a set of tunable word embedding vectors, which do not have to be translatable back to human-readable words. Yet, prompt tuning is still as tricky as conventional fine-tuning: as the training continues, the generalization ability may decrease and even under-perform the zero-shot baseline. As shown in Figure 1 (a&b), the prompt tuning method CoOp (Zhou et al., 2021) achieves the best results via early stopping, and its accuracies heavily drop by at most 4% when the training continues. Besides, Figure 1 (c&d) show that CoOp underperforms zero-shot CLIP without augmentation or enough samples from downstream tasks. To the best of our knowledge, existing methods still rely on the conventional anti-overfitting techniques such as early stopping and data augmentation (Zhou et al., 2021; 2022; Gao et al., 2021b; Qin & Joty, 2022a) , which lacks a principled solution to the nature of improper prompt tuning. Furthermore, the Grad-CAM visualization results indicate that the fine-tuned prompt starts to mislead the VLM to forget the general knowledge that the classification should at least focus on the foreground object but not the background. Comparing CoOp (Figure 2 (b)) with zero-shot CLIP (Figure 2 (c)), we find that the CoOp model distracts its attention to the background, while CLIP mainly focuses on the foreground object. These results demonstrate the over-fitting risk of existing prompt tuning strategies, especially when the number of training samples is extremely limited (e.g., 1 or 2). To this end, we present a novel prompt tuning method called Prompt-aligned Gradient (ProGrad) to overcome the improperly biased tuning for CLIP. The principle of ProGrad is to regularize each tuning step not to conflict with the general knowledge offered by the original prompt, e.g., the zero-shot CLIP predictions. Specifically, we measure the general knowledge direction G g using the gradient of Kullback-Leibler (KL) divergence between the predictions of the zero-shot prompted CLIP and the few-shot fine-tuned model, which we name as general direction. Similarly, we compute the domain-specific knowledge direction G d using the gradient of cross-entropy between the groundtruth and the few-shot fine-tuned model, dubbed domain-specific direction. We decompose the domain-specific direction G d into: 1) a vector G ⊥ orthogonal to the general direction, which denotes the non-conflicting domain-specific knowledge; and 2) a vector G ∥ parallel to the general direction, which denotes the general knowledge. Note that the first gradient component does NOT override the general direction as any two orthogonal vectors can be transformed into two non-conflicting base vectors. For the second component, it must be one of the two directions: 1) the same of the general direction, which indicates that the update is aligned to the general knowledge, and 2) the opposite of general direction, indicating a conflicting update that should be discarded to avoid forgetting. Overall, in each iteration, ProGrad only updates the parameters in the prompt-aligned direction that has an acute angle to the general direction. Compared to CoOp and CLIP, both G g and G ⊥ (Figure 2 (d&e)) help to regularize the model to focus on the foreground, and our ProGrad (Figure 2 (f)) can further improve the visual response. Following CLIP, CoOp and CoCoOp (Zhou et al., 2022) , we evaluate our ProGrad under the few-shot learning, domain generalization, base-to-new generalization and cross-dataset transfer settings over 11 image classification benchmark datasets, covering generic object classification, fine-grained image recognition, action classification. In summary, our ProGrad achieves: 1) clear improvement compared to CoOp over all of the 11 datasets; 2) clear improvement on the harmonic mean of base-class and new-class accuracies on all 11 datasets compared to CoOp and CoCoOp, and 3) clear improvement on both the source and target datasets of the domain generalization.

2. RELATED WORK

Fine-tuning for VLMs. It is for the VLM adaptation to various downstream tasks, e.g., visual question answering (Kim et al., 2021; Tan & Bansal, 2019) , visual grounding (Yao et al., 2021) , image retrieval (Lu et al., 2019) , semantic segmentation (Rao et al., 2021) and image classification (Zhou et al., 2021; 2022) . We focus on image classification task. Conventional "pre-train then fine-tune" paradigm that plugs in an additional classifier on top of visual backbone and trained on downstream data has been widely-adopted, e.g., Linear Probe (Radford et al., 2021) . CLIP-Adapter (Gao et al., 2021a) and Tip-Adapter (Zhang et al., 2021) add vision and language feature adapter to boost conventional fine-tuning results. Recently, NLP community presents a novel fine-tuning paradigm named "prompt-based learning", which is formulated as a "fill-in-the-blank" cloze test, and finetunes the prompt to maximize the ground-truth token (Lester et al., 2021; Liu et al., 2021b) . In CV community, CoOp (Zhou et al., 2021) uses a continuous prompt optimization from downstream data instead of hand-craft design. CoCoOp (Zhou et al., 2022) further extends CoOp by learning image conditional prompt rather than a static one to improve generalization to unseen classes. ProDA (Lu et al., 2022) adapts VLMs to downstream classification tasks by learning a prompt distribution over the output embedding space. VPT (Derakhshani et al., 2022) introduces variational prompt tuning by combining a base learned prompt with a residual vector sampled from a instance-specific underlying distribution. Our proposed ProGrad follows the line of prompt-based learning to improve both few-shot classification performance and generalization ability by aligning the gradient to general direction, without model structure modification or tuning the pre-trained model parameters. Knowledge Transfer. Forgetting mitigation by knowledge distillation or memory replay is widely deployed in incremental learning (Liu et al., 2020; Rebuffi et al., 2017; Qin & Joty, 2022b; Riemer et al., 2018; Hu et al., 2021) . However, prompt-based fine-tuning is fundamentally different from incremental learning: the former assumes that VLMs have already captured all the knowledge needed in downstream tasks and the goal is to compose a domain-specific query, whereas the latter assumes that the knowledge is yet to be sufficient. In addition, incremental learning requires old data from memory storage while our prompt-based learning method has no access to the pre-trained data. For example, OGD (Farajtabar et al., 2020) projects the gradients from new classes to the orthogonal direction of the gradients of previous tasks. However, as we have no access to pre-training process, the requirement of OGD to store the gradients of old tasks is not possible for prompt tuning. Moreover, OGD alters the gradients of downstream tasks even in non-conflicting scenarios which potentially results in sub-optimal performance for downstream tasks. Another related field that leverages gradient matching to transfer knowledge is domain generalization (Shi et al., 2022; Rame et al., 2021) and multi-task learning (Sener & Koltun, 2018; Yu et al., 2020) . However, their methods are not directly applicable in prompt tuning whose transfer direction is only from general to downstream. In Appendix, we will show how their methods fail in several ablative studies.

3. METHODOLOGY

In this section, we introduce the preliminary concepts of hand-crafted prompt-based zero-shot inference, prompt-based learning, and present our proposed Prompt-aligned Gradient solution to align the domain knowledge with general knowledge for few-shot generalization. learnable context  … [V ] 1 [V ] 2 [V ] M p CE Loss w 1 w 2 w K … … KL Loss zero-shot predic>on p zs y label text features G d G g G prograd (a) (b) G g G d G prograd image features similarity score ❄ ❄ " G d G g G prograd ❄ Fixed " Tunable

3.1. PRELIMINARIES

Contrastive language-image pre-training (CLIP) (Radford et al., 2021) adopts a contrastive language-image pre-training paradigm on tremendous pairs of images with natural language descriptions. For contrastive learning, the associated image and sentences are taken as the positive samples, while the non-associated pairs are regarded as negative samples. The contrastive objective maximizes the similarity of positive pairs while minimize the similarity of negative pairs. Zero-shot transfer inference adapts the pre-trained CLIP model to downstream tasks without fine-tuning the model. Taking image classification as an example, zero-shot transfer is enabled by formulating the classification task as an image-text matching problem, where the text is obtained by extending the "[CLASS]" name using a template like "a photo of a [CLASS] .". CLIP (Radford et al., 2021) finds that such a simple template narrows the distribution gap to pre-training text inputs. The image-class matching score is measured based on the cosine similarity < w i , f > between the image feature f and the class-extended text feature w i for i-th class. The image feature f for image x is extracted by the image encoder, while the text feature w i for i-th class is obtained by feeding the prompt description into the text encoder. The probability for i-th class is obtained as p zs (w i |x) = exp(< w i , f > /τ ) K j=1 exp(< w j , f > /τ ) , where K denotes the number of classes, and τ is a temperature learned by CLIP. Prompt-based learning further strengths the transferring ability of the CLIP model and avoids prompt engineering by automatically learning the prompt given few samples from the downstream task. Different from the zero-shot transfer that used a fixed hand-craft prompt, CoOp (Zhou et al., 2021) constructs and fine-tunes a set of M continuous context vectors v = {v 1 , v 2 , ..., v M } as the turnable prompt. Specifically, the prompt t i = {v 1 , v 2 , ..., v M , c i } combines the learnable context vectors v and the class token embedding c i , and is fed to the text encoder g(•). CoOp optimizes the static context vectors v by minimizing the negative log-likelihood of the ground-truth token: L ce (v) = - i y i log p(t i |x), p(t i |x) = exp(< g(t i ), f > /τ ) K j=1 exp(< g(t j ), f > /τ ) , where y denotes the one-hot ground-truth annotation and K denotes the number of classes.

3.2. PROMPT-ALIGNED GRADIENT

As we introduced in Section 1, CoOp faced a challenge that the transfer performance drops when the number of annotations is very limited (e.g., one per class), even underperforms the zero-shot transfer. Also, CoOp heavily relies on anti-overfitting techniques such as early stopping and data augmentation. To overcome the over-fitting challenge, we propose an effective and efficient fine-tuning paradigm ProGrad to align the few-shot downstream knowledge with the large-scale general knowledge. Motivated by the success of knowledge distillation (Phuong & Lampert, 2019; Hinton et al., 2015) in knowledge transfer, we leverage the zero-shot CLIP predictions as the general knowledge, and compare the fine-tuned predictions with the general knowledge to regularize the gradient direction. Specifically, we obtain the domain-specific direction by calculating the cross-entropy L ce (v) between the model prediction p(t i |x) and the ground-truth y according to Eq. ( 2), and the general knowledge direction based on the Kullback-Leibler (KL) divergence between p(t i |x) and the zero-shot CLIP prediction p zs (w i |x): L kl (v) = - i p zs (w i |x) log p(t i |x) p zs (w i |x) . We denote the gradients of L kl (v) and L ce (v) as G g = ∇ v L kl (v) and G d = ∇ v L ce (v), respectively. The relations between G g and G d are two-fold. (1) Their angle is smaller than 90 • (Figure 3 (a)), which indicates that the optimization direction of few-shot downstream knowledge does not conflict with general knowledge. In this case, we safely set the updated gradient direction G prograd as G d . (2) Their angle is larger than 90 • (Figure 3(b) ), which indicates that the few-shot downstream knowledge conflicts with general knowledge. In other words, optimizing the context vectors following G d will lead to the forgetting of the pre-trained general knowledge. In this case, we project the G d to the orthogonal direction of G g to optimize the model for classification, which avoids increasing the KL loss. Our ProGrad strategy is mathematically formulated as: G prograd = G d , if G d • G g ≥ 0 G d -λ • Gd•Gg ∥Gg∥ 2 G g , otherwise. (4) Fig 3(c) illustrates the pipeline of our ProGrad. Instead of updating the context vectors using G d in CoOp (Zhou et al., 2021) , we optimize the context vectors using G prograd , which prevent the gradient direction from overfitting to few-shot downstream samples. We further introduce λ in Eq. ( 4) to generalize the formulation, which can flexibly control the strength of general knowledge guidance in applications. In particular, λ = 1 denotes projecting G d to the orthogonal direction of G g (Figure 3 (b)), while setting λ = 0 makes ProGrad degenerate to CoOp, i.e., CoOp is a special case of our strategy. We include the detailed analysis of λ in Appendix. Generalization Error Analysis. We further theoretically analyze the generalization error of our ProGrad. Here, we provide a sketch proof and include the detailed justification in Appendix. Our ProGrad keeps the optimal value L kl of the pre-trained domain when optimizing the empirical risk on the downstream domain. The model fprograd learned by such update rule can be viewed as optimizing the empirical risk on pre-trained and downstream domains (Yu et al., 2020) : fprograd = argmin f ∈F R(d+p) (f ) = argmin f ∈F Rd (f ) + Rp (f ), where F is the function class, and R(•) and R(•) denote the expected risk and empirical risk. We bound the generalization error of ProGrad by virtue of Rademacher Complexity (Bartlett & Mendelson, 2002) and the Theorem 6.2 in (Zhang et al., 2012) . The detailed proof is in Appendix. Theorem 1 Let X N d 1 = {x (d) n } N d n=1 and X Np 1 = {x (p) n } Np n=1 be two set of i.i.d. samples drawn from the downstream domain D d and the pre-trained domain D p . Then for any ϵ > 0, we have with probability at least 1 -ϵ, R d ( fprograd ) ≤ R(d+p) ( fprograd ) + 1 2 γ F (D, P ) + R p (F) + R d (F) + 3 2 ln(4/ϵ) 2N d + 3 2 ln(4/ϵ) 2N p + 1 2 ln(4/ϵ) 2 1 N d + 1 N p , where γ F (D, P ) is the integral probability metric (Müller, 1997) that measures the difference between the distribution of pre-trained domain and the downstream domain, R d (F) and R p (F) are the Rademacher complexity of F. Note that the bound of R(F) is inversely proportional to the number of training samples. Theorem 1 shows that the generalization error R d ( fprograd ) is bounded by the empirical training risk R(d+p) ( fprograd ), the two domain gap γ F (D, P ) and the estimation error. The empirical training risk can be minimized to arbitrary small value when using deep models with high capacity. The estimation error that related to N p asymptotically tends to 0 as the sample size N p tends to infinity. Thanks to the large amount of pretrained samples N p , we can approximate the generalization error bound as R d ( fprograd ) ≤ 1 2 γ F (S, P ) + R d (F) + 3 2 ln(4/ϵ) 2N d + 1 2 ln(4/ϵ) 2 1 N d . Similarly, we have the generalization error for CoOp fcoop as R d ( fcoop ) ≤ 2R d (F) + 3 ln(4/ϵ) 2N d + ln(4/ϵ) 2 1 N d . ( ) Under the assumption that the gap between pre-trained and downstream domains γ(P, D) is small, the estimation error bound of R d ( fcoop ) is at least two times greater than R d ( fprograd ). Considering that N d is typically very small in few-shot setting, our ProGrad model fprograd achieves a much lower error bound than conventional fine-tuning model like CoOp fcoop .

4.1. DATASETS AND IMPLEMENTATION DETAILS

We follow CLIP, CoOp and CoCoOp to validate the effectiveness of our ProGrad on four settings: (1) few-shot classification, (2) domain generalization, (3) base-to-new generalization, (4) cross-dataset transfer. We report and discuss the results of cross-dataset transfer in Appendix D.1 Datasets. For evaluations of few-shot learning and base-to-new generalization, we follow CLIP and CoOp to use 11 image classification datasets, i.e., ImageNet (Deng et al., 2009) and Cal-tech101 (Fei-Fei et al., 2004) for generic object classification, OxfordPets (Parkhi et al., 2012) , StanfordCars (Krause et al., 2013) , Flowers102 (Nilsback & Zisserman, 2008) , Food101 (Bossard et al., 2014) and FGVCAircraft (Maji et al., 2013) for fine-grained image recognition, EuroSAT (Helber et al., 2019) for satellite image classification, UCF101 (Soomro et al., 2012) for action classification, DTD (Cimpoi et al., 2014) for texture classification, and SUN397 (Xiao et al., 2010) for scene recognition. For domain generalization, we use ImageNet as the source dataset and select ImageNetV2 (Recht et al., 2019) , ImageNet-Sketch (Wang et al., 2019) , ImageNet-A (Hendrycks et al., 2021b) and ImageNet-R (Hendrycks et al., 2021a) as the target datasets. Training Details. For few-shot learning, following CoOp and CLIP, all models are trained with {1, 2, 4, 8, 16} shots respectively then evaluated on the full test split. For domain generalization and base-to-new generalization, we evaluate 4-shot performance, which justifies the robustness under low-shots condition. All results of learning-based models are averaged over three random seeds. Unless otherwise stated, we adhere to CoOp to use ResNet-50 (He et al., 2016) as the backbone of image encoder. Following CoOp and CoCoOp, the length of context tokens M is set to 16 for few-shot classification and M = 4 for base-to-new, domain generalization and cross-dataset transfer. We follow the same training epochs, training schedule and the data augmentation settings in CoOp. λ is set to 1 by default, except that λ is set to 0.8 for 16 shots. Please refer to Appendix for more details. Baselines. We compare ProGrad with 4 baselines: (1) Zero-shot CLIP (2) Linear probe CLIP (3) CoOp and (4) CoCoOp. Although our method can beat some other fine-tune methods like CLIP-Adapter (Gao et al., 2021a) , we mainly focus on comparing with prompt-based learning methods. The results of other fine-tuning methods are in Appendix.

4.2. FEW-SHOT CLASSIFICATION

Setup. We compare with two zero-shot CLIP models, i.e., CLIP and CLIP++ stand for using single prompt and prompt ensembling respectively (Please refer to Appendix Section C for the templates of single and ensemble prompts). ProGrad and ProGrad++ stand for using single prompt and prompt ensembling as general knowledge to implement ProGrad respectively. Note that we only use the hand-crafted prompt ensembling to generate G g , which provides a more accurate general direction. Therefore, we still optimize a single prompt with 16 learnable tokens for ProGrad++. 

4.3. DOMAIN GENERALIZATION

This setting evaluates the generalization ability of models on a target domain which is different from the source domain. Conventional fine-tuning on limited data from a specific domain may mislead the model to learn spurious correlations or in-distribution patterns, resulting in a biased model with under-performance in unseen domains. In contrast, zero-shot CLIP does not exploit such spurious correlations or patterns, since it is not fine-tuned on that distribution. Since our ProGrad uses the general knowledge from the pre-trained domain to regularize the fine-tuning on a specific distribution, our ProGrad is hopefully robust to the distribution shift. As shown in Table 2 , despite the exposure We follow (Zhou et al., 2022) to evaluate the generalization performance from seen classes to unseen classes. All the classes are equally divided into two groups, i.e., base classes and new classes, and all the methods are only trained on base classes and tested on both base classes and novel classes. The harmonic mean of base-class and new-class accuracies is reported to evaluate the trade-off. Compared to CoOp and CoCoOp, ProGrad also generalizes well to the new classes. From Table 3 , we observed that ProGrad achieves the best average performance in terms of all metrics. In contrast, CoOp and CoCoOp fail in new classes: their performance is consistently worse than zero-shot CLIP. These results highlight that ProGrad has better generalizability to both seen and unseen classes. Please refer to Appendix for the results of 11 datasets.

4.5. FURTHER ANALYSIS

Failure cases. We further analyze the failure cases where ProGrad models predict incorrectly but CoOp gives right predictions. Specifically, we count the percentage of the failure cases that zero-shot CLIP models also fails in Figure 5 . We found that a high proportion of the failure cases are also mis-classified by Zero-shot CLIP model (red bar in Figure 5 ). This observation indicates that the general direction G g generated by imprecise zero-shot general knowledge is detrimental to model generalization. As the number of samples increases, the downstream knowledge represented by G d becomes more accurate and unbiased. As expected, we observe that the red bar becomes larger. Conflict of knowledge. ProGrad requires the updated gradient direction be acute to the general knowledge gradient directions. We explore how this constraint helps to defuse the conflicts of domain- specific and general knowledge by visualizing the angle between their representative gradients during training (angle between G d and G g ). As depicted in Figure 6 , for the training without G prograd , the angle between G d and G g converges to 90 degree due to the fact that "all high-dimensional random vectors are almost always orthogonal to each other" (Cai et al., 2013) . Intuitively, without any constraint, the optimization direction G d is independent to the general direction, and the average angle would be around 90 degree (i.e., orthogonal). In contrast, utilizing G prograd during training leads to the result that the angle finally converge to an obtuse angle. The reason is that G prograd intervenes the model to learn the downstream knowledge aligned with the general knowledge and leads to the insufficient learning of downstream knowledge that is incompatible with the general knowledge. As training stabilizes, G d struggles to learn the conflicting knowledge, reflecting an obtuse angle to the G g . Thanks to ProGrad, we discard such conflicting knowledge to avoid forgetting. Comparison with conventional knowledge distillation. Since our ProGrad utilizes the gradient direction of knowledge distillation loss as regularization, one may wonder whether our ProGrad is indeed conventional knowledge distillation. We answer this question by investigating whether a simple knowledge distillation (i.e., L total = L ce + α • L kd ) can achieve similar performance as our ProGrad. We repeated the fewshot experiments on 11 datasets with a variety of α and report the average results in Table 4 . Overall, ProGrad outperforms KD for various few-shot settings. Although KD with small α ≤ 1 promotes CoOp in low-shot (e.g., 1, 2 and 4 shots), the performance drops when number of shots is large (see 8 and 16 shots). These results indicate that our ProGrad works differently from KD and is more robust to the number of training samples. Applying ProGrad to conventional fine-tune paradigm. We are also interested in whether ProGrad can be applied to the conventional "pre-train then fine-tune" paradigm. Specifically, we plug in an additional cosine classifier on top of the visual backbone and compare the performance by conducting the few-shot experiments. Table 5 shows that conventional fine-tuning can benefit from our ProGrad. The implementation details and the result of each dataset are provided in Appendix. We also analyze the upper bound performance of ProGrad and effect of the hyper-parameter λ. The results and discussion are presented in Appendix D.3 and Appendix D.2

5. CONCLUSION

In this paper, we pointed out the over-fitting issues of existing prompt tuning methods for few-shot generalization, which heavily relies on early stopping and data augmentation to promote zeroshot inference. We proposed a prompt tuning method ProGrad that regularize each tuning step not to conflict with the general knowledge of the hand-crafted prompt. Experiments on few-shot classification, base-to-new generalization and domain generalization over 11 datasets demonstrate the effectiveness and efficiency of our ProGrad. In the future, we will explore how to apply ProGrad on other tasks like object detection and segmentation.

ETHIC STATEMENT

No human subjects were involved during the research and developments of this work. All of our experiments were conducted on the standard benchmarks in the lab-based, controlled environment. Thus, due to the abstract nature of this work, it has minimal concerns regarding issues such as discrimination/bias/fairness, privacy, etc.

REPRODUCIBILITY STATEMENT

In this paper, we conduct the experiments three times and report the mean and standard deviation (confident interval at 95%) values to alleviate the randomness of the starting seed. In Appendix Section D.4, we provide the full details of our experimental settings. This appendix is organized as follows: • Section A provides the generalization error analysis for ProGrad. • Section B provides additional training details. • Section C lists the adopted hand-crafted prompts. • Section D gives additional experiment results, including the analysis on the effect of the hyper-parameter λ (Section D.3); the comparison with other fine-tuning methods (e.g., CLIP-Adapter (Gao et al., 2021a) , gradient matching method (Yu et al., 2020) , knowledge distillation (Hinton et al., 2015) and conventional fine-tune method with cosine classifier) with the confidence interval at 95% on 11 few-shot classification datasets (Section D.4); and additional detailed results for each dataset of the base-to-new generalization experiments (see Section D.5).

A JUSTIFICATION FROM GENERALIZATION ERROR

We further analyze the generalization error bound of our ProGrad. We define the expected risk R(•) and empirical risk R(•) of a classifier f on domain D as  R(f ) = E (X,Y )∼D [ℓ(f (X), Y )], R(f ) = 1 N N i=1 ℓ(f (X i ), Y i ) For the implementation of ProGrad, we initialize the model fprograd using the pre-trained model fp . We regularize each training step not to increase the KL divergence between the predictions of fprograd and fp . In this way, fprograd can keep the optimal value of the pre-trained domain L kl when optimizing the empirical risk on the downstream domain. The model fprograd learned by our ProGrad can be viewed as optimizing the empirical risk on both domains: fprograd = argmin f ∈F R(d+p) (f ) = argmin f ∈F Rd (f ) + Rp (f ). ( ) Based on Theorem 4.1 of (Yang et al., 2021) , assuming that the neural network has L layers with parameters matrices W 1 , ..., W L , and their Frobenius norm are at most M 1 , ..., M L and the activation functions are 1-Lispschitz continuous, positive-homogeneous, and applied element-wise. The output of the neural network is the softmax function that predicts c classes. Let F be a function class with the range [a, b] . Distribution is such that ∥x∥ ≤ B. Let X N d 1 = {x (d) n } N d n=1 and X Np 1 = {x (p) n } Np n=1 be two set of i.i.d. samples drawn from the downstream domain D d and the pre-trained domain D p . Then for any ϵ > 0, we have with probability at least 1 -ϵ, R d ( fprograd ) ≤ R(d+p) ( fprograd ) + 1 2 γ F (D, P ) + cB 2 log(2)L + 1 L j=1 M j N p + cB 2 log(2)L + 1 L j=1 M j √ N d + 3 2 (b -a) ln(4/ϵ) 2N d + 3 2 (b -a) ln(4/ϵ) 2N p + 1 2 (b -a) 2 ln(4/ϵ) 2 1 N d + 1 N p , where γ F (D, P ) is the integral probability metric (Müller, 1997) that measures the difference between the distribution of pre-trained domain and the downstream domain. The Eq. ( 13) shows that the generalization error R d ( fprograd ) is bounded by the empirical training risk R(d+p) ( fprograd ), the two domain gap γ F (D, P ) and the estimation error that is inversely to number of training samples, i.e., N d and N p . The empirical training risk can be minimized to arbitrary small value and the estimation error that related to N p asymptotically tends to 0 as the sample size N p tends to infinity. Thanks to the large amount of pretrained samples N p , we can approximate the generalization error bound for the model learned by ProGrad as R d ( fprograd ) ≤ 1 2 γ F (S, P ) + cB 2 log(2)L + 1 L j=1 M j √ N d + 3 2 (b -a) ln(4/ϵ) 2N d + 1 2 (b -a) 2 ln(4/ϵ) 2 1 N d . Similarly, we have the generalization error for fcoop as R d ( fcoop ) ≤ 2 cB 2 log(2)L + 1 L j=1 M j √ N d + 3 (b -a) ln(4/ϵ) 2N d + (b -a) 2 ln(4/ϵ) 2 1 N d . (15) If the gap between the pre-trained domain D p and the downstream domain D d is very small, the γ F (D, P ) will tend to 0. Under this assumption, the estimation error bound of R d ( fcoop ) is at least 2 times greater than R d ( fprograd ). Considering that in few-shot setting, N d is typically very small, which makes our ProGrad model fprograd a much lower error bound than conventional fine-tuning model fcoop .

B ADDITIONAL IMPLEMENTATION DETAILS

For ProGrad implementation, we first initialize the learnable context vector v with the word embeddings of the zero-shot hand-crafted prompt. Concretely, if the context length M is 16 and the hand-crafted prompt is "a photo of a", which only has 4 tokens, we initialize the former 12 context vectors with zeros and the last 4 context vectors with the word embedding of "a photo of a". We follow the training settings of CoOp (Zhou et al., 2021) : All prompt-based models are trained by SGD with an initial learning rate of 0.002 which is decayed by the cosine annealing rule. During the first epoch, we use the warm-up trick by fixing the learning rate to 1 × 10 -5 to alleviate the gradient explosion. The training epoch is set to 50 for all shots of experiments of ImageNet dataset. For the rest 10 datasets, the training epoch is set to 50 for 1 shot, 100 for 2/4 shots and 200 for 8/16 shots. We train all prompt-based model with batch size of 32 expect for CoCoOp. As described in (Zhou et al., 2022) , CoCoOp consumes a significant amount of GPU memory if the batch size is set larger than one. We set the batch size to 1, following their original setting. Our experiments are conducted on one 2080Ti GPU for all datasets except ImageNet where we train the models on one A100 GPU. "a bad photo of a {}." "a photo of many {}." "a sculpture of a {}." "a photo of the hard to see {}." "a low resolution photo of the {}." "a rendering of a {}." "graffiti of a {}." "a bad photo of the {}." "a cropped photo of the {}." "a tattoo of a {}." "the embroidered {}." "a photo of a hard to see {}." "a bright photo of a {}." "a photo of a clean {}." "a photo of a dirty {}." "a dark photo of the {}." "a drawing of a {}." "a photo of my {}." "the plastic {}." "a photo of the cool {}." "a close-up photo of a {}." "a black and white photo of the {}." "a painting of the {}." "a painting of a {}." "a pixelated photo of the {}." "a sculpture of the {}." "a bright photo of the {}." "a cropped photo of a {}." "a plastic {}." "a photo of the dirty {}." "a jpeg corrupted photo of a {}." "a blurry photo of the {}." "a photo of the {}." "a good photo of the {}." "a rendering of the {}." "a {} in a video game.' "a photo of one {}." "a doodle of a {}." "a close-up photo of the {}." "a photo of a {}." "the origami {}." "the {} in a video game.' "a sketch of a {}." "a doodle of the {}." "a origami {}." "a low resolution photo of a {}." "the toy {}." "a rendition of the {}." "a photo of the clean {}." "a photo of a large {}." "a rendition of a {}." "a photo of a nice {}." "a photo of a weird {}." "a blurry photo of a {}." "a cartoon {}." "art of a {}." "a sketch of the {}." "a embroidered {}." "a pixelated photo of a {}." "itap of the {}." "a jpeg corrupted photo of the {}." "a good photo of a {}." "a plushie {}." "a photo of the nice {}." "a photo of the small {}." "a photo of the weird {}." "the cartoon {}." "art of the {}." "a drawing of the {}." "a photo of the large {}." "a black and white photo of a {}." "the plushie {}." "a dark photo of a {}." "itap of a {}." "graffiti of the {}." "a toy {}." "itap of my {}." "a photo of a cool {}." "a photo of a small {}." "a tattoo of the {}."

C HAND-CRAFTED PROMPTS

The hand-crafted prompts for 11 datasets as well as the ImageNet variants are listed in Table 6 . We select the ensemble prompts from CLIP (Radford et al., 2021) , examples for ImageNet are shown in Table 7 . 

D ADDITIONAL EXPERIMENTS D.1 CROSS-DATASET TRANSFER

All models are trained on ImageNet as source dataset and evaluated on the rest 10 target datasets. The goal of this setting is to demonstrate the potential to transfer beyond a single dataset. The results are presented in Table 8 . As shown, our ProGrad not only achieves the highest performance on source datasets but also outperforms other baselines 9 out of 10 datasets.

D.2 UPPER BOUND OF PROGRAD

As the regularized general gradient direction G g is the key to improve the results, we are interested in the upper-bound of performance if we use an oracle direction G full g instead of the one offered by hand-crafted prompt. To do so, we first optimize a prompt with plain cross-entropy loss on the full dataset to create G full g and then use such gradient to implement ProGrad. The results are shown in Table 9 . The results indicate a more accurate regularization direction G full g can elicit a stronger ProGrad model. 

D.3 EFFECT OF HYPER-PARAMETER

We further analyze the effect of the hyper-parameter λ described in Eq. ( 4) in the main paper. Results are shown in Table . 10. As discussed in Section 3.2 in the main paper, a smaller λ weakens the general knowledge regularization, which results in a inferior performance under low-shot setting for most datasets. However, for DTD in Table 10 , using a smaller λ = 0.9 to reduce the general knowledge regularization can improve the 16 shots results. One possible reason is that texture images of DTD has large gap with the CLIP pre-trained images that collected from the Internet, stronger regularization from pre-trained knowledge might be detrimental to the fine-tune performance if downstream data is sufficient. 

D.4 ADDITIONAL FEW-SHOT CLASSIFICATION RESULTS

In this section, we further provide the detailed few-shot classification results of other learning-based fine-tuning methods with confidence interval at 95% in Table 11 and Table 12. Cosine. As described in Section 4.5 of the main paper, we plug in an additional cosine classifier on top of the visual backbone and trained on downstream dataset. CoOp learns the context prompt from data rather than hand-crafted design. CLIP-Adapter learns additional feature adapter to boost conventional fine-tuning results. Cosine + ProGrad employs ProGrad to the training process of cosine classifier. CoOp + l 2 prompt reg. We further investigate whether simply using the l 2 distance between learned prompt vector v and the word embedding vector of hand-crafted prompt v zs as the regularization can improve few-shot performance, i.e., L total (v) = L ce (v) + α∥v -v zs ∥ 2 , where we select α = 0.01. CoOp + GM applies gradient matching method (Yu et al., 2020) to CoOp, i.e., we not only project the G d to the perpendicular direction of G g as the updated gradient, but also project the G g to the perpendicular direction of G d as the updated gradient to fine-tune the model alternately. CoOp + KD. As described in Section 4.5 of the main paper, we apply knowledge distillation loss to CoOp, i.e., L total = L ce + L kl CoOp + ProGrad employs ProGrad to CoOp. For all prompt-based methods, we set the context length M to 16 except for CoOp + l 2 prompt reg. The learned length for CoOp + l 2 prompt reg needs to be equal to the hand-crafted prompt length to compute the l 2 norm, e.g., the M has to be 4 if the hand-crafted prompt is "a photo of a ". According to the average results in Table 11 , we observe that our CoOp + ProGrad still achieves the best average performance. By comparing the results of 1) Cosine and Cosine + ProGrad; and 2) CoOp and CoOp + ProGrad, we demonstrates both conventional "pre-train then fine-tune" paradigm and prompt tuning paradigm can benefit from our ProGrad. The gap between CoOp and CoOp + l 2 prompt reg demonstrates that directly regularize the learned prompt to be not far away from the hand-crafted prompt has limited improvement. By digging into CoOp + KD and CoOp + GM, we find performance improvement by introducing the general knowledge. However, their performance still under-performs our CoOp + ProGrad. This is because 1) CoOp + KD learns the average knowledge from two domains which still allows the fine-tuned model to learn from the downstream knowledge that conflicts with the general knowledge; 2) CoOp + MD additional requires the fine-tuned model to discards the general knowledge that is not aligned with the downstream knowledge, as the downstream data is limited, the inaccurate estimation of G d will lead the model focus on biased general knowledge. 



The pre-trained dataset includes samples from diverse classes. Here, we only consider the pre-trained data belonging to the classes of downstream task.



Figure 1: Comparison of Zero-shot CLIP, CoOp, and our ProGrad on Stanford Cars and OxfordPets datasets. (a)&(b): Given 1 shot training sample, CoOp's performance severely drops and underperforms zero-shot CLIP by large margins when the training continues. (c)&(d): CoOp may fail to improve CLIP without data augmentation or plenty of samples. Image Domain-specific Direc1on ( ) G d

Figure 3: (a) If G d is aligned with G g , we set G prograd as G d . (b) If G d conflicts with G g (i.e., their angle is larger than 90 • ), we set G prograd as the projection of G d on the vertical direction of G g . (c) Training pipeline of our ProGrad. Only the context vectors are learnable.

Figure 4: Accuracy (%) of few-shot learning on 11 datasets. The context length M is set to 16.

Figure 5: Distribution of samples that are mis-classified by ProGrad but correctly classified by CoOp. to the source dataset, ProGrad clearly outperforms Zero-shot CLIP, CoOp and CoCoOp on all target datasets as well as the source dataset with ResNet-based and Transformer-based visual backbones.

Figure 6: The angles between G d and G g during training on Caltech101 and StanfordCars.

) where ℓ(f (X), Y ) denotes the cross-entropy and N is the volume of training data. We are interested in the downstream domain D d and pre-trained domain D p , respectively. 1 Let F be a function class, the conventional fine-tune model fcoop is trained on D d by shot CLIP model fp is considered to be trained on D p by fp = argmin f ∈F Rp (f ).

These results demonstrate the anti-overfitting ability of our ProGrad when the training samples are extremely limited. Furthermore, leveraging prompt ensembling can further explore the potential of ProGrad. From Table1, with more accurate general knowledge offered by prompt ensembling, CLIP++ improves the zero-shot CLIP from 58.77% to 59.38%; ProGrad++ increases the accuracy of ProGrad from 74.28% to 75.03% at 16 shots.

Evaluation on robustness to distribution shift with different visual backbones.

Averaged accuracy (%) over 11 datasets for base-to-new generalization.

Comparison with knowledge distillation. Average accuracy (%) over 11 datasets.

Applying ProGrad to cosine classifier.

Hand-crafted Prompts.

Prompt Ensembling Examples for ImageNet.

Comparison of prompt learning methods in the cross-dataset transfer setting. Prompts are learned from 4-shots ImageNet. ProGrad 62.17 88.30 86.43 55.61 62.69 76.76 15.76 60.16 39.48 24.87 58.70 57.36

Upper-bound results (%) of 1, 2, 4, 8, and 16 shots of few-shot learning.

Accuracy (%) of 1, 2, 4, 8, and 16 shots training with different λ on DTD and OxfordPets.

further presents the results for base-to-new generalization on each of the 11 datasets.

Accuracy (%) with confidence interval at 95% of few-shot learning on 11 datasets (Part I).The context length M is set 16 for prompt-based methods. * indicates results copied from(Gao et al.,  2021a).

Accuracy (%)  with confidence interval at 95% of few-shot learning on 11 datasets (Part II). The context length M is set 16 for prompt-based methods. * indicates results copied from(Gao  et al., 2021a). ± 1.51 77.97 ± 0.51 78.89 ± 0.10 78.90 ± 0.15 79.07 ± 0.06 CoOp + KD 76.06 ± 1.47 77.59 ± 0.60 78.72 ± 0.19 78.38 ± 0.24 78.90 ± 0.16 CoOp + ProGrad 76.04 ± 0.54 74.95 ± 0.57 75.95 ± 0.27 76.65 ± 0.23 78.41 ± 0.08 FGVCAircraft Cosine 12.47 ± 1.00 17.75 ± 1.35 22.00 ± 1.50 29.14 ± 0.54 36.47 ± 0.18 CoOp 9.71 ± 6.09 18.74 ± 0.48 21.78 ± 0.50 27.55 ± 0.ProGrad 12.83 ± 0.48 17.59 ± 1.59 19.70 ± 1.62 26.34 ± 0.51 31.98 ± 0.68 CoOp + l2 prompt reg 18.01 ± 0.44 19.78 ± 0.23 22.51 ± 0.94 27.24 ± 0.38 30.55 ± 0.54 CoOp + GM 17.08 ± 0.37 19.34 ± 0.24 19.62 ± 0.40 21.07 ± 0.08 22.52 ± 0.19 CoOp + KD 17.67 ± 0.45 19.29 ± 0.15 21.21 ± 0.60 25.55 ± 0.30 28.58 ± 0.42 CoOp + ProGrad 18.81 ± 0.50 20.47 ± 0.90 23.32 ± 0.36 27.02 ± 0.67 31.12 ± 0.62 SUN397 Cosine 25.32 ± 0.18 38.13 ± 0.37 49.83 ± 0.45 56.97 ± 0.21 62.84 ± 0.16 CoOp 60.30 ± 0.64 59.52 ± 0.60 63.33 ± 0.39 65.65 ± 0.± 1.38 38.87 ± 1.02 48.05 ± 3.02 56.24 ± 2.81 63.40 ± 0.58 CoOp + l2 prompt reg 43.74 ± 1.45 45.98 ± 2.76 53.25 ± 1.55 59.08 ± 0.58 62.31 ± 1.05 CoOp + GM 43.81 ± 2.15 47.64 ± 0.63 49.17 ± 1.52 53.17 ± 0.63 54.06 ± 0.45 CoOp + KD 43.01 ± 2.18 49.31 ± 1.10 53.03 ± 1.49 60.26 ± 0.34 63.14 ± 0.39 CoOp + ProGrad 46.14 ± 1.74 49.78 ± 1.37 54.43 ± 0.86 60.69 ± 0.10 63.97 ± 0.61 EuroSAT Cosine 37.55 ± 5.27 52.93 ± 5.66 49.81 ± 6.23 46.08 ± 11.13 33.30 ± 13.04 CoOp 49.40 ± 3.86 62.23 ± 4.94 69.49 ± 3.23 76.56 ± 1.ProGrad 41.55 ± 6.19 51.35 ± 5.76 47.64 ± 9.68 30.03 ± 2.99 33.30 ± 1.67 CoOp + l2 prompt reg 54.28 ± 5.38 62.60 ± 2.77 70.43 ± 1.81 77.32 ± 2.20 83.30 ± 1.11 ProGrad 36.61 ± 0.14 52.11 ± 1.43 60.66 ± 1.50 69.85 ± 0.94 74.27 ± 0.30 CoOp + l2 prompt reg 62.88 ± 0.74 64.43 ± 0.71 67.46 ± 0.40 72.28 ± 0.88 75.77 ± 0.29 CoOp + GM 64.27 ± 0.48 66.14 ± 0.25 66.37 ± 0.27 67.91 ± 0.29 68.96 ± 0.04 CoOp + KD 64.99 ± 0.35 67.29 ± 0.46 68.44 ± 0.13 71.77 ± 0.41 74.15 ± 0.55 CoOp + ProGrad 64.55 ± 0.50 66.35 ± 0.18 69.86 ± 0.30 73.33 ± 0.65 77.28 ± 0.96

Accuracy (%) for the base-to-new generalization evaluation. The context length M is 4 for prompt-based methods which are learned from the base classes with 4 shots. H: Harmonic mean.

