PROTOTYPICAL CALIBRATION FOR FEW-SHOT LEARNING OF LANGUAGE MODELS

Abstract

In-context learning of GPT-like models has been recognized as fragile across different hand-crafted templates, and demonstration permutations. In this work, we propose prototypical calibration to adaptively learn a more robust decision boundary for zero-and few-shot classification, instead of greedy decoding. Concretely, our method first adopts Gaussian mixture distribution to estimate the prototypical clusters for all categories. Then we assign each cluster to the corresponding label by solving a weighted bipartite matching problem. Given an example, its prediction is calibrated by the likelihood of prototypical clusters. Experimental results show that prototypical calibration yields a substantial improvement on a diverse set of tasks. Extensive analysis across different scales also indicates that our method calibrates the decision boundary as expected, greatly improving the robustness of GPT to templates, permutations, and class imbalance.

1. INTRODUCTION

Large-scale language models (LMs) have shown strong generalization ability on a wide range of downstream tasks (Devlin et al., 2018; Radford et al., 2019; Yang et al., 2019; Lewis et al., 2019; Brown et al., 2020; Dong et al., 2019; Bao et al., 2020) . Fine-tuning has been the common strategy to transfer the extensive knowledge to downstream tasks for a long time. However, fine-tuning such large LMs suffers from the over-parameterization issue under few-shot settings. Brown et al. (2020) propose the concept of in-context learning with GPT, which enables LMs to quickly adapt to a new task by conditioning on hand-crafted prompts as shown in Figure 1 . The prompts consist of task-specific templates and several input-label pairs (demonstrations). In-context learning is surprising as GPT can perform various tasks without any parameter updating. It has been noticed that the predictions of GPT conditioned on prompts tend to bias toward some specific answers and can be highly volatile across different templates, demonstrations, and their permutations (Lu et al., 2021; Jiang et al., 2020) . Zhao et al. (2021) propose to calibrate the model prediction by the content-free output to mitigate this problem. Rubin et al. (2021) and Lu et al. (2021) focus on the training examples retrieval and optimal ordering selection respectively to produce more performant prompts than random sampling. However, they did not explain why the in-context learning performance is fragile across different scenarios. In this paper, we analyze the intrinsic reason for the instability of few-shot learning with GPT. We observe significant distinctions among the prediction distributions of GPT under different prompts. As shown in Figure 2 , the conventional decision boundary of GPT (i.e., naively uses the output with the largest probability as the predicted label) often fails to discriminate the predictions. We argue that the predictions can be more discriminative when provided with a calibrated decision boundary. Specifically, we term the model outputs of examples whose ground-truth are the same category as prototypical clusters and adopt Gaussian Mixture Model (GMM) to estimate the distributions of them for all categories. The decision boundaries of the prototypical clusters are adaptively learned, which is called prototypical calibration (PROCA). Then the prototypical clusters are assigned to the corresponding labels through weighted bipartite matching. We also propose to improve estimations according to cluster-label assignment. Finally, the predictions of test examples are more precise owing to the calibrated decision boundary (as shown in Figure 2 ). Experimental results show that we achieve on average 13% absolute improvement for different sizes of GPT models across nine text classification datasets. We demonstrate that PROCA is effective across various templates and different demonstration permutations. To summarize, our key contributions are as follows: • We find that the decision boundary plays a critical role in few-shot evaluation. Moreover, performant decision boundaries are inconsistent across language models and prompts. • We propose prototypical calibration to adaptively learn a better decision boundary for the few-shot classification of language models. • Experiments show that PROCA achieves a 13% absolute improvement over the conventional approach on a wide range of text classification tasks.

2. DECISION BOUNDARY OF FEW-SHOT LEARNING WITH GPT

A decision boundary refers to an explicit prediction criterion in the output space for a given classification problem. As shown in Figure 2 , two dashed lines represent two different decision boundaries, which classify examples into negative and positive categories. In this section, we explore the effect of the decision boundary on few-shot learning. We demonstrate that optimal decision boundaries are inconsistent under different LMs and prompts. Figure 3 : Few-shot performance of GPT-2-Large (0.8B) and GPT-J (6B) using different decision boundaries. P1, P2, P3, and P4 represent different prompts. The red rectangle indicates the performance under the conventional decision boundary (P positive = 0.5 for the example task), i.e., naively using the outputs with larger probabilities as the predicted labels. It is observed that the decision boundary plays a critical role in few-shot evaluation. Decision boundary greatly influences the few-shot performance. We evaluate the performance of different models and prompts using different decision boundaries. Results are shown in Figure 3 . The red rectangle indicates the conventional decision boundary used by GPT, which naively decodes the label with a larger prediction probability. We observe that shifting the decision boundary can cause wild fluctuations of few-shot accuracy, from near state-of-the-art to random guessing. For each prompt, there is an exclusive region where the decision boundary is relatively robust. The model exhibits poor performance when the decision boundary is far from the robust region. Performant decision boundaries are not transferable across LMs or prompts. Figure 3 demonstrates that all prompts exhibit strong performance if the decision boundary locates in the robust region. However, different prompts and models lead to different degrees of deviation between the optimal decision boundary and the conventional one. It suggests that performant decision boundaries are inconsistent across models or prompts. Based on the above analysis, we argue that all prompts can achieve better performance when the decision boundary is calibrated into the robust region.

3. PROTOTYPICAL CALIBRATION

We have illustrated that the conventional decision boundary generally deviates from the robust region, which renders in-context learning fragile. In this section, we present prototypical calibration (PROCA) to adaptively learn a better decision boundary.

3.1. PROTOTYPICAL CLUSTER ESTIMATION

Considering an N -way few-shot classification task, let X denote the N -dimensional model outputs. For examples whose ground truth is the n-th category, the model outputs compose a prototypical cluster. For instance, the red and blue areas in Figure 2 refer to two prototypical clusters respectively. We assume that each prototypical cluster follows a Gaussian distribution: P G (X|µ n µ n µ n , Σ n Σ n Σ n ) = 1 (2π) N/2 |Σ n Σ n Σ n | 1/2 exp(- 1 2 (X -µ n µ n µ n ) T Σ n Σ n Σ n -1 (X -µ n µ n µ n )), where µ n µ n µ n and Σ n Σ n Σ n are the mean vector, and covariance matrix of the distribution, respectively. Next, we estimate N prototypical clusters for N categories with Gaussian mixture model (GMM): P GMM (X) = N n=1 α n P G (X|µ n µ n µ n , Σ n Σ n Σ n ), where α n is the mixing coefficient for the n-th distribution. In our work, we formulate the model prediction x = [x 1 , x 2 , ..., x N ] as follows: x n = log exp(o n ) N i=1 exp(o i ) , where o n and o i are the logits predicted by GPT, corresponding to label token n and label token i respectively. Intuitively, x n represents the log probability of the n-th category. After clarifying the GMM definition under few-shot learning, we utilize a small-scale unlabeled in-domain dataset, named as estimate set (D esti ), to estimate the parameters {α n , µ n µ n µ n , Σ n Σ n Σ n } N n=1 by the Expectation-Maximization (EM) algorithm (Moon, 1996) . Notice that the estimate set does not contain any human annotation. Specifically, EM is an iterative method to find the optimal estimation of GMM's parameters by maximizing the likelihood x∈Desti P GMM (x).

3.2. CLUSTER-LABEL ASSIGNMENT

Then we assign the estimated prototypical clusters to the target labels. Concretely, for an estimation e = {(α n , µ n µ n µ n , Σ n Σ n Σ n )} N n=1 , µ n,l is the l-th element of µ n and indicates how much the n-th cluster of e belongs to label l. Therefore, we propose a cluster-label assignment score CLA(•), which represents the overall belongingness of a cluster-label assignment. Let the tuple k = (k 1 , k 2 , • • • , k N ) denote a cluster-label assignment, where k is a permutation of {1, 2, ..., N}. It means that the n-th cluster is assigned to the label k n . The assignment score CLA(•) is defined as: CLA(e, k) = N n=1 µ n,kn . Then it is transformed into a weighted bipartite matching problem between N clusters and N labels. The optimal assignment is obtained by maximizing CLA(e, k): k * (e) = arg max k∈K CLA(e, k), where K indicates the set of all assignment permutations. In the worst case, this process requires N ! attempts to find the optimal assignment, which is time-consuming when N is large, thus we adopt Kuhn-Munkres algorithm (Kuhn, 1955) to accelerate it.

3.3. ESTIMATION SELECTION BASED ON CLUSTER-LABEL ASSIGNMENT

The EM algorithm is empirically sensitive to the different initializations of GMM parameters. So we repeat the estimation multiple times with different random seeds. Then we define a metric to evaluate how good these estimations are and select the best estimation. As CLA(e, k * ) reflects the overall label belongingness of the optimal assignment of an estimation e, thus it can be used to evaluate estimations. Formally, we select the estimation e * according to the assignment score of k * as follows: e * = arg max e∈E CLA(e, k * (e)), where E is the set of estimations obtained by different initializations of GMM parameters.

3.4. INFERENCE

After selecting the desired estimation e * , we use GMM to make predictions instead of the conventional approach used in GPT (Brown et al., 2020) . Due to the class-distribution discrepancy between the estimate set and the test set, we discard the mixing coefficient α n of each sub-distribution during inference. For a test example, the LM prediction is x. It will be assigned to the most likely cluster: ñ = arg max n=1,••• ,N P G (x|µ * n , Σ * n ). Finally, the predicted label is k * ñ(e * ), where the cluster-label assignment k * (e * ) is obtained according to Equation (5).

4. EXPERIMENTS 4.1 EXPERIMENTAL SETUP

We evaluate five models from GPT-family including GPT-2-large (Radford et al., 2019) with 0.8B parameters, GPT-2-XL (Radford et al., 2019) with 1.5B parameters, GPT-neo (Black et al., 2021) with 2.7B parameters, GPT-J (Wang & Komatsuzaki, 2021) with 6B parameters, and Bloom (BigScience, 2022) with 176B parameters. As for the estimate set, it can be constructed by generating from LMs (Lu et al., 2021; Wang et al., 2021; Meng et al., 2022; Ye et al., 2022) or sampling a light subset of training examples but without golden labels. For simplicity, we choose the latter way to construct the estimate set and we further compare their differences in Section 4.5. Moreover, the estimate set size is proportional to the number of classes of the task. For more details, please refer to Table 7 in Appendix. We use the k-means algorithm to initialize GMM parameters to accelerate the convergence. The maximum iterations and the convergence threshold for each EM process are set to 100 and 1e-3 respectively. Moreover, we repeat the estimation multiple times with different random initializations to avoid getting stuck in local optima. It is worth noting that multiple repetitions bring little additional time consumption compared to the inference of GPT, we thus simply set it to 100 for all tasks.

4.2. EVALUATION PROTOCOL

We evaluate the proposed method on nine widely-used text-classification datasets including SST-2 (Socher et al., 2013) , SST-5 (Socher et al., 2013) , Subj (Pang & Lee, 2004) , MR (Pang & Lee, 2005) , AP (Zhang et al., 2015) , DBPedia (Zhang et al., 2015) , AGNews (Zhang et al., 2015) , RTE (Dagan et al., 2005) , and TREC (Voorhees & Tice, 2000) . SST-2, SST-5, MR, and AP are sentiment classification tasks. RTE is a textual entailment recognition task and TREC is a text retrieval question classification task. Subj and AGNews are subjectivity and topic classification tasks respectively, and DBPeida is an ontology classification task. We use the full validation set for evaluation except for AGNews, DBPedia, and AP, for which we randomly sample 2000 test examples. We compare PROCA with the conventional approach used by GPT (Brown et al., 2020) and contextual calibration (Zhao et al., 2021) . Experiments are conducted under 0-shot, 1-shot, 4-shot, and 8-shot scenarios. We fix the template format for each dataset (details of templates are shown in Table 6 ) and use the randomly sampled training examples as demonstrations. We compute the average accuracy on the validation set over five random seeds for each setting except for Bloom using 2 seeds. We conduct the evaluation on 8 Tesla A100 GPUs for Bloom and Tesla V100 GPUs for other models.

4.3. MAIN RESULTS

We report the mean and standard deviation of accuracy across five different random seeds for GPT-2-XL, GPT-J, and Bloom in Table 1 . The results of GPT-2-Large and GPT-neo are shown in Table 9 of Appendix. From Table 1 and Table 9 , we observe that PROCA achieves, on average, a 13% absolute improvement compared to the conventional approach and a 6% absolute improvement over contextual calibration. In some cases, the absolute improvement can be up to 40% and 20% respectively, like GPT-J 0-shot on DBpedia and GPT-2-XL 8-shot on AGNews. Results show that PROCA maintains high effectiveness across different model sizes and few-shot scenarios, indicating its strong generalization ability. Moreover, compared to the conventional approach, PROCA achieves considerable improvements with lower variance across different prompts in most cases, which suggests that PROCA can effectively calibrate the decision boundary for various prompts (as illustrated in Figure 2 ). It also reflects that our estimation strategy is reliable and insensitive to different estimate sets, because of the low variance of PROCA's zero-shot performance. We observe that the performance gain on Bloom is smaller than that on relatively small models. It suggests that huge LMs have less suffering on the decision boundary deviation problem. In addition, PROCA seems invalid for GPT-2-XL on RTE. We identify the reason is that the entailment recognition task is too challenging for relatively small models like GPT-2-XL and the output of LM on such Zhao et al. 2021 ) and prototypical calibration (PROCA; Ours). We report the mean and the standard deviation of accuracy across 5 different prompts on the validation set except for Bloom, for which we only use 2 random seeds to reduce the computational cost. We also show the average performance across nine datasets. The results of ConCa are replicated based on the released codefoot_0 . The standard deviation of 0-shot accuracy for PROCA is caused by the difference of estimate sets over 5 random seeds. It shows that PROCA generally outperforms GPT and ConCa. challenging tasks is no more discriminative (same for GPT-2-Large, as shown in Table 9 in Appendix).

4.4. EFFECTIVENESS ANALYSIS

We conduct more experiments to verify the effectiveness of PROCA. The experimental results are the average accuracy of GPT-2-XL conditioned on 5 different 4-shot prompts unless otherwise specified. PROCA is consistently effective across different templates. We conduct the experiments across nine different prompts templates and label spaces (details of templates are shown in Table 8 of Appendix). The performance comparison among three approaches on SST-2 is shown in Figure 4 . We observe that contextual calibration remains high variance although it improves the average accuracy. However, our proposed prototypical calibration can bring a large improvement with low variance, which indicates that PROCA is effective on various prompt templates. PROCA is robust under demonstration perturbations. Previous works (Zhao et al., 2021; Lu et al., 2021) have noticed that the order of training examples has significant effects on the performance of few-shot demonstrations. In this part, we evaluate our prototypical calibration conditioned on nine 8-shot prompts with different class proportions for SST-2, and show the accuracy of twelve randomly sampled orderings for each proportion in Figure 6 . We find that contextual calibration can improve the performance in most cases but is still sensitive to the orderings. However, PROCA is significantly superior to the others and keeps an extremely low variance across different permutations, indicating the non-sensitiveness to the class proportion and permutation. It is also shown that although the class-balanced prompts tend to have higher performance, there are some exceptions(e.g., the prompt with all negative samples is the most performant one for both the conventional approach and contextual calibration). We think that it is GPT-2-XL's intrinsic bias to the positive class that leads toward counter-intuitive results. PROCA is robust to class imbalance. Due to the unavailability of the labels of the estimate examples, PROCA may suffer the problem of class imbalance. We construct nine estimate sets with different imbalance levels for SST-2 and Subj by controlling the proportion of positive examples in the sampled set. Then we evaluate PROCA and contextual calibration on them. The experimental results in Figure 5 show that the estimate set's class imbalance level affects the performance of PROCA to some extent and the class-balanced estimate set can lead to higher accuracy. As described in Section 3.1, standard GMM estimates the weight of each cluster, which reflects the proportion of different classes in the estimate set. Owing to our "weights-cutting" operation, the problem of class imbalance has a much less negative impact on PROCA, which even with an extremely class-imbalanced estimate set surpasses contextual calibration on both SST-2 and Subj. Besides, the absolute advancement for the class-balanced estimate set can reach 20% and 15% respectively.

Shot

Method SST-2 SST-5 MR Subj AP AGNews DBpedia RTE TREC 4-shot 

4.5. ABLATION STUDIES

Comparison between different estimate set construction methods. There are two ways to construct the estimate set. One is using light unlabeled examples from the training set, which is simple and convenient. The other is utilizing the generation ability of LMs to construct unlabeled dataset (Lu et al., 2021; Wang et al., 2021; Meng et al., 2022; Ye et al., 2022) . 2 and the 0-shot and 1-shot results are shown in Table 5 of Appendix. We observe that PROCA greatly outperforms the original LM whether using unlabeled data generated by LM or randomly sampled from the training set. It also shows that PROCA-t performs slightly better than PROCA-g and we speculate that it is due to the lower quality of the unlabeled data generated by LM. A relatively small-scale estimate set is sufficient for PROCA. In Figure 7 , we evaluate PROCA with ten different estimate set sizes across five datasets. We report the average accuracy over five randomly sampled estimate sets, conditioning on the same 4-shot prompt in each setting. We observe that increasing the scale of the estimate set within a certain small range can greatly improve the classification accuracy and reduce the variance. However, a larger estimate set can hardly bring further improvement, which indicates that a small estimate set can support PROCA to be optimal. It also shows that PROCA has acceptable performances with just several estimate examples on SST-2, MR, and Subj, which even surpasses both the conventional approach and contextual calibration. Estimation selection according to assignment score is useful to PROCA. The standard estimation of GMM aims to maximize the likelihood of all observations and select the optimal estimated parameters among multiple repetitions. We argue that the estimation with maximum likelihood is not consistently beneficial to PROCA especially in multi-classes tasks, because there is no supervision to force the predictions to be assigned to their inclined classes during the estimation procedure. We propose to determine estimation according to the assignment score (as described in Section 3.3). We compare PROCA with the two strategies for GPT-2-XL and GPT-J respectively in Table 3 . It indicates that our determination strategy can achieve more stable improvements on AGNnews and DBPedia regardless of model size. assignment score 73.6 73.6 73.6 3.0 66.1 66.1 66.1 1.5 79.9 79.9 79.9 3.8 90.0 90.0 90.0 2.2 87.2 87.2 87.2 4.9 91.9 91.9 91.9 2.6 89.4 89.4 89.4 0.7 95.1 95.1 95.1 0.5 Dataset Strategy 0-shot 1-shot 4-shot 8-shot GPT-2 GPT-J GPT-2 GPT-J GPT-2 GPT-J GPT-2 GPT Table 3 : Performance of PROCA with different strategies of estimation selection (maximum likelihood, and assignment score as in Equation ( 4)) for GPT-2-XL and GPT-J on AGNews and DBPedia.

5. RELATED WORK

Instability of Few-shot Learning with Language Models. It has been recognized that the few-shot performance of language models is unstable under different in-context scenarios. Language models are prone to predict some specific labels due to the intrinsic bias or demonstration permutations (Zhao et al., 2021; Lu et al., 2021) . Lu et al. (2021) demonstrate LM's sensitivity to the order of few-shot demonstrations, and introduced an Entropy-based metric to select the most performant prompts. Zhao et al. (2021) attribute the instability to three biases of prompts, including majority bias, recency bias, and common token bias, and proposed a contextual calibration approach. However, the selected content-free test inputs can not precisely reflect the bias of models and lead to the problem of over-correction or under-correction. On the contrary, we adaptively provide the classification criterion according to the text inputs' overall prediction distribution, and completely calibrate the bias introduced by models and prompts. Improving and Understanding In-context Learning with Language Models. Due to the instability, prior efforts propose various methods to improve the in-context learning performance. Holtzman et al. (2021) explores the surface form competition problem in zero-shot models and proposes Domain Conditional Pointwise Mutual Information to reweigh the answer scores. Min et al. (2021) introduces a noisy channel approach that computes the conditional probability of the input given the label, which boosts improvements with lower variance and higher worst-case accuracy. Moreover, Liu et al. (2021) focus on prompt engineering to construct more semantically-similar demonstrations. To the best of our knowledge, we are the first to study the intrinsic reason for the instability of in-context learning from the perspective of the decision boundary and propose prototypical calibration to improve it. Another line of work aims to understand how in-context learning works through casting it as implicit Bayesian inference (Xie et al., 2022) , analyzing corpora sources and statistics (Shin et al., 2022; Razeghi et al., 2022) and learning which aspects of demonstrations contribute most to the downstream performance (Min et al., 2022) .

6. CONCLUSION AND LIMITATION

To our analysis, the decision boundary is of critical importance to the performance of few-shot demonstrations and the traditional decision boundary leads to the fragility of prompting LMs. We propose prototypical calibration to adaptively learn a more robust decision boundary. Experiments show that the calibrated decision boundary is effective across various prompt templates, class proportions, and permutations. We achieve on average a 13% absolute improvement across different sizes of pretrained language models on nine popular text classification tasks. A limitation of our method is that it is not applicable for tasks whose label space is open-ended since a fixed label space is necessary for estimating prototypical clusters. Furthermore, our method is designed for in-context learning on individual downstream tasks, it fails to calibrate the inherent bias of language models like gender and occupation bias. For future work, we would like to extend our method to the tasks with open-ended answer space, such as generative question-answering and text summarization tasks. A EXPERIMENTAL DETAILS AND ADDITIONAL RESULTS To validate the effectiveness of PROCA for bidirectional language models, we conduct the evaluation on RoBERTa-Large (Liu et al., 2019) with 355M parameters across the nine text classification tasks. The templates for RoBERTa are constructed by replacing the label symbols in the templates in Table 6 with [MASK] symbols. We compare the 0-and 8-shot average performance of RoBerta-Large and PROCA over 5 random seeds and the detailed results are shown in Table 9 . We can see that PROCA greatly outperforms RoBERTa-Large on all tasks except for 8-shot on SST-2. Therefore, we suggest that PROCA is also effective for Bidirectional Language Models. (Holtzman et al., 2021) and Noisy Channel models (Min et al., 2021 ) also have strong performance on language model prompting in zero-and few-shot text classification. To compare against these approaches, we conduct the 0-shot and 8-shot evaluation on GPT-2-Large for channel modelsfoot_1 and PROCA. We use simpler prompt templates for fair comparison and the details are shown in Table 10 . The 0-shot accuracies of PMI DC reported by Holtzman et al. (2021) are also presented for further comparison. As shown in Table 11 , it is clear that PROCA outperforms channel models in most cases and has competitive performance with PMI DC on the 0-shot scenario. PMI DC also calibrates the output distribution based on the domain premises and different domain premises may lead to different calibration results. In our opinion, channel models transform the decision boundary from label space to input space and it is more robust than the conventional language model in-context learning. There may still exist the decision boundary deviation problem in channel models. In comparison, our method directly calibrates the decision boundary by estimating the prototypical clusters and therefore boosts higher performance.



https://www.github.com/tonyzhaozh/few-shot-learning The results of noisy channel models are replicated based on the released code https://github.com /shmsw25/Channel-LM-Prompting.



Figure 1: Example of few-shot learning with GPT.

Figure 2: Left and Middle: Prediction distribution of GPT-2-XL under two different prompts for SST-2. Two distributions colored by blue and red represent model predictions for negative and positive ground-truth examples respectively. P positive denotes the prediction probability of positive label. The orange dashed line represents the decision boundary commonly used by GPT (i.e., P positive = 0.5 for binary classification). The green dashed line represents the decision boundary of our prototypical calibration (PROCA). Right: Performance comparison of GPT-2-XL and PROCA under the two prompts, which indicates that PROCA is effective because the calibrated decision boundary is more discriminative for classification.

Figure 4: Performance comparison across nine different templates.

Figure 5: The impact of classimbalanced estimate set on PROCA's performance.

Figure 6: Performance comparison under different label proportions and permutations of demonstrations. Each box indicates the accuracy of twelve randomly sampled permutations.

For the generation method, we follow Lu et al. (2021) to generate diverse estimate examples based on the various permutations of demonstrations. Specifically, we only use two labeled examples per category as demonstrations for generation except for DBPedia(one labeled example per category) and all these labeled examples are not involved in the evaluation. The 4-shot and 8-shot experimental results are shown in Table

Figure 7: Performance of PROCA across different estimate set sizes.

Performance comparisons among the conventional approach (GPT; Brown et al. 2020), contextual calibration (ConCa;

4-and 8-shot performance comparison of different estimate set construction methods for GPT-neo across nine text classification tasks. PROCA-g and PROCA-t represent PROCA based on the unlabeled estimate set generated by LM and randomly sampled from the training set, respectively.

-J AGNews max-likelihood 64.0 1.1 55.2 55.2 55.2 0.3 65.8 5.2 79.2 5.8 60.1 11.1 79.7 6.7 71.7 8.7 78.1 11.7

Average performance of the conventional approach, contextual calibration (ConCa) and PROCA for GPT-2-Large and GPT-neo across nine text classification tasks.

Average performance of the conventional approach and PROCA for RoBERTa-Large across nine text classification tasks.

Templates used in Appendix D.

Performance comparison among domain conditional PMI, noisy channel method and PROCA when using GPT-2-Large on text classification tasks. Channel 76.9 0.0 29.8 0.0 72.8 0.0 64.2 0.0 59.8 0.0 58.1 0.0 46.4 46.4 46.4 0.0 PROCA 84.6 0.1 40.7 40.7 40.7 2.9 81.1 81.1 81.1 0.2 72.3 72.3 72.3 0.1 69.7 69.7 69.7 0.7 68.7 68.7 68.7 1.9 42.8 0.9 8-shot Channel 86.4 86.4 86.4 1.2 35.3 2.5 80.5 2.5 49.9 12.3 71.5 3.6 72.5 2.2 30.8 6.3 PROCA 86.4 86.4 86.4 3.8 37.7 37.7 37.7 5.2 85.5 85.5 85.5 1.2 82.4 82.4 82.4 6.0 75.7 75.7 75.7 2.1 87.4 87.4 87.4 3.7 51.8 51.8 51.8 8.4 D PERFORMANCE COMPARISON WITH DOMAIN CONDITIONAL PMI SCORING FUNCTION AND NOISY CHANNEL METHOD.

annex

We evaluate the performance of PROCA based on three commonly used model output formalizations including logits, probability, and log-probability on SST-2, AGNews, and AP for GPT-2-XL. The experimental results are plotted in Figure 8 . We find that PROCA with log-probability consistently outperforms that with probability, especially for AGNews. We identify that the log operation makes the predictions more fitted to Gaussian distribution and more discrepant. Although PROCA with logits seems to have superior performance when few-shot demonstrations are provided, it degrades severely at zero-shot setting. It suggests that output embedding normalization over label space is necessary to PROCA, because the output is more likely biased to the tokens outside the label space when no prompts are accessed, which is consistent with the conclusion of Min et al. (2022) .Published as a conference paper at ICLR 2023 

