PROTOTYPICAL CALIBRATION FOR FEW-SHOT LEARNING OF LANGUAGE MODELS

Abstract

In-context learning of GPT-like models has been recognized as fragile across different hand-crafted templates, and demonstration permutations. In this work, we propose prototypical calibration to adaptively learn a more robust decision boundary for zero-and few-shot classification, instead of greedy decoding. Concretely, our method first adopts Gaussian mixture distribution to estimate the prototypical clusters for all categories. Then we assign each cluster to the corresponding label by solving a weighted bipartite matching problem. Given an example, its prediction is calibrated by the likelihood of prototypical clusters. Experimental results show that prototypical calibration yields a substantial improvement on a diverse set of tasks. Extensive analysis across different scales also indicates that our method calibrates the decision boundary as expected, greatly improving the robustness of GPT to templates, permutations, and class imbalance. The code will be released at https://github.com/zhixhan/ProCa.

1. INTRODUCTION

Large-scale language models (LMs) have shown strong generalization ability on a wide range of downstream tasks (Devlin et al., 2018; Radford et al., 2019; Yang et al., 2019; Lewis et al., 2019; Brown et al., 2020; Dong et al., 2019; Bao et al., 2020) . Fine-tuning has been the common strategy to transfer the extensive knowledge to downstream tasks for a long time. However, fine-tuning such large LMs suffers from the over-parameterization issue under few-shot settings. Brown et al. (2020) propose the concept of in-context learning with GPT, which enables LMs to quickly adapt to a new task by conditioning on hand-crafted prompts as shown in Figure 1 . The prompts consist of task-specific templates and several input-label pairs (demonstrations). In-context learning is surprising as GPT can perform various tasks without any parameter updating. It has been noticed that the predictions of GPT conditioned on prompts tend to bias toward some specific answers and can be highly volatile across different templates, demonstrations, and their permutations (Lu et al., 2021; Jiang et al., 2020) . Zhao et al. (2021) propose to calibrate the model prediction by the content-free output to mitigate this problem. Rubin et al. ( 2021) and Lu et al. ( 2021) focus on the training examples retrieval and optimal ordering selection respectively to produce more performant prompts than random sampling. However, they did not explain why the in-context learning performance is fragile across different scenarios. In this paper, we analyze the intrinsic reason for the instability of few-shot learning with GPT. We observe significant distinctions among the prediction distributions of GPT under different prompts. As shown in Figure 2 , the conventional decision boundary of GPT (i.e., naively uses the output with the largest probability as the predicted label) often fails to discriminate the predictions. We argue that the predictions can be more discriminative when provided with a calibrated decision boundary. Specifically, we term the model outputs of examples whose ground-truth are the same category as prototypical clusters and adopt Gaussian Mixture Model (GMM) to estimate the distributions of them for all categories. The decision boundaries of the prototypical clusters are adaptively learned, which is called prototypical calibration (PROCA). Then the prototypical clusters are assigned to the corresponding labels through weighted bipartite matching. We also propose to improve estimations according to cluster-label assignment. Finally, the predictions of test examples are more precise owing to the calibrated decision boundary (as shown in Figure 2 ). Experimental results show that we achieve on average 13% absolute improvement for different sizes of GPT models across nine text classification datasets. We demonstrate that PROCA is effective across various templates and different demonstration permutations. To summarize, our key contributions are as follows: • We find that the decision boundary plays a critical role in few-shot evaluation. Moreover, performant decision boundaries are inconsistent across language models and prompts. • We propose prototypical calibration to adaptively learn a better decision boundary for the few-shot classification of language models. • Experiments show that PROCA achieves a 13% absolute improvement over the conventional approach on a wide range of text classification tasks.

2. DECISION BOUNDARY OF FEW-SHOT LEARNING WITH GPT

A decision boundary refers to an explicit prediction criterion in the output space for a given classification problem. As shown in Figure 2 , two dashed lines represent two different decision boundaries, which classify examples into negative and positive categories. In this section, we explore the effect of the decision boundary on few-shot learning. We demonstrate that optimal decision boundaries are inconsistent under different LMs and prompts.



* Contribution during internship at Microsoft Research.



Figure 1: Example of few-shot learning with GPT.

Figure 2: Left and Middle: Prediction distribution of GPT-2-XL under two different prompts for SST-2. Two distributions colored by blue and red represent model predictions for negative and positive ground-truth examples respectively. P positive denotes the prediction probability of positive label. The orange dashed line represents the decision boundary commonly used by GPT (i.e., P positive = 0.5 for binary classification). The green dashed line represents the decision boundary of our prototypical calibration (PROCA). Right: Performance comparison of GPT-2-XL and PROCA under the two prompts, which indicates that PROCA is effective because the calibrated decision boundary is more discriminative for classification.

