OVERTHINKING THE TRUTH: UNDERSTANDING HOW LANGUAGE MODELS PROCESS FALSE DEMONSTRATIONS

Abstract

Through few-shot learning or chain-of-thought prompting, modern language models can detect and imitate complex patterns in their prompt. This behavior allows language models to complete challenging tasks without fine-tuning, but can be at odds with completion quality: if the context is inaccurate or harmful, then the model may reproduce these defects in its completions. In this work, we show that this harmful context-following appears late in a model's computation-in particular, given an inaccurate context, models perform better after zeroing out later layers. More concretely, at early layers models have similar performance given either accurate and inaccurate few-shot prompts, but a gap appears at later layers (e.g. layers 13-14 for GPT-J). This gap appears at a consistent depth across datasets, and coincides with the appearance of "induction heads" that attend to previous answers in the prompt. We restore the performance for inaccurate contexts by ablating a small subset of these heads, reducing the gap by 23.2% on average across 14 datasets. Our results suggest that studying early stages of computation could be a promising strategy to prevent misleading outputs, and that understanding and editing internal mechanisms can help correct unwanted model behavior.

1. INTRODUCTION

A key behavior of modern language models is context-following: neural networks like GPT-3 are able to infer and imitate the patterns in their prompt. At its best, this allows language models to perform well on benchmarks without the need for fine-tuning (Brown et al., 2020; Rae et al., 2021; Hoffmann et al., 2022; Chowdhery et al., 2022; Srivastava et al., 2022) . This has led researchers to study how properties of the context affect few-shot performance (Min et al., 2022b; Kim et al., 2022; Xie et al., 2021; Zhao et al., 2021) , and what internal mechanisms underlie context-following (Olsson et al., 2022) . However, context-following can also lead to incorrect, toxic or unsafe model outputs (Rong, 2021) . For example, if an inexperienced programmer prompts Codex (Chen et al., 2021) with poorly written or vulnerable code, the model is likely to produce poorly written or vulnerable code completions. Similarly, in this work we study few-shot learning for classification tasks: prompting the model with inaccurate demonstrations reduces model accuracy (Figure 1 , left), because the model learns to reproduce the false demonstrations. We thus ask: Can we attribute this "false context-following" behavior to specific model components, and can we mitigate it by intervening on these components? We show that, perhaps surprisingly, false context-following in text classification is primarily a property of late stages of computation. In particular, stopping the model early-by zeroing out the later layers (Nostalgebraist, 2020)-actually improves performance (Figure 1 , center). Moreover, true and false contexts yield similar accuracy until some "critical layer" at which they sharply diverge. This demonstrates that even with false demonstrations, the model often "knows" the correct answer (it can be easily decoded from the latent states) but later replaces it with an incorrect answer that is more likely given the context. To identify the underlying mechanism for false context-following, we turn to Olsson et al. (2022) , who identify "induction heads" that attend to and reproduce previous patterns in the input. Motivated by this, we searched for heads that consistently attend to previous examples that have the same (true) answer as the current prompt. We found many such heads, primarily concentrated in later layers of the model (after the critical layer). By removing 10 of these heads, we are able to reduce the accuracy gap between accurate and inaccurate prompts by an average of 23.2% over 14 datasets, with negligible effects on the performance given true prefixes (Figure 1 , right). Figure 1 : Left: Given a prompt of inaccurate demonstrations, language models are more likely to output incorrect labels. Center: When demonstrations are incorrect, zeroing out the later layers increases the classification accuracy, here on SST-2. Right: We identify 10 attention heads and remove them from the model: this reduces the effect of incorrect demonstrations by 36.7% on SST-2, averaged over 15 prompt formats, without decreasing the accuracy given correct demonstrations. Our findings show how analyzing and editing model internals can help practictioners understand and mitigate model failures. Indeed, one intuition for why early-exiting succeeds is that the attention heads we identified cannot in general occur at the earliest layers. This is because these heads must recognize which inputs belong to the same class, which likely requires multiple layers of processing. Thus, early exiting might be a generally promising strategy to detect dishonest behavior in models.

2. PRELIMINARIES: FEW-SHOT LEARNING WITH FALSE DEMONSTRATIONS

We begin by introducing the setting we study: few-shot learning for classification, given demonstrations with correct or incorrect labels. Incorrect demonstrations consistently reduce classification performance, which is the phenomenon that we aim to study and mitigate in this work. Few-shot learning. We consider autoregressive transformer language models, which produce a conditional probability distribution p(t n+1 | t 1 , ..., t n ) over the next token t n+1 given previous tokens. We focus on the few-shot learning setting (Brown et al., 2020) for classification tasks: we sample k demonstrations (input-label pairs) from the task dataset, denoted (x 1 , y 1 ), ..., (x k , y k ). To query the model on a new input x, we use the predictive distribution p(y | x 1 , y 1 , ..., x k , y k , x). Datasets and models. We consider fourteen text classification datasets: SST-2 (Socher et al., 2013) , Poem Sentiment (Sheng & Uthus, 2020) , Financial Phrasebank (Malo et al., 2014) , Ethos (Mollas et al., 2020) , TweetEval-Hate (Barbieri et al., 2020) , TweetEval-Atheism (Barbieri et al., 2020) , TweetEval-Feminist (Barbieri et al., 2020 ), Medical Questions Pairs (McCreery et al., 2020) , MRPC (Wang et al., 2019) , SICK (Marelli et al., 2014) , RTE (Wang et al., 2019) , AGNews (Zhang et al., 2015) , TREC (Voorhees & Tice, 2000) , and DBpedia (Zhang et al., 2015) . We used the same prompt formats as in Min et al. (2022b) and Zhao et al. (2021) (Table 2, 3) . For SST-2 we use the first of the 15 prompt formats in Zhao et al. (Table 5) . We evaluated 3 autoregressive language models: GPT-J (Wang & Komatsuzaki, 2021) , GPT2-XL (Radford et al., 2019) , and GPT-NeoX-20B (Black et al., 2022) . Evaluation metrics. Given our focus on classification tasks, we are interested in how often the model assigns higher probability to the true label than to all other labels. However, model predictions can be very unstable with respect to small prompt perturbations (Gao et al., 2021) . To mitigate this variability, we measure the calibrated classification accuracy (Zhao et al., 2021) . Concretely, for a 2-class classification task, we measure how often the correct label has a higher probability than its median probability over the dataset. Assuming the dataset is balanced (which is true for us), this step has been shown to improve performance and reduce variability across prompts. Calibration for multi-class tasks follows a similar procedure, detailed in appendix A.1. ) more often than the other labels, rather than making random errors.

2.1. COMPARING TRUE AND FALSE DEMONSTRATIONS

We first confirm that the models we study exhibit false context-following behavior. To do so, we compare the performance of models when the demonstration labels are all correct, i.e. y i = class(x i ), and when they are all incorrect, i.e. y i = σ(class(x i )), for a cyclic permutation σ over the set of labels (Figure 1 , left). In particular, inputs from the same class are always assigned the same (possibly false) label within each prompt. For each model and dataset, we sample 1000 sequences each containing k demonstrations and evaluate the model's calibrated accuracy. We sample different demonstrations (x i , y i ) and label permutations σ for every sequence, and vary k from 0 to 40 (from 0 to 20 for GPT2-XL, due to its smaller context size). Figure 2 (left) shows the difference between GPT-J's calibrated accuracy given accurate and inaccurate prompts as the number of demonstrations increases. As expected, false demonstrations lead to worse performance, and the gap tends to increase with k for most datasets. These results are in agreement with Min et al. (2022b) , who found that incorrect demonstrations decreased GPT-J's performance on classification tasks (see Figure 4 in Min et al.) . Models could lose accuracy by copying the incorrect label, or by becoming confused and choosing random labels. To confirm it is the former, we also measure which labels the model chooses for multi-class tasks. Specifically, we measure the permuted score: how often the model chooses the permuted label σ(class(x)) over the other labels. For each dataset, a random classifier would have a permuted score of 1 #labels . To make the results comparable across datasets, we divide the permuted scores by this random baseline. Figure 2 (right) shows these reweighted permuted scores for GPT-J on the 9 multi-class datasets in our collection, as well as their average over the datasets. The permuted score increases steadily with the number of demonstrations and reaches twice its baseline value after 40 demonstrations.

3. THE LOGIT LENS: ZEROING OUT LATER LAYERS IMPROVES ACCURACY

In this section, we decode model predictions directly from intermediate layers. This allows us to evaluate the model's performance midway through processing the inputs. On false prefixes, we find that the model performs better midway through processing, and investigate this phenomenon in detail. Intermediate layer predictions: the logit lens. Given an autoregressive transformer language model, we will decode a probability distribution over the next token from each intermediate layer, Percentage of tasks where zeroing out all succeeding transformer blocks is superior than full model evaluation, denoted by dashed lines. Early-exiting is effective given false demonstrations, and perhaps more surprisingly, also effective given correct demonstrations. using the "logit lens" method (Nostalgebraist, 2020). Intuitively, these intermediate distributions represent model predictions after ℓ ∈ {1, ..., L} layers of processing. In more detail, let h ℓ ∈ R d denote the hidden state of token t i at layer ℓ, i.e. the sum of everything up to layer ℓ in the residual stream. For a sequence of tokens t 1 , ..., t n ∈ V , the logits of the predictive distribution p(t n+1 | t 1 , ..., t n ) are given by [logit 1 , ..., logit |V | ] = W U • LayerNorm L (h (n) L ), where LayerNorm L is the the pre-unembedding layer normalization, and W U ∈ R |V |×d is the unembedding matrix. The logit lens applies the same unembedding operation to the earlier hidden states h (i) ℓ , yielding an intermediate layer distribution p ℓ (t n+1 | t 1 , ..., t n ): [logit ℓ 1 , ..., logit ℓ |V | ] = W U • LayerNorm L (h (n) ℓ ). This provides a measurement of what predictions the model represents at layer ℓ, without the need to train a new decoding matrix. It can therefore be interpreted as a form of early exiting (Panda et al., 2015; Teerapittayanon et al., 2017; Figurnov et al., 2017) . Early exiting improves classification performance. We measure the calibrated accuracies of the intermediate layer distributions p ℓ for the three models and fourteen datasets from Section 2, using context lengths of 40 demonstrations (20 demonstrations for GPT2-XL). We also measure the layerwise accuracies for two toy datasets: "SST-2-A/B", a modification of SST-2 (Socher et al., 2013) , and "Unnatural", that extends a task in Rong (2021; section 4). In "SST-2-A/B", we replace the labels (e.g. 'Positive' and 'Negative') with letters 'A' and 'B'. In "Unnatural", demonstrations are of the form "[object]: [label]" and the labels are "plant/vegetable", "sport", and "animal". Figure 4 displays results for GPT-J, with corresponding plots for GPT2-XL and GPT-NeoX in Figures 9 and 10 in the Appendix. For GPT-J with correct demonstrations, accuracy tends to increase with layer depth, and starts to stagnate or grow more slowly around layer 15. The accuracy for incorrect demonstrations follows a similar trend at the early layers, but then diverges and decreases at the later layers. For incorrect demonstrations, decoding from earlier layers performs better than decoding from the final layer. For GPT-J, using p 16 (the first 16 layers) achieves a better accuracy than the full model for 12 out of the 14 datasets, by an average of 8.6 percentage points. This gain on false prefixes comes with a comparatively small cost for true prefixes: 1.6 percentage points (see Table ??). Similarly, for GPT2-XL and GPT-NeoX, the intermediate predictions p 30 and p 27 outperform using the full model for 11 out of 14 datasets, however the magnitude of the effect is smaller: an average of 3 percentage points for GPT2-XL and 0.7 percentage points for GPT-NeoX. Figure 4 : GPT-J early-exit classification accuracies across 16 tasks, given correct and incorrect demonstrations. Plots are grouped by task type: sentiment analysis (a-c), hate speech detection (dg), paraphrase detection (h-i), natural language inference (j-k), topic classification (l-n), and toy tasks (o-p). Given incorrect demonstrations, zeroing out all transformer blocks after layer 16 outperforms running the entire model, across 14 out of the 16 tasks. For the toy "Unnatural" dataset, these effects are particularly pronounced (see Figure 4p ). At layer 16, the accuracy of GPT-J for incorrect demonstrations reaches 0.91, which is 92% of the final layer accuracy given an accurate prompt. In contrast, at the final layer, GPT-J's accuracy given false demonstrations reaches its lowest value, 0.07. True and false prefixes sharply diverge at "critical layers". For each model, the accuracies for correct and incorrect demonstrations diverge at the same layers across all datasets. For example, for GPT-J, the accuracy gap between accurate and inaccurate prompts first exceeds 50% (and 45%, and 55%) of its final layer value between layers 13 and 14 for 12 out of the 14 datasets. We obtain similar results for GPT-NeoX with layers 10 to 12 and for GPT2-XL with layers 21 to 23. In summary, zeroing out later layers leads to better classification accuracies given incorrect demonstrations, and the accuracy gap between correct and incorrect demonstrations emerges at a consistent set of layers across datasets. Figure 5 : Examples of attention patterns on incorrect demonstrations from the "unnatural" dataset, for heads that are label-attending but not class-sensitive (Left), heads that are class-sensitive but not label-attending (Center), and heads that are both label-attending and class-sensitive (Right).

4. ZOOMING INTO ATTENTION HEADS

We found that for all datasets, the gap between true and false demonstrations appears in a small set of transformer blocks. We would like to know whether some specific attention heads are responsible for this behavior. (Olsson et al., 2022) introduce induction heads: attention heads that attend to previous occurences of the present token, and increase the probability of the outputs that follow them. Inspired by this work, we investigate the hypothesis that a small number of "induction heads" play a key role in false context-following, by attending to the labels in previous similar demonstrations and making the model more likely to output them. For example, in Figure 5 , we know that the model assigns a high probability to the mistaken label "sport". According to the hypothesis, this is because of heads that attend to the previous occurences of "sport" in this context, and increase the probability of that token. The previous occurences of "sport" share two properties: (1) they are labels in the previous demonstrations, and (2) they follow inputs with the same class as "beet": "tomato" and "garlic". Therefore, we look for heads that satisfy two conditions when they attend to inaccurate prompts. First, they should be label-attending, i.e. concentrate their attention on labels in the previous demonstrations. Second, they should be class-sensitive, meaning they should attend specifically to those labels that follow inputs in the same class as the latest input. We call heads that are both labelattending and class-sensitive given incorrect demonstrations false prefix-matching heads. We define a score to identify false prefix-matching heads. For a sequence of demonstrations (x i , y i ) and a final input x, the prefix-matching score (PMS h ) of a head h is: PMS h = n i=1 Att h (x, y i ) • 1 class(x)=class(xi) - 1 #labels -1 n i=1 Att h (x, y i ) • 1 class(x)̸ =class(xi) . Prefix-matching heads should have a high PMS scores. We therefore plot the distribution of prefixmatching scores across layers for these heads. Figure 6 shows these results for the "Unnatural" dataset. For each model, the scores remain low at the early layers, but then increase around the "critical layers" that we identified in the previous section. This lends correlational support to our hypothesis that false prefix-matching heads cause false-context following behavior. Ablating false prefix-matching heads. However, we are interested in causal evidence. Therefore, we check whether removing false prefix-matching heads reduces false context-following. We select the 10 heads from GPT-J with the highest prefix-matching scores given incorrect demonstrations on the "Unnatural" dataset. We ablate these heads by setting their keys, queries and values to zero. We then evaluate the resulting lesioned model on all 14 datasets, and compare its layerwise performance to the original model's. As a control baseline, we also perform the same analysis for 10 heads selected at random. The ablations considerably increase the accuracy given false demonstrations: they reduce the gap in accuracy between accurate and inaccurate prompts by an average of 23.2% for k = 40 and 39.3% for k = 10 (see Table 1 ). In contrast, ablating random heads reduces the gap by 3.6% for k = 40 and 13.4% for k = 10. While they greatly improve the accuracy given a false prefix, our ablations have a comparatively small effect on the accuracy given correct demonstrations: ablating the false prefix-matching heads decreases the accuracy given true demonstrations by 2.2% for k = 40 and by 1.3% for k = 10. These results show that the false prefix-matching heads cause a large fraction of the false context-following behavior. Analysing the outputs of false prefix-matching heads. We identified false prefix-matching heads based only on their attention patterns. However, our postulated mechanism also depends on the heads' outputs: they must increase the probability of the labels they attend to. We therefore study the outputs of these heads to understand how they affect the residual stream. We apply the logit lens to each head individually, by applying layer normalization followed by the unembedding matrix to its outputs. This tells us how much the head increases or decreases the intermediate logits of each token. For every head, we measure the difference between the logit increases of the permuted and correct labels on the Unnatural dataset (following the methodology in (Wang et al., 2022) ). Our 10 false prefix-matching heads have an average score of 1.2, which shows that they increase the logits of the permuted label more than those of the correct label. In contrast, when sampling 100 sets of 10 random heads, we find an average score of 0.25, with a standard deviation of 0.4. Therefore, false prefix-matching heads directly increase the probability of the permuted labels relative to the correct labels more often than random heads. 

5. DISCUSSION AND RELATED WORK

In this paper, we showed how stopping language models early by zeroing out their later layers improves classification performance given inaccurate contexts, without requiring any additional train-Table 1 : Ablating false prefix-matching heads recovers a large fraction of the accuracy gap between true and false prefixes, without hurting performance given true prefixes. We show the percentage reduction of the accuracy gap and percentage change in true prefix performance when ablating the 10 false prefix-matching heads chosen using the Unnatural dataset ("top") or 10 random heads ("random"). We bold gap reductions when they are greater for our heads than for the random heads. ing. We also identified attention heads that contribute to the effect of misleading prompts, and showed that ablating these heads mitigates this effect.

Dataset

Related work. Our work is closely related to Min et al. (2022b) and Kim et al. (2022) , who examine the role of false demonstration on model accuracy. Min et al. (2022b, figure, 4) find that for classification by a pre-trained model (GPT-J), the ground truth of demonstrations has a large effect on the accuracy. However, they do not find such an effect with a meta-tuned model (Min et al., 2022a) . Therefore, meta-tuning could serve as "negative control" to test our hypothesis that false prefix-matching heads cause false context-following: an interesting future direction would be to check whether meta-tuning reduces the number of false prefix-matching heads. The literature on early-exiting and overthinking (Kaya et al., 2018; Panda et al., 2015; Teerapittayanon et al., 2017; Figurnov et al., 2017; Hou et al., 2020; Liu et al., 2020; Xin et al., 2020; Zhou et al., 2020; Zhu, 2021; Schuster et al., 2022 ) also highlights how decoding from intermediate layers can save compute and sometimes produce better results. One major difference is that most of these methods rely either on modifying the training process to allow for early-exit, or on training additional probes to decode intermediate states. In contrast, the logit lens does not require any extra training to decode answers from internal representations. How does the logit lens compare to probing? Our work, especially Section 3, relies heavily on the "logit lens" (Nostalgebraist, 2020). We find it useful to think of this method in comparison to probing. If a layer has a high probing accuracy, this means that the correct answer can be decoded from the hidden states. However, this is often a low bar to clear, especially when the classification task is easy and the hidden states are high-dimensional (Hewitt & Liang, 2019) . In contrast, if a layer has a high logit lens accuracy, this shows that it encodes correct answers along a direction in the residual stream that the model subsequently decodes from, which is much more informative. On the other hand, a low logit lens performance at a layer does not imply that the correct answers cannot be decoded from that layer. One intermediate between probing and zeroing out later layers is the "tuned lens" (Ostrovsky et al., 2022) : instead of training probes on each classification task or directly using the final layer's decoding matrix, for each layer the authors train a new square adapter matrix between the residual stream and the unembedding matrix on a language modelling dataset such as the Pile (Gao et al., 2020) . It would be interesting to run our experiments with this alternative decoding method. Future work. While we find consistent results across 14 datasets, our experiments are restricted to a specific setting: text classification with a large number of incorrect few-shot examples. We also studied variations on our setting, with qualitatively similar results (Figure 14 ): when the demonstration labels are selected at random rather than according to a permutation, and when only half of the demonstration labels are incorrect. In the future, researchers could use the logit lens to study diverse real-world failures of large models, such as "prompt injection" (Branch et al., 2022) or vulnerable code completions (Pearce et al., 2022) . In both of these cases, the model outputs inaccurate or harmful completions even though it is capable of producing correct ones given a better prompt. In addition, our head ablations do not recover the entirety of the accuracy gap between accurate and inaccurate prompts. This could be because we did not identify some of the model components that cause false context-following. However, there is another possibility: if an attention head's outputs are on average far from zero, zeroing out that head takes the intermediate states off-distribution, which can decrease overall model performance. Thus, one promising future direction would be to replace head outputs by their value on different inputs, as in (Meng et al., 2022) . Relatedly, while we identified induction heads that increase the probability of the wrong answers, we still do not have a full mechanistic understanding of false context-following behavior. For example, we do not know which heads in the earlier layers compose to form these induction heads. Future work could build on our methodology to reverse-engineer circuits (Cammarata et al., 2020) in GPT models that implement false context-following. In this work we showed how studying the early stages of computation can mitigate the effects of misleading prompts. We hope this will spur further work on auditing network internals to detect dishonest behavior in models.

A APPENDIX

A.1 CALIBRATION For k-way tasks, we measure how often the correct label has a higher probabilitiy than the k-1 kquantile of its probability over the dataset. In figure 12 , we show the logit lens accuracies of GPT-J over the 16 datasets, and confirm that they are similar to the accuracies with calibration, albeit a bit noisier. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P(True) Figure 8 : The probability of the label "True" for 30 random test inputs in MRPC. The "True" class is marked with green dots and the "False" class is marked with red dots. As observed in Zhao et al. (2021) , the model can be biased towards one of the labels.

A.2 LOGIT LENS RESULTS FOR THE OTHER MODELS

We plot the Logit Lens results for GPT2-XL and GPT-NeoX in Figure 9 and Figure 10 . 2021)). Given incorrect demonstrations, prompt formats 1, 2, 3, 4, 5, 7, 8, 9, 10, and 13 experience an increase in performance before experiencing a decline. Prompt formats 6, 12, 14, and 15, on the other hand, do not exhibit this effect. Prompt format 11 produces poor performance, given both correct and incorrect demonstrations. See Table 5 for prompt format details. Figure 14 : Average GPT-J layerwise accuracies for the original model, the null ablation, and our ablation in variants of our setup: when half of the 40 demonstrations are true and half have permuted labels (a), when each demonstration's label is chosen at random among the incorrect labels (b), and when each demonstration's label is chosen at random among all the labels (c). We find qualitatively similar results in these different settings.

A.4 LOGIT LENS RESULTS FOR GPT-J WITHOUT CALIBRATION

Table 3 : The prompts used for paraphrase detection, natural language inference, and topic classification. The prompts for MedQ-Pairs, MRPC, SICK, and RTE are taken from Min et al. (2022b) , and the prompt for AGNews, TREC, and DBPedia are taken from Zhao et al. (2021) . We show one training example per task for illustration.



Figure2: Left: The difference in accuracy between accurate and inaccurate prompts increases with the number of demonstrations. Right: As the number of false demonstrations increases, the model chooses the permuted label σ(class(x)) more often than the other labels, rather than making random errors.

Figure3: Average performance across 14 tasks for GPT2-XL, GPT-J, and GPT-NeoX. y-axis (left): Calibrated accuracy given correct and incorrect demonstrations, denoted by full lines. y-axis (right): Percentage of tasks where zeroing out all succeeding transformer blocks is superior than full model evaluation, denoted by dashed lines. Early-exiting is effective given false demonstrations, and perhaps more surprisingly, also effective given correct demonstrations.

Figure6: Sum of prefix-matching scores given true and false demonstrations, for GPT2-XL (a), GPT-J (b), and GPT-NeoX (c) on the toy Unnatural dataset. The prefix-matching scores increase around the layers where the accuracy gap (averaged over tasks) between true and false demonstrations emerges.

Figure 7: Ablating false prefix-matching heads increases accuracy across multiple layers. (a), (b): Average accuracy at each layer before and after ablating false-prefix matching or random heads, given correct and incorrect demonstrations. (c), (d): Accuracy at each layer for incorrect demonstrations on AGNews and Unnatural, after ablating the k most class-sensitive heads, for k ∈ {5, 15, 35, 75}.

Figure9: GPT2-XL early-exit classification accuracies across 16 datasets, given correct and incorrect demonstrations. Given incorrect demonstrations, zeroing out all transformer blocks after layer 30 outperforms running the entire model on 13 out of 16 datasets. In all datasets, running the entire model is never superior than the max performance over the preceding layers.

Figure12: GPT-J early-exit uncalibrated classification accuracies across 16 tasks, given correct and incorrect demonstrations. The lack of calibration makes the results noisier especially at early layers, but early-exit still generally outperforms running the full model.

A.7 PROMPT FORMATS USED FOR ALL DATASETS

Table 2: The prompts used for sentiment analysis and hate speech detection. The prompt used for SST-2 is taken from Zhao et al. (2021) , and the prompts used for Poem-Sentiment, Financial-Phrasebank, Ethos, TweetEval-Hate, TweetEval-Atheism, and TweetEval-Feminist are taken from Min et al. (2022b) . We show one training example per task for illustration. 

