kNN PROMPTING: BEYOND-CONTEXT LEARNING WITH CALIBRATION-FREE NEAREST NEIGHBOR IN-FERENCE

Abstract

In-Context Learning (ICL), which formulates target tasks as prompt completion conditioned on in-context demonstrations, has become the prevailing utilization of LLMs. In this paper, we first disclose an actual predicament for this typical usage that it can not scale up with training data due to context length restriction. Besides, existing works have shown that ICL also suffers from various biases and requires delicate calibration treatment. To address both challenges, we advocate a simple and effective solution, kNN Prompting, which first queries LLM with training data for distributed representations, then predicts test instances by simply referring to nearest neighbors. We conduct comprehensive experiments to demonstrate its two-fold superiority: 1) Calibration-Free: kNN Prompting does not directly align LLM output distribution with task-specific label space, instead leverages such distribution to align test and training instances. It significantly outperforms state-of-the-art calibration-based methods under comparable few-shot scenario. 2) Beyond-Context: kNN Prompting can further scale up effectively with as many training data as are available, continually bringing substantial improvements. The scaling trend holds across 10 orders of magnitude ranging from 2 shots to 1024 shots as well as different LLMs scales ranging from 0.8B to 30B. It successfully bridges data scaling into model scaling, and brings new potentials for the gradient-free paradigm of LLM deployment. Code is publicly available 1 .

1. INTRODUCTION

Maximum Context Length Large language models (LLMs), when scale up to billions of parameters, have demonstrated remarkable capabilities in a wide range of NLP tasks (Radford et al., 2019; Brown et al., 2020) . However, such models are prohibitively expensive to train with most of the research-or consumer-level devices, though some of them are already publicly available (Zhang et al., 2022) . As a result, it is now an emerging paradigm that LLMs are hosted in a remote data center while accessed by end users or applications via simple API requests 2 . The typical usage of LLM under such paradigm is In-Context Learning, where LLM reads and completes a prompt sequence as how it is pretrained on massive text corpora. The prompt is constructed by concatenation of several training examples and a test instance, and the prediction is obtained by mapping the LLM word continuations back to label space. It is widely investigated and acknowledged that modern neural networks generally perform better w.r.t. increased training data. Specifically, there exists a power law between expected model performance and available data scale (Hestness et al., 2017; Rosenfeld et al., 2020) . For ICL, it is also empirically observed that the performance continually improves when more training examples are prepended into the prompt (Brown et al., 2020) . However, such improvements are quickly prevented by the predicament of context length restriction, as language models are designed and trained to only process sequences within a fixed length, which is in fact 1024 or 2048 tokens. In order to utilize more training data, several works try to select the most relevant examples to compose the prompt before querying LLM (Liu et al., 2022b; Rubin et al., 2022) , but still only in-context examples can actually participate the LLM inference while most training data are discarded beforehand, thus providing marginal data scaling benefits. Besides, their reliance on external retriever also incurs further complications. As a consequence, such a situation poses a serious challenge for many practical scenarios where more than a few training data are available. Another vulnerability of ICL is the severe bias existed in the output distribution of LLMs, which results in considerable performance degradation (Holtzman et al., 2021) and instability (Lu et al., 2022) as shown in existing works. Accordingly, many have proposed various ways to calibrate the output distribution (Zhao et al., 2021; Jiang et al., 2021; Min et al., 2022a) . For example, Zhao et al. (2021) measure such bias by probing LLM with a "NA" example and record the according prior. However, as LLMs are pretrained on general-domain natural language, its capability to complete a fabricated prompt is essentially not aligned with downstream task-specific label space. As a consequence, such calibration-based methods can only alleviate the bias to a limited extent. In this paper, we advocate a simple and effective solution, kNN Prompting, to address both challenges. Specifically, we assign training data into a demonstration set and an anchor set. We append each anchor example into the prompt and query LLM, then instead of aligning word continuations with labels, we collect the language modeling probability as distributed representation and cache it into a local datastore. At inference time, for each test instance, we similarly obtain its representation and match it against the maintained datastore to make predictions. In general, the proposed framework enables both calibration-free optimization because it avoids forced input-label alignment, and beyond-context learning because the anchor set allows utilization of unlimited training data. We conduct comprehensive experiments using 10 established text classification tasks to demonstrate the significant superiority of kNN Prompting across various scenarios and against competitive opponents: 1) Under few shot scenario where training data is very limited and fits in the context, kNN Prompting outperforms state-of-the-art calibration-based methods by considerable margin (up to +7.07). 2) Under low resource or fully supervised scenario where training data can not fit in the context, kNN Prompting further exhibits its major advantage. It can effectively scale up with as many training data as are available across 10 orders of magnitude (2 shots∼1024 shots, see Figure 1 for illustration) as well as different LLMs scales (0.8B∼30B). Specifically, with only 32 shots training data, it dramatically improves ICL by +13.58 in average score at its most, and achieves absolute improvements up to +18.84 under fully supervised setting. We also provide formal explanation on the intrinsic mechanism of effectiveness, as well as detailed analyses regarding its robustness and choices of design. Accompanied with these appealing aspects, kNN Prompting is in general a promising solution that bridges the benefits of data scaling into model scaling to take the gradientfree paradigm of LLM deployment one step further.

2. BACKGROUND: IN-CONTEXT LEARNING

In this section, we formulate the task and recap the ICL baseline. Assuming a target task with training data set T = {(x i , y i )}, and Y as its categorical label space. At inference time, the model is asked to predict y test given test instance x test . We then denote an LLM θ that is pretrained with a standard language modeling objective. At employment, it samely predicts a probability distribution p(w t |w <t , θ) for the next token at t-th position conditioned on previous context w <t . In-context learning first formulates training examples {(x i , y i )} in the format of input-label mapping via intuitive templates (see Appendix F for illustration), and concatenates them into a natural language sequence along with the test instance to construct a prompt: P = π(x 1 , y 1 ) ⊕ π(x 2 , y 2 ) ⊕ ... ⊕ π(x |T | , y |T | ) ⊕ π(x test , * ) (1) where π denotes template-based transformation (see Appendix for illustration) and ⊕ denotes concatenation operation. Note that π implies a verbalization process that maps label space Y to corresponding tokens V picked from the LM vocabulary. When queried by the prompt P , LLM will try to mimic the prepended training examples in the context and predict a probability distribution p(v|P, θ) for the next token v. We then map it back to label space Y as prediction: ŷtest = arg max y∈Y (v|P, θ), y π -→ v (2) Figure 2 : ICL improves with num. of training examples but is limited by context length restriction. Figure 2 provides a pilot study showing that when prompt P includes more demonstrations, the performance consistently improves, which is in accord with the power law of data scaling investigated in many existing studies (Hestness et al., 2017; Rosenfeld et al., 2020) . However, this trend is then prevented by the context length restriction. We provide more comprehensive statistics in Table 1 and Appendix A. In conclusion, this situation poses an actual challenge in many scenarios where one would further collect training examples from few-shots to dozens and expect improved performance, but the power law fails.

3. kNN PROMPTING

In this section, we introduce the kNN Prompting framework. For a training data set T , we split and exploit it in two respective usage: T = D ∪ A, i.e., a demonstration set D = {(x d i , y d i )} and an anchor set A = {(x a i , y a i )}. kNN Prompting consists of two stages namely meta test and formal test, the overall framework is illustrated in Figure 3 .

Meta Test

We build a datastore that caches all anchor examples in A for later inference time usage. For each x a i , we respectively concatenate it into prompt P , where the prompt prefix is constructed using the demonstration set the same as Equation 1: P i = π(x d 1 , y d 1 ) ⊕ π(x d 2 , y d 2 ) ⊕ ... ⊕ π(x d |D| , y d |D| ) ⊕ π(x a i , * ) By querying LLM using P i , we obtain the distribution p(v|P i , θ). Instead of mapping it back to label space V , we cache the entire language modeling distribution as the key representation: k i = p(v|P i , θ) ) Accordingly, label y a i is the value. The entire datastore thus consists of paired {k i , y a i }, we denote the set of keys as K. Formal Test At inference time, for each test instance x test , we construct the same prompt as Equation 1, and obtain distribution p test = p(v|P test , θ). We then match the distribution against cached K in the datastore, where standard KL divergence is used to measure the distance: D KL (p test ||k i ) = v p(v|P test , θ) log p(v|P test , θ) p(v|P i , θ) (5) Meta-Test Formal-Test LLM Interface 𝑃 (𝑥 ! " , * ) (𝑥 # " , * ) … (𝑥 𝒜 " , * ) (𝑥 %&'% , * ) Datastore 𝑘 Nearest Neighbors KL Divergence 𝒑 !"#! 𝑦 $ $%"& 𝒌 ! 𝒌 # 𝒌 ( 𝒌 𝒜 (𝑥 ! ) , 𝑦 # ) ) … … (𝑥 # ) , 𝑦 # ) ) (𝑥 |𝒟| ) , 𝑦 |𝒟| ) )

Demonstration

Anchor Set Distribution The predictions are calculated by aggregating its k nearest neighbors: ŷpred = arg max y∈Y i∈NN k (ptest,K) 1(y a i = y) (6) where NN k ( * , K) denotes the set of k nearest neighbors in K.

4.1. SETUP

We use 10 established text classification datasets, respectively SST2 (Socher et al., 2013) , SUBJ (Pang & Lee, 2004) , MPQA (Wiebe et al., 2005) , AGNews (Zhang et al., 2015) , CB (De Marneffe et al., 2019) , CR (Hu & Liu, 2004) , DBPedia (Zhang et al., 2015) , MR (Pang & Lee, 2005) , RTE (Dagan et al., 2005) and TREC (Voorhees & Tice, 2000) . For each dataset, we devise intuitive prompt template (Appendix F), and other regarding statistics are listed in Appendix A. We investigate a wide range of LLM scales, including GPT2 (0.8B and 1.5B) (Radford et al., 2019) and the OPT (Zhang et al., 2022) series (3B-30B). GPT2 XL is used for most analyses unless explicitly indicated. We invariantly set the number of neighbors k to 3. There are no other hyper-parameters as the entire framework is training-free. We run with 5 different random seeds for Figure 5 and Figure 6 , 10 seeds for all other results. Mean and standard deviation are reported.

4.2. DATA UTILITY

In this paper, we refer to data utility as whether and how performance scales up with increased training data. It can be formally depicted by the power law (Hestness et al., 2017) : ε(m) ∝ αm β (7) where m is the training data size, α is a constant scaling factor, kNN Prompting significantly outperforms competitive baselines under strictly comparable settings (m ≤ 8), specifically, +3.56 for 4 shot, and +7.07 for 8 shot. β

Superiority of Whole LM Distribution

Calibration-based methods (Zhao et al., 2021) as well as standard ICL only access label words instead of whole LM distribution, which is inferior in two aspects: 1) loss of information. LLM always generate distribution over all words in the entire vocabulary, non-label words probabilities also reflect its understanding in certain perspectives; 2) multiple label words competing with each other. There exist various choices for label words but no oracle rules to select one, and alternative choices potentially compete with the selected label words, distorting the label space distribution. This is also referred to as surface form competition (Holtzman et al., 2021) . In Table 2 we very this benefit by masking out non-label words (referred to as Partial).

4.2.2. DATA UTILITY BEYOND THE CONTEXT

We then investigate a major advantage of kNN Prompting, which is scaling up to more training examples that otherwise would not fit in the context. We increase m to 128 and compare with: 1) ICL Ensemble which is an intuitive alternative to scale ICL up, and has been adopted in previous works (Jiang et al., 2020) ; 2) finetuning of standard PLM such as BERT or GPT Large, which could produce meaningful results with such amount of data. For all methods, we append maximum M T examples into prompt P . For ICL Ensemble, we split T into multiple non-overlap demonstration sets T = D 1 ∪ D 2 ∪ ... ∪ D N to construct different prompts, and ensemble their predictions. Results in Table 3 show that kNN Prompting continues to improve and outperform ICL and its ensemble respectively by +16.96 and +16.08 (0.8B model, average score). Besides, the ensemble baseline is also very inefficient, assuming M T = 8, we need to query 128/8 = 16 times for every test instance, and this keeps growing linearly if we use more training data, eventually becomes prohibitively inefficient and can not scale up either. kNN Prompting also outperforms FT when the adopted LLM scales above 6B (82.73). By contrast, it would require LLM to scale above 30B (82.45) using ICL Ensemble and even larger using standard ICL. Comparison to Demonstration Selection A line of related works try to utilize available training data by firstly retrieving the most relevant ones from the entire training set, then selectively composing the prompt before querying LLM (Liu et al., 2022b; Rubin et al., 2022) . We since reproduce such methods according to Liu et al. (2022b) . We employ state-of-the-art general-purpose sentence encodersfoot_2 to represent test and training instance, and compute their cosine similarity, the most similar M T examples are selected to construct prompt P . These retrieving models include BM25 (Trotman et al., 2014) , Sentence-BERT (Reimers & Gurevych, 2019) , SimCSE (Gao et al., 2021) and Trans-Encoder (Liu et al., 2022a) . Figure 6 shows that although such methods indeed exhibits marginal scaling benefits, they are nowhere near competitive against kNN Prompting. Full results are listed in Appendix C.3. To fur- 

4.3.2. SPLIT STRATEGY BETWEEN DEMONSTRATIONS AND ANCHORS

For fully supervised scenario where m M T , we can simply set |D| to its maximum M T . Otherwise, we might need to deliberate the trade-off between the demonstration set and anchor set. In Figure 13 we search through all possible combinations given that |A| + |D| ≤ M T . We see the left part of the heatmap generally performs inferior, i.e., |A| ∈ {1, 2}. This means a larger |A| contributes more to the performance while the choice for |D| is relatively more robust. The conclusion also corresponds to few shot results that anchor set yields better data utility than context concatenation. More datasets are also provided in Appendix C.5. 

4.3.3. QUALITATIVE ANALYSES AND REASONS OF EFFECTIVENESS

We first formally organize the explanation as follows according to Figure 9 : • The output language modeling (LM) distribution of LLM is essentially not well aligned with taskspecific label space, resulting in inferior performance (83.5 test accuracy) of default ICL. • If we similarly perform inference on anchor set, we would expect to get approximately 83.5 anchor accuracy by assuming i.i.d. data distribution. However, we are already aware of each of their golden labels, which actually gives 100 anchor accuracy. • LM distribution is inferior for making direct predictions, but superior for matching examples because it entails distributional, delicate and comprehensive representations generated by LLM. kNN Prompting leverages such representations (83.5 accuracy) only for matching, and refer to their golden labels (100 accuracy) for predicting, thus successfully transfer part of the knowledge originating from anchor labels to test instances. In the visualization, the representations generally exhibit partially clustered pattern, we can identify proportional examples that get entangled with different categories and crowded together (Case A, C, D), these confusing cases are likely to cause erroneous predictions in ICL and corresponds to underperformed 83.5 accuracy as mentioned above. Specifically, case A is an abstract about a novelist and should belong to category artist, but it is easily confused with category book using ICL because the context did mention books. By contrast, kNN Prompting can correctly predict by referring to similar anchors that are also about novelists and their books (as listed in the table). Some of the anchors are also incorrectly predicted as book, but it no longer matters because kNN Prompting only use the distribution for nearest neighbor search but refer to golden labels for prediction. Besides, as we can clearly know how the prediction is made, i.e., which anchor examples are referred, the proposed method also exhibits explainability as a further advantage.

5. RELATED WORKS

Large language models, since firstly scaled up to hundreds of billions parameters by Brown et al. (2020) and followed by several others (Rae et al., 2021; Zhang et al., 2022; Chowdhery et al., 2022; Cohen et al., 2022; Smith et al., 2022) , have become the most prominent direction of NLP area. Although these models exhibit surprisingly powerful and even emergent capabilities in a wide range of NLP tasks (Wei et al., 2022b; Hendrycks et al., 2020; Srivastava et al., 2022) , they are prohibitively expensive for most researchers or users to train or even hold. In-Context Learning, which suits LLM to various tasks while requires no training, therefore becomes the typical usage as is popularized by Brown et al. (2020) . Similar ideas of formulating target tasks into natural language sequences can also be found in earlier works (Trinh & Le, 2018; Raffel et al., 2020) . To better exploit LLM for various scenarios, it becomes a crucial problem to develop augmented methods for ICL (Dong et al., 2022) kNN is a classical machine learning algorithm (Fix & Hodges, 1989) well known for its simplicity and inspired a wide range of application (Papernot & McDaniel, 2018; Orhan, 2018) . In the field of NLP, Kaiser et al. ( 2017) construct a differentiable memory module for nearest neighbor searching which improves generalization to rare events. Similar idea has also been explored for generation tasks (Guu et al., 2018) , such as dialog generation (Weston et al., 2018) 2022) consider unsupervised corpus as datastore, retrieve and interpolate them with the current step language modeling probability. The retrieved corpus can also serve as references for knowledge intensive tasks, but the retriever would need explicitly training for such purpose (Lewis et al., 2020; Borgeaud et al., 2022; Izacard et al., 2022) . Different from these works, kNN Prompting is suitably situated in the gradient-free paradigm of LLM deployment, which avoids calibration treatment and effectively bridges data scaling into model scaling.

6. DISCUSSION

Under the existing ICL paradigm, it is often impossible to take advantage of both the capability of LLM and the data utility of finetuning, i.e., model scaling and data scaling. kNN Prompting finds an effective solution to promise them both. Nevertheless, we assume its data utility should still be inferior to the specialized finetuning of LLMs, if given sufficient computation resources in an ideal setting. We believe that it is a very important and promising direction to further approach this upper-bound and expect to raise more interests in future works. A potential concern for retrieval-based models is their efficiency, especially when corpus level datastore is utilized. 

ETHICAL CONSIDERATIONS

This work is built upon the ICL paradigm and involves querying LLM for responses. These models might generate contents with potential ethical risks regarding fairness and bias (Bommasani et al., 2021; Blodgett et al., 2020) , depending on specific downstream tasks. Although the scope of this paper remains on how to better exploit LLM for task performance, it is worth further discussion to combine the proposed framework in conjunction with well-established methods that can measure (Nadeem et al., 2021) and mitigate (Nadeem et al., 2021; Gupta et al., 2022 ) such ethical risks.

A DATASET STATISTICS: MAX SHOT IN CONTEXT

We provide detailed statistics about the number of maximum shots in 

B MORE ANALYSES B.1 DISTANCE MEASUREMENT

We investigate euclidean distance as an alternative distance measurement, which has also been explored in Khandelwal et al. (2020) . We take the contextual representation h of LLM, and denotes their distance as D L2 (h test , h i ). Table 6 shows that both methods are effective but D KL (Equation 5) based on LM distribution p is a superior measurement. Actually, p is a projection of h through the word embedding, we think this procedure exploits the well-structured word embeddings of LLM to provide more disentangled representations, thus can better serves as the anchor space. 

Conclusion

We further explore the robustness of kNN Prompting regarding the reliance of prior knowledge on target distribution. We show that among the investigated baselines, ICL and Con-textualCalibration (Zhao et al., 2021) are greatly impacted by prior knowledge of test distribution, while kNN Prompting and NoisyChannel (Min et al., 2022a) are much more robust. Experimental Setting We investigate various combinations of prior distribution by controlling the imbalance ratio λ train and λ test . For every setting, we include 5 binary classification tasks (SST2, MPQA, SUBJ, MR, CR) and run with 10 random seeds, we report the average score of these results. Specifically, λ train = 0.125 means one category (positive) accounts for 12.5% of the entire train set, and λ train = 1.5 corresponds to the balanced setting. We investigate both λ train < 0.5 and λ test < 0.5, which results in three different settings. ContextualCalibration explicitly include a prior distribution, at such imbalanced scenario, one can either use the default assumption (pos : neg = 1 : 1, as designed in the original paper) or use the observed assumption from train set (pos : neg = λ train /(1 -λ train )). We respectively refer to them as Balanced Prior and Trainset Prior. Note that test distribution is unaccessible so we can not use it. Other methods (ICL, NoisyChannel and kNN Prompting) do not technically incorporate any prior knowledge. So they are not concerned with this investigation dimension. Analyses The results are reported in Table 7 . For ICL, LLM naturally suffers from the bias learned in pretraining stage, thus is vulnerable to any different prior distributions. The performance greatly degrades in Setting B (-22.06) and Setting C (-22.17). For ContextualCalibration, technically, it necessarily requires a prior distribution to rectify the LLM predicted label word probabilities. If the prior knowledge does not match the (unaccessible) test distribution, its performance will be greatly degraded. Specifically, if Balanced Prior is consistent with test distribution, the method performs well (Setting A, -1.77), otherwise, the performance degrades (Setting B, -20.24 and Setting C, -12.65). Similarly, if Trainset Prior Assumption is consistent with test distribution, the method performs well (Setting C, +8.65), otherwise, the performance degrades (Setting B, -20.24 and Setting A, -18.92). For NoisyChannel, its performance is rather robust (-2.60/-0.63/-7.87 respectively in Setting A, B, C). By re-formulating ICL into computing conditional probability of the input given the output, it is indeed an effective way to calibrate the task prediction. For the proposed kNN Prompting, technically, it does not incorporate any prior knowledge of train or test distribution. Both the construction of datastore and the retrieving then predicting procedure do not vary w.r.t. different prior knowledge of distribution. The proposed method can robustly adapt to all imbalanced settings, including imbalanced trainset, testset and both. There is basically no performance degradation (-1.3/-2.6/+0.05 respectively in Setting A, B, C, where -1.3/-2.6 can be considered within ordinary fluctuation). Table 7 : Reliance on prior knowledge. All reported results are averaged across 5 datasets, and we further report average performance across all imbalance ratios. λ train/test denotes the subsampled ratio. MaxDrop measures the performance degradation compared to ordinary balanced setting (λ * = 0.5), where the best is bolded and no drop is highlighted in pink.

B.3 EMPIRICAL CHOICE OF k

In Figure 10 we search for different choices of k on MPQA, and found that it is generally a rather robust choice within the wide range [3, |A|/2 -1].

B.4 ROBUSTNESS UNDER IMBALANCED SCENARIO

We further test the robustness of kNN Prompting under imbalanced label scenario. Take binary classification like SST2 as example, we simulate imbalance ratio by controlling one of the category proportionally to {0.5, 0.25, 0.125, 0.0625, 0.03125, 0.015625} of the entire training set, where 0.5 corresponds to the ordinary balanced scenario. We keep the test set intact, which results in a challenging out-of-distribution (OOD) setting. Results in Figure 11 reveal the vulnerability of the proposed method. Under imbalanced setting, kNN Prompting is overwhelmed by the large quantity of anchors from the majority class where it is simply far more easier to find closer neighbors. To address such performance degradation under challenging imbalanced scenario, we propose a simple normalization trick: we average the anchor representations to produce one centered anchor for each class, the resulting anchor is thus more representative and also avoids quantity distraction. Such an adaptation works surprisingly well with no loss of performance even under ordinary balanced setting.

C COMPLETE RESULTS

C.1 DATA SCALING CURVE UNDER FEW-SHOT SCENARIO We provide scaling curve w.r.t. each specific dataset in Figure 12 , corresponding to Table 2 and Figure 4 in the main manuscript. We provide results on more datasets regarding the investigation of split strategy in Figure 13 , corresponding to Section 4.3.2.

D COMPARISON TO FINETUNING UNDER FULLY SUPERVISED SCENARIO

We compare kNN Prompting to standard PLM finetuning in a more extensive data scale. For finetuning baselines in both Table 3 and Figure 14 , we set hyper-parameters following previous works (Schick & Schütze, 2021) . We set learning rate to 1e-5, batch size to 16, and training steps to 125, 250 or 500, respectively for m ∈ {32, 64}, {128, 256}, {512, 1024}. For CB, AGNews and RTE, batch size is adjusted to 8, for DBPedia, batch size is adjusted to 4 to avoid OOM. We observe that with the same model scale, kNN Prompting is superior than finetuning under the low resource setting, but inferior under fully supervised setting. This indicates its data utility factor α is still smaller than finetuning. However, the main advantage of kNN Prompting comes with LLM, which significantly outperforms the finetuning baseline without any gradient-based optimization. We have also discussed this in Section 6.

E MORE CASE STUDY

We provide more case study, respectively on SST2, TREC, AGNews and MR (Figure 15, 17, 16 and 18) . 

F PROMPT TEMPLATE

See Table 11 for the used templates (Adopted from Lu et al. (2022) ). They are intuitively designed and the proposed method should be robust with choices of templates.



https://github.com/tonyzhaozh/few-shot-learning https://github.com/shmsw25/Channel-LM-Prompting, as there exists slightly difference of prompt templates, we report their best template out of four. Note thatLiu et al. (2022b) also employ RoBERTa model finetuned on SST2, but this overlaps with our selected benchmark and does not generalize to other tasks beyond sentiment classification. Calculated usingTable 8 statistics with 32 shot setting, 0.8B model



Figure 1: kNN Prompting brings substantial improvements over standard ICL, and can continually scale up beyond the context with as many data as are available. Conducted with GPT XL.

Figure 3: The overall framework of kNN Prompting

Figure 4: Data scaling under few shot scenario. Compared with calibration-based methods.

Figure 5: Data scaling under fully supervised scenario. Conducted across various LLM scales. 4.2.3 CONTINUALLY SCALING UP TO THOUSANDS OF TRAINING DATA We now fully scale data up to thousands of training examples and provide extensive results across model scales to observe their overall scaling performance. Figure 5 shows kNN prompting can continually generalize across the tested range to provide effective data utility, re-enabling the power law under gradient-free paradigm of LLM deployment. The full results can be found in Appendix C.2. With only 32 shots training data, kNN prompting dramatically improves ICL by +13.58 in average score at its most (0.8 B), and achieves absolute improvements up to +18.84 under fully supervised setting. With the largest model OPT 30B, it achieves a best performance of 86.02.

Figure 6: cf. Demonstration Selection.

Figure 7: Robustness.Figure8: Split strategy.

Figure 9: t-SNE (van der Maaten & Hinton, 2008) for anchors and test cases. Cases are randomly selected given that kNN Prompting outperforms ICL. DBPedia is an ontology classification task.

Figure 10: Empirical choices of k. Conducted on SST2, SUBJ and MPQA respectively (left to right).

Figure 11: Robustness under imbalanced scenario. Left: few-shot scenario (32). Right: fullysupervised scenario (1024).

Figure 12: Scaling curve under few shot setting for each specific dataset. The baselines are strictly comparable under m ≤ 8, some baselines might be truncated when m ≥ 16. Models & Methods SST2 SUBJ MPQA AGNews CB CR DBPedia MR RTE TREC AVG

Figure 13: Split of Demonstration and Anchor Set. |A| + |D| ≤ M T . C.5 SPLIT STRATEGY OF DEMONSTRATION AND ANCHOR SET

Figure 14: Comparison to finetuning baseline.

Maximum number of training shots (per class) allowed by 1024 tokens of context. Calculated using GPT2 tokenizer. Inside the parentheses are Truncation Probability (TP, i.e., whether or not truncated, restricted below 5%). We set M T for each task T in our experiments for simplicity.

63.4±7.3 58.9±8.7 70.5±5.2 61.7±15.4 45.0±9.1 83.3±13.7 59.9±11.5 77.0±15.7 53.6±3.1 54.4±1.7 62.77 ICL Ensemble 63.0±6.4 57.7±10.3 69.1±6.2 67.4±2.9 41.1±3.1 83.8±11.5 67.8±3.7 72.7±12.3 55.1±3.8 59.0±3.7 63.65 kNN Prompting 84.5±5.3 85.8±1.6 83.1±0.8 84.5±1.3 62.1±3.4 89.ICL Ensemble 92.6±2.3 84.1±4.6 79.9±1.9 78.3±2.4 67.1±4.8 88.1±2.9 93.8±1.0 93.4±1.0 65.1±4.6 82.0±2.5 82.45 kNN Prompting 94.3±0.9 92.7±1.7 84.5±0.9 87.1±1.4 70.7±8.2 91.0±1.2 98.8±0.5 93.1±1.5 61.7±4.8 79.8±2.0 85.38



Derek Landy (born October 23 1974) is an Irish author and screenwriter best known for the Skulduggery Pleasant series of children's books. Adrian Dawson (born 26 January 1971) is a British author of thriller and horror fiction currently best known for his 2010 debut novel Codex ... ... non-fiction rather than fiction for such subject matter. Michelle Harrison (born 1979 Grays Essex England) is a British author whose debut novel 13 Treasures won the Waterstone's Children's Book Prize … … Young-adult fiction book is Unrest and was released in 2012.

.Xie et al. (2022) provide theoretical explanations that formalize ICL as Bayesian inference.Dai et al. (2022) reveal that ICL can be seen as implicit finetuning where LLM produces meta-gradients from in-context demonstrations to adapt the model behavior.Wei  et al. (2022a)  andSanh et al. (2022) propose instruction tuning, which further pretrains LLM with a collection of downstream tasks in a shared prompting format.Min et al. (2022b)  andChen et al.  (2022b)  introduce meta-learning to better adapt LMs to ICL.Wei et al. (2022c)  andKojima et al. (2022) propose to augment the demonstrations with human-aided reasoning steps or hints, which surprisingly improved the performance for arithmetics and other reasoning tasks. Closely related to this work,Liu et al. (2022b)  andRubin et al. (2022) propose to compose prompt P by selecting most similar training examples.Zhao et al. (2021) andMin et al. (2022a)  propose to calibrate ICL prediction via either probing the bias or reversing the conditional prediction formulation.

kNN Prompting is free of such concerns as it considers training data, which is in manageable scale. Under few shot scenario, kNN Prompting even reduces computational costs at deployment time compared to standard ICL. It works well with one shot demonstrations as experimented in Section 4.2.1, while the anchor examples are queried for only once and cached locally. By contrast, existing methods only perform better when all examples are prepended in a single prompt. This advantage is rather important as we need to repeatedly query the prompt in practical usage of LLM service, and longer prompt results in linearly more monetary costs if charged by token numbers or super-linearly more computational costs if measured by FLOPS.

Table 5, i.e., M T for each task, corresponding to Table 1 in the main manuscript.

Dataset statistics and the maximum shots (per class) that a context of 1024 tokens or 2048 tokens can allow. We provide maximum shots under 5% Truncation Probability (TP) restriction as well as the actual M T taken in this paper, which is set from {1, 2, 4, 8, 16, 32} for simplicity.

Comparison for alternative distance measurement.

Full results for data scaling, corresponds to Figure5. Some ICL results are reused 7 .

ACKNOWLEDGMENTS

This work is supported in part by the National Natural Science Foundation of China under Grant 62222212, 62232006 and 61876223, Science Fund for Creative Research Groups under Grant 62121002.

C.2 DATA SCALING RESULTS UNDER FULLY SUPERVISED SCENARIO

We provide comprehensive results of kNN Prompting across data scales and LLM scales in Table 8 , corresponding to Figure 5 in the main manuscript.

C.3 FULL RESULTS OF COMPARISON TO PROMPTCOMPOSE

We provide the full results of comparison between kNN Prompting and PromptCompose in Table 9 , corresponding to Figure 6 in the main manuscript. The employed sentence encoder can be found at https://huggingface.co/models 8910 . 7 In Table 8 , there are few cases where kNN Prompting gets identical results with ICL baseline. This happens when MT =32, which leaves |A| = 0 as we have invariably set |D| = MT . To avoid exploiting exceptional split strategies for such specific case, we simply re-use the Underperformed ICL baseline results. In general, this only occurs on MPQA (when m = 32) and SST2 (when m = 32 and LLM > 2.7B, 2048 tokens context), and should have few impact on the overall results and conclusion. Similar situation happens in Table 9 8 https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 9 https://huggingface.co/cambridgeltl/trans-encoder-bi-simcse-bert-base 10 https://huggingface.co/princeton-nlp/unsup-simcse-bert-base-uncased Prediction: false Premise: For women earning 22,000 a year, the total pay accumulated after six months maternity leave would be just 5,300 in the UK and 5,850 in Ireland. Entitlements in Germany would also be relatively low, at 5,900, along with those in France, Spain and the Netherlands, all at 6,750. At the other end of the scale, pay received after six months leave in Italy would be 9,150 while in Denmark and Norway it would be as much as 11,000. Hypothesis: Maternity leave varies in Europe. Prediction:TREC Question: How did serfdom develop in and then leave Russia ? description, entity, Type: description expression, human, Question: What is Shakespeare 's nickname ? location, number Type:Table 11: Templates for ICL. These are minimum cases with only one demonstration example for illustration.

