kNN PROMPTING: BEYOND-CONTEXT LEARNING WITH CALIBRATION-FREE NEAREST NEIGHBOR IN-FERENCE

Abstract

In-Context Learning (ICL), which formulates target tasks as prompt completion conditioned on in-context demonstrations, has become the prevailing utilization of LLMs. In this paper, we first disclose an actual predicament for this typical usage that it can not scale up with training data due to context length restriction. Besides, existing works have shown that ICL also suffers from various biases and requires delicate calibration treatment. To address both challenges, we advocate a simple and effective solution, kNN Prompting, which first queries LLM with training data for distributed representations, then predicts test instances by simply referring to nearest neighbors. We conduct comprehensive experiments to demonstrate its two-fold superiority: 1) Calibration-Free: kNN Prompting does not directly align LLM output distribution with task-specific label space, instead leverages such distribution to align test and training instances. It significantly outperforms state-of-the-art calibration-based methods under comparable few-shot scenario. 2) Beyond-Context: kNN Prompting can further scale up effectively with as many training data as are available, continually bringing substantial improvements. The scaling trend holds across 10 orders of magnitude ranging from 2 shots to 1024 shots as well as different LLMs scales ranging from 0.8B to 30B. It successfully bridges data scaling into model scaling, and brings new potentials for the gradient-free paradigm of LLM deployment. Code is publicly available 1 .

1. INTRODUCTION

Maximum Context Length Large language models (LLMs), when scale up to billions of parameters, have demonstrated remarkable capabilities in a wide range of NLP tasks (Radford et al., 2019; Brown et al., 2020) . However, such models are prohibitively expensive to train with most of the research-or consumer-level devices, though some of them are already publicly available (Zhang et al., 2022) . As a result, it is now an emerging paradigm that LLMs are hosted in a remote data center while accessed by end users or applications via simple API requests 2 . The typical usage of LLM under such paradigm is In-Context Learning, where LLM reads and completes a prompt sequence as how it is pretrained on massive text corpora. The prompt is constructed by concatenation of several training examples and a test instance, and the prediction is obtained by mapping the LLM word continuations back to label space. It is widely investigated and acknowledged that modern neural networks generally perform better w.r.t. increased training data. Specifically, there exists a power law between expected model performance and available data scale (Hestness et al., 2017; Rosenfeld et al., 2020) . For ICL, it is also empirically observed that the performance continually improves when more training examples are prepended into the prompt (Brown et al., 2020) . However, such improvements are quickly prevented by the predicament of context length restriction, as language models are designed and trained to only process sequences within a fixed length, which is in fact 1024 or 2048 tokens. In order to utilize more training data, several works try to select the most relevant examples to compose the prompt before querying LLM (Liu et al., 2022b; Rubin et al., 2022) , but still only in-context examples can actually participate the LLM inference while most training data are discarded beforehand, thus providing marginal data scaling benefits. Besides, their reliance on external retriever also incurs further complications. As a consequence, such a situation poses a serious challenge for many practical scenarios where more than a few training data are available. Another vulnerability of ICL is the severe bias existed in the output distribution of LLMs, which results in considerable performance degradation (Holtzman et al., 2021) and instability (Lu et al., 2022) as shown in existing works. Accordingly, many have proposed various ways to calibrate the output distribution (Zhao et al., 2021; Jiang et al., 2021; Min et al., 2022a) . For example, Zhao et al. ( 2021) measure such bias by probing LLM with a "NA" example and record the according prior. However, as LLMs are pretrained on general-domain natural language, its capability to complete a fabricated prompt is essentially not aligned with downstream task-specific label space. As a consequence, such calibration-based methods can only alleviate the bias to a limited extent. In this paper, we advocate a simple and effective solution, kNN Prompting, to address both challenges. Specifically, we assign training data into a demonstration set and an anchor set. We append each anchor example into the prompt and query LLM, then instead of aligning word continuations with labels, we collect the language modeling probability as distributed representation and cache it into a local datastore. At inference time, for each test instance, we similarly obtain its representation and match it against the maintained datastore to make predictions. In general, the proposed framework enables both calibration-free optimization because it avoids forced input-label alignment, and beyond-context learning because the anchor set allows utilization of unlimited training data. We conduct comprehensive experiments using 10 established text classification tasks to demonstrate the significant superiority of kNN Prompting across various scenarios and against competitive opponents: 1) Under few shot scenario where training data is very limited and fits in the context, kNN Prompting outperforms state-of-the-art calibration-based methods by considerable margin (up to +7.07). 2) Under low resource or fully supervised scenario where training data can not fit in the context, kNN Prompting further exhibits its major advantage. It can effectively scale up with as many training data as are available across 10 orders of magnitude (2 shots∼1024 shots, see Figure 1 for illustration) as well as different LLMs scales (0.8B∼30B). Specifically, with only 32 shots training data, it dramatically improves ICL by +13.58 in average score at its most, and achieves absolute improvements up to +18.84 under fully supervised setting. We also provide formal explanation on the intrinsic mechanism of effectiveness, as well as detailed analyses regarding its robustness and choices of design. Accompanied with these appealing aspects, kNN Prompting is in general a promising solution that bridges the benefits of data scaling into model scaling to take the gradientfree paradigm of LLM deployment one step further.

2. BACKGROUND: IN-CONTEXT LEARNING

In this section, we formulate the task and recap the ICL baseline. Assuming a target task with training data set T = {(x i , y i )}, and Y as its categorical label space. At inference time, the model is asked to predict y test given test instance x test . We then denote an LLM θ that is pretrained with a standard language modeling objective. At employment, it samely predicts a probability distribution p(w t |w <t , θ) for the next token at t-th position conditioned on previous context w <t . In-context learning first formulates training examples {(x i , y i )} in the format of input-label mapping via intuitive templates (see Appendix F for illustration), and concatenates them into a natural



Figure 1: kNN Prompting brings substantial improvements over standard ICL, and can continually scale up beyond the context with as many data as are available. Conducted with GPT XL.

