THRUST: ADAPTIVELY PROPELS LARGE LANGUAGE MODELS WITH EXTERNAL KNOWLEDGE

Abstract

Large-scale pre-trained language models (PTLM) have achieved great success in various natural language processing (NLP) tasks. Much evidence shows that PTLMs already encode rich knowledge themselves, but knowledge stored in PTLMs can be opaque and static, making external knowledge retrieval necessary. However, there are two major challenges when using external knowledge. First, knowledge indexing and retrieving on large-scale knowledge bases are time costly. Second, knowledge retrieved could be noisy and sometimes misleading. Motivated by the observation that external knowledge is not always required by PTLMs, we investigate an effective and efficient way to apply knowledge only when the knowledge is essential. Specifically, we propose instance-level adaptive propulsion of external knowledge (IAPEK), where we score each instance on whether the PTLMs need the support of external knowledge. To achieve this goal, we design a novel metric, Thrust, which leverages the distribution estimation on seen/training instances. Extensive experiments demonstrate that we can achieve significantly higher cost-efficiency through Thrust compared to the naive usage of external knowledge on 88% of the evaluated tasks with 26% average performance improvement. Such findings further shed light on the real-world practice of knowledge-enhanced LMs with a limited budget for knowledge seeking due to computation latency or costs 1 .

1. INTRODUCTION

Knowledge plays an important role in solving natural language processing (NLP) tasks, where encyclopedic or commonsense knowledge is commonly required to answer questions from various tasks (Yin et al., 2022) . In recent years, the emergent advance of pre-trained language models (PTLM) has demonstrated great improvement on various tasks (Devlin et al., 2019; Radford et al., 2019; Liu et al., 2019; Raffel et al., 2020; Brown et al., 2020) . Evidence also show that PTLMs contain rich encyclopedic (Petroni et al., 2019) or commonsense (Kocijan et al., 2019) knowledge themselves. However, such implicit knowledge embedded in the model's hidden states can be opaque, static, and inefficient to utilize (Khattab et al., 2022) . These issues motivate the common practice on seeking external knowledge (Xu et al., 2021; Verga et al., 2021; Paranjape et al., 2022) in NLP. A typical line of work focuses on retrieval-based methods, where knowledge is retrieved by a standalone retriever from external knowledge bases and then used to augment the inference models (i.e., Reader) such as PTLMs (Karpukhin et al., 2020; Gao & Callan, 2021; Khattab & Zaharia, 2020) . However, there are several limitations with the usage of external knowledge: (i) performance on the downstream tasks is not commonly revealed. Metrics of the common benchmarks (e.g., MS-MARCO (Nguyen et al., 2016) , BEIR (Thakur et al., 2021) ) measure the quality of retrieval (e.g., Recall@50, nDCG@10). Although retrieving the relevant content may positively relate to the downstream performance, not reporting the downstream performance, especially for the out-of-domain tasks, limits the exploration of how to utilize the external knowledge in practice; (ii) the external knowledge can be noisy or unnecessary. On the retriever side, though concurrent retrievers achieve great performance on various tasks, the noise can still exist. For instance, ColBERT v2 (Santhanam et al., 2022) achieved 68.9 Success@5 on Natural Question (Kwiatkowski et al., 2019) , which suggests that gold documents do not appear in the top 5 retrieved documents for 31.1% of the queries. Intuitively, a solution to the noise and inefficiency issues is only seeking external knowledge when it is necessary. In this work, we capture this intuition by proposing Instance-level Adaptive Propulsion of External Knowledge (IAPEK) to reduce the effect of noise in external knowledge and improve the cost-efficiency of knowledge augmentation. In detail, for each instance of a given task, we compute a confidence score measuring how likely it can be solved directly with respect to a given model and reject the use of external knowledge when the score is high. We design a simple and lightweight metric Thrust to serve such a purpose by leveraging the estimation of the instance distribution in the eyes of the target models. To comprehensively understand the effectiveness of Thrust, we first create a large-scale benchmark examining the downstream performance of the task-plus-knowledge paradigm with (i) tasks with different formats and types (e.g., multiple-choice classification (MC classification) and open-domain question answering (open-domain QA)); (ii) knowledge with different formats and from different resources (e.g., knowledge graphs, Wikipedia paragraphs, and human annotations). Next, with models that can utilize external knowledge, we evaluate the effectiveness of Thrust by showing that it can boost the performance of various tasks under various settings, such as injecting external knowledge to different portions of the test instances. Extensive experiments show that Thrust can improve the cost-efficiency of seeking and using external knowledge on 88% cases with 26% average performance improvement through identifying the instances that mostly require knowledge. We can also observe that, with Thrust, we can achieve higher performance than injecting external knowledge for all the instances, where models are benefited from both the performance and efficiency aspects.

2.1. INSTANCE-LEVEL ADAPTIVE PROPULSION OF KNOWLEDGE

We first define IAPEK as follows: for each query q i in a given test set D = {q (1) , q (2) , . . .}, let f (q) denotes the scoring function of the necessity of external knowledge, we extract the corresponding scores S = {f (q) (1) , f (q) (2) , . . .}. With S, we re-rank the test set into D ′ = {q ′(1) , q ′(2) , . . .}. Given any threshold t ∈ R, we sample a subset D k = {q (1) k , q (2) k , . . .} as that with highest knowledge need, where for each q k ∈ D k , f (q k ) > t. Empirically, we can set t as a particular percentile of S, e.g., top 25% of S. Next, for each instance in D k , we seek for external knowledge pieces and



The code and data will be released upon acceptance.



Figure 1: The predictions from OPT (175B version) with/without external knowledge retrieved by DPR (Karpukhin et al., 2020) from Wikipedia paragraphs. Although the top retrieved paragraphs are relevant, since the internal knowledge is already sufficient, the external knowledge can either be misleading (potentially due to the effect of misprime (Kassner & Schütze, 2020)) or less useful.

