THRUST: ADAPTIVELY PROPELS LARGE LANGUAGE MODELS WITH EXTERNAL KNOWLEDGE

Abstract

Large-scale pre-trained language models (PTLM) have achieved great success in various natural language processing (NLP) tasks. Much evidence shows that PTLMs already encode rich knowledge themselves, but knowledge stored in PTLMs can be opaque and static, making external knowledge retrieval necessary. However, there are two major challenges when using external knowledge. First, knowledge indexing and retrieving on large-scale knowledge bases are time costly. Second, knowledge retrieved could be noisy and sometimes misleading. Motivated by the observation that external knowledge is not always required by PTLMs, we investigate an effective and efficient way to apply knowledge only when the knowledge is essential. Specifically, we propose instance-level adaptive propulsion of external knowledge (IAPEK), where we score each instance on whether the PTLMs need the support of external knowledge. To achieve this goal, we design a novel metric, Thrust, which leverages the distribution estimation on seen/training instances. Extensive experiments demonstrate that we can achieve significantly higher cost-efficiency through Thrust compared to the naive usage of external knowledge on 88% of the evaluated tasks with 26% average performance improvement. Such findings further shed light on the real-world practice of knowledge-enhanced LMs with a limited budget for knowledge seeking due to computation latency or costs 1 .

1. INTRODUCTION

Knowledge plays an important role in solving natural language processing (NLP) tasks, where encyclopedic or commonsense knowledge is commonly required to answer questions from various tasks (Yin et al., 2022) . In recent years, the emergent advance of pre-trained language models (PTLM) has demonstrated great improvement on various tasks (Devlin et al., 2019; Radford et al., 2019; Liu et al., 2019; Raffel et al., 2020; Brown et al., 2020) . Evidence also show that PTLMs contain rich encyclopedic (Petroni et al., 2019) or commonsense (Kocijan et al., 2019) knowledge themselves. However, such implicit knowledge embedded in the model's hidden states can be opaque, static, and inefficient to utilize (Khattab et al., 2022) . These issues motivate the common practice on seeking external knowledge (Xu et al., 2021; Verga et al., 2021; Paranjape et al., 2022) in NLP. A typical line of work focuses on retrieval-based methods, where knowledge is retrieved by a standalone retriever from external knowledge bases and then used to augment the inference models (i.e., Reader) such as PTLMs (Karpukhin et al., 2020; Gao & Callan, 2021; Khattab & Zaharia, 2020) . However, there are several limitations with the usage of external knowledge: (i) performance on the downstream tasks is not commonly revealed. Metrics of the common benchmarks (e.g., MS-MARCO (Nguyen et al., 2016 ), BEIR (Thakur et al., 2021) ) measure the quality of retrieval (e.g., Recall@50, nDCG@10). Although retrieving the relevant content may positively relate to the downstream performance, not reporting the downstream performance, especially for the out-of-domain tasks, limits the exploration of how to utilize the external knowledge in practice; (ii) the external knowledge can be noisy or unnecessary. On the retriever side, though concurrent retrievers achieve great performance on various tasks, the noise can still exist. For instance, ColBERT v2 (Santhanam et al., 2022) achieved 68.9 Success@5 on Natural Question (Kwiatkowski et al., 2019) , which suggests that gold documents do not appear in the top 5 retrieved documents for 31.1% of the queries.



The code and data will be released upon acceptance.1

