INTERNET-AUGMENTED LANGUAGE MODELS THROUGH FEW-SHOT PROMPTING FOR OPEN-DOMAIN QUESTION ANSWERING

Abstract

In this work, we aim to capitalize on the unique few-shot capabilities of large-scale language models (LLMs) to overcome some of their challenges with respect to grounding to factual and up-to-date information. Motivated by semi-parametric language models (LMs), which ground their decisions in external retrieved evidence, we use few-shot prompting to learn to condition LMs on information returned from the web using Google Search, a broad and constantly updated knowledge source. Our approach does not involve fine-tuning or learning additional parameters, thus making it applicable to any LM, offering therefore a strong baseline. Indeed, we find that LMs conditioned on the web surpass performance of closed-book models of similar, or even larger, model sizes in open-domain question answering. Finally, we find that increasing the inference-time compute of models, achieved via using multiple retrieved evidences to generate multiple answers followed by a reranking stage that uses scores generated by the same LMs, leads to better performance and alleviates lower performance of smaller few-shot LMs. All in all, our findings suggest that it might be beneficial to slow down the race towards the biggest model and instead shift attention towards finding more effective ways to use models, including but not limited to, better prompting or increasing inference-time compute.

1. INTRODUCTION

Undoubtedly, large-scale language models (LLMs) present a breakthrough for language research, particularly for their state-of-the-art language modeling results (Radford et al., 2019; Rae et al., 2021) and impressive generative capabilities. Above all, increasing scale has made few-shot learning a defining new paradigm for language models (LMs). Due to the versatility of prompting, these models can now be quickly adapted using only a handful of examples to perform tasks ranging from question answering and numeric reasoning to creative writing (Brown et al., 2020) . All these considerations place few-shot LLMs at an excellent position to be used as building blocks for open-ended and "in the wild" user interactions. Despite these successes, few-shot LLMs still lack a key ingredient; they are susceptible to hallucinations (Maynez et al., 2020) caused by incorrect retrieval of knowledge stored in their weights or due to the model having incomplete or outdated knowledge. As for many user interactions we expect factuality to play an important role, it is imperative to find ways to keep LLMs up-to-date and grounded to factual and new information as it becomes available. As the current trend sees the size of these models to continually grow, mitigating those issues should rely on flexible and robust approaches that can be easily transferred to different domains and tasks. Here, we aim to capitalize on the unique benefits offered by pre-trained LLMs and propose to overcome some of their limitations by drawing ideas from semi-parametric models (Khandelwal et al., 2019; Guu et al., 2020; Lewis et al., 2020; Izacard & Grave, 2020 ) that ground their decisions in external retrieved evidence to reduce hallucinations and improve factuality (Shuster et al., 2021) . Specifically, we use the Internet as a source of up-to-date knowledge, and rely on the powerful few-shot capabilities of these LLMs to learn how to use it effectively for answering questions. Taking open-domain question answering as a task where factual correctness is vital, we design a system that given a question uses a retrieval model to retrieve relevant documents from the Internet. Then, using few-shot learning we prompt the model to answer the question via conditioning on the retrieved documents, without the need to fine-tune or learn extra parameters. As a retrieval system we use a search engine -in particular Google Search -allowing us to treat the whole web as a knowledge source. While Wikipedia has been the dominant knowledge source driving progress on a multitude of tasks, given the current progress and the quest towards more complex interactions, there has never been a better time to widen their scope, embracing the opportunities working with the whole web, such as considering a wider range of topics and views, as well as the many challenges, such as working with more noisy and potentially uncurated and unsafe text in the wild. Indeed, there is momentum building up in breaking away from Wikipedia-only research (Komeili et al., 2021; Nakano et al., 2021; Piktus et al., 2021; Thoppilan et al., 2022) . To test the effectiveness of equipping LLMs with Internet search on open-domain question answering, we use a mix of single-hop and multi-hop, language generation and classification tasks. We find that our biggest LLMs benefit from conditioning on the web through few-shot prompting. For the language generation tasks, we see a relative performance increase of 15%-30% over the commonly used closed-book few-shot approach, while also making up performance-wise for LLMs with smaller size. Surprisingly, we find that our method achieves gains, albeit smaller, even on complex multi-hop questions, despite the fact that these questions suffer from higher retrieval errors. While perhaps the mainstream view places scaling models' parameters as the primary way to increase their few-shot performance, our results add to the stream of work that emphasizes instead better use of the models' powerful prompting abilities (Rubin et al., 2021; Liu et al., 2021a) . As such, our approach presents a lightweight method applicable to virtually any pre-trained LM without the need for fine-tuning or adding extra learnable parameters. Finally, increasing the inference-time compute of models via sampling multiple answers and reranking using scores computed from the same LLMs not only adds further performance gains, but also alleviates generally decreased performance of smaller few-shot LMs, partly closing their performance gap with larger models. All in all, our findings hint at the possibility of slowing down the race towards the biggest model and instead shifting the attention to more targeted and effective use of models' few-shot capabilities in combination with increasing inference-time compute, a generally more scalable approach.

2. RELATED WORK

Semi-parametric language models have been recently gaining momentum (Khandelwal et al., 2019; Guu et al., 2020; Yogatama et al., 2021; Borgeaud et al., 2021) , extending monolithic parametric models with information from a knowledge source. This process facilitates overcoming distribution shift (e.g., domain or temporal) in a flexible way by simply updating the external knowledge. When applied to question answering tasks (Lewis et al., 2020; Izacard & Grave, 2020; Sachan et al., 2021) , they surpass performance of parametric-only models -they are able to efficiently handle an increasing number of retrieved passages and ground their predictions into additional information, thus reducing hallucinations and improving factuality. However, to be faithful to their input, these models need to be trained (or fine-tuned) to attend to the additional input. In contrast, our work pushes the limits of few-shot prompting as a way to learn to condition on external evidence with no additional training required, thus making it applicable to virtually any pre-trained LM. Web as knowledge source Open-domain question answering traditionally has been studying carefully constructed benchmarks, where answerability of questions from Wikipedia has been confirmed through annotations. Recently a new trend emerged -using the whole web as knowledge source to support more varied and rich interactions. Augenstein et al. (2019) and Fan et al. ( 2020) make use of web data through commercial search engines as a part of building more diverse datasets for factchecking. On the other hand, Piktus et al. (2021) find that considering the web as a retrieval source brings material gains to knowledge intensive tasks, despite any difficulties with building a search index from an order of magnitude more (noisy) data than Wikipedia. To avoid similar challenges with building and maintaining a search index, recent work that aims in improving factuality in user interactions adopts the use of commercial search engines as building block for their systems (Komeili et al., 2021; Nakano et al., 2021; Thoppilan et al., 2022; Menick et al., 2022) . Similar to us, Nakano et al. ( 2021) analyze benefits of increasing compute at inference time. However, unlike us, they either target open-ended dialogue interactions (Komeili et al., 2021; Thoppilan et al., 2022) or focus on

