INTERNET-AUGMENTED LANGUAGE MODELS THROUGH FEW-SHOT PROMPTING FOR OPEN-DOMAIN QUESTION ANSWERING

Abstract

In this work, we aim to capitalize on the unique few-shot capabilities of large-scale language models (LLMs) to overcome some of their challenges with respect to grounding to factual and up-to-date information. Motivated by semi-parametric language models (LMs), which ground their decisions in external retrieved evidence, we use few-shot prompting to learn to condition LMs on information returned from the web using Google Search, a broad and constantly updated knowledge source. Our approach does not involve fine-tuning or learning additional parameters, thus making it applicable to any LM, offering therefore a strong baseline. Indeed, we find that LMs conditioned on the web surpass performance of closed-book models of similar, or even larger, model sizes in open-domain question answering. Finally, we find that increasing the inference-time compute of models, achieved via using multiple retrieved evidences to generate multiple answers followed by a reranking stage that uses scores generated by the same LMs, leads to better performance and alleviates lower performance of smaller few-shot LMs. All in all, our findings suggest that it might be beneficial to slow down the race towards the biggest model and instead shift attention towards finding more effective ways to use models, including but not limited to, better prompting or increasing inference-time compute.

1. INTRODUCTION

Undoubtedly, large-scale language models (LLMs) present a breakthrough for language research, particularly for their state-of-the-art language modeling results (Radford et al., 2019; Rae et al., 2021) and impressive generative capabilities. Above all, increasing scale has made few-shot learning a defining new paradigm for language models (LMs). Due to the versatility of prompting, these models can now be quickly adapted using only a handful of examples to perform tasks ranging from question answering and numeric reasoning to creative writing (Brown et al., 2020) . All these considerations place few-shot LLMs at an excellent position to be used as building blocks for open-ended and "in the wild" user interactions. Despite these successes, few-shot LLMs still lack a key ingredient; they are susceptible to hallucinations (Maynez et al., 2020) caused by incorrect retrieval of knowledge stored in their weights or due to the model having incomplete or outdated knowledge. As for many user interactions we expect factuality to play an important role, it is imperative to find ways to keep LLMs up-to-date and grounded to factual and new information as it becomes available. As the current trend sees the size of these models to continually grow, mitigating those issues should rely on flexible and robust approaches that can be easily transferred to different domains and tasks. Here, we aim to capitalize on the unique benefits offered by pre-trained LLMs and propose to overcome some of their limitations by drawing ideas from semi-parametric models (Khandelwal et al., 2019; Guu et al., 2020; Lewis et al., 2020; Izacard & Grave, 2020 ) that ground their decisions in external retrieved evidence to reduce hallucinations and improve factuality (Shuster et al., 2021) . Specifically, we use the Internet as a source of up-to-date knowledge, and rely on the powerful few-shot capabilities of these LLMs to learn how to use it effectively for answering questions. Taking open-domain question answering as a task where factual correctness is vital, we design a system that given a question uses a retrieval model to retrieve relevant documents from the Internet. Then,

