RECITATION-AUGMENTED LANGUAGE MODELS

Abstract

We propose a new paradigm to help Large Language Models (LLMs) generate more accurate factual knowledge without retrieving from an external corpus, called RECITation-augmented gEneration (RECITE). Different from retrievalaugmented language models that retrieve relevant documents before generating the outputs, given an input, RECITE first recites one or several relevant passages from LLMs' own memory via sampling, and then produces the final answers. We show that RECITE is a powerful paradigm for knowledge-intensive NLP tasks. Specifically, we show that by utilizing recitation as the intermediate step, a recite-and-answer scheme can achieve new state-of-the-art performance in various closed-book question answering (CBQA) tasks. In experiments, we verify the effectiveness of RECITE on four pre-trained models (PaLM, UL2, OPT, and Codex) and three CBQA tasks (Natural Questions, TriviaQA, and HotpotQA).

1. INTRODUCTION

Large language models (LLMs) have achieved impressive in-context few-shot performance on knowledge-intensive NLP tasks (Brown et al., 2020; Rae et al., 2021; Hoffmann et al., 2022; Chowdhery et al., 2022) . For example, in open-domain question answering (Chen et al., 2017) , demonstrated by only a few examples of question-answer pairs, LLMs are able to answer arbitrary factoid questions (Joshi et al., 2017; Yang et al., 2018; Kwiatkowski et al., 2019) . Recent research (Guu et al., 2020; Lewis et al., 2020; Izacard et al., 2022) shows that retrieval-augmentation can further improve LLMs' performance on knowledge-intensive tasks by conditioning the LLMs on retrieved relevant passages from an external corpus. This paper proposes a new paradigm to help LLMs generate more accurate factual knowledge without retrieving from an external corpus, called RECITation-augmented gEneration (RECITE), wherein we tackle knowledge-intensive NLP tasks by first reciting relevant information and then generating the outputs. Such a two-step paradigm decomposes the original knowledge-intensive task into two sub-tasks: knowledge-recitation and task-execution, where the former can be regarded as a form of intermediate knowledge retrieval step (from the model weights), while the latter is the execution step that produces the final outputs. The motivation of introducing an additional knowledge-recitation step comes from our observation that while few-shot prompting can help LLMs execute specific NLP tasks, these tasks are usually not in a similar form as the original causal language modeling pre-training objective. This hinders LLMs from effectively reciting knowledge from their memory (Carlini et al., 2021) . Consider a student taking a closed-book exam that contains knowledge-intensive questions, for example, "what is the tenth decimal of π?". They typically cannot directly answer this question because in studying stage (in analogy to the language modeling pre-training stage for LLMs), it is highly unlikely that they would read "the tenth decimal of π is 5". However, there can be some sentences like "the first N digits of π are 3.14159 26535..." existing in the textbook that can be recited by the student. Therefore, a student can possibly answer this question in a recite-and-answer scheme: "The first 10 digits of π are 3.14159 26535. So the answer is 5". Here, the knowledge-recitation step can serve as an intermediate step that mimics the language modeling pre-training task, and thus better helps the LLM to generate factual knowledge. We verify the effectiveness of our recitation-augmented generation on few-shot Closed-Book Question Answering (CBQA) tasks (referred as recite-and-answer in the CBQA context), as illustrated in Figure 1 . CBQA is an attractive open-domain QA task in that a fully parameterized LM can generate answers directly without an external corpus or separate retrieval models (Roberts et al., 2020) . We show that the proposed recite-and-answer scheme is an effective method for CBQA and compatible with other techniques for boosting few-shot performance of LLMs. We also show that, in addition to improving the few-shot in-context learning performance of RECITE-enhanced LLM, fine-tuning the pre-trained LLMs on synthetic generated question-passage pairs can further improve the recitation performance and lead to a better downstream QA accuracy. Experiments on four large language models (PaLM (Chowdhery et al In the closed-book setting, the QA model is not allowed to access any external knowledge, and needs to store all the knowledge in its parameters. It has been recently observed that large-scale pre-trained language models (Devlin et al., 2019; Radford et al., a; Yang et al., 2019b) can internalize a sort of implicit "knowledge base" after pre-training (Petroni et al., 2019; Jiang et al., 2020; Talmor et al., 2020) . Roberts et al. (2020) show that after fine-tuning on open-book question-answer pairs, T5 (Raffel et al., 2020) can answer a large portion of knowledge-intensive questions. This is similar as taking a closed-book exam. However, Lewis et al. (2021) found that the high performance is mainly due to training set question memorization. Wang et al. ( 2021) also found that it is still challenging for relatively small-scale pre-trained language models like RoBERTa (Liu et al., 2019) or b) to answer closed-book questions.



* Work done during internship at Google.



Figure 1: Illustration of evaluating (few-shot) open-domain question answering with (closed-book) direct generation (Chowdhery et al., 2022), (open-book) retrieval-augmented generation (Izacard et al., 2022), and (closed-book) recitation-augmented generation (ours).

Prager et al., 2007)  refers to the task of generating answers for arbitrary context-free questions. In the open-book setting, it is typically assumed that the QA model can find the answer in an external corpus, e.g.,Wikipedia (Chen et al., 2017; Izacard & Grave,  2021)  or web pages(Lazaridou et al., 2022). This is in analogy as taking an open-book exam where students can search over an external knowledge corpus. The standard pipeline(Chen et al., 2017;  Izacard & Grave, 2021; 2020)  usually consists of a learnable or non-learnable document retriever module and a learnable neural network-based reader module.

availability

://github.com/

