FID-LIGHT: EFFICIENT AND EFFECTIVE RETRIEVAL-AUGMENTED TEXT GENERATION

Abstract

Retrieval-augmented generation models offer many benefits over standalone language models: besides a textual answer to a given query they provide provenance items retrieved from an updateable knowledge base. However, they are also more complex systems and need to handle long inputs. In this work, we introduce FiD-Light to strongly increase the efficiency of the state-of-the-art retrieval-augmented FiD model, while maintaining the same level of effectiveness. Our FiD-Light model constrains the information flow from the encoder (which encodes passages separately) to the decoder (using concatenated encoded representations). Furthermore, we adapt FiD-Light with re-ranking capabilities through textual source pointers, to improve the top-ranked provenance precision. Our experiments on a diverse set of seven knowledge intensive tasks (KILT) show FiD-Light consistently improves the Pareto frontier between query latency and effectiveness. FiD-Light with source pointing sets substantial new state-of-the-art results on six KILT tasks for combined text generation and provenance retrieval evaluation, while maintaining reasonable efficiency.

1. INTRODUCTION

Enabling machine learning models to access information contained in parametric or non-parametric storage (i.e., retrieval-enhanced machine learning) can lead to efficiency and/or effectiveness improvements in a wide range of learning tasks (Zamani et al., 2022) . For example, retrievalaugmented generation (Lewis et al., 2020) , which is the focus of this paper, has a manifold of benefits over closed-loop language modelling in knowledge intensive tasks: Answers can be grounded in (multiple) specific pieces of information which enables clear attribution (Dehghani et al., 2019; Rashkin et al., 2021; Lamm et al., 2021) ; the knowledge base can easily be managed, updated, and swapped (Izacard et al., 2022) ; the decomposition of retrieval and generation module offers clear efficiency-effectiveness tradeoff controls; and the data structure of combined retrieval and text generation enables many insightful failure analyses. However, with these benefits also come downsides, such as a higher system complexity with higher training and inference cost. Therefore, our goal is to reduce costs as much as possible, while retaining effectiveness, to make these benefits more widely available. The most effective approach for knowledge intensive tasks, such as those contained in the KILT benchmark (Petroni et al., 2021) , is the Fusion-in-Decoder (FiD) model proposed by Izacard & Grave (2020) . The FiD model uses an external retriever, such as a dense retrieval model, to gather candidate passages, which are encoded with the query by a T5-encoder (Raffel et al., 2020) ; the encoded vectors are concatenated and fed through a T5-decoder to produce a single output string. FiD can synthesize answers from multiple different sources, which leads to state-of-the-art results in many tasks from open domain QA to fact verification (Hofstätter et al., 2022; Izacard et al., 2022) . While undoubtedly the leading architecture -in terms of effectiveness for knowledge intensive generation tasks -the FiD model is resource intensive. In state-of-the-art configurations concatenating all encoded tokens before the decoding leads often to sequences longer than 10 thousand vectors, coupled with auto-regressive decoding, this leads to a high inference latency. In Figure 1 we plot the average latency of a single query measured on a single TPUv4 of the encoder and decoder modules 1

