FID-LIGHT: EFFICIENT AND EFFECTIVE RETRIEVAL-AUGMENTED TEXT GENERATION

Abstract

Retrieval-augmented generation models offer many benefits over standalone language models: besides a textual answer to a given query they provide provenance items retrieved from an updateable knowledge base. However, they are also more complex systems and need to handle long inputs. In this work, we introduce FiD-Light to strongly increase the efficiency of the state-of-the-art retrieval-augmented FiD model, while maintaining the same level of effectiveness. Our FiD-Light model constrains the information flow from the encoder (which encodes passages separately) to the decoder (using concatenated encoded representations). Furthermore, we adapt FiD-Light with re-ranking capabilities through textual source pointers, to improve the top-ranked provenance precision. Our experiments on a diverse set of seven knowledge intensive tasks (KILT) show FiD-Light consistently improves the Pareto frontier between query latency and effectiveness. FiD-Light with source pointing sets substantial new state-of-the-art results on six KILT tasks for combined text generation and provenance retrieval evaluation, while maintaining reasonable efficiency.

1. INTRODUCTION

Enabling machine learning models to access information contained in parametric or non-parametric storage (i.e., retrieval-enhanced machine learning) can lead to efficiency and/or effectiveness improvements in a wide range of learning tasks (Zamani et al., 2022) . For example, retrievalaugmented generation (Lewis et al., 2020) , which is the focus of this paper, has a manifold of benefits over closed-loop language modelling in knowledge intensive tasks: Answers can be grounded in (multiple) specific pieces of information which enables clear attribution (Dehghani et al., 2019; Rashkin et al., 2021; Lamm et al., 2021) ; the knowledge base can easily be managed, updated, and swapped (Izacard et al., 2022) ; the decomposition of retrieval and generation module offers clear efficiency-effectiveness tradeoff controls; and the data structure of combined retrieval and text generation enables many insightful failure analyses. However, with these benefits also come downsides, such as a higher system complexity with higher training and inference cost. Therefore, our goal is to reduce costs as much as possible, while retaining effectiveness, to make these benefits more widely available. The most effective approach for knowledge intensive tasks, such as those contained in the KILT benchmark (Petroni et al., 2021) , is the Fusion-in-Decoder (FiD) model proposed by Izacard & Grave (2020) . The FiD model uses an external retriever, such as a dense retrieval model, to gather candidate passages, which are encoded with the query by a T5-encoder (Raffel et al., 2020) ; the encoded vectors are concatenated and fed through a T5-decoder to produce a single output string. FiD can synthesize answers from multiple different sources, which leads to state-of-the-art results in many tasks from open domain QA to fact verification (Hofstätter et al., 2022; Izacard et al., 2022) . While undoubtedly the leading architecture -in terms of effectiveness for knowledge intensive generation tasks -the FiD model is resource intensive. In state-of-the-art configurations concatenating all encoded tokens before the decoding leads often to sequences longer than 10 thousand vectors, coupled with auto-regressive decoding, this leads to a high inference latency. In Figure 1 we plot the average latency of a single query measured on a single TPUv4 of the encoder and decoder modules of FiD. 1 The first observation is the overpowering 93% of time spent on decoding in FiD. A common and straightforward approach to reduce the latency of FiD is to reduce the number of input passages, e.g., to only 10 passages. While this approach naturally reduces the overall latency, the decoding latency still requires 10 times as long as the encoding (see Figure 1 ). Crucially, this approach will also reduce the model's effectiveness substantially, as we show later in this work (see §4.3). To overcome the inefficiencies of the decoding, we propose FiD-Light, a simple yet effective adaptation of the FiD model. The connection between the encoder and decoder has a large capacity for information in FiD. In contrast, the retrieval community, showed that in applications, such as dense retrieval with dot-product scoring, encoded information may be compressed to a fraction of the original input length, including representing passages in a single (Hofstätter et al., 2021) or multiple vectors (Chen et al., 2020) . Following these footsteps, we propose to compress the number of vectors per encoded passage, to a fraction of the input vectors, before they are accessed by the decoder. Using this approach FiD-Light is able to ingest a large number of passages with strongly reduced latency, as illustrated in Figure 1 . Here we still use 40 passages, showing the same encoding time as FiD, but a substantially faster decoding (now on par with the encoding time), for a total latency lower than FiD with 10 passages. The knowledge intensive tasks we aim to solve ideally require a system to produce both a generated output text, as well as a ranked list of provenance items from the knowledge base. However, FiD is limited to only produce output text. Falling back to return the original candidate ranking is usually sub-optimal with low-precision. To incorporate re-ranking capabilities into FiD-Light we adapt a passage marker workflow proposed by Lakhotia et al. ( 2021) as part of FiD-Ex. They marked the input passages with textual indices, and trained the model to output the relevant indices in the output text. We find that using these textual indices or source pointers directly as output, as Lakhotia et al. ( 2021) proposed, is brittle and prone to distribution shifts in the number of expected relevant passages between training and evaluation (see §4.2). Therefore, our FiD-Light SP approach re-ranks the selected passages to the top of the ranked list, without discarding the rest of the retrieved list, for higher robustness and improved results. We conduct experiments on seven tasks of the KILT benchmark composed by Petroni et al. (2021) spanning open domain QA, slot filling, fact verification, and dialogue tasks. We study the following research questions to demonstrate the efficacy of our proposed FiD-Light SP model: RQ1 What impact does training the retrieval module have on FiD-Light SP downstream results? The quality of the final result is strongly bound by the recall quality of the retriever module. While many complex end-to-end training procedures have been proposed (Singh et al., 2021; Izacard et al., 2022) , we focus on simple, yet effective directly supervised dense retrieval training. We show that a simple retrieval training comfortably outperforms a zero-shot retrieval baseline from Hofstätter et al. (2022) and the resulting FiD-Light SP downstream results take a major step towards a realistic oracle retriever ceiling. RQ2 How robust is our source pointing and re-ranking workflow applied to FiD and FiD-Light? We use available passage relevance information for each task in the KILT benchmark to train our source pointer output via text markers. We train the FiD(-Light) generator to output the indices for



All our measurements in this work are conducted on TPUv4s, however we confirmed that using V100 GPUs we observe a similar ratio of time spent in the encoder vs. the decoder of FiD and FiD-Light.



Figure 1: Average inference latency for a query of FiD & FiD-Light (T5-Base on a single TPUv4).

