DISTILLING KNOWLEDGE FROM READER TO RETRIEVER FOR QUESTION ANSWERING

Abstract

The task of information retrieval is an important component of many natural language processing systems, such as open domain question answering. While traditional methods were based on hand-crafted features, continuous representations based on neural networks recently obtained competitive results. A challenge of using such methods is to obtain supervised data to train the retriever model, corresponding to pairs of query and support documents. In this paper, we propose a technique to learn retriever models for downstream tasks, inspired by knowledge distillation, and which does not require annotated pairs of query and documents. Our approach leverages attention scores of a reader model, used to solve the task based on retrieved documents, to obtain synthetic labels for the retriever. We evaluate our method on question answering, obtaining state-of-the-art results.

1. INTRODUCTION

Information retrieval is an important component for many natural language processing tasks, such as question answering (Voorhees et al., 1999) or fact checking (Thorne et al., 2018) . For example, many real world question answering systems start by retrieving a set of support documents from a large source of knowledge such as Wikipedia. Then, a finer-grained model processes these documents to extract the answer. Traditionally, information retrieval systems were based on hand-crafted sparse representations of text documents, such as TF-IDF or BM25 (Jones, 1972; Robertson et al., 1995) . Recently, methods based on dense vectors and machine learning have shown promising results (Karpukhin et al., 2020; Khattab et al., 2020) . Deep neural networks based on pre-training, such as BERT (Devlin et al., 2019) , have been used to encode documents into fixed-size representations. These representations are then queried using approximate nearest neighbors (Johnson et al., 2019) . These techniques have lead to improved performance on various question answering tasks. A challenge of applying machine learning to information retrieval is to obtain training data for the retriever. To train such models, one needs pairs of queries and the corresponding list of documents that contains the information corresponding to the queries. Unfortunately, hand-labeling data to that end is time consuming, and many datasets and applications lack such annotations. An alternative approach is to resort to heuristics, or weakly supervised learning, for example by considering that all documents containing the answer are positive examples. However, these approaches suffer from the following limitations. First, frequent answers or entities might lead to false positive examples. As an example, consider the question "where was Ada Lovelace born?". The sentence "Ada Lovelace died in 1852 in London" would be considered as a positive example, because it contains the answer "London". A second limitation is that for some tasks, such as fact checking or long form question answering, such heuristics might not be applicable directly. In this paper, we propose a procedure to learn retriever systems without strong supervision in the form of pairs of queries and documents. Following previous work (Chen et al., 2017) , our approach uses two models: the first one retrieves documents from a large source of knowledge (the retriever), the second one processes the support documents to solve the task (the reader). Our method is inspired by knowledge distillation (Hinton et al., 2015) , and uses the reader model to obtain synthetic labels to train the retriever model. More precisely, we use a sequence-to-sequence model as the reader, and use the attention activations over the input documents as synthetic labels to train the retriever. Said otherwise, we assume that attention activations are a good proxy for the relevance of documents. We then train the retriever to reproduce the ranking of documents corresponding to that metric. We make the following contributions: • First, we show that attention scores from a sequence-to-sequence reader model are a good measure of document relevance (Sec. 3.2) ; • Second, inspired by knowledge distillation, we propose to iteratively train the retriever from these activations, and compare different loss functions (Sec. 3.4) ; • Finally, we evaluate our method on three question-answering benchmarks, obtaining stateof-the-art results (Sec. 4). Our code is available at: github.com/facebookresearch/FiD.

2. RELATED WORK

We briefly review information retrieval based on machine learning. We refer the reader to Manning et al. ( 2008) and Mitra et al. ( 2018) for a more exhaustive introduction to the subject. Vector space models. In traditional information retrieval systems, documents and queries are represented as sparse vectors, each dimension corresponding to a different term. Different schemes have been considered to weigh the different term, the most well known being based on inverse document frequency, or term specificity (Jones, 1972) . This technique was later extended, leading to the BM25 weighting scheme which is still widely used today (Robertson et al., 1995) . A limitation of sparse representations is that the terms of the query need to match the terms of the returned documents. To overcome this, Deerwester et al. (1990) proposed to use latent semantic analysis for indexing, leading to low-dimension dense representations of documents. Neural information retrieval. Following the success of deep learning for other natural processing tasks, neural networks were applied to the task of information retrieval. Huang et al. (2013) proposed a deep bag-of-words model, where queries and documents were embedded independently, a technique known as bi-encoder. Documents were then ranked by using the cosine similarity with the query, and the model was trained on clickthrough data from a search engine. This technique was later extended by using convolutional neural networks (Shen et al., 2014) and recurrent neural networks (Palangi et al., 2016) . A limitation of independently embedding documents and query is that it does not capture fine-grained interactions between the query and documents. This lead Nogueira & Cho (2019) and Yang et al. (2019) to use a BERT model to jointly embed documents and query, a technique known as cross-encoder. End-to-end retrieval. Most of the methods described in the previous paragraph were used to rerank a small number of documents, usually returned by a traditional IR systems. In the context of ad-hoc document retrieval, Gillick et al. (2018) showed that bi-encoder models could be competitive with traditional IR systems. For open domain question-answering, Karpukhin et al. ( 2020) introduced dense passage retrieval (DPR), which uses dense embeddings and nearest neighbors search. More precisely, question and passage embeddings are obtained using a BERT-based biencoder model, which is trained on a small dataset of question and passage pairs. Then, the full knowledge source (Wikipedia) is encoded using this model, and passages are queried by computing the k-nearest neighbors of the embedding of the question. Jointly embedding the query and documents makes the application of cross-encoder models intractable to large database. To address this limitation, Humeau et al. ( 2019) introduced the poly-encoder architecture, in which each documents is represented by multiple vectors instead of one. Similarly, Khattab et al. (2020) proposed a scoring function where each term of the query and documents is represented by a single vector. To make the method tractable, their system retrieves documents with an approximate score, which are then re-ranked with the exact one. Finally, Luan et al. ( 2020) conducts a theoretical and empirical study of sparse, dense and cross-attention information retrieval systems. Unsupervised learning. Closest to our work, there is growing body of work trying to learn information retrieval systems from unsupervised data. Lee et al. (2019) introduced the inverse cloze task

