DISTILLING KNOWLEDGE FROM READER TO RETRIEVER FOR QUESTION ANSWERING

Abstract

The task of information retrieval is an important component of many natural language processing systems, such as open domain question answering. While traditional methods were based on hand-crafted features, continuous representations based on neural networks recently obtained competitive results. A challenge of using such methods is to obtain supervised data to train the retriever model, corresponding to pairs of query and support documents. In this paper, we propose a technique to learn retriever models for downstream tasks, inspired by knowledge distillation, and which does not require annotated pairs of query and documents. Our approach leverages attention scores of a reader model, used to solve the task based on retrieved documents, to obtain synthetic labels for the retriever. We evaluate our method on question answering, obtaining state-of-the-art results.

1. INTRODUCTION

Information retrieval is an important component for many natural language processing tasks, such as question answering (Voorhees et al., 1999) or fact checking (Thorne et al., 2018) . For example, many real world question answering systems start by retrieving a set of support documents from a large source of knowledge such as Wikipedia. Then, a finer-grained model processes these documents to extract the answer. Traditionally, information retrieval systems were based on hand-crafted sparse representations of text documents, such as TF-IDF or BM25 (Jones, 1972; Robertson et al., 1995) . Recently, methods based on dense vectors and machine learning have shown promising results (Karpukhin et al., 2020; Khattab et al., 2020) . Deep neural networks based on pre-training, such as BERT (Devlin et al., 2019) , have been used to encode documents into fixed-size representations. These representations are then queried using approximate nearest neighbors (Johnson et al., 2019) . These techniques have lead to improved performance on various question answering tasks. A challenge of applying machine learning to information retrieval is to obtain training data for the retriever. To train such models, one needs pairs of queries and the corresponding list of documents that contains the information corresponding to the queries. Unfortunately, hand-labeling data to that end is time consuming, and many datasets and applications lack such annotations. An alternative approach is to resort to heuristics, or weakly supervised learning, for example by considering that all documents containing the answer are positive examples. However, these approaches suffer from the following limitations. First, frequent answers or entities might lead to false positive examples. As an example, consider the question "where was Ada Lovelace born?". The sentence "Ada Lovelace died in 1852 in London" would be considered as a positive example, because it contains the answer "London". A second limitation is that for some tasks, such as fact checking or long form question answering, such heuristics might not be applicable directly. In this paper, we propose a procedure to learn retriever systems without strong supervision in the form of pairs of queries and documents. Following previous work (Chen et al., 2017) , our approach uses two models: the first one retrieves documents from a large source of knowledge (the retriever), the second one processes the support documents to solve the task (the reader). Our method is inspired by knowledge distillation (Hinton et al., 2015) , and uses the reader model to obtain synthetic labels to train the retriever model. More precisely, we use a sequence-to-sequence model as the reader, and use the attention activations over the input documents as synthetic labels to train the retriever. Said otherwise, we assume that attention activations are a good proxy for the relevance of

