PROMPTAGATOR : FEW-SHOT DENSE RETRIEVAL FROM 8 EXAMPLES

Abstract

Much recent research on information retrieval has focused on how to transfer from one task (typically with abundant supervised data) to various other retrieval tasks where supervision is limited, with the implicit assumption that it is possible to generalize from one task to all the rest. However, this overlooks the fact that there are many diverse and unique retrieval problems, each targeting different search intents, queries, and search domains. In this paper, we suggest to work on Few-shot Dense Retrieval, a setting where each task comes with a short description and a few examples. To address this, we introduce Prompt-based Query Generation for Retrieval (PROMPTAGATOR ): for each task, we feed the few-shot examples to a large language model (LLM) and prompt it to behave as a task-specific query generator. Using this, we can synthetically generate a large number of relevant queries for any document, yielding abundant data for training task-specific retrievers -with no reliance on traditional resources such as Natural Questions (Kwiatkowski et al., 2019) or MS MARCO (Nguyen et al., 2016). Surprisingly, PROMPTAGATOR using only 8 annotated examples enables efficient dual encoder retrievers to outperform computationally more expensive models trained on MS MARCO such as ColBERT v2 (Santhanam et al., 2022) by more than 1.2 points nDCG@10 on average on 11 retrieval sets. Further training standard-size rerankers using the same generated data yields another 5.0 points nDCG@10 improvement. Our studies show that synthetic query generation can be far more effective than previously observed, especially when a small amount of task-specific knowledge is given.

1. INTRODUCTION

Significant progress has been made on neural retrieval models such as dual encoders, which can search over a large collection of documents containing millions to billions of passages (Yih et al., 2011; Lee et al., 2019; Karpukhin et al., 2020 ). However, Thakur et al. (2021) recently proposed the BEIR heterogeneous retrieval benchmark, and showed that it is still difficult for neural retrievers to perform well on a wide variety of retrieval tasks that lack dedicated training data. To address this problem, many previous approaches focus on transferring knowledge from high-resource question answering (QA) datasets such as MS MARCO (Nguyen et al., 2016) , and propose architectures that possess good inductive biases, such as models that allow fine-grained token-level interaction (e.g., ColBERT (Khattab & Zaharia, 2020; Santhanam et al., 2022) and SPLADE (Formal et al., 2021 )) which often come with higher inference cost. Data augmentation via synthetic query generation has previously been explored (Ma et al., 2021; Shakeri et al., 2020) , but these question generators are learned from high-resource QA datasets, and often cannot generalize well to new retrieval tasks. We argue that it is hard to anticipate models based on one or two QA datasets to perform well across all retrieval tasks. First, different retrieval tasks have very different search intents; in other words, different definitions of "relevance". For example, consider Figure 1 (a): both Dbpedia-Entity (Hasibi et al., 2017) and FEVER (Thorne et al., 2018) are tasks to retrieve documents from Wikipedia. However, Dbpedia-Entity is a task to retrieve entities that are mentioned in the query, while FEVER is a task to find evidence that either supports or refutes a given statement. Which

