PROMPTAGATOR : FEW-SHOT DENSE RETRIEVAL FROM 8 EXAMPLES

Abstract

Much recent research on information retrieval has focused on how to transfer from one task (typically with abundant supervised data) to various other retrieval tasks where supervision is limited, with the implicit assumption that it is possible to generalize from one task to all the rest. However, this overlooks the fact that there are many diverse and unique retrieval problems, each targeting different search intents, queries, and search domains. In this paper, we suggest to work on Few-shot Dense Retrieval, a setting where each task comes with a short description and a few examples. To address this, we introduce Prompt-based Query Generation for Retrieval (PROMPTAGATOR ): for each task, we feed the few-shot examples to a large language model (LLM) and prompt it to behave as a task-specific query generator. Using this, we can synthetically generate a large number of relevant queries for any document, yielding abundant data for training task-specific retrievers -with no reliance on traditional resources such as Natural Questions (Kwiatkowski et al., 2019) or MS MARCO (Nguyen et al., 2016). Surprisingly, PROMPTAGATOR using only 8 annotated examples enables efficient dual encoder retrievers to outperform computationally more expensive models trained on MS MARCO such as ColBERT v2 (Santhanam et al., 2022) by more than 1.2 points nDCG@10 on average on 11 retrieval sets. Further training standard-size rerankers using the same generated data yields another 5.0 points nDCG@10 improvement. Our studies show that synthetic query generation can be far more effective than previously observed, especially when a small amount of task-specific knowledge is given.

1. INTRODUCTION

Significant progress has been made on neural retrieval models such as dual encoders, which can search over a large collection of documents containing millions to billions of passages (Yih et al., 2011; Lee et al., 2019; Karpukhin et al., 2020) . However, Thakur et al. (2021) recently proposed the BEIR heterogeneous retrieval benchmark, and showed that it is still difficult for neural retrievers to perform well on a wide variety of retrieval tasks that lack dedicated training data. To address this problem, many previous approaches focus on transferring knowledge from high-resource question answering (QA) datasets such as MS MARCO (Nguyen et al., 2016) , and propose architectures that possess good inductive biases, such as models that allow fine-grained token-level interaction (e.g., ColBERT (Khattab & Zaharia, 2020; Santhanam et al., 2022) and SPLADE (Formal et al., 2021 )) which often come with higher inference cost. Data augmentation via synthetic query generation has previously been explored (Ma et al., 2021; Shakeri et al., 2020) , but these question generators are learned from high-resource QA datasets, and often cannot generalize well to new retrieval tasks. We argue that it is hard to anticipate models based on one or two QA datasets to perform well across all retrieval tasks. First, different retrieval tasks have very different search intents; in other words, different definitions of "relevance". For example, consider Figure 1 (a): both Dbpedia-Entity (Hasibi et al., 2017) and FEVER (Thorne et al., 2018) are tasks to retrieve documents from Wikipedia. However, Dbpedia-Entity is a task to retrieve entities that are mentioned in the query, while FEVER is a task to find evidence that either supports or refutes a given statement. Which Motivated by these observations, we advocate to work on the setting of Few-shot Retrieval for diverse retrieval tasks ( §2), where each task comes with a short description and a few annotated examples to clearly illustrate the search intent. To address this challenge, we propose Prompt-based Query Generation for Retrieval (PROMPTAGATOR) ( §3): for each new retrieval task, we feed the few-shot examples to a large language model (LLM) such as FLANfoot_0 (Wei et al., 2022a) and prompt it to perform doc-to-query generation. Importantly, the few-shot examples ensure that we capture the specific search intent of that task. Using this query generator, we can synthetically generate a large number of relevant queries for any document, yielding abundant data for training any retriever, including highly efficient dual encoder models. We find that our few-shot LLM query generator can produce good queries without any fine-tuning ( §3.1). In fact, as shown in Figure 1 (b), our synthetically generated data is strong enough to completely forego using annotated query-document pairs from traditional high-resource datasets such as Natural Questions (Kwiatkowski et al., 2019) or MS MARCO (Nguyen et al., 2016) . While PROMPTAGATOR is not the first application of LLMs for retrieval, prior work did not explore task-specific few-shot adaptation, and often came with high inference cost. Neelakantan et al. (2022) proposes to use GPT-3 (Brown et al., 2020) in dual encoders. However, their embedding dimension is 12k, which makes the search index footprint and inference cost prohibitively high for many applications. Sachan et al. (2022) and Bonifacio et al. (2022) prompt LLMs for question generation, but did not explore the idea of using task-specific few-shot prompts for rapid task adaptation.foot_1 They also focus primarily on models that rerank top retrievals from an existing retriever, rather than directly adapting the underlying retriever which must efficiently search over millions or billions of documents. To summarize, the contributions of the paper are as follows: • We highlight previously overlooked differences across retrieval tasks (e.g., search intent and query distribution), and propose a Few-Shot Retrieval evaluation for the BEIR dataset. 



FLAN is a LLM that is not trained on any document retrieval or document-to-query generation tasks. InPars (Bonifacio et al., 2022) used the same few-shot prompt constructed from MS MARCO to generate reranker data for multiple tasks, so no task-specific prompt is used.



Figure 1: Few-shot retrieval with PROMPTAGATOR. Left (a): Retrieval tasks from BEIR differ in query distribution, retrieval corpus, and search intents. Middle (b): Most prior work uses supervised setting (2) which trains model on a large QA retrieval datasets and transfer to other retrieval tasks. Right (c): Few-shot PROMPTAGATOR performance. Average nDCG@10 on 11 datasets from BEIR from our PROMPTAGATOR models and previously MS MARCO-supervised models (SPLADE v2).document is relevant to the query can be very different from one task to another task even if they share the same domain. Moreover, different tasks have distinct distributions of queries even when their search intents are similar. For example, queries in HotpotQA(Yang et al., 2018)  are long compositional questions, while queries in FiQA(Maia et al., 2018)  are short financial questions.

We propose PROMPTAGATOR, a simple recipe for few-shot retrieval by prompting an LLM to generate synthetic task-specific training data. For the first time, we can train fully neural retrievers and rerankers solely based on a few supervised examples.• Our results show that, surprisingly, PROMPTAGATOR with two-to-eight examples produces significantly better retrievers than recent models trained on MS MARCO or NQ that have over 500k human annotated examples (Figure1(c)) and utilize more expensive architectures: PROMPTAGATOR outperforms ColBERT v2 and SPLADE v2 on 11 retrieval tasks we tested, while reranking boosts results by another 5 points on standard retrieval evaluation metric.

