AUGMENTING ZERO-SHOT DENSE RETRIEVERS WITH PLUG-IN MIXTURE-OF-MEMORIES Anonymous

Abstract

In this paper we improve the zero-shot generalization ability of language models via Mixture-Of-Memory Augmentation (MoMA), a mechanism that retrieves augmentation documents from multiple information corpora ("external memories"), with the option to "plug in" new memory at inference time. We develop a joint learning mechanism that trains the augmentation component with latent labels derived from the end retrieval task, paired with hard negatives from the memory mixture. We instantiate the model in a zero-shot dense retrieval setting by augmenting a strong T5-based retriever with MoMA. Our model, MoMA-DR, obtains strong zero-shot retrieval accuracy on the eighteen tasks included in the standard BEIR benchmark. It outperforms other dense retrieval models of similar scale and achieves comparable accuracy with systems that seek generalization from increased scales in encoder models or vector indices. Our analysis illustrates the necessity of augmenting with mixture-of-memory for robust generalization, the benefits of joint learning, and how MoMA-DR utilizes the plug-in memory at inference time without changing its parameters. We plan to open source our code.

1. INTRODUCTION

Scaling up language models-with more parameters, compute, and annotation data-improves model generalization ability on downstream applications (Raffel et al., 2019; Brown et al., 2020; Smith et al., 2022) , but with diminishing return: linear improvements on downstream metrics often require exponentially more parameters and computing cost (Kaplan et al., 2020; Hoffmann et al., 2022) . Hence, scaling pretrained language models in this way is economically unsustainable (Strubell et al., 2020; Bender et al., 2021; Zhang et al., 2022) . Retrieval augmented language models provide a promising alternative. They allow language models to efficiently access vast resources from an external corpus (Guu et al., 2020; Borgeaud et al., 2022) that serves as a kind of "memory" they can refer to when making predictions, alleviating the need to memorize as much information in their own network parameters (Roberts et al., 2020) . This openbook approach helps language models to better generalize on token prediction tasks and machine translation (Khandelwal et al., 2019; Borgeaud et al., 2022) , and tasks which already involve a first-stage retrieval component, e.g., OpenQA (Borgeaud et al., 2022; Izacard et al., 2022) . In this paper we improve the zero-shot generalization ability of language models using "mixture-ofmemory" (MoMA), a new retrieval augmentation mechanism. Instead of a single corpus, MoMA retrieves documents from a "mixture" of multiple external corpora. This mechanism also allows removing and/or "plugging-in" new corpora during inference time, when more information from the target task is revealed, or as an additional way for users to control the model. It is not trivial to guide a retrieval model to leverage multiple corpora; we need to jointly train the augmentation component and dense retriever using supervised relevance signals and self-mined hard negatives. We instantiate MoMA with a T5 encoder-decoder model (Ni et al., 2022) and apply it to the dense retrieval task (Karpukhin et al., 2020) . Our resulting retrieval system, MoMA-DR, uses a set of augmenting documents from the mixture-of-memories to enhance its representation of the query with important context; the retriever then uses the enhanced query representation to retrieve a final candidate set. At inference time, we plug in the target task's corpus to the memory mixture to introduce in-domain context information, without updating any parameter. We measure MoMA-DR on zero-shot dense retrieval (ZeroDR) (Thakur et al., 2021b) , an important real-world application. Our experiments on eighteen retrieval tasks included in BEIR (Thakur et al., 2021b) , the standard ZeroDR benchmark, demonstrate the improved zero-shot ability of MoMA-DR. It outperforms baseline T5 without the MoMA augmentation component, as well as recent stateof-the-art dense retrieval systems of the same scale, by large margins. It also achieves comparable performance to ZeroDR systems that scaled their model parameters, training data, and/or number of vector representations beyond those in this study. Our analysis reveals that large and diverse corpora in the memory leads to the best performance; only using a single corpus during training does not improve performance on unseen target tasks. The joint learning is also important for MoMA-DR to utilize the diverse information from the mixture. Our analysis and case studies illustrate how MoMA-DR leverages the plug-in memory at testing time to enrich its query representations with in-domain information that was not available in training.

2. RELATED WORK

Recent research has explored two common ways to construct the external memory in retrievalaugmented language models. The first is to use a token vocabulary and retrieve similar tokens for language models to copy from when predicting the next token (Khandelwal et al., 2019; Zhong et al., 2022) . The second is to use a document corpus, often the pretraining corpus or the task-specific ones, and retrieve the related documents (text sequences) from the corpus as additional input (Guu et al., 2020; Borgeaud et al., 2022) . Document-based ones align well with language systems that already involve a first stage retrieval component, like knowledge-intensive tasks (Petroni et al., 2020) and OpenQA (Chen et al., 2017) . This work falls into the latter. Learning to retrieve useful documents to augment the language model is a challenging task, since human annotations on the usefulness of augmentation documents are costly and seldom available. The most straightforward way is to use representations from raw pretrained language models to find documents similar to the task input, i.e., as unsupervised dense retrieval (Guu et al., 2020; Borgeaud et al., 2022) . Adapting dense retrieval models trained for relevance matching is another common choice (Izacard & Grave, 2020b; Lewis et al., 2020; Yu et al., 2021) . A more formal solution is to jointly learn the augmentation components end-to-end using supervision from the final task, for example, treating the augmentation as latent variables and applying EM (Zhao et al., 2021) , or distilling the augmentation component from feedback of the final model (Izacard & Grave, 2020a). In a parallel work, Izacard et al. (2022) found the most effective one is attention distillation method (ADist), which trains the augmentation component using soft labels derived from the end model's attention on augmentation documents. Recent dense retrieval systems achieve strong empirical performance in supervised settings (Lee et al., 2019; Karpukhin et al., 2020; Xiong et al., 2020) . Unfortunately, dense retrieval models trained on a resource rich source tasks, e.g., web search, do not perform as well when zero-shot transferred to other domains (Thakur et al., 2021a) . This is concerning since many important real-world scenarios do not have the luxury of web corpus training signals and must rely on near zero-shot transfer, especially the medical and enterprise search domains (Kim, 2022). 2021) analyzed the challenge of shifting between training and testing domains, and leveraged domain-invariant learning to mitigate the gap. Another common approach is to first construct domain-specific weak supervisions for each task, and then use them to train dense retriever (Thakur et al., 2021a; Wang et al., 2022) . Additionally, continuous pretraining the language model also improves its generalization ability in ZeroDR (Izacard et al., 2021; Gao & Callan, 2022) .

Xin et al. (

Many seek better generalization ability in ZeroDR from other resources, for example, combining with sparse retrieval to introduce exact match signals (Formal et al., 2021) , using multiple vectors per documents for term-level matching (Khattab & Zaharia, 2020b) , or scaling up the retrieval model using large scale pretrained language models (Ni et al., 2021; Neelakantan et al., 2022) .

3. METHOD

In this section we first describe our Mixture-of-Memory Augmentation. Then we discuss how it is jointly learned with the end system and enables plug-in memory at inference time.

