AUGMENTING ZERO-SHOT DENSE RETRIEVERS WITH PLUG-IN MIXTURE-OF-MEMORIES Anonymous

Abstract

In this paper we improve the zero-shot generalization ability of language models via Mixture-Of-Memory Augmentation (MoMA), a mechanism that retrieves augmentation documents from multiple information corpora ("external memories"), with the option to "plug in" new memory at inference time. We develop a joint learning mechanism that trains the augmentation component with latent labels derived from the end retrieval task, paired with hard negatives from the memory mixture. We instantiate the model in a zero-shot dense retrieval setting by augmenting a strong T5-based retriever with MoMA. Our model, MoMA-DR, obtains strong zero-shot retrieval accuracy on the eighteen tasks included in the standard BEIR benchmark. It outperforms other dense retrieval models of similar scale and achieves comparable accuracy with systems that seek generalization from increased scales in encoder models or vector indices. Our analysis illustrates the necessity of augmenting with mixture-of-memory for robust generalization, the benefits of joint learning, and how MoMA-DR utilizes the plug-in memory at inference time without changing its parameters. We plan to open source our code.

1. INTRODUCTION

Scaling up language models-with more parameters, compute, and annotation data-improves model generalization ability on downstream applications (Raffel et al., 2019; Brown et al., 2020; Smith et al., 2022) , but with diminishing return: linear improvements on downstream metrics often require exponentially more parameters and computing cost (Kaplan et al., 2020; Hoffmann et al., 2022) . Hence, scaling pretrained language models in this way is economically unsustainable (Strubell et al., 2020; Bender et al., 2021; Zhang et al., 2022) . Retrieval augmented language models provide a promising alternative. They allow language models to efficiently access vast resources from an external corpus (Guu et al., 2020; Borgeaud et al., 2022) that serves as a kind of "memory" they can refer to when making predictions, alleviating the need to memorize as much information in their own network parameters (Roberts et al., 2020) . This openbook approach helps language models to better generalize on token prediction tasks and machine translation (Khandelwal et al., 2019; Borgeaud et al., 2022) , and tasks which already involve a first-stage retrieval component, e.g., OpenQA (Borgeaud et al., 2022; Izacard et al., 2022) . In this paper we improve the zero-shot generalization ability of language models using "mixture-ofmemory" (MoMA), a new retrieval augmentation mechanism. Instead of a single corpus, MoMA retrieves documents from a "mixture" of multiple external corpora. This mechanism also allows removing and/or "plugging-in" new corpora during inference time, when more information from the target task is revealed, or as an additional way for users to control the model. It is not trivial to guide a retrieval model to leverage multiple corpora; we need to jointly train the augmentation component and dense retriever using supervised relevance signals and self-mined hard negatives. We instantiate MoMA with a T5 encoder-decoder model (Ni et al., 2022) and apply it to the dense retrieval task (Karpukhin et al., 2020) . Our resulting retrieval system, MoMA-DR, uses a set of augmenting documents from the mixture-of-memories to enhance its representation of the query with important context; the retriever then uses the enhanced query representation to retrieve a final candidate set. At inference time, we plug in the target task's corpus to the memory mixture to introduce in-domain context information, without updating any parameter. We measure MoMA-DR on zero-shot dense retrieval (ZeroDR) (Thakur et al., 2021b) , an important real-world application. Our experiments on eighteen retrieval tasks included in BEIR (Thakur et al., 2021b) , the standard ZeroDR benchmark, demonstrate the improved zero-shot ability of MoMA-DR. It outperforms baseline T5 without the MoMA augmentation component, as well as recent stateof-the-art dense retrieval systems of the same scale, by large margins. It also achieves comparable performance to ZeroDR systems that scaled their model parameters, training data, and/or number of vector representations beyond those in this study. Our analysis reveals that large and diverse corpora in the memory leads to the best performance; only using a single corpus during training does not improve performance on unseen target tasks. The joint learning is also important for MoMA-DR to utilize the diverse information from the mixture. Our analysis and case studies illustrate how MoMA-DR leverages the plug-in memory at testing time to enrich its query representations with in-domain information that was not available in training.

2. RELATED WORK

Recent research has explored two common ways to construct the external memory in retrievalaugmented language models. The first is to use a token vocabulary and retrieve similar tokens for language models to copy from when predicting the next token (Khandelwal et al., 2019; Zhong et al., 2022) . The second is to use a document corpus, often the pretraining corpus or the task-specific ones, and retrieve the related documents (text sequences) from the corpus as additional input (Guu et al., 2020; Borgeaud et al., 2022) . Document-based ones align well with language systems that already involve a first stage retrieval component, like knowledge-intensive tasks (Petroni et al., 2020) and OpenQA (Chen et al., 2017) . This work falls into the latter. Learning to retrieve useful documents to augment the language model is a challenging task, since human annotations on the usefulness of augmentation documents are costly and seldom available. The most straightforward way is to use representations from raw pretrained language models to find documents similar to the task input, i.e., as unsupervised dense retrieval (Guu et al., 2020; Borgeaud et al., 2022) . Adapting dense retrieval models trained for relevance matching is another common choice (Izacard & Grave, 2020b; Lewis et al., 2020; Yu et al., 2021) . A more formal solution is to jointly learn the augmentation components end-to-end using supervision from the final task, for example, treating the augmentation as latent variables and applying EM (Zhao et al., 2021) , or distilling the augmentation component from feedback of the final model (Izacard & Grave, 2020a) . In a parallel work, Izacard et al. (2022) found the most effective one is attention distillation method (ADist), which trains the augmentation component using soft labels derived from the end model's attention on augmentation documents. Recent dense retrieval systems achieve strong empirical performance in supervised settings (Lee et al., 2019; Karpukhin et al., 2020; Xiong et al., 2020) . Unfortunately, dense retrieval models trained on a resource rich source tasks, e.g., web search, do not perform as well when zero-shot transferred to other domains (Thakur et al., 2021a) . This is concerning since many important real-world scenarios do not have the luxury of web corpus training signals and must rely on near zero-shot transfer, especially the medical and enterprise search domains (Kim, 2022) . Xin et al. (2021) analyzed the challenge of shifting between training and testing domains, and leveraged domain-invariant learning to mitigate the gap. Another common approach is to first construct domain-specific weak supervisions for each task, and then use them to train dense retriever (Thakur et al., 2021a; Wang et al., 2022) . Additionally, continuous pretraining the language model also improves its generalization ability in ZeroDR (Izacard et al., 2021; Gao & Callan, 2022) . Many seek better generalization ability in ZeroDR from other resources, for example, combining with sparse retrieval to introduce exact match signals (Formal et al., 2021) , using multiple vectors per documents for term-level matching (Khattab & Zaharia, 2020b) , or scaling up the retrieval model using large scale pretrained language models (Ni et al., 2021; Neelakantan et al., 2022) .

3. METHOD

In this section we first describe our Mixture-of-Memory Augmentation. Then we discuss how it is jointly learned with the end system and enables plug-in memory at inference time.

3.1. MIXTURE-OF-MEMORY AUGMENTATION

Before going to the details of MoMA, we first recap some preliminaries in ZeroDR. Preliminaries. The dense retrieval (DR) task aims to find relevant documents d from a corpus C for the given query q by representing them in a shared embedding space. Specifically, the retrieval score in DR is often calculated as: f (q, d) = q • d; q = g(q); d = g(d). (1) It uses dot product as the scoring function to match the embeddings q and d, which is known to support efficient nearest neighbor search (ANN) (Johnson et al., 2019) . A pretrained language model is often the encoder of choice g(). We use the ST5-EncDec variant of Sentence-T5 (Ni et al., 2022) : g(x) = Dec(Enc(x)), which feeds in the text sequence (prepended by a special [CLS] tokens) to the encoder of T5, Enc(), and uses the output representation of the [CLS] token from the decoder, Dec(), as the text representation. This naturally leverages the attention from decoder to encoder at all Transformer layers (Raffel et al., 2019) , as a fine-grained information gathering mechanism. The training of dense retrieval systems often applies standard ranking loss and pairs the relevant documents d + ∈ D + for each query q with hard negatives d -∈ D -: L = q d + ∈D + d -∈D - l(f (q, d + ), f (q, d -)); D -∼ ANN C f (q,•) \ D + . Eqn. 3 uses ANCE hard negatives, which are the top-retrieved documents from C using the retriever itself (Xiong et al., 2020) . The loss function l() can be any standard ranking loss such as cross entropy. A ZeroDR model is trained on q s and documents d s ∈ C s from a source task, often web search, and tested on target tasks q t and C t ; supervision signals are only present from the source. Mixture-of-Memory Augmentation. The key idea of (document-based) retrieval augmented language models is to enrich the representation g(q) with additional contextual input for the model, i.e., augmentation documents d a retrieved from an external memory M. Instead of using a single document corpus, MoMA uses multiple corpora to provide richer and more diverse external resources for augmentation. For example, M can be composed by the source corpus C s , a general encyclopedia, a domain specific knowledge graph, etc. Then we can retrieve the augmentation documents D a : D a = ANN M f a (x,•) ; M = {C 1 , ..., C M }. This augmentation component uses another dense retriever f a () (also a Sentence T5 model), with parameters distinct from those in g(). Note that instead of retrieving D a separately from M different ANN memory sources and merging results, Eqn. 4 combines them into one ANN index. This requires the augmentation component f a () to be flexible enough handle various corpora in the mixture. Using the encoder-decoder architecture for g() in Eqn. 2 enables a simple extension to incorporate the augmentation documents using the fusion-in-decoder (FiD) mechanism (Izacard & Grave, 2020b) : g MoMA (q) = Dec(Enc(q), Enc(d a 1 ), ..., Enc(d a K )); D a = {d a 1 , ..., d a K }. It feeds in the K augmentation documents separately to the T5 encoder of g(). Then it fuses the encoded documents together with Enc(q) using one decoder that attends to all encoded vectors, as illustrated in Figure 1 . The FiD approach in Eqn 5 is a nice balance of efficiency and capacity when modeling multiple text sequences (Izacard & Grave, 2020b) . It is more efficient than concatenating all text pieces together, while also remaining expressive enough to model the nuances from many sequences. (Izacard & Grave, 2020a; Izacard et al., 2022) . When instantiating MoMA in the dense retrieval setting, we focus on augmenting the query representation q, as queries are often short, ambiguous, and benefit more from additional contextual information (Lavrenko & Croft, 2017; Yu et al., 2021) . This leads to the following definition of MoMA-DR: f MoMA (q, d) = q a • d; q a = g MoMA (q), d = g(d), using the construction of g MoMA () in Eqn. 5 upon the augmentation documents defined in Eqn. 4.  L MoMA = q s d + ∈D s+ d -∈D s- l(f MoMA (q s , d + ), f MoMA (q s , d -)); (7) D s-∼ ANN C s f MoMA (q s ,•) \ D s+ . ( ) The training signals come from the source task, including q s , its relevant documents D s+ , and ANCE hard negatives D s-retrieved from the source corpus C s . Augmentation Learning. Training f a () is challenging as it is hard to label whether an augmentation document is useful. Propagating gradients from the final loss to f a () is also prohibitive as the retrieval operation in Eqn. 4 is discrete. Fortunately, recent research found the attention scores from the FiD decoder to each encoded inputs (Eqn. 5) are good approximations to the usefulness of augmentation documents (Izacard & Grave, 2020a) : FidAtt(d a i ) = layers positions heads Attention Dec→Enc (g MoMA (d a i )). ( ) It sums the attentions from g MoMA ()'s special token at the decoder's [CLS] position over all layers, input positions, and attention heads. Ideally, higher FidAtt() is assigned to d a i that provides useful contextual information. Previously, FidAtt scores are often used as soft labels for the augmentation model (Izacard & Grave, 2020a; Izacard et al., 2022) . Doing so with memory mixtures is risky as it is too sparse and overfits memory resource that appears earlier in the training, which are the only ones available for the decoder to attend on. To improve the learning robustness, we introduce ANCE-style hard negative mining to train the augmentation component as well. First, we formulate the positive set of augmentation documents as: D a+ = D s+ ∪ Top-N FidAtt(d a i ),D a . ( ) which combines relevant documents D s+ and the augmenting ones that received N-highest attention scores from g MoMA (). Then we pair them with hard negatives to formulate the training of f a () as: L a = q s d + ∈D a+ d -∈D a- l(f a (q s , d + ), f a (q s , d -)); (11) D a-∼ ANN M f a (q s ,•) \ D a+ . ( ) Notice the negatives for f a () have comprehensive coverage from multiple corpora. Iterative Training. The learning of f MoMA () and f a () is an iterative process that fits naturally into the training procedure of dense retrieval training with hard negatives. We follow the standard iterations in ANCE and construct the t-th training episode of MoMA-DR: 1. Construct hard negatives D s-via Eqn. 8 using weights f MoMA t-1 () from the last episode; 2. Retrieve augmentation D a via Eqn. 4 using weights f a t-1 () from the last episode; 3. Train f MoMA t () as Eqn. 7; 4. Formulate new positive augmentation documents D a+ , using updated attention scores from f MoMA t (), and mine negative augmentation documents D a-using f a t-1 (); 5. Train f a t () following Eqn. 11. Both f MoMA 0 () and f a 0 () can be initialized with a BM25 warmed-up T5 retriever. Steps 1 and 3 above are inherited from standard dense retrieval training. The rest are introduced by MoMA. The additional computation in the training side mainly resides updating the index for the memory mixture, a standard cost in retrieval-augmented language models (Guu et al., 2020; Izacard et al., 2022) . Zero-Shot Retrieval with Plug in Memories. To perform zero-shot retrieval on unseen tasks, MoMA-DR first retrieves augmented documents using f a () from M for the target query q t , and retrieves target documents d t ∈ C t with the augmented model f MoMA () without changing any model parameters. MoMA allows f a () to attend over the target corpus as well if it isplugged in: M = M∪C t , which conveys in-domain information. The augmenting corpus can also be engineered by users manually to inject their preference or domain knowledge, e.g., as "memory engineering". In this work we focus on swapping out the source corpus for the target corpus; we leave other explorations for future work.

4. EXPERIMENTAL METHODOLOGIES

Datasets. We choose the MS MARCO passage dataset (Bajaj et al., 2016) as the source domain dataset, whereas the target domains are from the 18 datasets in BEIR (Thakur et al., 2021a) benchmark, which include including biomedical, scientific and financial texts. More details can be found in Appendix A.1. The evaluation metric NDCG@10 is the same with BEIR benchmark, which measures Normalized Discounted Cumulative Gain (Wang et al., 2013) of top 10 prediction. The higher NDCG@10 value indicates better performance. Augmenting Corpora. During training, the mixture-of-memory is composed of source training corpus (MARCO), Wikipedia and a medical knowledge graph. We use the Wikipedia chunk prepossessed by Karpukhin et al. (2020) without further processingfoot_0 . The medical knowledge graph is extracted from the Medical Subject Headings (MeSH) 2 , an open-source database for indexing and cataloging of biomedical and health-related information. Since it is hierarchical in structure, we linearize it by concatenating spans with text information. During testing, we directly replace MARCO with the corresponding document sets from BEIR. Each task from BEIR is augmented independently. More dataset and preprocessing details can be found in Appendix A.1. Baselines. We compare our MoMA-DR with standard sparse and dense retrieval models on BEIR. We also compare MoMA-DR with advanced approaches that are specifically designed for zero-shot generalization. They involve techniques that are not directly comparable with this paper, including pretraining on extra data, in-domain continuous pretraining, and generating target pairs using another pretrained generative model. Besides, some baselines use larger scale language model as their backbone. We list the details of baselines in Appendix A.2. Implementation Details. For MoMA-DR, we use the same architecture as T5-base (Raffel et al., 2019) : 12-layer Transformer, 768 hidden size. Following Xiong et al. (2020) , both the augmentation component and end retriever are first trained using BM25 negatives for 10 epochs. After warming up, we jointly trained the two components for three episodes, each episode including three training epochs. After three joint episodes, the end retriever reaches the best performance on MSMARCO, so we select this checkpoint for evaluation. The ratio between positive and hard negative pairs is 1:7 for both models. The main hyperparameters in MoMA-DR include the total number of grounding documents K and the attention threshold number N in Equation 10. We directly set K=10 and N=5 without any parameter tuning. More details on hyperparameters and experimental settings can be found in Appendix A.3. 

5. EVALUATION RESULTS

Our experiments evaluate the zero-shot accuracy of MoMA-DR, its performance with different memory sources, the influence of memory mixture learning, and the benefits of plug-in memory.

5.1. ZERO-SHOT RETRIEVAL ACCURACY

The retrieval accuracy of MoMA-DR and baselines are listed in Table 1 . Besides baselines of similar parameter count, we also include larger models (GTRlarge) or those using multiple vectors per documents (ColBERT). MoMA-DR shows stronger zero-shot accuracy against previous state-of-theart methods that do continuous contrastive pretraining (coCondenser), generate pseudo labels (GenQ), or consume additional training signals in both continuous pretraining and finetuning phrases (GTR base ). MoMA-DR also achieved nearly comparable zero-shot accuracy against larger models like GTR large , and ColBERT, which scales up the number of vectors per documents (one per token). This confirms that retrieval-augmentation provides another path to improve language models' generalization ability besides scaling up. MoMA-DR also outperforms its direct baseline, T5-ANCE, which MoMA-DR uses as a subroutine for retrieval augmentation, on all but one retrieval task, showing the robustly improved generalization ability from plug-in mixture of memory.

5.2. PERFORMANCE WITH DIFFERENT MEMORIES

Table 2 evaluates how MoMA-DR behaves under different combinations of external memories. Unsurprisingly, using a single out-of-domain memory for retrieval augmentation does not help, for example, even though MARCO is the source domain corpus, solely grounding on it reduces zero-shot accuracy. MeSH as the sole augmenting corpus also lowers performance, even on some medical retrieval tasks such as BioASQ. Interestingly, when we expand the memory to include MARCO, Wiki, and MeSH, but keep the target corpus excluded (w/o Target), MoMA-DR exhibits better accuracy compared to the no-memory T5-ANCE. Our conclusion is that more memory sources achieves better generalization, especially when no target domain information is available. In the Full setting, the 3-memory mixture of MARCO, Wiki, and MeSH is jointly learned with final task at training time. At test time, MARCO is swapped out for the target corpus. The Full improves zero-shot accuracy over both the w/o Target setting (where the target corpus is excluded at test time), and the w/o Learning setting (wherein the augmentation component is not learned). As expected, plugging in the target corpus at test time is the most valuable source of generalization power. It is also the most realistic, as access to the target corpus may only be available at testing time. 

5.3. EFFECT OF MEMORY MIXTURE LEARNING

To study the effect of our joint learning mechanism on the memory mixture, we compare it with recent state-of-the-art Attention Distillation (ADist), which is first used in Izacard & Grave (2020a) and recently updated in a parallel work Izacard et al. (2022) . It jointly trains the augmentation model using attention scores from the end language model as pseudo-labels. We also enrich ADist with relevance labels from MARCO for more direct supervision, which was shown to be effective in distilling a dense retriever from stronger cross-encoder ranking model (Hofstätter et al., 2021) . The performances of these joint learning methods are listed in Table 3 . We pick six BEIR tasks whose domains are closely related to the augmentation corpora: TREC-COVID, BIOASQ, and NFCorpus are medical search and closely related to MeSH. NQ, HotpotQA, and FEVER are all Wikipedia based. The results show that ADist, either standalone or enriched with MARCO labels, does not improve the final accuracy compared to using a supervised dense retriever as the augmentation component without joint learning. The main difference is that the supervised retriever has been trained effectively using hard negative sampling (Xiong et al., 2020) . Jointly learning using soft labels without hard negatives downgraded the augmentation accuracy. Hence, MoMA-DR is a simple technique to learn the end task signals via the attention scores together with hard negatives, which improves quality over a supervised retriever alone. To further illustrate the joint training process, we track the attention scores of documents from different memory sources as well as their ratio in the augmentation set in Figure 2 . We also split MARCO documents by whether they are labeled as Relevant (Rel) for the corresponding query. Firstly, MoMA-DR learns to increasingly attend to, and retrieve, relevant documents from the memory mixture throughout training. In Figure 2a , more attention is paid to MARCO Relevant documents than to any other type in the memory. Although the number of MARCO Relevant documents is not significant as a percentage of the augmenting set in Figure 2c , a query level analysis confirms that percentage of queries having at least one relevant document in the augmenting set increases from 46% in Epi-0 to 62% in Epi-2. This apparent discrepancy can be explained by the fact that MARCO has only one relevant label per query on average, leaving plenty of room for other types of documents to be included in the augmenting set. Secondly, the amount of attention paid to certain types of documents by MoMA-DR is positively correlated with their representation in the augmenting set. This confirms that the joint learning effectively conveys the feedback signals from the end model to the augmentation component. For instance, in Figure 2a , MoMA-DR pays a high level of attention to MARCO Other documents, a signal reflected in the composition of its augmentation set in Figure 2c . Even though MARCO Other documents were not labeled relevant for the query, they can still prove to be valuable as an augmenting document because they may contain partial information that helps query understanding (Lavrenko & Croft, 2017) or it was simply not annotated in MARCO's sparse labels (Bajaj et al., 2016) . In comparison, the correlation of the two in ADist is weak as the model seems to include 60% augmenting documents from MeSH, far greater than the fraction of medical queries in MARCO. Figure 4 demonstrates how the augmentation model and end retriever interact. We measure MRR on MARCO for both components during training. Firstly, the augmentation component improves on the source domain even thought it is not directly optimized with relevance labels. Secondly, the end retriever monotonically benefits from information collected by the augmenting component in each iteration, indicating that the two components mutually enhance each other in the joint learning process, and the high percentage of MARCO Other documents still ultimately benefit the end-retriever. [Marco] Why is Hotel Transylvania 2 rated PG? It is rated PG for some scary images, action and rude humor. [Wiki] Another review aggregate calculated an average score of 47 out of 100, indicating "mixed or average reviews". Zero-Shot Testing [HotpotQA] Were Scott Derrickson and Ed Wood of the same nationality? [Wiki] Scott Derrickson (born July 16, 1966 ) is an American director, screenwriter and producer. [HotpotQA] Edward Davis Wood Jr. (October 10, December 10, 1978) was an American filmmaker, actor, writer, producer, and director. [BIOASQ] Is AND-1/Ctf4 essential for proliferation? [BIOASQ] AND-1/Ctf4 bridges the CMG helicase and DNA polymerase alpha, facilitating replication. [Wiki] FADD has no effect on the proliferation of B cells induced by stimulation of the B cell receptor.

5.4. GENERALIZATION OF PLUG-IN MEMORY

In the previous section, we observed how MoMA-DR learns to attend to, and retrieve, informative documents from memories on which it was trained. In this section, we examine MoMA-DR's zero-shot behavior on new corpora plugged-in at test time (keeping Wiki and MeSH as before). Figure 3 compares documents from the plugged-in target versus the remaining memory mixture in terms of membership in the augmenting set (Doc Ratio) and attention. Again, on all tasks, MoMA-DR heavily attends to -and successfully retrieves -in-domain documents, even if those in-domain documents were only just plugged in. This confirms that the augmentation model achieves the zero-shot ability to capture relevant information from unseen corpora. In the medical domain, the model pays more attention to MeSH documents, especially on TREC-Covid task since MeSH includes high quality updated information related to COVID-19. Wikipedia documents received more attention on the Wiki-centric tasks like FEVER, as expected. Some tasks may need a small amount of precise information from Wikipedia to answer the detailed question, e.g. in HotpotQA. Similar with the training process, there is a non-trivial correspondence between attention score of a memory and its membership in the augmentation set.

5.5. CASE STUDIES

In Table 4 we show examples of how augmenting documents chosen by MoMA-DR can provide valuable contextual information for the query. The first example is a training query from MARCO, where the augmenting documents help disambiguate the query word "rating". In the second one, documents from the official Wiki Dump and HotpotQA's Wiki corpus are descriptions of the two entities in HotpotQA's comparison question. It illustrates the how MoMA-DR provides more comprehensive augmentation by incorporating information from different sources. The last query shows the benefit of the in-domain plug-in corpus as it brings in very specific information about the query (AND-1/Ctf4) that is hard to find elsewhere.

6. CONCLUSION

In this paper we propose a new mixture-of-memory mechanism to allow language models to leverage information across multiple disparate corpora (memories) simultaneously. This mechanism can also incorporate new corpora that are "plugged-in" at test time, which improves dense retrieval models' generalization abilities to unseen corpora in a zero-shot manner. The results show that MoMA-DR achieves strong zero-shot accuracy on the eighteen retrieval tasks included in BEIR benchmark, showing that retrieval augmentation with plug-in mixture-of-memories is another way to improve zero-shot ability of language models. Our analysis demonstrates that the most valuable memory mixtures are from multiple sources with in-domain information and our joint learning mechanism can utilize such diverse information. We hope our findings inspire future research in retrieval-augmented language models to achieve generalization ability with better efficiency. (Thakur et al., 2021a) foot_2 and include the following domains:

A APPENDIX

• Open-domain Question Answering (QA): HotpotQA (Yang et al., 2018) , NQ (Kwiatkowski et al., 2019) , and FiQA (Maia et al., 2018) . • Bio-Medical Information Retrieval: TREC-COVID (Voorhees et al., 2021) , NFCorpus (Boteva et al., 2016) , and BioASQ (Tsatsaronis et al., 2015) . • Argument Retrieval: Webis-Touché2020 (Bondarenko et al., 2020) and ArguAna (Wachsmuth et al., 2018) . • News Retrieval: TREC-NEWS (Soboroff et al., 2018) and Robust04 (Voorhees et al., 2004) . • Tweet Retrieval: Signal-1m (Suarez et al., 2018) . • Duplicate Question Retrieval: Quora (Thakur et al., 2021a) and CQADupStack (Hoogeveen et al., 2015) . • Entity Retrieval: DBPedia (Hasibi et al., 2017) • Citation Prediction: SCIDOCS (Cohan et al., 2020) • Fact Checking: SciFact (Wadden et al., 2020) , FEVER (Thorne et al., 2018) , and Climate-FEVER (Diggelmann et al., 2020) We list the statistics of the BEIR benchmark in Table 5 . Augmenting Corpora Corpus size We first introduce more details on how we preprocessed the Medical Subject Headings (MeSH) Database. We select text information from the Qualifier Record Set and Descriptor Record Set. Each set contains multiple <Concept> elements, which is composed of three sub-elecments, i.e., <ConceptName>, <ScopeNote> and <TermList>. Among the subelecments, <ScopeNote> is the major textual information source, which is usually a short description to a medical term or phenomenon. We directly consider each <ScopeNote> as a document entry and concatenate it with corresponding <ConceptName>. We list the statistics of the augmenting corpora in Table 6 .



https://huggingface.co/datasets/wiki_dpr https://www.ncbi.nlm.nih.gov/mesh/ https://github.com/beir-cellar/beir We separate them from dense retrieval since they usually rely on Seq2seq models to generate pseudo query-document pairs, and they train a model for each dataset independently instead of using a single model for all datasets. Unfortunately, this corpus has not been released by the authors.



Figure 1: Illustraion of the Mixture-of-Memory Augmentation.

Figure 2: Grounding component breakdown for different distillation methods in each learning iteration. We display the regularized doc and att. score ratio of documents from different augmentation sources.

Figure 4: MRR on MS-MARCO of the augmentation component and end retriever during training.

NDCG@10 on the BEIR benchmark. The best result each task is marked bold. The second best result each task is underlined. An * denotes unfair comparison, as NQ is used in training for GTR. †: GenQ generated pseudo labels to train an independent model for each task. ‡: Larger models

NDCG@10 of MoMA-DR under different memory compositions: no memory, single memory, and a mixture of memories. w/o Learning uses the end retriever to select augmenting documents without use of an augmentation component. w/o Target excludes the target from memory.

Zero-shot Performances of different distillation methods. We observe consistent trend on all BEIR datasets. We present results on 6 representative datasets from Wikipedia or medical domains.

MoMA-DR retrieves augmenting documents during training (Marco) and testing (BEIR).

Statistics of datasets in the BEIR benchmark. The table is taken from the original BEIR benchmark paperThakur et al. (2021a).Evaluation Datasets Target domain datasets used in our experiments are collected in the BEIR benchmark

REPRODUCIBILITY STATEMENT

To enhance reproducibility, we provide an overall introduction on our experimental setting in Section 4. Beyond that, we present more details in Appendix. Appendix A.1 includes statistics on the evaluation datasets and augmenting corpora. Appendix A.2 provides the introduction and implementation of all baselines in the paper. Appendix A.3 lists the configuration of our experimental setting and the complete choices of hyperparameters. We plan to submit our code after the discussion forums are opened. We will make a comment containing a link to our anonymous repository directly to the reviewers and ACs so that our code is only internally visible. We will release all code, augmentation data and model checkpoints, along with analysis scripts if this work is accepted. 

A.2 BASELINES

We use the baselines from the current BEIR leaderboard (Thakur et al., 2021a) and recent papers. These baselines can be divided into four groups: dense retrieval, dense retrieval with generated queries 4 , lexical retrieval and late interaction.Dense Retrieval For dense retrieval, the baselines are the same dual-tower model as ours. We consider DPR (Karpukhin et al., 2020) , ANCE (Xiong et al., 2020) , T5-ANCE, coCondenser Gao & Callan (2022) and one recently-proposed model GTR (Ni et al., 2021) with different size configuration in this paper.• DPR uses a single GTRbase leverages the same T5-base model as MoMA-DR, while GTRlarge is based on T5-large, which is not directly comparable to our method as it triples the parameters.Dense Retrieval with Generated Queries GenQ first fine-tunes a T5-base (Raffel et al., 2019) model on MS MARCO for 2 epochs and then generate 5 queries for each passage as additional training data for the target domain to continue to fine-tune the TAS-B (Hofstätter et al., 2021) model.Lexical Retrieval Lexical retrieval is a score function for token matching calculated between two high-dimensional sparse vectors with token weights. BM25 (Robertson et al., 2009) is the most commonly used lexical retrieval function. We use the BM25 results reported in Thakur et al. (2021a) for comparison.

Late Interaction

We also consider a late interaction baseline, namely ColBERT (Khattab & Zaharia, 2020a) . The model computes multiple contextualized embeddings for each token of queries and documents, and then uses a maximum similarity function to retrieve relevant documents. This type of matching requires significantly more disk space for indexes and has a higher latency.

A.3 DETAILED EXPERIMENTAL SETTINGS AND HYPERPARAMETERS

Our implementation uses PyTorch (Paszke et al., 2019) with Hugging Face Transformers (Wolf et al., 2020) . We optimize the model using AdamW (Loshchilov & Hutter, 2019) with a peak learning rate at 5e-6, weight decay of 0.01, and linear learning rate decay. The global batch size is set to 256. The maximum length of query and passage are set to 32 and 128 respectively. We summarize all hyperparameter settings in Table 7 . The model is trained with 8 Nvidia A100 80GB GPUs and FP16 mixed-precision training. The total running time is 6.6 hrs for three episodes of augmentation component training and 6.3 hrs for end retriever training. We detail the training time of each episode in Table 8 .When evaluating on the BEIR benchmark, we follow the setting in GTR (Ni et al., 2021) , which use sequences of 64 tokens for the questions and 512 for the documents in all datasets except Trec-News, Robust-04 and ArguAna. In particular, we set the document length to 768 for Trec-News and Robust-04. For ArguAna, we set both question and document length to 128. The above length setting is in accordance to the average query and document lengths in these datasets.

