CONTEXTUALIZED GENERATIVE RETRIEVAL

Abstract

The text retrieval task is mainly performed in two ways: the bi-encoder approach and the generative approach. The bi-encoder approach maps the document and query embeddings to common vector space and performs a nearest neighbor search. It stably shows high performance and efficiency across different domains but has an embedding space bottleneck as it interacts in L2 or inner product space. The generative retrieval model retrieves by generating a target sequence and overcomes the embedding space bottleneck by interacting in the parametric space. However, it fails to retrieve the information it has not seen during the training process as it depends solely on the information encoded in its own model parameters. To leverage the advantages of both approaches, we propose Contextualized Generative Retrieval model, which uses contextualized embeddings (output embeddings of a language model encoder) as vocab embeddings at the decoding step of generative retrieval. The model uses information encoded in both the non-parametric space of contextualized token embeddings and the parametric space of the generative retrieval model. Our approach of generative retrieval with contextualized vocab embeddings shows higher performance than generative retrieval with only vanilla vocab embeddings in the document retrieval task, an average of 6% and 18% (25%) higher performance in KILT (NQ, TQA) R-precision and NQ-320k Hits@1 (@10), respectively, suggesting the benefits of using contextualized embedding in generative retrieval models. 1

1. INTRODUCTION

Text retrieval is often formulated as finding the most relevant items from a large corpus given an input query. The bi-encoder approach of using an encoder to map the documents and the query to a common vector space and performing a nearest neighbor search has been a common practice in text retrieval tasks (Karpukhin et al., 2020; Wu et al., 2020; Ni et al., 2021) . Despite its high performance and popularity, it has an embedding space bottleneck (Luan et al., 2021; Lee et al., 2022; Cao et al., 2021) . The performance decreases as the document length increases due to the limited expressiveness of fixed-size document embeddings. Also, it misses the fine-grained interaction between the query and the document as they interact in L2 or inner product space. The bi-encoder approach also requires large storage space to save all document embeddings. A recently-proposed alternative to the bi-encoder approach is using a generative retrieval model (Cao et al., 2021; Tay et al., 2022; Bevilacqua et al., 2022; Lee et al., 2022) which retrieves the most relevant sequence by generating the item token-by-token, where the item is the identifier of the target sequence or the sequence itself (e.g., title, passage, document ID). They show high performance while using a low storage footprint by overcoming the embedding space bottleneck. These models interact in the parametric space of the language model rather than just in the inner product space. However, as existing generative retrieval models rely solely on the information encoded in their own parameters, the model cannot retrieve the correct target sequence if it has not seen such information during the training process. To this end, we propose contextualized generative retrieval model (CGR), a retrieval model that overcomes the aforementioned limitations of existing generative retrieval models by leveraging contextualized vocab embeddings (output embeddings of language model encoder) to make use of non-parametric information from the context surrounding the vocab tokens. It uses not only the parametric space of the model as in generative retrieval models but also the non-parametric space of contextualized target embeddings (external memory) as in bi-encoder models. As in Figure 1 , the model has two submodules: (1) an EMBedding model (EMB), which is an encoder model that outputs contextualized embeddings, and (2) a RETrieval model (RET), which is an encoder-decoder model that retrieves a target sequence when given an input query. The model first constructs the contextualized embedding matrix with the output embeddings of EMB and uses the matrix as the decoder vocab embeddings when training RET. By utilizing the contextualized embedding matrix rather than the vanilla embedding matrix while generating a target sequence, RET uses both information encoded in its own parameters as existing generative retrieval models and information encoded in the contextualized embeddings. Also, as RET uses the contextualized embeddings during both the training and inference step, RET is optimized to utilize the information encoded in the contextualized embeddings. We show the importance of using external memory (non-parametric space) of contextualized target embedding in generative retrieval models by comparing the performance between CGR and GENRE (Cao et al., 2021) , a generative retrieval model which only operates on the parametric space. CGR shows an average of 6% increment in Natural Questions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017) in KILT (Petroni et al., 2021) and 18% (25%) higher performance in Hit@1 (@10) of NQ-320k. We also compare the results with different baselines for a comprehensive understanding of the model performance. The main contributions of our paper are as follows: • We present Contextualized Generative Retrieval (CGR), a generative retrieval model which uses the contextualized embedding matrix while generating a target sequence. It shows an average of 6% and 18% (25%) higher performance in KILT (NQ, TQA) R-precision and NQ-320k Hits@1 (@10), respectively, compared to GENRE in the same setting. • We show that using contrastive learning as intermediate training further increases the performance of the contextualized generative retrieval model by a large margin. • We perform extensive ablation studies and analysis over several variants of contextualized generative retrieval models for a comprehensive understanding of how to use contextualized embeddings and why using contextualized embeddings is better than using vanilla vocab embeddings.

2. RELATED WORK

Generative Retrieval Existing generative retrieval models retrieve relevant items by generating either the identifiers or entire sequences of the items. Cao et al. (2021) propose GENRE (Generative ENtity REtrieval), which retrieves a document by generating the titles with constrained beam search. Tay et al. (2022) propose DSI (Differentiable Search Index), which assigns a unique ID to each item in the corpus and trains the model to encode all information of the document and the ID in the model parameters. During the inference step, DSI generates the ID of the most relevant document. Wang et al. (2022) propose NCI (Neural Corpus Indexer), which also retrieves by generating the document ID as in DSI, but improves performance by query generation and prefix-aware weight-adaptive decoder. Bevilacqua et al. (2022) propose SEAL (Search Engines with Autoregressive LMs), which can retrieve any span from any position in the corpus by using the compressed full-text substring index (FM-Index). In this work, we propose Contextualized Generative Retrieval which generates the target sequence by utilizing the contextualized embedding matrix rather than the vanilla vocab embedding matrix as in the aforementioned generative retrieval models. Therefore, the model utilizes both the parametric space of the generative retrieval and the non-parametric space of contextualized token embeddings. To the best of our knowledge, we are the first to utilize the contextualized token embeddings on generative retrieval models. 

Cape(2) .Town(1)

. Cape (1) . South(1) .

Africa(1)

.of(1) .Cl imate (1) .Cape .Town . Climate . of

.Afri ca

.South Figure 1 : The left side shows the model architecture of CGRBase, and the right side shows the difference between contextualized embedding space and vanilla embedding space which is constructed with contextualized embedding matrix and vanilla embedding matrix, respectively. Contextualized embedding matrix is constructed by the output embeddings from EMB and is used as the decoder vocab embeddings in RET; existing generative retrieval utilizes the vanilla embedding matrix when generating the target sequence, RET utilizes the contextualized embedding matrix. In the document retrieval task with a title as the target sequence, the title of the document and its corresponding content from the document is given as input to EMB and we only save the output embeddings of the title. Retrieval Models with Contextualized Token Embedding ME-BERT (Luan et al., 2021) ,Col-BERT (Khattab & Zaharia, 2020) and COIL (Gao et al., 2021a) are retrieval models which retrieve the target sequence by utilizing the multiple contextualized token embeddings. It has shown high performance by leveraging the benefits of the cross-encoder architecture in bi-encoder architecture. CGR also utilizes the contextualized token embeddings, but differs from the three models in that while they interact in the inner product space, CGR has the benefit of interacting in the parametric space. Semi-Parametric Models KNN-LM (Khandelwal et al., 2020) , RAG (Lewis et al., 2020) and RETRO (Borgeaud et al., 2022) are semi-parametric models which use both the parametric space of the model and the non-parametric space. KNN-LM improves the LM performance by generating the next token during the inference step by interpolating between the nearest neighbor distribution (distance in the contextualized embedding space) and the model vocab distribution. RAG and RETRO are semi-parametric models that first retrieve relevant texts with the retriever in a non-parametric manner and generate the output based on the retrieved texts. CGR also utilizes both the parametric and non-parametric space as the three models do. However, it differs from KNN-LM in that it is trainable and from RAG and RETRO in that CGR uses the non-parametric space for the decoder vocab embeddings.

3. CONTEXTUALIZED GENERATIVE RETRIEVAL

Generative retrieval is the task of retrieving the most relevant retrieval target (e.g., title, passage, document identifier) by generating the retrieval target token-by-token when given an input query. The training objective of the generative retrieval model is to maximize P ((t 1 , • • • , t n )|q) ∝ n i=1 P (t i |q, t <i ) where t * denote the tokens of the retrieval target. Such an approach has shown high performance while using a low storage footprint (Cao et al., 2021; Tay et al., 2022; Bevilacqua et al., 2022; Lee et al., 2022) . However, it has limitations in that the model depends solely on the information encoded in its own parameters. Thus, the model fails to retrieve the correct target sequence if it has not seen the information during the training process. To overcome the limitations, we propose Contextualized Generative Retrieval model (CGR), a generative retrieval model which uses not only the parametric space of the model but also the non-parametric space (external memory) of contextualized token embeddingsfoot_1 to leverage the benefits of the bi-encoder model and combine the advantages of the two models. CGR (Figure 1 ) utilizes the contextualized embeddings (output embeddings of the language model encoder) rather than the vanilla vocab embeddings while generating the retrieval target. Therefore, the model does not depend only on the information encoded in its own parameters, but can also take advantage of non-parametric information during generation as CGR encodes the document content into contextualized vocab embeddings. Also, by allowing a single token to have multiple token embeddings, the model can learn about the different meanings of the token and the different contexts in which it is used. Therefore, the embedding space constructed with multiple contextualized token embeddings (contextualized embedding space) will become more expressive and fine-grained than the embedding space constructed with model vocab embeddings (vanilla embedding space). CGR differs from existing generative retrieval models by which token embedding matrix is used for the decoder model. Generative retrieval models such as GENRE (Cao et al., 2021) utilize the pre-trained language model architecture as-is: both the encoder and the decoder model share the same vanilla vocab embedding matrix of shape V × D where V is the vocab size, and D is vocab embedding dimension. CGR whereas uses different vocab embedding matrices for the encoder and decoder model: the encoder model uses the vanilla vocab embedding matrix (V × D) as in existing generative retrieval models, but the decoder model uses the contextualized embedding matrix of shape C × D where C is the number of contextualized embeddings. C is larger than V as a token is matched with multiple contextualized embeddings in most cases (e.g., in Figure 1 , same token "Cape" has two different contextualized embeddings which we name as Cape(1) and Cape(2) to differentiate the two.), but C can be reduced with practical tactics (Section 4.4). CGR is composed of two submodules: (1) an EMBedding model (EMB), which is an encoder model that outputs meaningful contextualized embeddings, and (2) a RETrieval model (RET) which is an encoderdecoder model that retrieves a target sequence by generating the sequence while utilizing the information encoded in the contextualized embedding matrix. For example, when the retrieval target sequence is the title of a document, we pass the concatenation of the title and its corresponding document content as the input to EMB and save the pair of input tokens and their output embeddings. By passing not only the title but also its corresponding document content to the model, the output embeddings contain both the information of the title and the document content. In practice, for efficiency, we only sample a few embeddings that are deemed to be the most informative and representative of the target sequence, or simply the first few embeddings. The extracted contextualized embeddings are then used to form the contextualized embedding matrix, which serves as the decoder vocab embeddings of RET. As the vocab embedding matrix of the encoder and decoder model have different shapes, we assign different token IDs for the decoder; a unique token is paired with multiple token IDs where each ID indicates different contextualized embedding in the decoder.

4. MODEL DETAILS

In Section 3, we show the overall architecture of CGR that can be applied to general tasks. In this section, we present details of how we design CGR for practical usage in document retrieval tasks with the document title as the retrieval target. The ideal design of CGR is to use the encoder of RET as EMB for every gradient update during the training step to ensure the high coherency between the contextualized embeddings and RET. However, such a method requires high computational cost as it needs to construct contextualized embedding matrix at every step. Therefore, we present practical models; the base architecture of contextualized generative retrieval (CGR Base ), and two improvements, CGR Async and CGR Contra . Also, we show how we reduce the number of contextualized embeddings. We add the figure of each model and more details in Appendix A.

4.1. CGR BASE

Base CGR (CGR Base ) in Figure 1 

4.3. CGR CONTRA

Bi-encoder retrieval models with contrastive loss have shown high performance, as the model learns and constructs well-structured global embedding space and regularizes the space to be uniform (Ni et al., 2021; Gao et al., 2021b; Gao & Callan, 2022; Izacard et al., 2022) . CGR with Contrastive Learning (CGR Contra ) is designed to leverage such benefits of contrastive learning in a contextualized generative retrieval model; the model is first trained with contrastive loss and then on the generative retrieval objective, i.e., retrieving the most relevant sequence by generating the sequence token-by-token. Step 1. Token-level Contrastive Learning We first train RET with token-level contrastive loss to allow the model to learn the overall search space of token embeddings in the target corpus. Given a training dataset of pairs {(q, t)} where q is the query text, and t is the retrieval target (title of the document to retrieve) composed of multiple tokens t i (1 ≤ i ≤ k where k is the length of the target), we split the dataset into k separate pairs {(q, t i )} where t i is a token of t to construct the training dataset of query-token. With the query-token dataset, we train the first output token embedding from the decoder of RET to be close to all token embeddings in T + when given query q as an input to RET. T + = {t + 1 , • • • , t + k } (k = |T + | ) is a set of positive token embeddings (tokens that make up one title), and T -= {t - 1 , • • • , t - |T -| } is the set of negative token embeddings, which are all other token embeddings in contextualized embedding matrix. The objective is to minimize the contrastive loss to make the query text embedding q be closer to all positive token embeddings in T + : L(q, t + 1 , • • • , t + |T + | , t - 1 , • • • , t - |T -| ) = -log t + ∈T + e <q,t + > t + ∈T + e <q,t + > + t -∈T -e <q,t -> (2) where ⟨ , ⟩ is the inner product value between the two embeddings. As we have the whole set of contextualized embeddings of the corpus (contextualized embedding matrix), it is possible to consider all other embeddings as negative. See Section C.2 for more details and analysis of different contrastive loss and model architecture. Step 2. Generative Retrieval We further train RET from step1 with a generative retrieval objective, which is the same as in CGR Base . To be specific, we use the encoder of RET in step1 as EMB of step2 to extract the contextualized embeddings, and use the RET in step1 as the initial parameter of RET in step2. 

4.4. CLUSTERING

To construct the contextualized embedding matrix to be used as the vocab embedding matrix of the decoder of RET, we first extract all contextualized embeddings of each target token with EMB. As it requires a large storage footprint to save all the embeddings, we reduce the number of embeddings by using clustering and saving only the representative embeddings of each cluster. To be specific, we perform k-means clustering over the contextualized embeddings of the same token (which might have different surrounding contexts) and leave only the k centroid embeddingsfoot_2 as the decoder vocab embeddings of the token. We keep k = 5 for all experiments. When k = 5, it only requires 0.3% of storage footprint compared to when saving all contextualized token embedding. Also, it requires 0.34GB more storage compared to the vanilla vocab embeddings (k = 1) which is marginal compared to the storage footprint to save the model parameters (3GB). See Appendix C.3 for examples, and how k affects the performance and the storage footprint.

5. EXPERIMENTS

In Section 5.1, we describe the baselines, datasets, and basic setup used in our experiment. In Section 5.2, we show both qualitative and quantitative effectiveness of using contextualized embeddings by comparing CGR with baseline models. We also compare the performance and characteristics among the variants of CGR.

5.1. SETUP

We compare CGR with six baselines widely used in document retrieval task: BM25, GENRE, DSI, NCI, SEAL, and DPR. BM25 (Robertson & Zaragoza, 2009 ) is a term-matching model relying on an efficient algorithm. DPR (Karpukhin et al., 2020 ) is a bi-encoder retrieval model which retrieves the most relevant document by mapping a query and documents to a common vector space and performing a nearest neighbor search. See Section 2 for descriptions of GENRE, DSI, NCI, and SEAL. We compare all models using a document retrieval task, where the input is a query, and the output is a sequence related to relevant Wikipedia documents (e.g., title, document ID). See Appendix B for more details. KILT (NQ, TQA) We train CGR with two datasets, Natural Questions (NQ) (Kwiatkowski et al., 2019) and TriviaQA (TQA) (Joshi et al., 2017) from the KILT dataset (Petroni et al., 2021) , a benchmark for knowledge-intensive language tasks with eleven different datasets spanning five different tasks (fact checking, question answering, entity linking, dialogue, and slot filling). It gathers data in different formats into a common format, and the corresponding datasets share the same snapshot of Wikipedia as the corpus. DPR and GENRE are trained with all nine datasets in KILTfoot_3 . Contextualized embeddings of the target retrieval sequence (title of the page) are the output embeddings from EMB when the title and its corresponding document content are given as the input. We evaluate all results with R-precision, a metric widely used to evaluate retrieval performance in KILT. It is calculated as r R where R is the number of Wikipedia documents in each provenance set, and r is the number of related documents among the top-R retrieved documents.

NQ-320k

To compare with Tay et al. (2022) , we experiment on NQ-320k, a restricted setting from the official NQ dataset; it uses about 4% of Wikipedia corpus as the corpus setfoot_4 . We construct the contextualized embedding matrix with the title of the document and its corresponding content as the input to EMB. The results are evaluated using Hits@N (N={1, 10}), which shows the proportion of the correct documents ranked in the top N predictions. 1 show that CGR Contra outperforms GENRE* by 6% which demonstrates the effectiveness of contextualized embeddings. For both cases where the model is trained over a single dataset and over NQ and TQA together (NQ+TQA), all CGR variants show higher performance over GENRE*. All CGR variants trained jointly on NQ and TQA show higher performance than those trained on a single dataset (NQ or TQA). Such results suggest that CGR tends to improve the performance when trained with more datasets. Note that due to limited available resources, we did not train CGR with the full KILT dataset (ALL KILT) as in GENREfoot_5 , DPR, or SEAL. However, CGR Contra trained on less than 5% of the training dataset from the full KILT dataset show higher or comparable performance to those models. See Appendix C.7 for results in the KILT dev set. We report the results on NQ320k in Table 2 , which uses about 4% of Wikipedia corpus as the corpus set, to compare CGR with DSI and NCI which only experiment over NQ-320k. We compare the results between CGR Base , CGR Contra , and baselines (BM25, DSI, NCI, GENRE*). CGR Contra shows the highest performance when trained on the same number of datasets; 18% and 25% higher performance to GENRE* in Hits@1 and Hits@10, respectively. We also compare the result of CGR Base with DSI as a direct baseline in Appendix C.8. Under review as a conference paper at ICLR 2023 1 , we can see that CGR Contra shows consistently higher performance than CGR Base and CGR Async . We hypothesize two factors for such improvements. First, as the model is trained on contrastive learning before training on generative retrieval task, CGR Contra can leverage the benefits of contrastive learning where it learns and constructs well-structured overall embedding space and regularizes the space to be uniform (Ni et al., 2021; Gao et al., 2021a; b; Izacard et al., 2022) . We check the quality of the embedding space with L uniformity proposed in Wang & Isola (2020) , where the numbers represent how uniform the embedding space is. CGR Contra (-19.7) shows a lower number than CGR Base (-18. 2) where the lower the better. Second, as EMB is initialized with the encoder of RET, there is high coherency between EMB and RET. The importance of having high coherency between EMB and RET can also be seen from the performance gain from CGR Base to CGR Async ; CGR Base uses the initial EMB without any replacement, but CGR Async replaces EMB with the encoder of RET every N epochs and shows higher performance as N decreases. i.e., the update is more frequent. More details in Appendix C.4.

Importance of Having Contextualized Embeddings with Document Content

In Table 3 , we compare the results between CGR Base -title-only, a model trained with contextualized embeddings extracted with only the title as the input to EMB, and CGR Base , a model trained with contextualized embeddings extracted with both the title and corresponding document content as input to EMB. CGR Base -title-only can be considered as an intermediate model between CGR Base and GENRE* as it uses the non-parametric space but is constructed with limited information (only with the title, without the entire document content). CGR Base shows the highest performance, GENRE* shows the lowest performance, and the performance of CGR Base -title-only is in-between the two models, suggesting that there is a correlation between the performance and how much contextual information is in the non-parametric space. GENRE*, which uses vanilla vocab embedding as the target embedding, has to depend solely on the information encoded in its own parameters (the parametric space of the generative retrieval model). On the other hand, CGR Base and CGR Base -title-only can depend on not only the parametric space of the generative retrieval model as GENRE* does but also the non-parametric space of corpus information embedded in the contextualized target embedding. By utilizing the contextualized target embedding, the model can know in which context the token is used and discern documents with different contexts. Although both CGR Base and CGR Base -title-only utilize contextualized target embeddings, the contextualized target embedding of CGR Base -title-only contains constrained information compared to that of CGR Base . Therefore, CGR Base -title-only fails on cases where the document content is necessary to retrieve the target sequence successfully. Table 5 shows examples where there is no direct relationship between the query and the target sequence such as lexical overlap or semantic similarity. It is difficult for the model to predict the target without the help of the document content about what information is in the document or what relationship exists between the query and the target sequence. We can see from the table (Table 5 ) that CGR Base successfully retrieves as such information is embedded in the contextualized target embeddings whereas CGR Base -title-only fails as it does not contain the information in its embeddings. See Appendix C.5 for more details about the contextualized embeddings of CGR Base . Lexical overlap between query and retrieval target To see whether the main performance improvement of CGR over GENRE* comes from CGR leveraging the information contained in the contextualized token embedding, we check the performance of CGR Base , CGR Base -title-only, and GENRE* on queries that need document content to find the answer. We first run TF-IDF over all the queries of NQ dev set in KILT and divide the queries into two sets: low-overlap and high-overlap. Low-overlap is a set of queries with TF-IDF score lower than average, and high-overlap is the rest of the queries. For queries in the high-overlap setfoot_6 , all three models show high performance as it is easy to infer the correct retrieval target from the query alone even if the model does not know the document content. On the other hand, while all models show relatively lower performance for queries in the low-overlap setsfoot_7 as the context information is required to infer the relationship between the query and the retrieval targetfoot_8 , CGR Base shows the most strong performance. CGR Base shows about 7% higher performance on the low-overlap set and 5% higher performance on the high-overlap than GENRE* by leveraging the contextualized information encoded in the token embedding (Table 9 

6. CONCLUSION

In this paper, we propose Contextualized Generative Retrieval (CGR), a generative retrieval model that utilizes contextualized embeddings (output embeddings of the language model encoder) rather than vanilla vocab embeddings while generating the target sequence. This way, the model does not rely only on the information encoded in its own model parameters but also on the information encoded in the contextualized embeddings. Experimental results show that CGR achieves significantly higher performance than vanilla generative retrieval, demonstrating the effectiveness of utilizing such non-parametric external memory during decoding. We also perform extensive ablation studies and analysis on several variants of contextualized generative retrieval models to better understand how they work.

A MODEL DETAILS

See Figure 2a , Figure 1 , Figure 2b , and Figure 2c for figures of generative retrieval model, CGR Base , CGR Async , and CGR Contra , respectively.

A.1 INFERENCE STEP

We perform a constrained beam search with prefix tree (Cao et al., 2021; Lee et al., 2022) during the inference step to assure that all generated sequences are in the corpus. The prefix tree is constructed with the tokenization result of the corpus, and we perform a constrained beam search by masking out the tokens that do not create a sub-string of the text in the corpus. We find the next tokens from the top-k of the unmasked ones. While token ID was used as the node of the prefix tree in previous works since each token was mapped to a unique token ID, we construct a prefix tree with the text of the token as the node, because CGR contains multiple token IDs for a single token. Therefore, rather than unmasking only a single token ID, we unmask all token IDs that correspond to the text in order to unmask a token. We keep the beam size to 10 for all experiments following Cao et al. (2021) . We train all models using a pre-trained T5-large checkpoint from Wolf et al. (2020) as the initial checkpoint. GENRE* and CGR are trained with the same hyperparameter setting for a fair comparison. We experiment over 8 32GB V100 GPUs or 2 48GB A6000 GPUs. We train using Adafactor with a learning rate 1e-4 with a linear warm-up for the first 10% of training and then linear decay with batch size 512 till a maximum of 150 epochs.

B.2 BM25 & DPR

Unlike Maillard et al. (2021) , which performs document retrieval tasks by training the model on passagelevel tasks and considers retrieval successful when it retrieves the document that contains the passage, to match the setting (dataset) similar to CGR, we train the model in document retrieval task. We consider the first five paragraphs as the content of the document and train the model so that the query embedding gets close to not the paragraph embedding but the document embedding. The number of the corpus in the document retrieval tasks are the same as the number of page in the KILT dataset. For BM25, the corpus is the same as in DPR where each item in the corpus is the first five paragraphs of individual documents in the KILT corpus.

C.1 UPDATE FREQUENCY IN CGR ASYNC -TITLE-ONLY

We analyzed how the performance changes according to how often the replacement of EMB by the encoder of RET occurs (replacement for every N epoch) with CGR Async -title-only. When comparing the performance with N = {10, 20, 50}, CGR Async -title-only shows the highest performance at N = 10, and the performance tends to deteriorate as N becomes larger. Also, all CGR Async -title-only show higher performance than CGR Base -title-only (CGR-title-only without any replacement). Results show that although the model requires high computation cost and longer training time as N gets smaller, it is important to have high coherency between the contextualized embeddings (output embeddings of EMB) and RET by frequent replacement.

C.2 DIFFERENT CONTRASTIVE LOSS IN CGR CONTRA

We experiment with three different types of contrastive loss when training CGR Contra . In this section, we show the losses and how the results differ by such methods. Given a training dataset of pairs {(q, t)} where q is the query text, and t is the retrieval target (title of the document to retrieve) composed of multiple tokens t i (1 ≤ i ≤ k where k is the length of the target), we split all tokens into k separate pairs {(q, t i )} to construct the training dataset of query-token. The three loss differs in what the model considers as a negative set and a positive set. Loss 1: Neg: In-Batch Negatives / Pos: Single Token Embedding With the query-token dataset, we train RET's first output token representation from the decoder to be close to all t + ∈ {t 1 , • • • , t k } (embedding of any token in the retrieval target t) given the query q as an input to RET. The objective is to minimize the contrastive loss to make the query text embedding q be closer to positive token embedding t + : L(q, t + , t - 1 , • • • , t - |T -| ) = -log e <q,t + > e <q,t + > + t -∈T -e <q,t -> (3) where ⟨ , ⟩ is the inner product value between the two embeddings, and T -= {t - 1 , • • • , t - |T -| } is the set of negative token embeddings, which are other token embeddings in the training batch that are not paired with q (in-batch negatives (Karpukhin et al., 2020) ). Loss 2: Neg: Contextualized Embedding Matrix / Pos: Single Token Embedding The loss differs from the upper loss in that it considers all embeddings in contextualized embedding matrix except the single positive embedding as negative rather than performing the in-batch negatives which consider the subset of contextualized embedding matrix as negatives. The equation is same as Equation 3, but elements in T -are all other token embeddings in contextualized embedding matrix. Loss 3: Neg: Contextualized Embedding Matrix / Pos: Multiple Token Embedding The loss differs from the upper loss in that it considers all token embeddings in the title as positive embeddings; for each query q, there are more than one positive contextualized token embeddings. With the query-token dataset, where T + = {t + 1 , • • • , t + k } , set of positive token embeddings, we train RET's first output token representation from the decoder to be close to all token embeddings in T + given the query q as an input to RET. The objective is to minimize the contrastive loss to make the query text embedding q be closer to all positive token embedding in T + : L(q, t + 1 , • • • , t + |T + | , t - 1 , • • • , t - |T -| ) = -log t + ∈T + e <q,t + > t + ∈T + e <q,t + > + t -∈T -e <q,t -> (4) where ⟨ , ⟩ is the inner product value between the two embeddings, and T -= {t - 1 , • • • , t - |T -| } is the set of negative token embeddings, which are all other token embeddings in contextualized embedding matrix. Results Table 6 show the performance of CGR Contra with different contrastive loss by what it considered as the positive pair and the negative pair. Multiple Token Emb considers all token embeddings in the same target sequence as positive pairs, and Single Token Emb considers all token embeddings separately thus only one of the token embedding from the title token embeddings is considered as positive pair. In-Batch Negatives considers all embeddings in a batch except for the positive embedding as negative pairs, and Contextualized Embedding Matrix considers all embeddings in the contextualized embedding matrix (a matrix constructed with the contextualized token embeddings) except for the positive embeddings as negative pairs. The model trained on contrastive loss with multiple token embeddings as positive pairs, and all other embeddings in contextualized embedding matrix as negative pairs (Loss3) show the highest performance. The model trained on the same negative but with a single token embedding as positive (Loss2) shows the lowest performance. The model with single token embedding as positive and in-batch negatives as negative pairs (Loss1) shows the performance in-between. As in Xiong et al. (2021) , the model with Loss2 and Loss3 has the benefits of looking at the global embedding space by considering the contextualized embedding matrix as the negative pair, unlike Loss1 which only considers embeddings in the same batch as negatives (in-batch negatives). However, Loss2 show lower performance than Loss1 as in the case where the model considers a single token embedding as a positive pair, the model considers the rest of the token embeddings in the same title as the negative pair. As the token embeddings in the same title are matched with the same query, such a training method seems to make the model confused and leads to bad performance. Thus when considering a single token embedding as positive pair (Loss1 or Loss2), it is better to consider only the embeddings in the same batch as negatives (in-batch negatives) rather than on all the token embeddings (Contextualized Embedding Matrix) as there is a low possibility of the model to have two different token embeddings of the same title in a batch.

C.3 CLUSTERING

Example of Clustering When a token "the" appears in the corpus 100 times, 100 different contextualized embeddings of "the" are extracted by the encoder model at first. Then, we perform k-means clustering on the 100 contextualized embeddings to cluster them into at most k clusters and save all centroid embeddings. We leave only the k centroid embeddings as the decoder vocab embeddings of the token "the" and assign a new decoder token ID for each contextualized embedding by the cluster it belongs to. By repeating the process over all the tokens, each token has a number of contextualized embeddings less or equal to k. As there are multiple contextualized token embeddings for a single token, we replace the ground-truth target token IDs with the newly constructed decoder token IDs to specify which contextualized token embedding the ground-truth target token ID is referring to.

Performance by Number of Clusters and Storage Footprint

As saving all contextualized token embeddings to use as the vocab embedding matrix requires a large storage footprint (≈ 148GB), we reduce the number of token embeddings by clustering and saving only the k centroid embeddings for each token (Section 4.4). Figure 5 shows the effect of the maximum number of clusters for each token (k) on the performance. Models with a k = 5 (maximum of five different contextualized token embeddings for each token) show the highest performance and having k smaller or larger than five decreases the performance. We hypothesize that performance of models with k < 5 degrades because the number of the embeddings is too small to contain all different contextual meanings of the token and thus will be closer to vanilla token embedding. In contrast, the performance of models with k > 5 decreases because the search space of each generation step is too large and the parametric space of the model becomes too fine-grained which might distract the model. When k = 5, the number of embeddings is 980 times less than using all the contextualized embeddings of KILT corpus as the vocab embeddings and 3.7 times larger than using the vanilla vocab embeddings of T5 (k = 1). Therefore, when k = 5, it needs 0.47GB of storage footprint to save all the vocab embeddings, whereas the vanilla vocab embeddings (k = 1) need 0.13GB. The increase in the storage footprint of vocab embeddings (0.34GB) is marginal compared to the storage footprint to save the model parameters (3GB).

C.4 CHARACTERISTICS OF CONTEXTUALIZED EMBEDDINGS IN VARIANTS OF CGR

We compare the contextualized token embeddings of CGR Base , CGR Async , and CGR Contra 10 For 1000 cluster embeddings, we check the rate of the same token among the top-5 embeddings similar to the corresponding embedding. CGR Base shows the lowest rate of 50%. CGR Async and CGR Contra show a similarly high rate of 70%. The rate tends to increase as N increases in CGR Async . Such results suggest that as same token has similar lexical meaning, it is better to have a relatively similar meaning. However, as the performance increases as a single token are matched to multiple token embeddings till k = 5 (Appendix C.3), it is also important to have slightly different meanings depending on the surrounding context. When checking which corpus bundles are bound to the same cluster, all three tend to depend on which position of text the token is placed on and the meaning of surrounding tokens. See Appendix C.5 for more details.

C.5 CLUSTERING OVER TOTAL EMBEDDINGS

To understand the spatial properties of the contextualized embeddings, we conducted a qualitative analysis on the embeddings, by performing k-means clustering over the total contextualized token embeddings of CGR Base (EMB is the encoder of T5-large). Specifically, we clustered 36 million token embeddings, obtained from EMB, into 117,508 clustersfoot_10 using the FAISS k-means library (Johnson et al., 2021) . First, we randomly sampled 100 tokens, and for each token, we calculated the portion of the contextualized embeddings that belong to the top 10% of the clusters which contain the most embeddings of the token. As a result, on average 67.6% of the embeddings of a token are contained in the 10% of the clusters which contain the token, with a standard deviation of 22.7. This indicates that most of the tokens are concentrated in a few spatial regions, while the others are spread over many different areas. To get a deeper insight into the spatial properties of the embeddings, we picked two tokens, "Lincoln" and "Squad" and visualized some of the clusters that contain the tokens(Table 7 ). For each cluster, the tokens belonging to the cluster and their corresponding document names are shown. In (Table 7 ), at most 20 documents are shown for each token and only 4 tokens are shown in cluster 3 and 4 for simplicity. The first and second examples show the case that a cluster is composed of only a single token, as mentioned above. Interestingly, all of the corresponding documents of the first cluster are related to Lincolnshire, a county of England. Similarly, the tokens in the second cluster are related to the documents about sports (usually football) squads. On the other hand, the third and fourth examples show the other case that a cluster contains only a few tokens that we are interested in. The members of the third cluster are related to the middle names, and a few embeddings of the token "Lincoln" is contained in this cluster since there are some Wikipedia documents of the people whose middle name is Lincoln. Likewise, the fourth cluster consists of the embeddings which are related to the name of music albums(usually hip-hop and rock), where some of them are produced by the group named "Blazin' squad", for example. These examples show how expressive can the contextualized embeddings be compared to the vanilla token embeddings; in this case, it is hard to expect that these various context-dependent information of a token can be sufficiently encoded into a single token embedding. In summary, the results show that the contextualized embeddings corresponding to the same token are mapped to many different regions of the embedding space, depending on its context. This implies that the contextualized embeddings successfully acquired the contextual information of the corresponding documents, highlighting the effectiveness of utilizing contextualized embeddings for generative retrieval.

C.6 LEXICAL OVERLAP BETWEEN QUERY AND ANSWER

CGR show especially strong performance on queries in the low-overlap set; queries that in most cases need the context information unless the model saw the information during the training step (Section 5.2). We check four sets: 1. GENRE+/CGR +: queries where both CGR and GENRE* successfully retrieved 2. GENRE+/CGR-: queries where GENRE* successfully retrieved and CGR failed 3. GENRE-/CGR +: queries where GENRE* failed and CGR succeed 4. GENRE-/CGR-: queries where GENRE* and CGR both failed. 

C.7 KILT DEV RESULTS

Table 10 shows the result of five models (BM25, GENRE, DPR, SEAL, and CGR) in the NQ and TQA of the KILT dev set. Note that the DPR and BM25 models in the single setupfoot_11 are different from Table 1 (Section B.2). As in results with KILT test datasets (Table 1 ), results with KILT dev datasets (Table 10 ) show similar trends. CGR shows an average of 7% higher performance compared to its direct baseline model, GENRE*. CGR Contra shows the highest performance over three variants of CGR. When comparing the results with other retrieval models, CGR shows the highest performance in the single setup and shows comparable performance to models trained with all KILT datasets although CGR is trained with only 3% of the training dataset.

C.8 CONTEXTUALIZED EMBEDDINGS WITH DOCUMENT ID AS RETRIEVAL TARGET

To show that CGR is not restricted to the title but is generalizable to various retrieval targets, we experiment CGR (CGR Base -Naive) when the retrieval target is random document ID (Naively Structured String Identifiers in DSI (Tay et al., 2022) ) in NQ-320k. In the case, direct baseline model to CGR Base -Naive is DSI-Naive (DSI with Naively Structured String Indentifiers as the document ID) not GENRE*foot_12 . From Table 11 , we can see that CGR Base -Naive shows more than four times higher performance in Hits@1 compared to DSI-Naive. Also, CGR Base -Naive shows higher performance than DSI-Semantic where the document ID of DSI-Semantic is built by the hierarchical semantic meaning inside the document content. The results show the importance of using the contextualized embeddings as the decoder vocab embeddings; rather than building the document ID using the document content, using a random document ID but with the document content inside the token embedding of each document ID shows higher performance. Moreover, from the results between CGR Base -Naive and CGR Base -Title, we can see that the retrieval target affects the performance. We assume the difference comes from whether the retrieval target is a natural language text or not; CGR can leverage the benefits of pre-training step more when the target sequence is natural language (CGR Base -Title). However, the difference between CGR Base -Naive and CGR Base -Title is marginal compared to the performance difference between DSI and GENERE where the two models both used vanilla model embeddings, which suggests that CGR is generalizable to various retrieval targets. We leave other target sequences apart from the title and the document ID as future work. When compared to the performance of DSI T5-XXL with Semantic String Docid (the best performing model in Tay et al. ( 2022)) (40.4), CGR Base (58.7) shows about 1.5× higher performance in NQ-320k Hits@1. Note that the model has 14 times more parameters than CGR Base .

D LIMITATIONS & FUTURE WORKS

As CGR uses k-means clustering to reduce the number of contextualized embeddings, the performance may change by how the contextualized embeddings are clustered. Generative retrieval models show low performance to unseen cases due to their dependency on the parametric space of the model (Lee et al., 2022) ; the model is likely to retrieve sequences it has seen during the training step better as the information is saved in the parameters. Therefore, as CGR also leverages the benefit of the parametric space, we leave the direction of adapting the model to new corpus as future work.



We will make our code publicly available. In this paper, contextualized embeddings refer to the output of language model encoder, which can incorporate information from the nearby context. When the number of extracted contextualized embeddings of a token is smaller than k, we do not perform k-means clustering but use its own contextualized embedding. Also, we use a single non-contextualized embedding for special tokens such as the EOS token or PAD token. Due to limited resources, we did not train CGR with full KILT datasets as in GENRE, which used 128 V100 GPUs with 32GB of memory for about 33 hours. For a fair comparison with CGR, we train GENRE*, GENRE trained with the same resource, same pre-trained model (T5), hyperparameter, and dataset as CGR. The corpus set is the union of train/dev/test target sequences. As exact splits, document ID, and preprocessing code used byTay et al. (2022) are not released, we tried to replicate the setting as closely as possible when constructing the NQ-320k dataset to train CGRBase. GENRE uses 128 V100 GPUs with 32GB of memory for about 33 hours. e.g., Q: where was the world economic forum held this year / Target Document: World Economic Forum e.g., Q: During which season does cape town receive rainfall / Target Document: Climate of South Africa Among the queries that all three models successfully retrieved the right retrieval target, 61% of queries are in the high-overlap set. Also, among the queries that all three models failed, 74% of queries are in the low-overlap set. Such rates show that queries in low-overlap are relatively difficult. We analyze the EMB of step2 in CGRContra and last replace EMB for CGRAsync. The number of the clusters is same as the number of the tokens in contextualized embedding matrix, hence same as the number of the clusters we used in 4.4. Single setup is a setup where the models are trained with a single dataset; NQ only or TQA only. Note that the document ID of CGRBase-Naive and DSI-Naive are not exactly the same as the document ID of DSI is not released. However, both the document ID of CGRBase-Naive and DSI-Naive are the same in that they are constructed randomly. We plan to update the numbers when the official repo is open



Figure 3: Red bar indicates the high-rate and the blue bar indicates the low-rate. The rate is measured by NQ dev set in KILT. Details about high-rate and low-rate is in Appendix C.6.

is the most basic contextualized generative retrieval model among the ones we propose. It uses the pre-trained T5 encoder as EMB and the T5 encoder-decoder as RET. EMB is frozen during the training step, and only RET is trainable. 4.2 CGR ASYNC Asynchronous CGR (CGR Async ) is a model where EMB is asynchronously replaced by the encoder of RET for every N epochs. When the model parameters of EMB are replaced, we construct a new contextualized embedding matrix with the replaced EMB and resume the training. As the decoder vocab embeddings of RET are updated every N epoch, EMB and RET of CGR Async would have more coherency between each other compared to CGR Base . We keep N = 20 for all experiments. See Appendix C.1 for details on how N affects the performance.

R-precision(%) for document retrieval task on NQ and TQA test dataset in KILT. Results except CGR are from the KILT leaderboard. The column of the table is divided by how many training datasets are used. Numbers in the bracket are the rate of the number of training datasets over the number of training datasets when using all KILT datasets. Results with * in GENRE are from GENRE*. Underlined model is direct baseline of CGRBase. Results with † in BM25 and DPR are trained in same setting as CGR (Appendix B.2). Best in bold.

Hits@1, Hits@10 in NQ-320k. Results of BM25&DSI are fromTay et al. (2022) and NCI&NCI-are fromWang et al. (2022). NCIis NCI without query generation to match the number of training datasets with other models. All models are based on T5-large. Underlined model is direct baseline of CGRBase. Best in bold.

R-precision(%) for the document retrieval task on NQ and TQA test dataset in KILT. We compare the results of GENRE*, CGRBasetitle-only and CGRBase where the models are trained with NQ+TQA (Section 5.2). The results show the importance of extracting contextualized embeddings with not only the title but also the corresponding document content.

R-precision(%) for the test sets of document retrieval tasks in KILT. Both GENRE* and CGRBase are trained with NQ + TQA; other datasets are not seen during the training time. Best in Bold. GENRE* vs. CGR Base in Zero-Shot Setting Table4shows that CGR Base is stronger than GENRE* in the zero-shot setting where the models are trained on NQ and TQA and are evaluated on the other 9 datasets in KILT that are not used during the training step. CGR Base shows an average of 3% improvement from GENRE*. CGR Base shows high performance on information unseen during the training step as it does not solely rely on the information encoded in the parametric space (information that the model sees during the training step) but also on the non-parametric space of the contextualized embeddings.Differences among CGR Base , CGR Async , and CGR Contra In Table

Top-3 prediction results of CGRBase, CGRBase-title-only, and GENRE* on NQ dev set in KILT. Highlights on the correct target sequence.

R-precision(%) for the document retrieval task on NQ and TQA test dataset in KILT. See Appendix C.2 for details about how the loss term differs. The loss term is used while training CGRContra in contrastive learning (step 1 of training CGRContra).

Top-3 prediction result of CGRBase, and GENRE*

R-precision(%) for the document retrieval task on NQ dev dataset in KILT. See details about Low and High For both figures, GENRE-/CGR + shows a higher number in low-rate, which indicates that CGR tend to successfully predict queries in the low-overlap set compared to GENRE*. Also, for both figures, GENRE+/CGR + shows a high number in low-rate and GENRE-/ours-shows a high number of high-rate, which indicates that queries in the high-overlap set tend to be easy questions for both GENRE and CGR whereas queries in the low-overlap set are difficult for both models. Also, Table8shows samples of the top-5 prediction results of CGR Base and GENRE* where CGR Base successfully retrieved the correct item and GENRE* failed. Such queries tend to be in the low-overlap set. Such results suggest that CGR (CGR Base ) is robust on queries in the low-overlap set compared to GENRE*.

R-precision(%) for document retrieval task on NQ and TQA dev dataset in KILT. Results of BM25 and DPR are fromMaillard et al. (2021), results of SEAL are provided by the authors ofBevilacqua et al. (2022), and results of GENRE are from the released pre-trained models. SEALs is SEAL (LM+FM) and SEALi is SEAL (LM+FM, intersective) inBevilacqua et al. (2022). Underlined model is the direct baseline of CGR. Best of each section (same training dataset) in bold.

Hits@1 and Hits@10 in NQ-320k. Results of BM25, DSI-Naive, and DSI-Semantic are fromTay et al. (2022). CGRBase-Naive is CGRBase with document ID as a retrieval target, and CGRBase-Title is CGRBase with a title of the document as a retrieval target. Underlined model is direct baseline of CGRBase-Naive. Best from document ID as a target sequence in bold.

annex

Cape Town: Cape Town is one of South Africa's three capital cite is (…) (c) Model architecture of CGRContra step1 (contrastive learning). The first output token embedding from the decoder of RET is trained to be close to all contextualized token embeddings of the retrieval target (positive pairs) but far from other contextualized token embeddings (negative pairs). Note that the model architecture of CGRContra step2 (generative retrieval task) is the same as CGRBase (Figure 1 ). Moulton, Lincolnshire / Belton, North Lincolnshire / Walcott, Lincolnshire / Wrangle, Lincolnshire / Swineshead, Lincolnshire / Leverton, Lincolnshire / Kirton, Lincolnshire / Benington, Lincolnshire / Bicker, Lincolnshire / Dyke, Lincolnshire / Hilldyke, Lincolnshire / Waltham, Lincolnshire / Reepham, Lincolnshire / Bradley, Lincolnshire / Allington, Lincolnshire / Donington, Lincolnshire ettleton, Lincolnshire / Panton, Lincolnshire / Beckingham, Lincolnshire / Bigby, Lincolnshire / ... ...Note that CGR is CGR Base in this section. Figure 3 and Figure 4 show the low-rate (blue) and high-rate (red). Low-rate of each case is calculated as {Q∩L}

Q

, where Q is a set of queries in each case and L is a set of queries in a low-overlap set. High-rate of each case is calculated as {Q∩H} , where H is a set of queries in a high-overlap set.

