EMBEDDISTILL: A GEOMETRIC KNOWLEDGE DISTILLATION FOR INFORMATION RETRIEVAL

Abstract

Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval. In this paper, we aim to improve distillation methods that pave the way for the deployment of such models in practice. The proposed distillation approach supports both retrieval and re-ranking stages and crucially leverages the relative geometry among queries and documents learned by the large teacher model. It goes beyond existing distillation methods in the information retrieval literature, which simply rely on the teacher's scalar scores over the training data, on two fronts: providing stronger signals about local geometry via embedding matching and attaining better coverage of data manifold globally via query generation. Embedding matching provides a stronger signal to align the representations of the teacher and student models. At the same time, query generation explores the data manifold to reduce the discrepancies between the student and teacher where the training data is sparse. Our distillation approach is theoretically justified and applies to both dual encoder (DE) and cross-encoder (CE) models. Furthermore, for distilling a CE model to a DE model via embedding matching, we propose a novel dual pooling-based scorer for the CE model that facilitates a more distillation-friendly embedding geometry, especially for DE student models.

1. INTRODUCTION

Neural models for information retrieval (IR) are increasingly used to capture the true ranking in various applications, including web search (Mitra & Craswell, 2018) , recommendation (Zhang et al., 2019) , and question-answering (QA; Chen et al. 2017) . Notably, the recent success of Transformers (Vaswani et al., 2017) -based pre-trained language models (Devlin et al., 2019; Liu et al., 2019; Raffel et al., 2020) on a wide range of natural language understanding tasks has prompted their utilization in IR to capture query-document relevance (see, e.g., Dai & Callan, 2019b; MacAvaney et al., 2019a; Nogueira & Cho, 2019; Lee et al., 2019; Karpukhin et al., 2020, and references therein) . A typical IR system comprises two stages: (1) A retriever first selects a small subset of potentially relevant candidate documents (out of a large collection) for a given query; and (2) A re-ranker then identifies a precise ranking among the candidates provided by the retriever. Dual-encoder (DE) models are the de-facto architecture for retrievers (Lee et al., 2019; Karpukhin et al., 2020) Such models independently embed queries and documents into a common space, and capture their relevance by simple operations on these embeddings such as the inner product. This enables offline creation of a document index and supports fast retrieval during inference via efficient maximum inner product search (MIPS) implementations (Johnson et al., 2021; Guo et al., 2020) , with query embedding generation primarily dictating the inference latency. Cross-encoder (CE) models, on the other hand, are preferred as re-rankers, owing to their excellent performance (Nogueira & Cho, 2019; Dai & Callan, 2019a; Yilmaz et al., 2019) . A CE model jointly encodes a query-document pair while enabling early interaction among query and document text. Employing a CE model for retrieval is often infeasible, as it would require processing a given query with every document in the collection at inference time. In fact, even in the re-ranking stage, the inference cost of CE models is high enough (Khattab & Zaharia, 2020) to warrant exploration of efficient alternatives (Hofstätter et al., 2020; Khattab & Zaharia, 2020; Menon et al., 2022) . Knowledge distillation (Bucilǎ et al., 2006; Hinton et al., 2015) provides a general strategy to address the prohibitive inference cost associated with high-quality large neural models. In the IR literature, most existing distillation methods only rely on the teacher's query-document relevance scores (see, e.g., Chen et al., 2021; Lu et al., 2020; Hofstätter et al., 2020; Ren et al., 2021; Santhanam et al., 2021) or their proxies (Izacard & Grave, 2021) . However, given that neural IR models are inherently embedding-based, it is natural to ask: is it useful to go beyond matching of the teacher and student models' scores, and directly aim to align their embedding spaces? With this in mind, we propose a novel distillation method for IR models that utilizes an embedding matching task to train the student. The proposed method supports cross-architecture distillation and improves upon existing distillation methods for both retriever and re-ranker models. When distilling a large DE model into a smaller DE model, for a given query (document), one can minimize the distance between the query (document) embeddings of the teacher and student after compatible projection layers to account for any dimensionality mismatch. In contrast, defining an embedding matching task for distilling a CE model into a DE model is not as straightforward. For Transformers-based CE models, it is common to use the final embedding of a special token, e.g., [CLS] in BERT (Devlin et al., 2019) , to compute query-document relevance (Nogueira & Cho, 2019) . However, as we note in Sec. 4.2, this token embedding does not capture semantic similarity between the query and document. To make CE models more amenable to distillation via embedding matching, we propose a modified CE scoring approach by utilizing a novel dual-pooling strategy: this separately pools the final query and document token embeddings, and then computes the inner product between the pooled embeddings as the relevance score. Our key contributions toward improving IR models via distillation are: • We propose a novel distillation approach for neural IR models, namely EmbedDistill, that goes beyond score matching and aligns the embedding spaces of the teacher and student models. • We consider a novel DE to DE distillation setup to showcase the effectiveness of our embedding matching approach (Sec. 4.1). Specifically, we consider a student DE model with an asymmetric configuration, consisting of a small query encoder and a frozen encoder inherited from the teacher. Such a configuration significantly reduces query embedding generation latency during inference, while leveraging the teachers' high-quality document index. • We show that our proposed distillation approach can leverage synthetic data to improve student by further aligning the embedding spaces of the teacher and student (Sec. 4.3). • We theoretically justify both embedding matching and query generation components of our proposed method (Sec. 5). Further, we provide a comprehensive empirical evaluation of the method (Sec. 6) on two standard IR benchmarks -Natural Questions (Kwiatkowski et al., 2019a) and MSMARCO (Nguyen et al., 2016) . • Finally, we demonstrate the utility of embedding matching for CE to DE distillation on MS-MARCO by utilizing a novel pooling strategy for CE models, namely dual pooling (Sec. 4.2), which might be of independent interest.

2. RELATED WORK

Here, we review some existing Transformers-based IR models, and discuss prior work on distillation and data augmentation for such models. We also cover prior efforts on aligning representations during distillation for non-IR settings. Unlike our problem setting where the DE student is factorized, these works mainly consider distilling a single large Transformer into a smaller one. Transformers-based architectures for IR. Besides DE and CE models described in Section 1, intermediate configurations (MacAvaney et al., 2020; Khattab & Zaharia, 2020; Nie et al., 2020; Luan et al., 2021) have been proposed. Such models first independently encode the query and document, and then apply a more complex late interaction between the two. Interestingly, Nogueira et al. (2020) explore generative encoder-decoder style model for re-ranking, where a T5 (Raffel et al., 2020) model takes a query-document pair as input and its score for certain target tokens (e.g., True/False) defines the relevance score for the pair. In this paper, we focus on basic DE and CE models to showcase the benefits of our proposed geometric distillation approach. Exploring embedding matching for the aforementioned architectures is an interesting avenue for future exploration. Distillation for IR. Traditional distillation techniques have been widely applied in the IR literature, often to distill a teacher CE model to a student DE model (Chen et al., 2021; Li et al., 2020) . Recently, distillation from a DE model (with complex late interaction) to another DE model (with inner-product scoring) has also been considered (Lin et al., 2021; Hofstätter et al., 2021 2020) conduct an extensive study of knowledge distillation across a wide-range of model architectures. Most existing distillation schemes for IR rely on only teacher scores; by contrast, we propose a geometric approach that also utilizes the teacher embeddings. Many recent efforts (Qu et al., 2021; Ren et al., 2021; Santhanam et al., 2021) show that iterative multi-stage (self-)distillation improves upon single-stage distillation (Qu et al., 2021; Ren et al., 2021; Santhanam et al., 2021) . These approaches use a model from the previous stage to obtain labels (Santhanam et al., 2021) as well as mine hardernegatives (Xiong et al., 2021) . We only focus on the single-stage training in this paper. Multi-stage procedures are complementary to our work, as one can employ our proposed embedding-matching approach in various stages of such a procedure. Distillation with representation alignments. Outside of the IR context, a few prior works proposed to utilize alignment between hidden layers during distillation (Romero et al., 2014; Sanh et al., 2019; Jiao et al., 2020; Aguilar et al., 2020; Zhang & Ma, 2020) . Chen et al. (2022) utilize the representation alignment to re-use teacher's classification layer. Our work differs from these as it needs to address multiple challenges presented by an IR setting: 1) cross-architecture distillation such as CE to DE distillation; 2) partial representation alignment of query or document representations as opposed to aligning representations for the entire input, i.e., a query-documents pair; 3) catering representation alignment approach to novel IR setups such as asymmetric DE configuration; and 4) dealing with negative sampling due to a large number of classes (documents). To the best of our knowledge, our work is the first in the IR literature that goes beyond simply matching scores (or its proxies). Semi-supervised learning for IR. Data augmentation or semi-supervised learning has been previously used to ensure data efficiency in IR (see, e.g., Zhao et al., 2021; MacAvaney et al., 2019b) . More interestingly, data augmentation via large pre-trained models have enabled performance improvements as well. Doc2query (Nogueira et al., 2019b; a) performs document expansion by generating queries that are relevant to the document and appending those queries to the document. Query expansion has also been considered, e.g., for document re-ranking (Zheng et al., 2020) . Notably, generating synthetic (query, passage, answer) triples from a text corpus to augment existing training data for QA systems also leads to significant gains (Alberti et al., 2019; Oguz et al., 2021) . Furthermore, even zero-shot approaches, where no labeled query-document pairs are used, can also perform competitively to supervised methods. Such methods train a DE model by relying on inverse cloze task (Lee et al., 2019; Izacard et al., 2021) , synthetic query-document pairs given a target text corpus (Ma et al., 2021) , or relevance scores from large pretrained models (Sachan et al., 2022) . Unlike these works, we utilize query-generation capability to ensure tighter alignment between the embedding spaces of the teacher and student.

3. BACKGROUND

Let Q and D denote the query and document spaces, respectively. An IR model is equivalent to a scorer s : Q × D → R, i.e., it assigns a (relevance) score s(q, d) for a query-document pair (q, d) ∈ Q × D. Ideally, we want to learn an IR model or scorer that is faithful to the true query-document relevance, i.e., s(q, d) > s(q, d ) iff the document d is more relevant to the query q than document d . We assume access to n labeled training examples of the form S n = {(q i , d i , y i )} i∈[n] . Here, d i = (d i,1 , . . . , d i,L ) ∈ D L , ∀i ∈ [n] , denotes a list of L documents and y i = (y i,1 , . . . , y i,L ) ∈ {0, 1} L denotes the corresponding labels such that y i,j = 1 iff the document d i,j is relevant to the query q i . Given S n , we learn an IR model by minimizing R(s; S n ) = 1 n i∈[n] s qi,di , y i , where s qi,di := (s(q i , d 1,i ), . . . , s(q i , d 1,L )), y i denotes the loss s incurs on (q i , d i , y i ). Due to space constraint, we present a few common choices for the loss function in Appendix A. While this learning framework is general enough to work with any IR models as the scorers, next, we formally describe two families of IR models that are prevalent in the recent literature.

3.1. TRANSFORMER-BASED IR MODELS: CROSS-ENCODERS AND DUAL-ENCODERS

Let query q = (qfoot_0 , . . . , q m1 ) and document d = (d 1 , . . . , d m2 ) consist of m 1 and m 2 tokens, respectively. We now discuss how Transformers-based CE and DE models process a given (q, d) pair. Cross-encoder model. Let o = [q; d] denote the m 1 +m 2 length sequence obtained by concatenating q and d. Furthermore, let õ be the sequence obtained by adding special tokens such [CLS] and [SEP] to o at appropriate locations. Now, given an encoder-only Transformer model Enc, the relevance score for the query-document pair (q, d) is defined as s(q, d) = w, pool Enc(õ) = w, emb q,d , where w is a d-dimensional classification vector, and pool(•) denotes a pooling operation that transform Enc(õ) -contextualized token embeddings produced by Enc -to a joint embedding vector emb t q,d . [CLS]-pooling is a common strategy that simply outputs the embedding of the [CLS] token. Dual-encoder model. Let q and d be the sequences obtained by adding appropriate special tokens to q and d, respectively. A DE model comprises two (encoder-only) Transformers Enc Q and Enc D , which we call query and document encoders, respectively. 1 Let emb q = pool Enc Q (q) and emb d = pool Enc D ( d) denote the query and document embeddings, respectively. Now, one can define s(q, d) = pool emb q , emb d to be the relevance score assigned to (q, d) by the DE model.

3.2. SCORE-BASED DISTILLATION FOR IR MODELS

Most distillation schemes for IR (e.g., Chen et al., 2021; Lu et al., 2020; Hofstätter et al., 2020; Ren et al., 2021; Santhanam et al., 2021) rely on teacher relevance scores (cf. Fig. 1 ). In particular, given a training set S n and a teacher with scorer s t , one learns a student with scorer s s by minimizing R(s s , s t ; S n ) = 1 n i∈[n] d s s q,di , s t q,di , where d captures the discrepancy between s s and s t . See Appendix A for common choices for d .

4. EMBEDDING-MATCHING BASED DISTILLATION

Since modern neural IR models are inherently embedding-based, we propose to explicitly align the embedding spaces of the teacher and student via a novel distillation method, namely EmbedDistill. Note that our proposal goes beyond existing distillation methods in the IR literature that only utilize the teacher scores. Next, we describe EmbedDistill for two prevalent settings: (1) distilling a large DE model to a smaller DE model;foot_1 and (2) distilling a CE model to a DE model.

4.1. DE TO DE DISTILLATION

Given a (q, d) pair, let emb t q and emb t d be the query and document embeddings produced by the query encoder Enc t Q and document encoder Enc t D of the teacher DE model, respectively. Similarly, let emb s q and emb s d denote the query and document embeddings produced by a student DE model with (Enc s Q , Enc s D ) as its query and document encoders. Now, EmbedDistill optimizes the following embedding alignment loss in addition to the score-matching loss from Sec. 3.2: R Emb (t, s; S n ) = 1 n (q,d)∈Sn emb t q -proj emb s q + emb t d -proj emb s d ) , where proj is an optional trainable layer that is required if the teacher and student produce different sized embeddings. Alternatively, one can employ other variants of EmbedDistill, e.g., focusing on only aligning the query embeddings takes the following form (cf. Fig. 2 ). R Emb,Q (t, s; S n ) = 1 n q∈Sn emb t q -proj emb s q . (5) Asymmetric DE. We also propose a novel student DE configuration where the student employs the teacher's document encoder (i.e., Enc s D = Enc t D ) and only train its query encoder, which is much smaller compared to the teacher's query encoder. For such a setting, it is natural to only employ the embedding matching loss in Eq. 5 as the document embeddings are aligned by design (cf. Fig. 2 ).

Special tokens

Query & doc tokens

Student Query Encoder

Teacher Doc Encoder As a naïve solution, for a given (q, d) pair, one can simply match a joint transformation of the student's query embedding emb s q and document embedding emb s d to the teacher's joint embeddings emb t q,d . However, we observed that including such an embedding matching task often leads to severe over-fitting, and results in a poorly generalizable student. Since s t (q, d) = w, emb t q,d , during CE model training, the joint embeddings emb t q,d for relevant and irrelevant (q, d) pairs are encourage to be aligned with w and -w, respectively. This produces degenerate embeddings that do not capture semantic query-to-query or document-to-document relationships. We notice that even the final query and document token embeddings lose such semantic structure. Thus, a teacher CE model with s t (q, d) = w, emb t q,d does not add value for distillation beyond score-matching; in fact, it hurts to include naïve embedding matching. Next, we propose a modified CE model training strategy that facilitates EmbedDistill. CE models with dual pooling. We propose dual pooling to produce two embeddings emb t q←(q,d) and emb t d←(q,d) from a CE model that serve as the proxy query and document embeddings, respectively. Accordingly, we define the relevance score as s t (q, d) = emb t q←(q,d) , emb t d←(q,d) . We explore two variants of dual pooling: (1) special token-based pooling that pools from [CLS] and [SEP] ; and (2) segment-based weighted mean pooling that separately performs weighted averaging on the query and document segments of the final token embeddings. See Appendix B for details. In addition to employing the dual pooling, we also utilize a reconstruction loss during the CE training, which measures the likelihood of predicting each token of the original input from the final token embeddings. This loss encourages reconstruction of query and document tokens based on the final token embeddings and prevents the degeneration of the token embeddings during training on the IR task. Given proxy embeddings from the teacher CE , we can perform EmbedDistill with the loss defined in Eq. 4 or Eq. 5 (cf. Fig. 3 ).

4.3. TASK-SPECIFIC ONLINE DATA GENERATION

Data augmentation as a general technique has been previously considered in the IR literature (see, e.g., Nogueira et al., 2019b; Oguz et al., 2021; Izacard et al., 2021; Ma et al., 2021) , especially in data-limited, out-of-domain, or zero-shot settings. Since EmbedDistill aims to align the embeddings spaces of the teacher and student, the ability to generate similar queries or documents can naturally help enforce such an alignment globally on the task-specific manifold. Given a set of unlabeled task-specific query and document pairs U m , we can further add the embedding-alignment loss R Emb (t, s; U m ) to our training objective (cf. Eq.4). Interestingly, for DE to DE distillation setting, our approach can even benefit from a large collection of task-specific queries Q or documents D . Here, as we can independently add the additional embedding matching losses R Emb,Q (t, s; Q ) and R Emb,D (t, s; D ) that focus on queries and documents, respectively.

5. IMPROVEMENTS IN THE GENERALIZATION OF STUDENT

Note that we motivate EmbedDistill as well as asymmetric DE configuration where the student DE model inherits the teacher's document encoder from their potential ability to ensure a better alignment between the teacher and student embedding spaces. In this section, we provide a theoretical justification for both of these proposals by showing that they indeed result in a better generalization performance for the student and reduce the gap between the teacher and the student. We break our analysis into two steps: 1) First, we show that, using EmbedDistill and inheriting the document encoder from the teacher, the student's empirical risk (as measured by the distillation objective) gets closer to the teacher's population risk; and 2) Second, we argue that both of these techniques have favorable implications on the distillation loss of the student via uniform deviation bounds. The following result studies the gap between student's expected empirical (distillation) risk and teacher's population risk (see Appendix C.1 for a formal statement and proof). For simplicity, we focus on L = 1 (cf. Sec. 3). The result can be extended to L > 1 with more complex notation. Proposition 1 (Expected risk bound (informal)). Let label y ∈ {0, 1} indicate the relevance of query-document pair (q, d). Suppose that the score-based distillation loss d in Eq. 3 is based on binary cross entropy loss (Eq. 11 in Appendix A). Let one-hot (label-dependent) loss in Eq. 1 be the binary cross entropy loss (Eq. 9 in Appendix A). Assume that all encoders have the same output dimension. Under regularity conditions, we have E R(s s , s t ; S n ) -E (s t q,d , y) = O E d [ emb s d -emb t d 2 ] + E q [ emb s q -emb t q 2 ] + E (q,d,y) sigmoid( emb t q , emb t d ) -y . Proposition 1 can be viewed as the bias of the student wrt. the teacher, as realized by the distillation. It shows that the bias can be upper bounded by three terms:. 1) the expected difference between the doc embeddings of the student and the teacher, 2) the expected difference between the query embeddings, and 3) the teacher's error in modeling the true label probability. Observe that the student's bias is smaller when the embeddings of the student match those of the teacher. In particular, when the student inherits the document encoder from the teacher (as in Fig. 2 ), the error in the first  d s s qi,di , s t qi,di -E d s s q,d , s t q,d ≤ E Sn 48KL d √ n ∞ 0 log N (u, F)N (u, G) du; (6) sup s s ∈F×{g * } 1 n i∈[n] d s s qi,di , s t qi,di -E d s s q,d , s t qi,di ≤ E Sn 48KL d √ n ∞ 0 log N (u, F) du. (7) Here, g * is a fixed document encoder and N (u, •) denotes the u-covering number of a function class. Note that Eq. 6 and Eq. 7 corresponds to uniform deviation when we train without and with a frozen document encoder, respectively. It is clear that the bound in Eq. 7 is lower than Eq. 6 (due to absence of N (u, G) term which is always larger than 1), which alludes to desirable impact of employing a frozen document encoder (in terms of deviation in train and test performance). When we further employ EmbedDistill (e.g., with the loss in Eq. 5), it regularizes the function class of query encoders; effectively, reducing it to F with |F | ≤ |F|. This has further desirable implication for reducing the uniform deviation bounds as N (u, F ) ≤ N (u, F).

6. EXPERIMENTS

We now conduct a comprehensive evaluation of the proposed distillation approach. Specifically, we highlight the utility of the proposed approach for both DE to DE and CE to DE distillation. We also showcase the benefits of combining our distillation approach with query generation methods.

6.1. EXPERIMENTAL SETUP

Benchmarks and evaluation metrics. We focus on two popular IR benchmarks -Natural Questions (NQ; Kwiatkowski et al. 2019b ) and MSMARCO (Nguyen et al., 2016) , which focus on finding the most relevant passage/document given a question and a search query, respectively. NQ provides both standard test and dev sets, whereas MSMARCO has only standard dev set publicly available. In what follows, we use the terms query (document) and question (passages) interchangeably. For NQ, we use the standard full recall (strict) as well as the relaxed recall metric (Karpukhin et al., 2020) to evaluate the retrieval performance. For MSMARCO, we focus on the standard metrics Mean Reciprocal Rank (MRR)@10, and normalized Discounted Cumulative Gain (nDCG)@10. See Appendix D for a detailed discussion on these evaluation metrics. Student DE model training. We train student DE models using a combination of (i) one-hot loss (cf. Eq. 8 in Appendix A) on training data; (ii) distillation loss in (cf. Eq. 10 in Appendix A); and (iii) embedding matching loss in Eq. 5. We used [CLS]-pooling for all student encoders. Unlike DPR (Karpukhin et al., 2020) , we do not use hard negatives from BM25 or other models, which greatly simplifies our distillation procedure. Results and discussion. To understand the impact of various proposed configurations and losses, we train models by sequentially adding components and evaluate on NQ and MSMARCO dev set as shown in Table 1 and 2 , respectively. (See Table 5 in Appendix G.1 for performance on NQ in terms of the relaxed recall.) We begin by training a symmetric DE without distillation. As expected, moving to distillation brings in considerable gains. Next, we swap the student document encoder with document embeddings from the teacher (nontrainable), which leads to a good jump in the performance. Now we can introduce EmbedDistill with Eq. 5 for aligning query representations between student and teacher. The two losses are combined with weight of 1.0 and 5.0 for NQ and MSMARCO, respectively. This improves performance significantly, e.g., it provides ∼ 5 and ∼ 9 points increase in recall@5 on NQ with students based on DistilRoBERTa and RoBERTa-mini, respectively. We further explore the utility of EmbedDistill in aligning the teacher and student embedding spaces in Appendix H.1. On top of the two losses (standard distillation and embedding matching), we also employ R Emb,Q (t, s; Q ) from Sec. 4.3 on 2 additional questions (per input question) generated from BART for further gain. We also try a variant where we eliminate the standard distillation loss and only employ the embedding matching loss in Eq. 5 along with inheriting document embedding from the teacher. This configuration without the standard distillation loss leads to excellent performance (with query generation again providing additional gains). It is worth highlighting that DE models trained with the proposed methods (e.g. asymmetric DE with embedding matching and generation) achieve 99% of the performance in both NQ/MSMARCO tasks with a query encoder that is half the size of that of the teacher. Furthermore, even with 1/8th Method #Layers R@20 R@100 DPR (Karpukhin et al., 2020) 12 78.4 85.4 DPR + PAQ (Oguz et al., 2021) 12 84.0 89.2 DPR + PAQ (Oguz et al., 2021) 24 84.7 89.2 ACNE (Xiong et al., 2021) 12 81.9 87.5 RocketQA (Qu et al., 2021) 12 82.7 88.5 MSS-DPR (Sachan et al., 2021) 12 84.0 89.2 MSS-DPR (Sachan et al., size of the query encoder, our proposal can achieve 91-97% of the performance. This is particularly useful for latency critical applications with minimal impact on the final performance. Finally, we take our best student models for NQ based on the dev set, i.e., one trained using only embedding matching loss with inherited document embedding and using data augmentation from query generation, and evaluate it on the NQ test set (cf. Table 3 ). We compare with various prior work and note they worked with considerably bigger models in terms of depth (12 or 24 layers) and width (upto 1024 dims), while our approach obtains competitive performance with fewer than 50% of the parameters. Note that, even with 6 layers, our student model is very close (99%) to its teacher.

6.3. CE TO DE DISTILLATION

Method MRR@10 We considered the following distillation variants: standard score-based distillation from the [CLS]pooled teacher, and our novel Dual-pooled CE teacher (with and without embedding matching loss). For each variant, we initialize encoders of the student DE model with two RoBERTa-base models and train for 500K steps on the training triples. We performed the naïve joint embedding matching for the [CLS]-pooled teacher (cf. Sec. 4.2) and employed the query embedding matching (cf. Eq.5) for the Dual-pooled CE teacher. In either case, embedding-matching loss is added on top of the standard cross entropy loss with the weight of 1.0 (when used). Results and discussion. Table 4 evaluates the effectiveness of the dual pooling and the embedding matching for CE to DE distillation. As described in Sec. 4.2, the traditional [CLS]-pooled teacher did not provide any useful embedding for the embedding matching (see Appendix H.2 for the further analysis of the resulting embedding space). However, with the Dual-pooled teacher, embedding matching does boost student's performance.

7. CONCLUSION

We propose EmbedDistill -a novel distillation method for IR that goes beyond simple score matching. We specialize it to distill a DE model into another DE model by (a) reusing the teacher's document encoder in the student and (b) aligning query embeddings of the teacher and student. This simple approach delivers immediate quality and computational gains in practical deployments and we demonstrate them on MSMARCO and NQ benchmarks. We show that query generation technique further improves the performance of the distilled student. We generalize the proposed approach to distill a CE model to a DE model and show the benefits on MSMARCO. Finally, our theoretical analysis alludes to the favorable implications of both embedding matching and inheriting document encoder in DE to DE distillation setting. A more comprehensive and systematic analysis of embedding matching-based distillation for IR is an exciting avenue for future research. 

A LOSS FUNCTIONS

Here, we state various (per-example) loss functions that most commonly define training objectives for IR models. Typically, one hot training with original label is performed using softmax-based cross-entropy loss functions: s q,di , y i = - j∈[L] y i,j • log exp(s(q i , d i,j )) j ∈[L] exp(s(q i , d i,j )) . Alternatively, it's also common to employ an one-vs-all loss function based on binary cross-entropy loss as follows: s q,di , y i = - j∈[L] y i,j • log 1 1 + exp(-s(q i , d i,j )) + (1 -y i,j ) • log 1 1 + exp(s(q i , d i,j )) . As for distillation, one can define a distillation objective based on the softmax-based cross-entropy loss asfoot_2 : d s s q,di , s t q,di = - j∈[L] exp(s t i,j ) j ∈[L] exp(s t i,j ) • log exp(s s i,j ) j ∈[L] exp(s s i,j ) , ( ) where s t i,j := s t (q i , d i,j ) and s s i,j := s s (q i , d i,j ) denote the teacher and student scores, respectively. On the other hand, the distillation objective with the binary cross-entropy takes the form: d s s q,di , s t q,di = - j∈[L] 1 1 + exp(-s t i,j ) • log 1 1 + exp(-s s i,j ) + 1 1 + exp(s t i,j ) • log 1 1 + exp(s s i,j ) . Finally, distillation based on the meas square error (MSE) loss (aka. logit matching) employs the following loss function: d s s q,di , s t q,di = j∈[L] s t (q i , d i,j ) -s s (q i , d i,j ) 2 . (12)

B DUAL POOLING DETAILS

In this work, we focus on two kinds of dual pooling strategies: • Special tokens-based dual pooling. Let pool CLS and pool SEP denote the pooling operations that return the embeddings of the [CLS] and [SEP] tokens, respectively. We define emb t q←(q,d) = pool CLS Enc t (õ) , emb t d←(q,d) = pool SEP Enc t (õ) , where õ denotes the input token sequence to the Transformers-based encoder, which consists of { query, document, special } tokens. • Segment-based weighted-mean dual pooling. Let Enc t (õ)| Q and Enc t (õ)| D denote the final query token embeddings and document token embeddings produced by the encoder, respectively. We define the proxy query and document embeddings emb t q←(q,d) = mean wt Enc t (õ)| Q , emb t d←(q,d) = mean wt Enc t (õ)| D , where mean wt (•) denotes the weighted mean operation. We employ the specific weighting scheme where each token receives a weight equal to the inverse of the square root of the token-sequence length. C DEFERRED DETAILS AND POOFS FROM SECTION 5 In this section we present more precise statements and proofs of Proposition 1 and 2 (stated informally in Section 5 of the main text) along with the necessary background. First, for the ease of exposition, we define new notation which will facilitate theoretical analysis in this section. Notation Denote the query and document encoders as f : Q → R k and g : D → R k for the student, and F : Q → R k , G : D → R k for the teacher (in the dual-encoder setting). With q denoting a query and d denoting a document, f (q) and g(d) then denote query and document embeddings by the student. We define F (q) and G(d) similarly for embeddings by the teacher. 4 C.1 BOUND ON THE EXPECTED RISK Proposition 3 (Formal statement of Proposition 1). Given an example (q, d, y) ∈ Q × D × {0, 1}, let s f,g (q, d) = f (q) T g(d) be the scores assigned to the (q, d) pair by a dual-encoder model with f and g as query and document encoders, respectively. Let and d be the binary cross-entropy loss (cf. Eq. 9 with L = 1) and the distillation-specific loss based on it (cf. Eq. 11 with L = 1), respectively. In particular, (s F,G (q, d), y) := -y log σ F (q) G(d) -(1 -y) log 1 -σ F (q) G(d) d (s f,g (q, d), s F,G (q, d)) := -σ F (q) G(d) • log σ f (q) g(d) - [1 -σ F (q) G(d) ] • log 1 -σ f (q) g(d) , where σ is the sigmoid function. Assume that 1. All encoders f, g, F, and G have the same output dimension k ≥ 1. 2. There exist K Q , K D ∈ (0, ∞) such that sup q∈Q max { f (q) 2 , F (q) 2 } ≤ K Q and sup d∈D max { g(d) 2 , G(d) 2 } ≤ K D . Given a sample {(q i , d i , y i )} i.i.d. ∼ P(q, d, y), we have E 1 n n i=1 d s f,g (q i , d i ), s F,G (q i , d i ) -E s F,G (q, d), y ≤ 2K Q E [ g(d) -G(d) 2 ] + 2K D E [ f (q) -F (q) 2 ] + K Q K D E (q,d,y) σ F (q) G(d) -y . Proof of Proposition 3. We first note that the distillation loss can be rewritten as d s f,g (q, d), s F,G (q, d) = 1 -σ(F (q) G(d) f (q) g(d) + γ(-f (q) g(d)) , where γ(v) := log[1 + e v ] is the softplus function. Similarly, the one-hot (label-dependent) loss can be rewritten as s F,G (q, d), y = (1 -y)F (q) G(d) + γ(-F (q) G(d)). As a shorthand, define R(f, g) := 1 n n i=1 d s f,g (q i , d i ), s F,G (q i , d i ) , R(F, G) := E (q,d,y)∼P(q,d,y) s F,G (q, d), y , as the empirical risk based on the distillation loss, and the population risk based on the labeldependent loss, respectively. With this notation, the quantity to upper bound can be rewritten as E R(f, g) -R(F, G) = E    := 1 R(f, g) -R(f, G) + := 2 R(f, G) -R(F, G) + := 3 R(F, G) -R(F, G)    . ( ) 4 Note that, as per the notations in the main text, we have (f, g) = (Enc s Q , Enc s D ) and (F, G) = (Enc t Q , Enc t D ) . Similarly, we have (emb t q , emb t d ) = (f (q), g(d)) and (emb t q , emb t d ) = (F (q), G(d)). We start by bounding E[ 1 ] as E[ 1 ] = E 1 n n i=1 d s f,g (q i , d i ), s F,G (q i , d i ) - 1 n n i=1 d s f,G (q i , d i ), s F,G (q i , d i ) = E d s f,g (q, d), s F,G (q, d) -d s f,G (q, d), s F,G (q, d) = E 1 -σ(F (q) G(d)) f (q) g(d) + γ(-f (q) g(d)) -E 1 -σ(F (q) G(d)) f (q) G(d) + γ(-f (q) G(d)) = E f (q) (g(d) -G(d)) 1 -σ(F (q) G(d)) + γ(-f (q) g(d)) -γ(-f (q) G(d)) (a) ≤ E f (q) (g(d) -G(d)) 1 -σ(F (q) G(d)) + f (q) g(d) -f (q) G(d) (b) ≤ E f (q) g(d) -G(d) 1 -σ(F (q) G(d)) + f (q) g(d) -G(d) ≤ K Q E g(d) -G(d) 2 2 -σ(F (q) G(d)) ≤ 2K Q E [ g(d) -G(d) 2 ] where at (a) we use the fact that γ is a Lipschitz continuous function with Lipschitz constant 1, and at (b) we use Cauchy-Schwarz inequality. Similarly for E[ 2 ], we proceed as E[ 2 ] = E 1 n n i=1 d s f,G (q i , d i ), s F,G (q i , d i ) - 1 n n i=1 d s F,G (q i , d i ), s F,G (q i , d i ) = E 1 -σ(F (q) G(d)) f (q) G(d) + γ(-f (q) G(d)) -E 1 -σ(F (q) G(d)) F (q) G(d) + γ(-F (q) G(d)) = E G(d) (f (q) -F (q)) 1 -σ(F (q) G(d)) + γ(-f (q) G(d)) -γ(-F (q) G(d)) ≤ E G(d) f (q) -F (q) + f (q) G(d) -F (q) G(d) ≤ 2K D E [ f (q) -F (q) 2 ] . E[ 3 ] can be bounded as E[ 3 ] = E[ R(F, G) -R(F, G)] = E (q,d,y) d s F,G (q, d), s F,G (q, d) -s F,G (q, d), y = E (q,d,y) 1 -σ(F (q) G(d)) F (q) G(d) + γ(-F (q) G(d)) -E (q,d,y) (1 -y)F (q) G(d) + γ(-F (q) G(d)) = E (q,d,y) 1 -σ(F (q) G(d)) -(1 -y) F (q) G(d) ≤ K Q K D E (q,d,y) σ(F (q) G(d)) -y . Combining ( 15), ( 16), (17), and (18) gives the result.

C.2 UNIFORM DEVIATION BOUND

Let F denote the class of functions that map queries in Q to their embeddings in R k via the query encoder. Define G analogously for the doc encoder, which consists of functions that map documents in D to their embeddings in R k . To simplify exposition, we assume that each training example consists of a single relevant or irrelevant document for each query, i.e., L = 1 in Section 3. Let FG = {(q, d) → f (q) g(d) | f ∈ F, g ∈ G} Given S n = {(q i , d i , y i ) : i ∈ [n]}, let N ( , H) denote the -covering number of a function class H with respect to L 2 (P n ) norm, where h 2 L2(Pn) := h 2 n := 1 n n i=1 h(q i , d i ) 2 2 . Depending on the context, the functions in H may map to R or R d . To bound the first term, using Cauchy-Schwartz inequality, we can write 1 n n i=1 f (q i ) g(d i ) -f (q i ) g(d i ) 2 ≤ sup q∈Q f (q) 2 2 • 1 n n i=1 (g -g)(d i ) 2 2 . Therefore f g -f g n ≤ K Q g -g n ≤ K Q . Similarly f g -f g n ≤ K D f -f n ≤ K D Plugging these in Eq. 21, we get f g -f g n ≤ (K Q + K D ) . This completes the proof.

D EVALUATION METRIC DETAILS

For NQ, we evaluate models with full strict recall metric, meaning that the model is required to find a golden passage from the whole set of candidates (21M). Specifically, for k ≥ 1, recall@k or R@k denotes the percentage of questions for which the associated golden passage is among the k passages that receive the highest relevance scores by the model. In addition, we also present results for relaxed recall metric considered by Karpukhin et al. (2020) , where R@k denotes the percentage of questions where the corresponding answer string is present in at least one of the k passages with the highest model (relevance) scores. For MSMARCO, we follow the standard evaluation metrics Mean Reciprocal Rank(MRR)@10 and normalized Discounted Cumulative Gain (nDCG)@10, which are computed with respect to BM25 generated 1000 candidate passages for each query. We report 100 × MRR@10 and 100 ×nDCG@10, as per the convention followed in the prior works.

E QUERY GENERATION DETAILS

We introduced query generation to encourage geometric matching in local regions, which can aid in transferring more knowledge in confusing neighborhoods. As expected, this further improves the distillation effectiveness on top of the embedding matching. To focus on the local regions, we generate queries from the observed examples by adding local perturbation in the data manifold (embedding space). Specifically, we employ an off-the-shelf encoder-decoder model (BART). First, we embed an observed query in the corresponding dataset. Second, we add a small perturbation to the query embedding. Finally, we decode the perturbed embedding to generate a new query in the input space. Formally, the generated query x given an original query x takes the form x = Dec(Enc(x) + ), where Enc() and Dec() correspond to the encoder and the decoder from the offthe-shelf model, respectively, and is an isotropic Gaussian noise. Furthermore, we also randomly mask the original query tokens with a small probability. We generate two new queries from an observed query and use them as additional data points during our distillation procedure. As a comparison, we tried adding the same size of random sampled queries instead of the ones generated via the method described above. That did not show any benefit, which justifies the use of our query/question generation method.

F ADDITIONAL TRAINING DETAILS

Training for teacher models. For the teacher DE model on NQ, we initialize its question and document encoders by two pre-trained RoBERTa-base models (12 layers). Following (Oguz et al., 2021) , the model is further pre-trained on PAQ (Lewis et al., 2021) for 800K steps, and then finetuned on NQ train set with the help of in-batch negatives (Karpukhin et al., 2020) for 40K step. As for the teacher DE model on MSMARCO, it's known that directly training a DE model on MSMARCO training set leads to poor performance (Menon et al., 2022) . Thus, we first train a [CLS]-pooled CE model on triples in MSMARCO training set by using cross-entropy loss. We subsequently use the same triples to distill the resulting CE model to a DE model that has two pretrained RoBERTa-base models as its two encoders. We utilize cross-entropy based distillation loss in Eq. 10. Optimization. For all of our experiments, we use ADAM weight decay optimizer with a short warm up period and a linear decay schedule. We use the initial learning rate of 10 -5 and 2.8 × 10 -5 for experiments on NQ and MSMARCO, respectively. We chose batch sizes to be 128.

G ADDITIONAL EXPERIMENT RESULTS

G.1 ADDITIONAL EXPERIMENT RESULTS ON NQ See Table 5 for the performance of various DE models on NQ, as measured by the relaxed recall metric. . All distilled students used the same teacher (110M parameter RoBERTa-base models as both encoders), with the performance (in terms of relaxed recall) of Recall@5 = 82.5, Recall@20 = 92.6, Recall@100 = 97.1.

G.2 ADDITIONAL EXPERIMENT RESULTS ON MSMARCO

See Table 6 for CE to DE distillation results on MSMARCO, as measured by the nDCG@10 metric. Method nDCG@10 Traditional score matching-based distillation might not result in transfer of relative geometry from teacher to student. To assess this, we look at the discrepancy between the teacher and student query embeddings for all q, q pairs: emb t qemb t q emb s qemb s q . Note that the analysis is based on NQ, and we focus on the teacher and student DE models based on RoBERTa-base



It is common to employ dual-encoder models where query and document encoders are shared. We focus on DE to DE distillation setup as the CE to CE distillation is special case of the former with the classification vector w (cf. Eq. 2) being the trivial second encoder. It is common to employ temperature scaling with softmax operation. We do not explicitly show the temperature parameter for ease of exposition.



Figure 1: Illustration of score-based distillation for IR (cf. Section 3.2). Fig. 1a describes distillation from a teacher [CLS]-pooled CE model to a student DE model. Fig. 1b depicts distillation from a teacher DE model to a student DE model. Both distillation setup employ symmetric DE configurations where query and document encoders of the student model are of the same size.

Figure 3: Illustration of CE to DE distillation using EmbedDistill, with CE model employing dual pooling.

). As for distilling across different model architectures, Lu et al. (2020); Izacard & Grave (2021) consider distillation from a teacher CE model to a student DE model. Hofstätter et al. (

Enc s D ) denote the student DE model's query and document encoders. When distilling from a CE to DE model, defining an effective embedding matching task is not as straightforward as in Sec. 4.1: since CE models jointly encode query-document pairs, it is not obvious how to extract individual query and document embeddings.

Full recall performance of various student DE models on NQ dev set, including symmetric DE student model (82M or 16M transformer for both encoders), and asymmetric DE student model (82M or 16M transformer as query encoder and document embeddings inherited from the teacher). All distilled students used the same teacher (110M parameter RoBERTa-base models as both encoders), with the full Recall@5 = 64.6, Recall@20 = 81.7, and Recall@100 = 91.5. term vanishes. These observations also justify EmbedDistill which either employ Eq. 4 or Eq. 5 (when student inherits teacher's document encoder). Now, we analyze the deviation of a student model from its training empirical risk at the test time, which is typically captured by the uniform deviation bounds based on quantities like Rademacher complexity. Again, we restrict ourselves to binary cross-entropy loss with L = 1 for simplicity. Due to space constraints, we present an informal statement of the result (see Appendix C.2 for a more precise statement and proof).

Performance of EmbedDistill for DE to DE distillation on NQ test set. Note that the prior work mentioned in the table rely on techniques such as negative mining and multi-stage training. In contrast, we explore the orthogonal direction of embedding-matching that improves single-stage distillation, which can be combined with the aforementioned techniques.

Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '21, pp. 113-122, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380379. doi: 10.1145/3404835.3462891. URL https:

Relaxed recall performance of various student DE models on NQ dev set, including symmetric DE student model (82M or 16M transformer for both encoders), and asymmetric DE student model (82M or 16M transformer as query encoder and document embeddings inherited from the teacher)

Performance of CE to DE distillation on MSMARCO, as measured by the nDCG@10 metric. As for the teacher CE models, we consider two kinds of CE models based on two different pooling mechanism.

annex

Proposition 4. Let s t be scorer of a teacher model and d be a distillation loss function which is L d -Lipschitz in its first argument. Let the embedding functions in F and G output vectors with 2 norms at most K. Define the uniform deviationd f (q i ) g(d i ), s t qi,di -E q,d d f (q) g(d), s t q,d .For any g * ∈ G, we haveProof of Proposition 4. We first symmetrize excess risk to get Rademacher complexity, then bound the Rademacher complexity with Dudley's entropy integral.For a training set S n , the empirical Rademacher complexity of a class of functions H that mapswhere {ε i } denote i.i.d. Rademacher random variables taking the value in {+1, -1} with equal probability. By symmetrization (Bousquet et al., 2004) and the fact thatThen, Dudley's entropy integral (see, e.g., Ledoux & Talagrand, 1991) givesFrom Lemma 1 withPutting these together,Following the same steps with G replaced by {g * }, we getBy changing variable in Eq. 19 and Eq. 20, we get the stated bounds.ForDecomposing using triangle inequality, and DistilRoBERTa, respectively. As evident from Fig. 4 , embedding matching loss significantly reduces this discrepancy.

H.2 CE TO DE DISTILLATION

We qualitatively look at embeddings from CE model in Fig. 5 . The embedding emb t q,d from [CLS]pooled CE model does not capture semantic similarity between query and document as it is solely trained to classify whether the query-document pair is relevant or not. In contrast, the (proxy) query embeddings emb t q←(q,d) from our Dual-pooled CE model with reconstruction loss do not degenerate and its embeddings groups same query whether conditioned on positive or negative document together. Furthermore, other related queries are closer than unrelated queries. Such informative embedding space would aid distillation to a DE model via embedding matching.

