EMBEDDISTILL: A GEOMETRIC KNOWLEDGE DISTILLATION FOR INFORMATION RETRIEVAL

Abstract

Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval. In this paper, we aim to improve distillation methods that pave the way for the deployment of such models in practice. The proposed distillation approach supports both retrieval and re-ranking stages and crucially leverages the relative geometry among queries and documents learned by the large teacher model. It goes beyond existing distillation methods in the information retrieval literature, which simply rely on the teacher's scalar scores over the training data, on two fronts: providing stronger signals about local geometry via embedding matching and attaining better coverage of data manifold globally via query generation. Embedding matching provides a stronger signal to align the representations of the teacher and student models. At the same time, query generation explores the data manifold to reduce the discrepancies between the student and teacher where the training data is sparse. Our distillation approach is theoretically justified and applies to both dual encoder (DE) and cross-encoder (CE) models. Furthermore, for distilling a CE model to a DE model via embedding matching, we propose a novel dual pooling-based scorer for the CE model that facilitates a more distillation-friendly embedding geometry, especially for DE student models.

1. INTRODUCTION

Neural models for information retrieval (IR) are increasingly used to capture the true ranking in various applications, including web search (Mitra & Craswell, 2018) , recommendation (Zhang et al., 2019) , and question-answering (QA; Chen et al. 2017) . Notably, the recent success of Transformers (Vaswani et al., 2017) -based pre-trained language models (Devlin et al., 2019; Liu et al., 2019; Raffel et al., 2020) on a wide range of natural language understanding tasks has prompted their utilization in IR to capture query-document relevance (see, e.g., Dai & Callan, 2019b; MacAvaney et al., 2019a; Nogueira & Cho, 2019; Lee et al., 2019; Karpukhin et al., 2020, and references therein) . A typical IR system comprises two stages: (1) A retriever first selects a small subset of potentially relevant candidate documents (out of a large collection) for a given query; and (2) A re-ranker then identifies a precise ranking among the candidates provided by the retriever. Dual-encoder (DE) models are the de-facto architecture for retrievers (Lee et al., 2019; Karpukhin et al., 2020) Such models independently embed queries and documents into a common space, and capture their relevance by simple operations on these embeddings such as the inner product. This enables offline creation of a document index and supports fast retrieval during inference via efficient maximum inner product search (MIPS) implementations (Johnson et al., 2021; Guo et al., 2020) , with query embedding generation primarily dictating the inference latency. Cross-encoder (CE) models, on the other hand, are preferred as re-rankers, owing to their excellent performance (Nogueira & Cho, 2019; Dai & Callan, 2019a; Yilmaz et al., 2019) . A CE model jointly encodes a query-document pair while enabling early interaction among query and document text. Employing a CE model for retrieval is often infeasible, as it would require processing a given query with every document in the collection at inference time. In fact, even in the re-ranking stage, the inference cost of CE models is high enough (Khattab & Zaharia, 2020) to warrant exploration of efficient alternatives (Hofstätter et al., 2020; Khattab & Zaharia, 2020; Menon et al., 2022) . Knowledge distillation (Bucilǎ et al., 2006; Hinton et al., 2015) provides a general strategy to address the prohibitive inference cost associated with high-quality large neural models. In the IR literature, most existing distillation methods only rely on the teacher's query-document relevance scores (see, e.g., Chen et al., 2021; Lu et al., 2020; Hofstätter et al., 2020; Ren et al., 2021; Santhanam et al., 2021) or their proxies (Izacard & Grave, 2021). However, given that neural IR models are inherently embedding-based, it is natural to ask: is it useful to go beyond matching of the teacher and student models' scores, and directly aim to align their embedding spaces? With this in mind, we propose a novel distillation method for IR models that utilizes an embedding matching task to train the student. The proposed method supports cross-architecture distillation and improves upon existing distillation methods for both retriever and re-ranker models. When distilling a large DE model into a smaller DE model, for a given query (document), one can minimize the distance between the query (document) embeddings of the teacher and student after compatible projection layers to account for any dimensionality mismatch. In contrast, defining an embedding matching task for distilling a CE model into a DE model is not as straightforward. For Transformers-based CE models, it is common to use the final embedding of a special token, e.g., [CLS] in BERT (Devlin et al., 2019) , to compute query-document relevance (Nogueira & Cho, 2019) . However, as we note in Sec. 4.2, this token embedding does not capture semantic similarity between the query and document. To make CE models more amenable to distillation via embedding matching, we propose a modified CE scoring approach by utilizing a novel dual-pooling strategy: this separately pools the final query and document token embeddings, and then computes the inner product between the pooled embeddings as the relevance score. Our key contributions toward improving IR models via distillation are: • We propose a novel distillation approach for neural IR models, namely EmbedDistill, that goes beyond score matching and aligns the embedding spaces of the teacher and student models. • We consider a novel DE to DE distillation setup to showcase the effectiveness of our embedding matching approach (Sec. 4.1). Specifically, we consider a student DE model with an asymmetric configuration, consisting of a small query encoder and a frozen encoder inherited from the teacher. Such a configuration significantly reduces query embedding generation latency during inference, while leveraging the teachers' high-quality document index. • We show that our proposed distillation approach can leverage synthetic data to improve student by further aligning the embedding spaces of the teacher and student (Sec. 4.3). • We theoretically justify both embedding matching and query generation components of our proposed method (Sec. 5). Further, we provide a comprehensive empirical evaluation of the method (Sec. 6) on two standard IR benchmarks -Natural Questions (Kwiatkowski et al., 2019a) and MSMARCO (Nguyen et al., 2016) . • Finally, we demonstrate the utility of embedding matching for CE to DE distillation on MS-MARCO by utilizing a novel pooling strategy for CE models, namely dual pooling (Sec. 4.2), which might be of independent interest.

2. RELATED WORK

Here, we review some existing Transformers-based IR models, and discuss prior work on distillation and data augmentation for such models. We also cover prior efforts on aligning representations during distillation for non-IR settings. Unlike our problem setting where the DE student is factorized, these works mainly consider distilling a single large Transformer into a smaller one. Transformers-based architectures for IR. Besides DE and CE models described in Section 1, intermediate configurations (MacAvaney et al., 2020; Khattab & Zaharia, 2020; Nie et al., 2020; Luan et al., 2021) have been proposed. Such models first independently encode the query and document, and then apply a more complex late interaction between the two. Interestingly, Nogueira et al. ( 2020) explore generative encoder-decoder style model for re-ranking, where a T5 (Raffel et al., 2020) model takes a query-document pair as input and its score for certain target tokens (e.g., True/False) defines the relevance score for the pair. In this paper, we focus on basic DE and CE models to showcase the benefits of our proposed geometric distillation approach. Exploring embedding matching for the aforementioned architectures is an interesting avenue for future exploration. Distillation for IR. Traditional distillation techniques have been widely applied in the IR literature, often to distill a teacher CE model to a student DE model (Chen et al., 2021; Li et al., 2020) . Recently, distillation from a DE model (with complex late interaction) to another DE model (with

