MULTI-VECTOR RETRIEVAL AS SPARSE ALIGNMENT

Abstract

Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose ALIGNER, a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens (e.g. 'dog' vs. 'puppy') and per-token unary saliences reflecting their relative importance for retrieval. We show that controlling the sparsity of pairwise token alignments often brings significant performance gains. While most factoid questions focusing on a specific part of a document require a smaller number of alignments, others requiring a broader understanding of a document favor a larger number of alignments. Unary saliences, on the other hand, decide whether a token ever needs to be aligned with others for retrieval (e.g. 'kind' from 'what kind of currency is used in new zealand'). With sparsified unary saliences, we are able to prune a large number of query and document token vectors and improve the efficiency of multi-vector retrieval. We learn the sparse unary saliences with entropy-regularized linear programming, which outperforms other methods to achieve sparsity. In a zero-shot setting, ALIGNER scores 51.1 points nDCG@10, achieving a new retriever-only state-of-the-art on 13 tasks in the BEIR benchmark. In addition, adapting pairwise alignments with a few examples (≤ 8) further improves the performance up to 15.7 points nDCG@10 for argument retrieval tasks. The unary saliences of ALIGNER helps us to keep only 20% of the document token representations with minimal performance loss. We further show that our model often produces interpretable alignments and significantly improves its performance when initialized from larger language models.

1. INTRODUCTION

Neural information retrieval (IR) has become a promising research direction for improving traditional IR systems. The most-commonly adopted approach called the dual encoder operates by representing every query and document as a single dense vector. Given sufficient annotations, dual encoders directly learn task-driven similarity between vectors, and often surpass traditional IR systems on complex tasks such as question answering (Lee et al., 2019; Karpukhin et al., 2020; Ni et al., 2021) . However, these models can struggle to generalize over out-of-domain datasets (Thakur et al., 2021) and/or entity-centric questions (Sciavolino et al., 2021) due to the limited representational capacity of single vectors. As a remedy, multi-vector retrieval models (Khattab & Zaharia, 2020; Luan et al., 2021; Gao et al., 2021) instead use multiple vectors, typically the contextualized token vectors, to represent the text. These models largely improve the model expressiveness, and exhibit much stronger performance and robustness compared to their single-vector counterparts. Existing multi-vector retrieval models such as ColBERT (Khattab & Zaharia, 2020) computes querydocument similarity by selecting the highest scoring document token for each query token and aggregating the scores. This sum-of-max method has two major limitations. First, restricting the selection to a single document token can be highly sub-optimal for some retrieval tasks. As we will show in our experiments, the retrieval performance can be improved by more than 16 points nDCG@10 by relaxing this constraint. Second, the method also leads to a large search index and expensive computation. Specifically, the retrieval and storage cost scales linearly with the query and document length, making multi-vector retrieval models an inferior choice for efficiency-demanding applications. We directly tackle these challenges to build faster and more accurate models. The representation learning problem of multi-vector retrieval can be formulated as optimizing tokenlevel alignment. Specifically, we use a sparse alignment matrix to aggregate token-level similarities, where each element indicates the alignment of a pair of tokens. From this point of view, we are able to formulate different retrieval models in a unified manner (Figure 1 ) and discern the drawbacks of existing models. Based on our formulation, we propose ALIGNER, a novel multi-vector retrieval model that consists of pairwise alignment and unary salience. Pairwise alignments form the basis of ALIGNER, where pairs of query and document tokens are sparsely aligned based on their contextual representations. It is discovered that changing the sparsity of alignment can significantly impact the performance on retrieval tasks. For instance, factoid questions often favor a small number of alignments since they often focus on a small part of a document. However, other queries for different tasks (e.g., argument retrieval and fact checking) require a larger number of alignments for a broader understanding of a document. Our findings also support the claim of Dai et al. (2022b) that retrieval tasks with different intents should be modeled differently. ALIGNER also learns unary saliences, which decides whether each token ever needs to be aligned with any other token for retrieval. This corresponds to masking an entire row or column of the alignment matrix, rather than individual token alignments. To sparsify entire rows or columns, we introduce an algorithm that produces sparse token salience and is end-to-end differentiable based on a novel formulation of entropy-regularized linear programming. Sparsified unary saliences allow us to prune a large number of document and query token representations, making multi-vector retrieval a more efficient and affordable solution. We evaluate ALIGNER on the BEIR benchmark (Thakur et al., 2021) , which covers a diverse set of retrieval tasks in multiple domains. 1 In a zero-shot setting, we show that simply scaling our model achieves the state-of-the-art performance, outperforming prior neural retrievers without contrastive pre-training, model-based hard negative mining, or distillation. By adapting the pairwise alignments with a few examples from the target task -similar to the setup of Dai et al. (2022b) -ALIGNER can be further improved by up to 15.7 points nDCG@10 on argument retrieval tasks. Meanwhile, pruning with our unary saliences can reduce 50% of query tokens for better run-time efficiency and 80% of document tokens for better storage footprint, with less than 1 point decrease of nDCG@10. The pairwise alignments and unary saliences are also highly interpretable so that they often serve as concise rationales for retrieval.

2. MULTI-VECTOR RETRIEVAL AS SPARSE ALIGNMENT

Given a query Q and a collection of N documents C = {D (1) , . . . , D (N) }, a key problem in retrieval is how to represent these textual inputs in order to facilitate efficient search. To this end, one approach is lexical retrieval using sparse bag-of-words representation of the text; the other approach is dense retrieval, which this work focuses on. Dense retrieval models learn a parameterized function that encodes the query and documents into query representation q and document representations {d (1) , . . . , d (N) } respectively. Typically, each representation is a single d-dimensional vector. For retrieval, the similarity function is often defined as sim(Q, D (i) ) = q d (i) , and documents having high similarity scores to the query are retrieved.

2.1. MULTI-VECTOR RETRIEVAL

Instead of representing each query and document as a single fixed-length vector, multi-vector retrieval represents them with multiple token vectors, mainly to improve the limited capacity of fixedlength representations. Specifically, a query Q = {q 1 , . . . , q n } and a document D = {d 1 , . . . , d m } are encoded into a set of vectors {q 1 , . . . , q n } and {d 1 , . . . , d m }. The similarity function between a query and a document is re-defined for multi-vector retrieval. For instance, ColBERT (Khattab & Zaharia, 2020) designs the similarity function as follows: sim(Q, D) = n ∑ i=1 max j=1...m q i d j . For retrieval, instead of indexing N document vectors, multi-vector retrieval pre-computes N × m document token vectors where m is the average length of documents. Then, it retrieves K document token vectors for each query token vector with Maximum Inner-Product Search (MIPS), resulting in n × K candidate document tokens. The retrieved tokens are used to trace back the original documents (Lee et al., 2021a) , often followed by a final refinement stage that scores the similarity sim(Q, D) with all token representations of each document and the query (Khattab & Zaharia, 2020) . We adopt the same practice of ColBERT in our experiments.

2.2. SPARSE ALIGNMENT FORMULATION

A key design question for retrieval models is defining the similarity function in a manner that balances model expressiveness and inference cost. To facilitate our discussion, we formalize the similarities used in previous methods into a class of sparse alignment functions. The formulation also leads to a principled extension over existing work, which we will describe in §3. We begin by defining a similarity matrix S ∈ R n×m computed from all pairs of query and document tokens, where S i, j = q i d j . Then, we use an alignment matrix A ∈ [0, 1] n×m to compute the similarity between Q and D as follows: sim(Q, D) = 1 Z n ∑ i=1 m ∑ j=1 S i, j A i, j where Z is a normalization term defined as Z = ∑ i, j A i, j . The alignment matrix A can be directly derived from S or computed as a function of Q and D. We constrain the alignment matrix A to be sparsely activated: ||A|| 0 ≤ σ where || • || 0 is the number of non-zero elements in a matrix. Sparse activation assumes that only a few query-document token matches are critical for retrieval, inspired by traditional retrieval methods. Indeed, most existing dense retrieval models already enforce the sparse alignment with their own heuristics. Figure 1 illustrates how different models can be described under our formulation: • Dense passage retriever (DPR; Karpukhin et al., 2020 ) uses a single [CLS] vector to represent each query and document. This is equivalent to setting A 1,1 = 1 and 0 otherwise, resulting in ||A|| 0 = 1. • ME-BERT (Luan et al., 2021) uses the first k document token vectors for multi-vector representations of documents but a single vector for query. The similarity function is max j=1...k q 1 d j , which is equivalent to setting A 1, j = 1 when S 1, j is the maximum within S 1,1 to S 1,k , and 0 otherwise. The alignment sparsity is ||A|| 0 = 1. • ColBERT uses the sum-of-max similarity function ∑ n i=1 max j=1...m S i, j that is equivalent to setting an alignment matrix to select the maximum element from each row of S, i.e., A i, j = 1 when S i, j is the maximum within S i,: . ||A|| 0 = n in this case. • COIL (Gao et al., 2021) , similar to ColBERT, also selects the maximum element from each row of S, but requires a lexical exact match for a selected pair, i.e., A i, j = 1 when S i, j is the maximum within {S i, j | q i = d j }. ||A|| 0 ≤ n in this case. The choice of similarity and sparsity can have a large impact on model capacity and efficiency. For instance, ColBERT is more expressive and robust than DPR (Thakur et al., 2021) , but its retrieval and storage costs are much higher. Our work seeks to further advance expressiveness while retaining a strong efficiency. We describe our method in the next section.  alignment A pairwise alignment Ã unary salience u q ⊗ u d = ⊙ 0 0.2 1 0.2 0.2 0 0.2 0.2 1 ⋯ ⋯ query Q document D salience u q salience u d similarity S i,j = q ⊤ i d j q 1 q 2 q n d 1 d 2 d m

3. ALIGNER

In this section, we present ALIGNER built upon the sparse alignment formulation. ALIGNER factorizes the alignment matrix into pairwise alignment and unary salience: A = Ã (u q ⊗ u d ) (2) where is the Hadamard product and ⊗ is the outer product of two vectors. Pairwise alignment Ã ∈ R n×m determines which pairs of query and document tokens should be aligned, with the sparsity constraints tailored for downstream tasks ( §3.1). Unary salience u q ∈ R n and u d ∈ R m are sparse token weights deciding whether a token ever needs to be aligned ( §3.2). The factorization is introduced based on two critical hypotheses. First, the optimal sparsity of alignment can be task-dependent. Instead of imposing top-1 constraint as in ColBERT, activating more than one alignments for a query token can enhance retrieval performance for certain tasks. In our analyses for instance, we observe factoid questions that only concern a specific part of a document require a small number of alignments, while some other queries (such as fact checking) require more alignments for a broader understanding of the document. We explore different search spaces of the pairwise alignment matrix Ã in order to achieve better retrieval performance for each downstream task. Second, alignment is only needed for very few tokens. For example, we analyzed 2000 most retrieved documents in our preliminary study, and found only 12.8% document tokens are retrieved by at least one query. 2 Intuitively, tokens that are uninformative do not need to be aligned and stored, corresponding to sparse activation over an entire row or column of A. ALIGNER directly learns the row and column sparsity as unary salience, and utilizes them to enhance retrieval efficiency.

3.1. ADAPTING PAIRWISE ALIGNMENT

Queries and documents can have varied distributions. For example, a query can be a single entity, a natural question, or a few sentences, and a document can range from a short paragraph to a long article. The search intent also changes from task to task (Dai et al., 2022b) . These changes can lead to different optimal alignment strategies. We explore the following sparse alignment variants that go beyond the top-1 strategy commonly adopted in existing work: • Top-k. Each query token is aligned with k document tokens with highest similarity scores. Precisely, Ãi, j = 1 when the j-th token is within top-k of the row S i . When k = 1, it is equivalent to ColBERT. • Top-p. This strategy is similar to top-k, but instead of aligning each query token with exactly k tokens, it makes the number of alignments proportional to the document length, i.e., each query token aligns with max( p • m , 1) tokens where m is the document length and p ∈ [0, 1] is the alignment ratio. Despite their simplicity, these variants can indeed enhance retrieval accuracy significantly on tasks such as argument retrieval. More importantly, while it is possible to train separate models for different alignment variants, we are interested in fast test-time adaptation using a single shared model as many important retrieval tasks lack sufficient training data (Thakur et al., 2021) . Specifically, we first train ALIGNER using a fixed alignment strategy such as top-1 in a source domain, and adapt the alignment strategy to each target task without changing the model parameters. 3 We use the following few-shot alignment adaptation method. Given a corpus {D (1) , . . . , D (N) }, and a few relevance-annotated query-document pairs from the target task {(Q 1 , D 1 + ), . . . (Q K , D K + )}, we first retrieve candidate documents with the learned token representations, and decide the pairwise alignment strategy based on the ranking performance on the annotated data. This adaptation can be performed efficiently because the alignment only concerns the computation of similarity score (Eq. 1) in the refinement stage. In practice, for some tasks, we are able to find a well-suited alignment strategy and improve the retrieval performance with as few as 8 annotated examples.

3.2. LEARNING UNARY SALIENCE

ALIGNER predicts token saliences from their token representations. For brevity, we only present the formulation for document salience, and query salience is defined similarly. Specifically, the salience of the i-th document token u d i is defined as: u d i = λ d i • f (W d d i + b d ) where W d and b d are learnable parameters. f is a non-linear activation function and we use ReLU such that salience is always non-negative. λ d = {λ d i } are gating variables to control the overall sparsity of u d , which we will elaborate next. For the document salience to be meaningful, we enforce salience sparsity as an inductive bias. ALIGNER jointly optimizes sparse salience with other parts of the model. Since tokens with zero salience do not contribute to computing similarity, our model will be encouraged to identify more important tokens in order to retain good retrieval performance. Note that during training we do not have any explicit annotation on which tokens are important. Instead, u d (and similarly u q ) are directly optimized to minimize the training loss, under the sparsity constraint that λ d 0 = α d • m , where α d is a constant sparsity ratio and m is the document length. A key question is how we can optimize the unary salience component given the controlled sparsity. We leverage a novel technique called entropy-regularized linear programming to enable end-to-end optimization. Specifically, let k = α d • m denotes the desired sparsity, s i = f (W d d i + b d ) denotes the token score before the sparse gate λ d i is applied, and s, λ d ∈ R m be the vectors {s i } and {λ d i } respectively. λ d is computed by solving the following optimization problem: max λ s λ + εH(λ) s.t. 1 λ = k, λ i ∈ [0, 1], ∀i = 1, . . . , m. where H(•) is the elementwise entropy functionfoot_3 and ε > 0 is a small constant. The optimization can be seen as a relaxed top-k operation. Without the entropy term εH(•), it becomes an instance of linear programming where the solution λ d is a binary mask indicating the top-k values of s, i.e., λ d i = 1 if and only if s i is one of top-k values in s. This top-k optimization is smoothed by adding the small entropy term εH(•) and by relaxing λ i from exact binary to [0, 1]. Given small ε, this still produce a sparse solution λ d and can be solved using simple vector operations. Specifically, let a ∈ R and b i ∈ R for i = 1, • • • , m be auxiliary variables that are initialized to zero. We iteratively update these variables using the following equations: a = ε ln(k) -ε ln ∑ i exp s i + b i ε , b i = min(-s i -a , 0). (5) In practice, it is sufficient to run only a few iterations and the final solution is given by λ i = exp( s i +b i +a ε ). These vector operations are differentiable so λ can be end-to-end trained with other parts of our model. The full derivation of this iterative algorithm is given in Appendix A.1. Pruning Multi-vector Retrieval Despite good retrieval performance, multi-vector retrieval models are notorious for its expensive token-level retrieval (Santhanam et al., 2022) . This prevents multi-vector retrievers from being widely adopted in practice. With the learned unary salience, we can prune redudent tokens for multi-vector retrieval. Pruning document tokens reduces the number of vectors in search index, and pruning query tokens reduces the number of searches. Compared to the sparse salience method using L1-norm regularization (Hofstätter et al., 2022) , our proposed method offers a direct control over the sparsity. In our experiments, we prune query and document tokens using two pruning ratios β q and β d respectively. For each document, we obtain the token salience using Eq.( 3) and only store the top β d percent of tokens in the index. Similarly we select the top β q percent query tokens to perform max inner-product search. Note that we vary these two ratios to control retrieval efficiency, and these ratios can be smaller than the sparsity ratio α q and α d which we use as constraints at training time. In the refinement stage, we still use the full model with all token vectors for scoring. With token embedding caching, the computation cost of refinement is relatively small compared to that of retrieval.

4. EXPERIMENTS

4.1 EXPERIMENTAL SETUP ALIGNER uses shared transformer encoder initialized from T5 version 1.1 (Raffel et al., 2020) . We project token embeddings to 128 dimension and apply L2 normalization. Following GTR (Ni et al., 2021) , we finetune ALIGNER on MS MARCO with hard negatives released by RocketQA (Qu et al., 2021) . The models are trained with a batch size of 256 for 25k steps, using query sequence length of 64 and document sequence length of 256. We train ALIGNER with top-1 pairwise alignment. 5For retrieval, we pre-compute the token encodings of all the documents in the corpus, and use ScaNN (Guo et al., 2020) to index and perform max inner-product search (MIPS). We retrieve 4,000 nearest neighbors for each query token,foot_5 and return the top-1,000 after the refinement stage. We evaluate ALIGNER on the BEIR benchmark (Thakur et al., 2021) and compare with state-of-the-art retrieval models shown in Table 1 . Note that ALIGNER does not rely on contrastive model pretraining (Izacard et al., 2022; Ni et al., 2021) , model-based hard negative mining (Santhanam et al., 2021) , or distillation (Santhanam et al., 2021) . We intentionally decide this simple recipe and focus on studying the impact of pairwise alignment and unary salience. For few-shot alignment adaptation of ALIGNER ( §3.1), we split the test data into multiple folds such that each fold contains 8 examples. Then we find the best alignment strategy that maximizes nDCG@10 on each fold with k ∈ {1, 2, 4, 6, 8} for top-k and p ∈ {0.5%, 1%, 1.5%, 2%} for top-p. Based on the best alignment strategy from each fold, we measure the retrieval performance on the remaining test examples with the best strategy. We report the average (± std.) of these test scores where the number of test scores equals the number of folds. The average of few-shot adaptation indicates the expected performance of using few examples to choose the best alignment strategy. ALIGNER xxl achieves the strongest results, showing multi-vector retrieval models benefit from large pretrained language models. It also outperforms GTR xxl on 9 out of 13 BEIR datasets and advances the retriever-only state-of-the-art (ColBERT v2 ) by 1.2 points nDCG@10 on average. Figure 3 shows that our multi-vector retriever model scales better than single-vector dual encoder GTR.

4.2. RETRIEVAL ACCURACY

Alignment Adaptation In the rightmost column of Table 2 , we show the effect of adapting pairwise alignment with ALIGNER on the BEIR benchmark. With only 8 examples for finding the proper alignment sparsity, its expected performance reaches 52.6 nDCG@10 on average. Alignmentadapted ALIGNER also benefits from scaling up, and consistently outperforms its non-adapted counterparts, as shown in Figure 3 . The gains are further explained in Table 3 , where we show individual task's performance under various alignment strategies. Although ALIGNER is trained with top-1 alignment, top-1 is not always the best strategy at inference time. Specifically, for ArguAna, we observe 16 points improvement by adjusting the number of alignments proportional to the document length with p = 1.5%. In general, keeping the sparsity low enough is preferable and supports our hypothesis that pairwise alignments should be sparse. Figure 4 compares ALIGNER variants trained with other pairwise alignment strategies, including top-k = 1, 2, 4, 8 and top-p = 0.5%, . . . , 4%. We evaluate their performance with training-time alignment strategy (default) and the optimal alignment strategy selected per dataset (oracle). Among these variants, top-1 and top-0.5% work the best, and increasing k or p hurts performance. Alignment adaptation improves all models, showing that retrieval tasks require different pairwise alignment. Figure 5 shows the effectiveness of few-shot alignment adaptation- ALIGNER (full) ALIGNER, β q = 50% ALIGNER, β q = 30% ALIGNER, β q = 10% GTR ReLU, β q = 50% Hard, β q = 50% L1, β q = 50% Figure 6 : ALIGNER with unary salience on MS MARCO. β q and β d are ratios to prune query and document tokens, respectively. : ALIGNER with unary salience on BEIR. We set query pruning ratio β q = 50% and vary document pruning ratio β d . We omit datasets with small corpora. We also report the performance of ALIGNER with alignment adaptation (AA). average score and reduces the variance. However, when the default alignment is already optimal (top-k=1 is optimal for QA tasks), few-shot alignment adaptation hurts performance due to the variance of our few-shot method. Nevertheless, ALIGNER outperforms Promptagator (Dai et al., 2022b) , another few-shot retrieval baseline, in 6 out of 11 datasets.

4.3. RETRIEVAL EFFICIENCY

Next we show how ALIGNER's unary salience improves retrieval efficiency. We train ALIGNER base with salience sparsity ratios (α q = 50%, α d = 40%) and ε = 0.002 based on empirical performance. At retrieval time, we prune query and document tokens with ratios β q and β d ( §3.2). Figure 6 shows the ALIGNER performance on MS MARCO with various pruning ratios. When pruned at the same ratio as training (β q = 50%, β d = 40%), the model performs similarly to a full model (MRR@10 38.1 vs. 38.8) . We can further prune tokens by adjusting β d and β q . The model achieves 37.3 MRR@10 with β d = 10%, i.e. it remains accurate with only 10% of the original index size. Decreasing the query pruning ratio β q to 30% does not sacrifice performance too much, although deceasing β q to 10% leads to worse performance. Figure 6 also compares ALIGNER's entropy-regularized linear program (Eq. 4) with alternative methods. With just a ReLU gate and no sparsity constraints ('ReLU' in Figure 6 ), the model retains a good performance when β d = 40%, but degrades for smaller β d . Removing the entropy regularization in Eq. 4 leads to simply selecting the hard top-k tokens with the highest predicted salience ('Hard' in Figure 6 ). The hard top-k solution performs worse for all β d . Another method to sparsify salience is adding an L1-norm regularization ('L1' in Figure 6 ). With a proper coefficient, it performs comparably to our method and slightly better when β d = 5%. Note that our method has the advantage of explicitly controlling the sparsity ratios, without tuning the coefficient of the L1 term. ALIGNER's salience estimation also generalizes to other retrieval datasets. As shown in Figure 7 , pruning with β d = 20% with β q = 30% causes minimal performance decrease for a majority of BEIR datasets, which leads to an 80% reduction in the search index size and up to 94% reduction in the computation costfoot_6 . We even observe performance increase for Touché-2020, as the model can only retrieve salient tokens after pruning. Besides, we show that alignment adaptation can be combined with pruning, resulting in an effective yet efficient retrieval model. Query what happens when stop drinking alcohol Doc. alcohol . symptoms of alcohol withdrawal may begin from 4 to 12 hours after you cut down or stop drinking , or as long as several days after the last drink , and can last a few days . they can range from mild to life -threatening . 1 mild withdrawal symptoms may include : 2 intense worry . 3 nausea or vomiting . 4 s hak iness . 5 sweat ing . Query where is the heart in the human body Doc. heart the heart is a muscular organ in most animals , which pumps blood through the blood vessels of the circul atory system . [ 1 ] blood provides the body with oxygen and nutrients , as well as assists in the removal of metabolic waste s . [ 2 ] in humans , the heart is located between the lungs , in the middle compartment of the chest . [ 3 ] Table 4 : Examples of the pairwise alignment and unary salience learned by ALIGNER. Three most salient query tokens and their top-1 pairwise-aligned document tokens are indicated with the same color. We highlight top 50% query tokens and 20% document tokens according to their salience.

4.4. INTERPRETABILITY

Table 4 shows examples of the pairwise alignment and unary salience learned by ALIGNER. The model aligns query tokens to contextually similar document tokens, but not necessarily identical tokens. The salience features are also highlighted where important noun phrases and verbs have higher salience, consistent with human intuition. We show more examples of alignments in the Appendix A.3. In general, we observe question answering tasks usually require fewer alignments, while other tasks that require a broad understanding of the document favor larger number of alignments.

5. RELATED WORK

Recent research on information retrieval often improves the retrieval accuracy with contrastive pretraining (Ni et al., 2021; Izacard et al., 2022; Oguz et al., 2022) , model-based hard negative mining (Xiong et al., 2020; Lu et al., 2021; Qu et al., 2021) and knowledge distillation (Santhanam et al., 2021; Zhang et al., 2022; Reddi et al., 2021) . Retrieval efficiency is improved via quantization (Santhanam et al., 2021) or lower-dimensional vectors (Hofstätter et al., 2022) . Term importance and salience have a long history in information retrieval: from term frequency (tf ) and inverse document frequency (idf ), to recent BERT-based importance measures (Dai & Callan, 2020; Zhao et al., 2021; Formal et al., 2021b; a) . These works mostly focus on sparse lexical retrieval and learn term weights for sparse bag-of-words representations. Our work is most related to ColBERTer (Hofstätter et al., 2022) , which proposed to fuse single-vector retrieval and multi-vector refinement. While ColBERTer prunes multi-vector word embeddings for refinement and tests on indomain retrieval task, we propose to prune multi-vector embeddings for retrieval and mainly study the generalization of retrieval on out-of-domain retrieval tasks. Zhou & Devlin (2021) proposed a multi-vector attention model for reranking, we have a similar formulation but focus on the retrieval. Recently, Promptagator (Dai et al., 2022b) points out the importance of using a few annotated examples to adapt to a new retrieval task. Promptagator achieves few-shot task adaptation via query generation (Ma et al., 2021; Lee et al., 2021b; Dai et al., 2022a) using large language models (Sanh et al., 2022; Brown et al., 2020; Wei et al., 2022) , which has high inference cost. ALIGNER is more versatile and can be fast adapted to a new task via few-shot alignment adaptation.

6. CONCLUSION

In this paper, we introduce ALIGNER, a novel sparse alignment method for multi-vector document retrieval. We first formulate different retrieval models with token-level sparse alignments and propose ALIGNER to tackle the limitations of existing models. Specifically, ALIGNER uses pairwise alignments and unary saliences that allow us to adapt to different tasks and prune unimportant tokens, respectively. As a result, we achieve strong performance on both zero-shot and few-shot document retrieval tasks while drastically improving the run-time and storage complexity of multi-vector retrieval. With its interpretable alignments and better performance with large language models, we envision that our multi-vector retrieval model can serve as a strong standalone retriever in the future.

A APPENDIX A.1 DERIVATION OF THE ITERATIVE UPDATES

We present the derivation of Eq.5 for solving optimization problem (4) in Section 3.2. The maximization problem (4) can be written as an equivalent minimization problem: max λ s λ + εH(λ) ⇐⇒ min λ -s λ -εH(λ) ⇐⇒ min λ -s λ -εH(λ) -ε1 λ (6) s.t. 1 λ = k, λ i ∈ [0, 1], i = 1, . . . , m. Note the term ε1 λ will be a constant ε × k, but we include it in the minimization object to make our derivation simpler later. Now, let a ∈ R and b ∈ R m be the Lagrangian variables corresponding to the linear constraints 1 λ = k and λ i ≤ 1 ∀i . 8 The minimization problem is equivalent to its Lagrangian expression: min λ ∈R m max a∈R,b≤0 -s λ -εH(λ) -ε1 λ + a(k -1 λ) + b (1 -λ) The objective function ( 6) is strongly convex and the solution space of λ is a convex set. As a result, strong duality holds and we can instead solve the dual problem that exchanges the min and max operators in ( 7) max a∈R,b≤0 min λ ∈R m -s λ -εH(λ) -ε1 λ + a(k -1 λ) + b (1 -λ) The optimal solution (a, b, λ) must have the Karush-Kuhn-Tucker (KKT) conditions hold (Kuhn & Tucker, 2014) , namely ∂ -s λ -εH(λ) + ε1 λ + a(k -1 λ) + b (1 -λ) ∂ λ = 0 ⇐⇒ λ = exp s + a + b ε ⇐⇒ λ i = exp s i + a + b i ε ∀i = 1, . . . , m Substituting λ using the above equation in (8), the dual problem now has a simple form: max a∈R,b≤0 k • a + 1 b -1 exp s + a + b ε We can solve this problem using coordinate descent (Wright, 2015) by successively maximizing the function with either a or b fixed. This leads to the iterative updates (Eq.5) described in Section 3.2. a = ε ln(k) -ε ln ∑ i exp s i + b i ε b i = min(-s i -a , 0) Discussion In short, we solve the dual problem of optimization (4) by performing coordinate decent of the dual variables a and b. That is, we find the optimal a that maximizes the dual objective given a fixed b, and vice versa. This iterative algorithm is also closely related to the Sinkhorn algorithm of Optimal Transport (OT). In fact, Sinkhorn algorithm solves the entropy-regularized version of Optimal Transport (Cuturi, 2013) . However, our work concerns an different optimization instance. While OT solves a transportation problem where the solution space is defined with the marginal constraints over the rows and columns of a transportation matrix, our optimization problem is constrained with a total budget (∑ i λ i = k) and upper bounds (λ i ≤ 1 ∀i). This leads to different iterative updates.

A.2 DIFFERENTIABLE ALIGNMENT WITH SPARSITY CONSTRAINTS

Besides the Top-k and Top-p alignments in §3.1, we also explore a differentiable pairwise alignment with sparsity contraints (DA). Both Top-k adn Top-p are doing hard selection of alignments, i.e., Ãi, j is either 1 or 0. We relax it by introducing soft sparsity constraints. Similar to our formulation for unary salience ( §3.2), we determine the alignment Ã by the following optimization problem: max A S, A + εH(A) s.t. ∑ j A i, j = k, i = 1, . . . , n A i, j ∈ [0, 1], i = 1, . . . , n, j = 1, . . . , m where H(•) is the elementwise entropy function and ε > 0 is a small constant. We constrain the sum of each row of Ã to equal k. When ε = 0, the solution of Eq. 9 is the same as Top-k. When ε > 0, the entropy term makes the optimization problem strongly concave, which can be solved by the same algorithm in Appendix A.1. The solution is differentiable, thus can be trained end-to-end in our model.

Query Gold Document

Quora what is the best birthday gift for a friend? what ( 3) is ( 2) a ( 4) good (1) birthday gift for a friend? MS MARCO (dev) when would you use a fathom measurement a fathom (1) is a unit of length in the imperial and the u.s. customary systems equal to 6 feet (3) (1.8288 m (4) ), used especially for measuring the depth of water. there are two yards (6 feet) in an imperial fathom (2) . Touché-2020 should animals be used for scientific or commercial testing? animal testing should not be allowed. . . [truncated]. . . albeit the nonprecocious mistakes of scientists (2) . also. . . We notice that the top-1 alignment quality is generally good across all three tasks. However, larger value of k results in spurious irrelevant alignments for Quora and MS MARCO, while remains fairly useful for Touché-2020. 



We will release our model checkpoints to encourage future research. The analysis was performed on MS MARCO(Nguyen et al., 2016) using our implementation of ColBERT. We have also explored a differentiable alignment with sparsity constraints (Appendix A.2), but alignment adaptation is still necessary to achieve good performance on target tasks. H(λ) = ∑ m i=1 -λ i log λ i . λ is not a probability distribution. H(λ) is an extension of the entropy function that is applied to any positive vector λ. We have trained models with other top-k and top-p pairwise alignments, but the MS MARCO training data favors top-1 alignment (see Appendix A.4 for details). Unlike ColBERT, ALIGNER does not use pad token embeddings for retrieval. Hence, retrieving 4,000 neighbors per query token results in a similar number of retrieved candidates to ColBERT. Assume brute force search. λ i ≥ 0 ∀i is already implied by the entropy term H(λ) in the objective.



Figure1: (a) We formulate multi-vector retrieval as token-level sparse alignment; (b) Earlier models can be covered by our formulation as using different alignments.

Figure 2: ALIGNER factorizes the alignment matrix into pairwise alignments and unary saliences. Pairwise alignment focuses on the alignment of individual token pairs. Unary saliences are determined by per-token salience features.

Figure7: ALIGNER with unary salience on BEIR. We set query pruning ratio β q = 50% and vary document pruning ratio β d . We omit datasets with small corpora. We also report the performance of ALIGNER with alignment adaptation (AA).

[truncated]. . . skeptic of the scientist (3) in question's abilities . . . [truncated]. . . continuous use if animals for clinical (4) and basic research." . . . [truncated]. . . majority of the scientific (1) community thinks on this issue, . . . Table 5: Examples of pairwise alignment with the top-k value up to 4 for the Quora, MS MARCO, and Touché-2020 datasets. Query tokens being aligned are shown in blue, and corresponding aligned document tokens are shown in red. The superscript on the document token (k) indicates top-k alignment.

Comparison of different retrieval models.

shows the document retrieval performance of ALIGNER on both MS MARCO and the BEIR benchmark. For this experiment, we do not prune any query or document tokens with unary saliences, but show their effects in §4.3 instead. ALIGNER base outperforms state-of-the-art sparse and dense retrievers on MS MARCO. It performs slightly worse than ColBERT v2 , given that we do not use distillation or model-based hard negatives to optimize the in-domain performance.

Results on MS MARCO (top; MRR@10) and the BEIR benchmark (bottom; nDCG@10). Best zero-shot scores are denoted in boldface.



shows examples of top-k pairwise alignments of a query token (highlighted in blue) to the corresponding document tokens for several different tasks. For question-answering (e.g., MS MARCO) and duplicate question retrieval (Quora), fewer alignments seem to be preferable, and as k increases, we start to see spurious alignments to unrelated documents tokens. For argument retrieval tasks such as Touché-2020, on the other hand, larger value of k tends to provide useful semantically relevant alignments (e.g., scientific vs clinical). These qualitative examples provide intuitive insights regarding why different alignment strategies are helpful for different tasks, and why alignment adaptation is necessary.

Retrieval performance on MS MARCO. The top half shows baselines from previous work. The botton half shows different ALIGNER models. DA: differential alignment. See Appendix A.2 A.5 FULL RESULT TABLES ON BEIR Tables 7 to 10 presents complete results of ALIGNER's performance on the BEIR datasets initialized from T5 base, large, XL, and XXL checkpoints. We set k = 1 during training, and show results across different inference-time alignment strategies (both top-k and top-p). As expected, model accuracy improves as we scale to larger models. Moreover, we observe similar benefits of alignment adaptation across all the different model sizes.

nDCG@10 on the BEIR benchmark with different k and p in ALIGNER base .

nDCG@10 on the BEIR benchmark with different k and p in ALIGNER large .

nDCG@10 on the BEIR benchmark with different k and p in ALIGNER xl .

nDCG@10 on the BEIR benchmark with different k and p in ALIGNER xxl .

nDCG@10 on the BEIR benchmark with different k and p in ALIGNER base .

