MULTI-VECTOR RETRIEVAL AS SPARSE ALIGNMENT

Abstract

Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose ALIGNER, a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens (e.g. 'dog' vs. 'puppy') and per-token unary saliences reflecting their relative importance for retrieval. We show that controlling the sparsity of pairwise token alignments often brings significant performance gains. While most factoid questions focusing on a specific part of a document require a smaller number of alignments, others requiring a broader understanding of a document favor a larger number of alignments. Unary saliences, on the other hand, decide whether a token ever needs to be aligned with others for retrieval (e.g. 'kind' from 'what kind of currency is used in new zealand'). With sparsified unary saliences, we are able to prune a large number of query and document token vectors and improve the efficiency of multi-vector retrieval. We learn the sparse unary saliences with entropy-regularized linear programming, which outperforms other methods to achieve sparsity. In a zero-shot setting, ALIGNER scores 51.1 points nDCG@10, achieving a new retriever-only state-of-the-art on 13 tasks in the BEIR benchmark. In addition, adapting pairwise alignments with a few examples (≤ 8) further improves the performance up to 15.7 points nDCG@10 for argument retrieval tasks. The unary saliences of ALIGNER helps us to keep only 20% of the document token representations with minimal performance loss. We further show that our model often produces interpretable alignments and significantly improves its performance when initialized from larger language models.

1. INTRODUCTION

Neural information retrieval (IR) has become a promising research direction for improving traditional IR systems. The most-commonly adopted approach called the dual encoder operates by representing every query and document as a single dense vector. Given sufficient annotations, dual encoders directly learn task-driven similarity between vectors, and often surpass traditional IR systems on complex tasks such as question answering (Lee et al., 2019; Karpukhin et al., 2020; Ni et al., 2021) . However, these models can struggle to generalize over out-of-domain datasets (Thakur et al., 2021) and/or entity-centric questions (Sciavolino et al., 2021) due to the limited representational capacity of single vectors. As a remedy, multi-vector retrieval models (Khattab & Zaharia, 2020; Luan et al., 2021; Gao et al., 2021) instead use multiple vectors, typically the contextualized token vectors, to represent the text. These models largely improve the model expressiveness, and exhibit much stronger performance and robustness compared to their single-vector counterparts. Existing multi-vector retrieval models such as ColBERT (Khattab & Zaharia, 2020) computes querydocument similarity by selecting the highest scoring document token for each query token and aggregating the scores. This sum-of-max method has two major limitations. First, restricting the selection to a single document token can be highly sub-optimal for some retrieval tasks. As we will show in our experiments, the retrieval performance can be improved by more than 16 points nDCG@10 by relaxing this constraint. Second, the method also leads to a large search index and expensive computation. Specifically, the retrieval and storage cost scales linearly with the query and document length, making multi-vector retrieval models an inferior choice for efficiency-demanding applications. We directly tackle these challenges to build faster and more accurate models. ⊙ sim(Q, D) = 1 Z ∑ i,j ( ) similarity S alignment A [CLS] a b [CLS] a a c The representation learning problem of multi-vector retrieval can be formulated as optimizing tokenlevel alignment. Specifically, we use a sparse alignment matrix to aggregate token-level similarities, where each element indicates the alignment of a pair of tokens. From this point of view, we are able to formulate different retrieval models in a unified manner (Figure 1 ) and discern the drawbacks of existing models. Based on our formulation, we propose ALIGNER, a novel multi-vector retrieval model that consists of pairwise alignment and unary salience. Pairwise alignments form the basis of ALIGNER, where pairs of query and document tokens are sparsely aligned based on their contextual representations. It is discovered that changing the sparsity of alignment can significantly impact the performance on retrieval tasks. For instance, factoid questions often favor a small number of alignments since they often focus on a small part of a document. However, other queries for different tasks (e.g., argument retrieval and fact checking) require a larger number of alignments for a broader understanding of a document. Our findings also support the claim of Dai et al. ( 2022b) that retrieval tasks with different intents should be modeled differently. ALIGNER also learns unary saliences, which decides whether each token ever needs to be aligned with any other token for retrieval. This corresponds to masking an entire row or column of the alignment matrix, rather than individual token alignments. To sparsify entire rows or columns, we introduce an algorithm that produces sparse token salience and is end-to-end differentiable based on a novel formulation of entropy-regularized linear programming. Sparsified unary saliences allow us to prune a large number of document and query token representations, making multi-vector retrieval a more efficient and affordable solution. We evaluate ALIGNER on the BEIR benchmark (Thakur et al., 2021) , which covers a diverse set of retrieval tasks in multiple domains. 1 In a zero-shot setting, we show that simply scaling our model achieves the state-of-the-art performance, outperforming prior neural retrievers without contrastive pre-training, model-based hard negative mining, or distillation. By adapting the pairwise alignments with a few examples from the target task -similar to the setup of Dai et al. (2022b) -ALIGNER can be further improved by up to 15.7 points nDCG@10 on argument retrieval tasks. Meanwhile, pruning with our unary saliences can reduce 50% of query tokens for better run-time efficiency and 80% of document tokens for better storage footprint, with less than 1 point decrease of nDCG@10. The pairwise alignments and unary saliences are also highly interpretable so that they often serve as concise rationales for retrieval.

2. MULTI-VECTOR RETRIEVAL AS SPARSE ALIGNMENT

Given a query Q and a collection of N documents C = {D (1) , . . . , D (N) }, a key problem in retrieval is how to represent these textual inputs in order to facilitate efficient search. To this end, one approach is lexical retrieval using sparse bag-of-words representation of the text; the other approach is dense retrieval, which this work focuses on. Dense retrieval models learn a parameterized function that encodes the query and documents into query representation q and document representations {d (1) , . . . , d (N) } respectively. Typically, each representation is a single d-dimensional vector. For retrieval, the similarity function is often defined as sim(Q, D (i) ) = q d (i) , and documents having high similarity scores to the query are retrieved.



We will release our model checkpoints to encourage future research.



Figure1: (a) We formulate multi-vector retrieval as token-level sparse alignment; (b) Earlier models can be covered by our formulation as using different alignments.

