MULTI-VECTOR RETRIEVAL AS SPARSE ALIGNMENT

Abstract

Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose ALIGNER, a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens (e.g. 'dog' vs. 'puppy') and per-token unary saliences reflecting their relative importance for retrieval. We show that controlling the sparsity of pairwise token alignments often brings significant performance gains. While most factoid questions focusing on a specific part of a document require a smaller number of alignments, others requiring a broader understanding of a document favor a larger number of alignments. Unary saliences, on the other hand, decide whether a token ever needs to be aligned with others for retrieval (e.g. 'kind' from 'what kind of currency is used in new zealand'). With sparsified unary saliences, we are able to prune a large number of query and document token vectors and improve the efficiency of multi-vector retrieval. We learn the sparse unary saliences with entropy-regularized linear programming, which outperforms other methods to achieve sparsity. In a zero-shot setting, ALIGNER scores 51.1 points nDCG@10, achieving a new retriever-only state-of-the-art on 13 tasks in the BEIR benchmark. In addition, adapting pairwise alignments with a few examples (≤ 8) further improves the performance up to 15.7 points nDCG@10 for argument retrieval tasks. The unary saliences of ALIGNER helps us to keep only 20% of the document token representations with minimal performance loss. We further show that our model often produces interpretable alignments and significantly improves its performance when initialized from larger language models.

1. INTRODUCTION

Neural information retrieval (IR) has become a promising research direction for improving traditional IR systems. The most-commonly adopted approach called the dual encoder operates by representing every query and document as a single dense vector. Given sufficient annotations, dual encoders directly learn task-driven similarity between vectors, and often surpass traditional IR systems on complex tasks such as question answering (Lee et al., 2019; Karpukhin et al., 2020; Ni et al., 2021) . However, these models can struggle to generalize over out-of-domain datasets (Thakur et al., 2021) and/or entity-centric questions (Sciavolino et al., 2021) due to the limited representational capacity of single vectors. As a remedy, multi-vector retrieval models (Khattab & Zaharia, 2020; Luan et al., 2021; Gao et al., 2021) instead use multiple vectors, typically the contextualized token vectors, to represent the text. These models largely improve the model expressiveness, and exhibit much stronger performance and robustness compared to their single-vector counterparts. Existing multi-vector retrieval models such as ColBERT (Khattab & Zaharia, 2020) computes querydocument similarity by selecting the highest scoring document token for each query token and aggregating the scores. This sum-of-max method has two major limitations. First, restricting the selection to a single document token can be highly sub-optimal for some retrieval tasks. As we will show in our experiments, the retrieval performance can be improved by more than 16 points nDCG@10 by relaxing this constraint. Second, the method also leads to a large search index and expensive computation. Specifically, the retrieval and storage cost scales linearly with the query and document length, making multi-vector retrieval models an inferior choice for efficiency-demanding applications. We directly tackle these challenges to build faster and more accurate models.

