LEXMAE: LEXICON-BOTTLENECKED PRETRAINING FOR LARGE-SCALE RETRIEVAL

Abstract

In large-scale retrieval, the lexicon-weighting paradigm, learning weighted sparse representations in vocabulary space, has shown promising results with high quality and low latency. Despite it deeply exploiting the lexicon-representing capability of pre-trained language models, a crucial gap remains between language modeling and lexicon-weighting retrieval -the former preferring certain or low-entropy words whereas the latter favoring pivot or high-entropy words -becoming the main barrier to lexicon-weighting performance for large-scale retrieval. To bridge this gap, we propose a brand-new pre-training framework, lexicon-bottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations. Essentially, we present a lexicon-bottlenecked module between a normal language modeling encoder and a weakened decoder, where a continuous bag-of-words bottleneck is constructed to learn a lexicon-importance distribution in an unsupervised fashion. The pre-trained LexMAE is readily transferred to the lexicon-weighting retrieval via fine-tuning. On the ad-hoc retrieval benchmark, MS-Marco, it achieves 42.6% MRR@10 with 45.8 QPS for the passage dataset and 44.4% MRR@100 with 134.8 QPS for the document dataset, by a CPU machine. And LexMAE shows state-of-the-art zero-shot transfer capability on BEIR benchmark with 12 datasets.

1. INTRODUCTION

Large-scale retrieval, also known as first stage retrieval (Cai et al., 2021) , aims to fetch top queryrelevant documents from a huge collection. In addition to its indispensable roles in dialogue systems (Zhao et al., 2020 ), question answering (Karpukhin et al., 2020) , search engines, etc., it also has been surging in recent cutting-edge topics, e.g., retrieval-augmented generation (Lewis et al., 2020) and retrieval-augmented language modeling (Guu et al., 2020) . As there are millions to billions of documents in a collection, efficiency is the most fundamental prerequisite for large-scale retrieval. To this end, query-agnostic document representations (i.e., indexing the collection independently) and lightweight relevance metrics (e.g., cosine similarity, dot-product) have become the common practices to meet the prerequisite -usually achieved by a two-tower structure (Reimers & Gurevych, 2019), a.k.a., bi-encoder and dual-encoder, in representation learning literature. Besides the prevalent 'dense-vector retrieval' paradigm that encodes both queries and documents in the same low-dimension, real-valued latent semantic space (Karpukhin et al., 2020), another retrieval paradigm, 'lexicon-weighting retrieval', aims to leverage weighted sparse representation in vocabulary space (Formal et al., 2021a; Shen et al., 2022) . It learns to use a few lexicons in the vocabulary and assign them with weights to represent queries and documents -sharing a high-level inspiration with BM25 but differing in dynamic (with compression and expansion) lexicons and their importance weights learned in an end-to-end manner. Although learning the representations in such a high-dimensional vocabulary space seems intractable with limited human-annotated query-document pairs, recent surging pre-trained language modeling (PLM), especially the masked language modeling (MLM), facilitates transferring context-aware lexicon-coordinate knowledge into lexicon-weighting retrieval by fine-tuning the PLM on the annotated pairs (Formal et al., 2021b; a; Shen et al., 2022) . Here, coordinate terms (full of synonyms and concepts) are highly related to relevance-centric tasks and mitigate the lexicon mismatch problem (Cai et al., 2021) , leading to superior retrieval quality.

