LEXMAE: LEXICON-BOTTLENECKED PRETRAINING FOR LARGE-SCALE RETRIEVAL

Abstract

In large-scale retrieval, the lexicon-weighting paradigm, learning weighted sparse representations in vocabulary space, has shown promising results with high quality and low latency. Despite it deeply exploiting the lexicon-representing capability of pre-trained language models, a crucial gap remains between language modeling and lexicon-weighting retrieval -the former preferring certain or low-entropy words whereas the latter favoring pivot or high-entropy words -becoming the main barrier to lexicon-weighting performance for large-scale retrieval. To bridge this gap, we propose a brand-new pre-training framework, lexicon-bottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations. Essentially, we present a lexicon-bottlenecked module between a normal language modeling encoder and a weakened decoder, where a continuous bag-of-words bottleneck is constructed to learn a lexicon-importance distribution in an unsupervised fashion. The pre-trained LexMAE is readily transferred to the lexicon-weighting retrieval via fine-tuning. On the ad-hoc retrieval benchmark, MS-Marco, it achieves 42.6% MRR@10 with 45.8 QPS for the passage dataset and 44.4% MRR@100 with 134.8 QPS for the document dataset, by a CPU machine. And LexMAE shows state-of-the-art zero-shot transfer capability on BEIR benchmark with 12 datasets.

1. INTRODUCTION

Large-scale retrieval, also known as first stage retrieval (Cai et al., 2021) , aims to fetch top queryrelevant documents from a huge collection. In addition to its indispensable roles in dialogue systems (Zhao et al., 2020) , question answering (Karpukhin et al., 2020) , search engines, etc., it also has been surging in recent cutting-edge topics, e.g., retrieval-augmented generation (Lewis et al., 2020) and retrieval-augmented language modeling (Guu et al., 2020) . As there are millions to billions of documents in a collection, efficiency is the most fundamental prerequisite for large-scale retrieval. To this end, query-agnostic document representations (i.e., indexing the collection independently) and lightweight relevance metrics (e.g., cosine similarity, dot-product) have become the common practices to meet the prerequisite -usually achieved by a two-tower structure (Reimers & Gurevych, 2019), a.k.a., bi-encoder and dual-encoder, in representation learning literature. Besides the prevalent 'dense-vector retrieval' paradigm that encodes both queries and documents in the same low-dimension, real-valued latent semantic space (Karpukhin et al., 2020) , another retrieval paradigm, 'lexicon-weighting retrieval', aims to leverage weighted sparse representation in vocabulary space (Formal et al., 2021a; Shen et al., 2022) . It learns to use a few lexicons in the vocabulary and assign them with weights to represent queries and documents -sharing a high-level inspiration with BM25 but differing in dynamic (with compression and expansion) lexicons and their importance weights learned in an end-to-end manner. Although learning the representations in such a high-dimensional vocabulary space seems intractable with limited human-annotated query-document pairs, recent surging pre-trained language modeling (PLM), especially the masked language modeling (MLM), facilitates transferring context-aware lexicon-coordinate knowledge into lexicon-weighting retrieval by fine-tuning the PLM on the annotated pairs (Formal et al., 2021b; a; Shen et al., 2022) . Here, coordinate terms (full of synonyms and concepts) are highly related to relevance-centric tasks and mitigate the lexicon mismatch problem (Cai et al., 2021) , leading to superior retrieval quality. Due to the pretraining-finetuning consistency with the same output vocabulary space, lexicon-based retrieval methods can fully leverage a PLM, including its masked language modeling (MLM) head, leading to better search quality (e.g., ∼ 1.0% MRR@10 improvement over dense-vector ones by fine-tuning the same PLM initialization (Formal et al., 2021a; Hofstätter et al., 2020) ). Meantime, attributed to the high-dimensional sparse-controllable representations (Yang et al., 2021; Lassance & Clinchant, 2022) , these methods usually enjoy higher retrieval efficiency than dense-vector ones (e.g., 10× faster with the identical performance in our experiments). Nonetheless, there still exists a subtle yet crucial gap between the pre-training language modeling and the downstream lexicon-weighting objectives. That is, MLM (Devlin et al., 2019) aims to recover a word back given its contexts, so it inclines to assign high scores to certain (i.e., low-entropy) words, but these words are most likely to be articles, prepositions, etc., or belong to collocations or common phrases. Therefore, language modeling is in conflict with the lexicon-weighting representation for relevance purposes, where the latter focuses more on the high-entropy words (e.g., subject, predicate, object, modifiers) that are essential to the semantics of a query or document. These can explain, in our experiments when being fine-tuned under the paradigm of lexicon-weighting retrieval (Formal et al., 2021a) , why a moderate PLM (i.e., DistilBERT) can even outperform a relatively large one (i.e., BERT-base) and why a well-trained PLM (e.g., RoBERTa) cannot even achieve a convergence. To mitigate the gap, in this work we propose a brand-new pre-training framework, dubbed lexiconbottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations for the transferable knowledge towards large-scale lexicon-weighting retrieval. Basically, LexMAE pre-trains a language modeling encoder for document-specific lexicon-importance distributions over the whole vocabulary to reflect each lexicon's contribution to the document reconstruction. Motivated by recent dense bottleneck-enhanced pre-training (Gao & Callan, 2022; Liu & Shao, 2022; Wang et al., 2022) , we present to learn the lexicon-importance distributions in an unsupervised fashion by constructing continuous bag-of-words (CBoW) bottlenecks upon the distributions. Thereby, LexMAE pre-training architecture consists of three components: i) a language modeling encoder (as most other PLMs, e.g., BERT, RoBERTa), ii) a lexicon-bottlenecked module, and iii) a weakened masking-style decoder. Specifically, a mask-corrupted document from the collection is passed into the language modeling encoder to produce token-level LM logits in the vocabulary space. Besides an MLM objective for generic representation learning, a max-pooling followed by a normalization function is applied to the LM logits to derive a lexicon-importance distribution. To unsupervisedly learn such a distribution, the lexicon-bottlenecked module leverages it as the weights to produce a CBoW dense bottleneck, while the weakened masking-style decoder is asked to reconstruct the aggressively masked document from the bottleneck. Considering the shallow decoder and its aggressive masking, the decoder in LexMAE is prone to recover the masked tokens on the basis of the CBoW bottleneck, and thus the LexMAE encoder assigns higher importance scores to essential vocabulary lexicons of the masked document but lower to trivial ones. This closely aligns with the target of the lexicon-weighting retrieval paradigm and boosts its performance. After pre-training LexMAE on large-scale collections, we fine-tune its language modeling encoder to get a lexicon-weighting retriever, improving previous state-of-the-art performance by 1.5% MRR@10 with ∼13× speed-up on the ad-hoc passage retrieval benchmark. Meantime, LexMAE also delivers new state-of-the-art results (44.4% MRR@100) on the ad-hoc document retrieval benchmark. Lastly, LexMAE shows great zero-shot transfer capability and achieves state-of-the-art performance on BEIR benchmark with 12 datasets, e.g., Natural Questions, HotpotQA, and FEVER.foot_0 

2. RELATED WORK

PLM-based Dense-vector Retrieval. Recently, pre-trained language models (PLM), e.g., BERT (Devlin et al., 2019 ), RoBERTa (Liu et al., 2019 ), DeBERTa (He et al., 2021b) , have been proven generic and effective when transferred to a broad spectrum of downstream tasks via fine-tuning. When transferring PLMs to large-scale retrieval, a ubiquitous paradigm is known as 'dense-vector retrieval' (Xiong et al., 2021) -encoding both queries and documents in the same low-dimension semantic space and then calculating query-document relevance scores on the basis of spatial distance. However, dense-vector retrieval methods suffer from the objective gap between lexicon-recovering



We released our codes and models at https://github.com/taoshen58/LexMAE.

