LEXMAE: LEXICON-BOTTLENECKED PRETRAINING FOR LARGE-SCALE RETRIEVAL

Abstract

In large-scale retrieval, the lexicon-weighting paradigm, learning weighted sparse representations in vocabulary space, has shown promising results with high quality and low latency. Despite it deeply exploiting the lexicon-representing capability of pre-trained language models, a crucial gap remains between language modeling and lexicon-weighting retrieval -the former preferring certain or low-entropy words whereas the latter favoring pivot or high-entropy words -becoming the main barrier to lexicon-weighting performance for large-scale retrieval. To bridge this gap, we propose a brand-new pre-training framework, lexicon-bottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations. Essentially, we present a lexicon-bottlenecked module between a normal language modeling encoder and a weakened decoder, where a continuous bag-of-words bottleneck is constructed to learn a lexicon-importance distribution in an unsupervised fashion. The pre-trained LexMAE is readily transferred to the lexicon-weighting retrieval via fine-tuning. On the ad-hoc retrieval benchmark, MS-Marco, it achieves 42.6% MRR@10 with 45.8 QPS for the passage dataset and 44.4% MRR@100 with 134.8 QPS for the document dataset, by a CPU machine. And LexMAE shows state-of-the-art zero-shot transfer capability on BEIR benchmark with 12 datasets.

1. INTRODUCTION

Large-scale retrieval, also known as first stage retrieval (Cai et al., 2021) , aims to fetch top queryrelevant documents from a huge collection. In addition to its indispensable roles in dialogue systems (Zhao et al., 2020) , question answering (Karpukhin et al., 2020) , search engines, etc., it also has been surging in recent cutting-edge topics, e.g., retrieval-augmented generation (Lewis et al., 2020) and retrieval-augmented language modeling (Guu et al., 2020) . As there are millions to billions of documents in a collection, efficiency is the most fundamental prerequisite for large-scale retrieval. To this end, query-agnostic document representations (i.e., indexing the collection independently) and lightweight relevance metrics (e.g., cosine similarity, dot-product) have become the common practices to meet the prerequisite -usually achieved by a two-tower structure (Reimers & Gurevych, 2019) , a.k.a., bi-encoder and dual-encoder, in representation learning literature. Besides the prevalent 'dense-vector retrieval' paradigm that encodes both queries and documents in the same low-dimension, real-valued latent semantic space (Karpukhin et al., 2020) , another retrieval paradigm, 'lexicon-weighting retrieval', aims to leverage weighted sparse representation in vocabulary space (Formal et al., 2021a; Shen et al., 2022) . It learns to use a few lexicons in the vocabulary and assign them with weights to represent queries and documents -sharing a high-level inspiration with BM25 but differing in dynamic (with compression and expansion) lexicons and their importance weights learned in an end-to-end manner. Although learning the representations in such a high-dimensional vocabulary space seems intractable with limited human-annotated query-document pairs, recent surging pre-trained language modeling (PLM), especially the masked language modeling (MLM), facilitates transferring context-aware lexicon-coordinate knowledge into lexicon-weighting retrieval by fine-tuning the PLM on the annotated pairs (Formal et al., 2021b; a; Shen et al., 2022) . Here, coordinate terms (full of synonyms and concepts) are highly related to relevance-centric tasks and mitigate the lexicon mismatch problem (Cai et al., 2021) , leading to superior retrieval quality. Due to the pretraining-finetuning consistency with the same output vocabulary space, lexicon-based retrieval methods can fully leverage a PLM, including its masked language modeling (MLM) head, leading to better search quality (e.g., ∼ 1.0% MRR@10 improvement over dense-vector ones by fine-tuning the same PLM initialization (Formal et al., 2021a; Hofstätter et al., 2020) ). Meantime, attributed to the high-dimensional sparse-controllable representations (Yang et al., 2021; Lassance & Clinchant, 2022) , these methods usually enjoy higher retrieval efficiency than dense-vector ones (e.g., 10× faster with the identical performance in our experiments). Nonetheless, there still exists a subtle yet crucial gap between the pre-training language modeling and the downstream lexicon-weighting objectives. That is, MLM (Devlin et al., 2019) aims to recover a word back given its contexts, so it inclines to assign high scores to certain (i.e., low-entropy) words, but these words are most likely to be articles, prepositions, etc., or belong to collocations or common phrases. Therefore, language modeling is in conflict with the lexicon-weighting representation for relevance purposes, where the latter focuses more on the high-entropy words (e.g., subject, predicate, object, modifiers) that are essential to the semantics of a query or document. These can explain, in our experiments when being fine-tuned under the paradigm of lexicon-weighting retrieval (Formal et al., 2021a) , why a moderate PLM (i.e., DistilBERT) can even outperform a relatively large one (i.e., BERT-base) and why a well-trained PLM (e.g., RoBERTa) cannot even achieve a convergence. To mitigate the gap, in this work we propose a brand-new pre-training framework, dubbed lexiconbottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations for the transferable knowledge towards large-scale lexicon-weighting retrieval. Basically, LexMAE pre-trains a language modeling encoder for document-specific lexicon-importance distributions over the whole vocabulary to reflect each lexicon's contribution to the document reconstruction. Motivated by recent dense bottleneck-enhanced pre-training (Gao & Callan, 2022; Liu & Shao, 2022; Wang et al., 2022) , we present to learn the lexicon-importance distributions in an unsupervised fashion by constructing continuous bag-of-words (CBoW) bottlenecks upon the distributions. Thereby, LexMAE pre-training architecture consists of three components: i) a language modeling encoder (as most other PLMs, e.g., BERT, RoBERTa), ii) a lexicon-bottlenecked module, and iii) a weakened masking-style decoder. Specifically, a mask-corrupted document from the collection is passed into the language modeling encoder to produce token-level LM logits in the vocabulary space. Besides an MLM objective for generic representation learning, a max-pooling followed by a normalization function is applied to the LM logits to derive a lexicon-importance distribution. To unsupervisedly learn such a distribution, the lexicon-bottlenecked module leverages it as the weights to produce a CBoW dense bottleneck, while the weakened masking-style decoder is asked to reconstruct the aggressively masked document from the bottleneck. Considering the shallow decoder and its aggressive masking, the decoder in LexMAE is prone to recover the masked tokens on the basis of the CBoW bottleneck, and thus the LexMAE encoder assigns higher importance scores to essential vocabulary lexicons of the masked document but lower to trivial ones. This closely aligns with the target of the lexicon-weighting retrieval paradigm and boosts its performance. After pre-training LexMAE on large-scale collections, we fine-tune its language modeling encoder to get a lexicon-weighting retriever, improving previous state-of-the-art performance by 1.5% MRR@10 with ∼13× speed-up on the ad-hoc passage retrieval benchmark. Meantime, LexMAE also delivers new state-of-the-art results (44.4% MRR@100) on the ad-hoc document retrieval benchmark. Lastly, LexMAE shows great zero-shot transfer capability and achieves state-of-the-art performance on BEIR benchmark with 12 datasets, e.g., Natural Questions, HotpotQA, and FEVER.foot_0 

2. RELATED WORK

PLM-based Dense-vector Retrieval. Recently, pre-trained language models (PLM), e.g., BERT (Devlin et al., 2019) , RoBERTa (Liu et al., 2019) , DeBERTa (He et al., 2021b) , have been proven generic and effective when transferred to a broad spectrum of downstream tasks via fine-tuning. When transferring PLMs to large-scale retrieval, a ubiquitous paradigm is known as 'dense-vector retrieval' (Xiong et al., 2021) -encoding both queries and documents in the same low-dimension semantic space and then calculating query-document relevance scores on the basis of spatial distance. However, dense-vector retrieval methods suffer from the objective gap between lexicon-recovering language model pre-training and document-compressing dense-vector fine-tuning. Although a natural remedy has been dedicated to the gap by constructing pseudo query-document pairs (Lee et al., 2019; Chang et al., 2020; Gao & Callan, 2022; Zhou et al., 2022a) or/and enhancing bottleneck dense representation (Lu et al., 2021; Gao & Callan, 2021; 2022; Wang et al., 2022; Liu & Shao, 2022) , the methods are still limited by their intrinsic representing manners -dense-vector leading to large index size and high retrieval latency -applying speed-up algorithms, product-quantization (Zhan et al., 2022) , however resulting in dramatic drops (e.g., -3% ∼ 4% by Xiao et al. (2022) ). Lexicon-weighing Retrieval. In contrast to the almost unlearnable BM25, lexicon-weighing retrieval methods, operating on lexicon-weights by a neural model, are proposed to exploit language models for term-based retrieval (Nogueira et al., 2019b; a; Formal et al., 2021b; a; 2022) . According to different types of language models, there are two lines of work: based on causal language models (CLM) (Radford et al.; Raffel et al., 2020) , (Nogueira et al., 2019a) use the concurrence between a document and a query for lexicon-based sparse representation expansion. Meantime, based on masked language models (MLM) (Devlin et al., 2019; Liu et al., 2019) , (Formal et al., 2021b) couple the original word with top coordinate terms (full of synonyms and concepts) from the pre-trained MLM head. However, these works directly fine-tune the pre-trained language models, regardless of the objective mismatch between general language modeling and relevance-oriented lexicon weighting. 

Overview of LexMAE Pretraining.

As illustrated in Figure 1 , our lexiconbottlenecked masked autoencoder (LexMAE) contains one encoder and one decoder with masked inputs in line with the masked autoencoder (MAE) family (He et al., 2021a; Liu & Shao, 2022) , while is equipped with a lexicon-bottlenecked module for document-specific lexicon-importance learning. Given a piece of free-form text, x, from a large-scale collection, D, we aim to pre-train a language modeling encoder, θ (enc) , that represents x with weighted lexicons in the vocabulary space, i.e., a ∈ [0, 1] |V| . V denotes the whole vocabulary. Here, each a i = P (w = w i |x; θ (enc) ) with w i ∈ V denotes the importance degree of the lexicon w i to the whole text x. To learn the distribution a for x in an unsupervised fashion, an additional decoder θ (dec) is asked to reconstruct x based on a.

3.1. LANGUAGE MODELING ENCODER

Identical to most previous language modeling encoders, e.g., BERT (Devlin et al., 2019) , the language modeling encoder, θ (enc) , in LexMAE is composed of three parts, i.e., a word embedding module mapping the discrete tokens of x to dense vectors, a multi-layer Transformer (Vaswani et al., 2017) for deep contextualization, and a language modeling head mapping back to vocabulary space R |V| . First, following the common practice of pre-training the encoder unsupervisedly, a masked language modeling (MLM) objective is employed to pre-train θ (enc) . Formally, given a piece of text x ∈ D, a certain percentage (α%) of the tokens in x are masked to obtain x, in which 80% replaced with a special token [MASK], 10% replaced with a random token in V, and the remaining kept unchanged (Devlin et al., 2019) . Then, the masked x is fed into the language modeling encoder, θ (enc) , i.e., S (enc) = Transformer-LM(x; θ (enc) ) ∈ R |V|×n , where S (enc) denotes LM logits. Lastly, the MLM objective is to minimize the following loss, enc) log P (w j = x j |x; θ (enc) ), where P (w j ) := softmax(S (enc) :,j ), L (elm) = - D j∈M ( where M (enc) denotes the set of masked indices of the tokens in x, w j denotes the discrete variable over V at the j-th position of x, and x j is its original token (i.e., golden label of the MLM objective).

3.2. LEXICON-BOTTLENECKED MODULE

Given token-level logits from Eq.( 1) defined in V, we calculate a lexicon-importance distribution by a := P (w|x; θ (enc) ) = Normalize(Max-Pool(S (enc) )) ∈ [0, 1] |V| , where Max-Pool(•) is pooling along with its sequence axis, which is proven more effective than mean-pooling in lexicon representation (Formal et al., 2021a) , Normalize(•) is a normalization function (let a i = 1), which we simply take softmax(•) in our main experiments. P (w|x; θ (enc) ) is lexicon-importance distribution over V to indicate which lexicons in V is relatively important to x. The main obstacle to learning the lexicon-importance distribution P (w|x; θ (enc) ) is that we do not have any general-purpose supervised signals. Inspired by recent bottleneck-enhanced dense representation learning (Gao & Callan, 2022; Liu & Shao, 2022; Wang et al., 2022) , we propose to leverage the lexicon-importance distribution as a clue for reconstructing x back. As such, our language modeling encoder will be prone to focus more on the pivot or essential tokens/words in x. However, it is intractable to directly regard the high-dimensional distribution vector a ∈ [0, 1] |V| as a bottleneck since i) the distribution over the whole V has enough capacity to hold most semantics of x (Yang et al., 2018) , making the bottleneck less effective, and ii) the high-dimensional vector is hardly fed into a decoder for representation learning and text reconstruction. Therefore, we further propose to construct a continuous bag-of-words (CBoW) bottleneck following the lexicon-importance distribution P (w|x; θ (enc) ) derived from Eq.( 3). That is b := E wi∼P (w|x;θ (enc) ) [e (wi) ] = W (we) a. (4) Here, W (we) = [e (w1) , e (w2) , . . . ] ∈ R d×|V| denotes the learnable word embedding matrix in the parameters θ (enc) of language modeling encoder, where d denotes embedding size, e (wi) ∈ R d is a word embedding of the lexicon w i . Thereby, b ∈ R d stands for dense-vector CBoW bottleneck, upon which a decoder (will be detailed in the next sub-section) is asked to reconstruct the original x back. Remark. As aforementioned in our Introduction, there exists a conflict between MLM and lexiconimportance objectives, but we still apply an MLM objective in our encoder. This is because i) the MLM objective can serve as a regularization term to ensure the original token in x receive relatively high scores in contrast to its coordinate terms and ii) the token-level noise introduced by the MLM task has been proven effective in robust learning.

3.3. WEAKENED MASKING-STYLE DECODER

Lastly, to instruct the bottleneck representation b and consequently learn the lexicon-importance distribution P (w|x; θ (enc) ), we leverage a decoder to reconstruct x upon b. In line with recent bottleneck-enhanced neural structures (Gao & Callan, 2022; Wang et al., 2022) , we employ a weakened masking-style decoder parameterized by θ (dec) , which pushes the decoder to rely heavily on the bottleneck representation. It is noteworthy that the 'weakened' is reflected by two-fold: i) aggressively masking strategy and ii) shallow Transformer layers (says two layers). In particular, given the masked input at the encoder side, x, we first apply an extra β% masking operation, resulting in x. That is, the decoder is required to recover all the masked tokens that are also absent in the encoder, which prompts the encoder to compress rich contextual information into the bottleneck. Then, we prefix x with the bottleneck representation b, i.e., replacing the special token [CLS] with the bottleneck. Therefore, our weakened masking-style decoding with a Transformer-based language modeling decoder can be formulated as S (dec) = Transformer-LM(b, x; θ (dec) ) ∈ R |V|×n , where θ (dec) parameterizes this weakened masking-style decoder. Lastly, similar to the MLM at the encoder side, the loss function is defined as L (dlm) = - D j∈M (dec) log P (w j = x j |x; θ (dec) ), where P (w j ) := softmax(S (dec) :,j ), where M (dec) denotes the set of masked indices of the tokens in the decoder's input, x.

3.4. PRE-TRAINING OBJECTIVE & FINE-TUNING FOR LEXICON-WEIGHTING RETRIEVER

The final loss of pre-training LexMAE is an addition of the losses defined in Eq.( 2) and Eq.( 6), i.e., L (lm) = L (elm) + L (dlm) . (7) Meanwhile, we tie all word embedding metrics in our LexMAE pre-training architecture, including the word embedding modules and language model heads of both the encoder & decoder, as well as W (we) in Eq.( 4). It is noteworthy that we cut-off the gradient back-propagation for W (we) in Eq.( 4) to make the training focus only on the lexicon-importance distribution P (w|x; θ (enc) ) rather than W (we) . Task Definition of Downstream Large-scale Retrieval. Given a collection containing a number of documents, i.e., D = {d i } |D| i=1 , and a query q, a retriever aims to fetch a list of text pieces Dq to contain all relevant ones. Generally, this is based on a relevance score between q and every document d i in a Siamese manner, i.e., < Enc(q), Enc(d i ) >, where Enc is an arbitrary representation model (e.g., neural encoders) and < •, • > denotes a lightweight relevance metric (e.g., dot-product). To transfer LexMAE into large-scale retrieval, we get rid of its decoder but only fine-tune the language modeling encoder for the lexicon-weighting retriever. Basically, to leverage a language modeling encoder for lexicon-weighting representations, we adopt (Formal et al., 2021a) and represent a piece of text, x, in high-dimensional vocabulary space by v (x) = log(1 + Max-Pool(max(Transformer-LM(x; θ (enc) ), 0))) ∈ R * |V| , where max(•, 0) ensures all values greater than or equal to zero for upcoming sparse requirements, and the saturation function log(1 + Max-Pool(•)) prevents some terms from dominating. In contrast to a classification task, the retrieval tasks are formulated as contrastive learning problems. That is, only a limited number of positive documents, d + , is provided for a query q so we need sample a set of negative documents, N (q) = {d (q) -, . . . }, from D for the q. And we will dive into various sampling strategies to get N (q) in §A. Note that if no confusion arises, we omit the superscript (q) that indicates 'a query-specific' for clear demonstrations. By following Shen et al. (2022) , we can first derive a likelihood distribution over the positive {d + } and negative N documents, i.e., p := P (d|q, {d + } ∪ N; θ (enc) ) = exp(v (q)T v (d) ) d ′ ∈{d+}∪N exp(v (q)T v (d ′ ) ) , ∀d ∈ {d + } ∪ N (9) where v (•) ∈ R * |V| derived from Eq.( 8) denotes a lexicon-weighting representation for a query q or a document d. Then, the loss function of the contrastive learning towards this retrieval task is defined as L (r) = q -log P (d = d + |q, {d + } ∪ N; θ (enc) )+λ FLOPS(q, d)= -log p [d=d+] +λ FLOPS, where FLOPS(•) denotes a regularization term for representation sparsity (Paria et al., 2020) as first introduced by Formal et al. (2021b) and λ denotes a hyperparameter of it loss weight. Note that, to train a competitive retriever, we adapt the fine-tuning pipeline in (Wang et al., 2022) , which consists of three stages (please refer to §A & §B for our training pipeline and inference details). Top-K Sparsifying. Attributed to inherent flexibility, we can adjust the sparsity of the lexiconweighting representations for the documents to achieve a targeted efficacy-efficiency trade-off. Here, the 'sparsity' denotes how many lexicons in the vocabulary we used to represent each document. Previous methods either tune sparse regularization strength (Formal et al., 2021a; 2022 ) (e.g., λ in Eq.( 10)) or propose other sparse hyperparameters (Yang et al., 2021; Lassance & Clinchant, 2022 ) (e.g., the number of activated lexicons), however causing heavy fine-tuning overheads. Hence, we present a simple but effective sparsifying method, which only presents during embedding documents in the inference phase so requires almost zero extra overheads. It only keeps top-K weighted lexicons in the representations v (d) ∈ R * |V| by Eq.( 8), while removing the others by assigning zero weights (see §D for details). We will dive into empirical efficacy-efficiency analyses later in §4.2.

4. EXPERIMENT

Benchmark Datasets. Following Formal et al. (2021a) , we first employ the widely-used passage retrieval datasets, MS-Marco (Nguyen et al., 2016) . We only leverage its official queries (no Learning 2020 (DL'20). M@10 and nDCG denotes MRR@10 and nDCG@10, respectively. The 'coCon' denotes the coCondenser that continually pre-trained BERT in an unsupervised manner, and the subscript of a pre-trained model denotes its scale (e.g., 'base' equal to 110M parameters). †Please refer to (Craswell et al., 2020) , and TREC Deep Learning 2020 set (Craswell et al., 2021) . Besides, we evaluate the zero-shot transferability of our model on BEIR benchmark (Thakur et al., 2021) . We employ twelve datasets covering semantic relatedness and relevance-based retrieval tasks (i.e., TREC-COVID, NFCorpus, Natural Questions, HotpotQA, FiQA, ArguAna, Tóuche-2020, DBPedia, Scidocs, Fever, Climate-FEVER, and SciFact) in the BEIR benchmark as they are widely-used across most previous retrieval works. Lastly, to check if our LexMAE is also compatible with long-context retrieval, we conduct document retrieval evaluations on MS-Marco Doc Dev. Note that if not specified in our analyzing sections of the remainder, the numbers are reported on MS-Marco passage dev. Evaluation Metrics. We report MRR@10 (M@10) and Recall@1/50/100/1K for MS-Marco Dev (passage), and report NDCG@10 for both TREC Deep Learning 2019 (passage) and TREC Deep Learning 2020 (passage). Moreover, NDCG@10 is reported on BEIR benchmark, while MRR@100 and Recall@100 are reported for MS-Marco Doc. Regarding R@N metric, we found there are two kinds of calculating ways, and we strictly follow the official evaluation (please refer to §C). Setups. We pre-train on the MS-Marco collection (Nguyen et al., 2016) , where most hyperparameters are identical to (Wang et al., 2022) : the encoder is initialized by BERT base (Devlin et al., 2019) whereas the others are randomly initialized, the batch size is 2048, the max length is 144, the learning rate is 3 × 10 -4 , the number of training steps is 80k, the masking percentage (α%) of encoder is 30%, and that (α + β%) of decoder is 50%. Meantime, the random seed is always 42, and the pre-training is completed on 8×A100 GPUs within 14h. Please refer to §A.2 for our fine-tuning setups.

4.1. MAIN EVALUATION

MS-Marco Dev (Passage Retrieval). First, we compare our fine-tuned LexMAE with a wide range of baselines and competitors to perform large-scale retrieval in Table 1 . It is shown that our method substantially outperforms the previous best retriever, SimLM, by a very large margin (+1.5% MRR@10) and achieves a new state-of-the-art performance. Standing with different retrieval paradigms and thus different bottleneck constructions, such a large performance margin verifies the superiority of lexicon-weighting retrieval when a proper initialization is given. Meantime, the LexMAE is dramatically superior (+5.1% MRR@10) to its baseline (Formal et al., 2022) , Co-Self-Disil, with the same neural model scale but different model initialization (coCondenser (Gao & Callan, 2022) v.s. LexMAE) . This verifies that our lexicon-bottlenecked pre-training is more effective than the dense-bottlenecked one in lexicon-weighting retrieval. Method M@100 R@100 BERT 38.9 87.7 ICT (Lee et al., 2019) 39.6 88.2 B-PROP (Ma et al., 2021) 39.5 88.3 SEED (Lu et al., 2021) 39.6 90.2 COSTA (Ma et al., 2022) MS-Marco Doc. Lastly, we evaluate document retrieval on MS-Marco doc in Table 3 : we pre-train LexMAE on the document collection and follow the fine-tuning pipeline of (Ma et al., 2022) (w/o distillation), where our setting is firstP of 384 tokens.

4.2. EFFICIENCY ANALYSIS AND COMPARISON

Here, we show efficacy-efficiency correlations after applying our top-K sparsifying (please see 3.4). A key metrics of retrieval systems is the efficiency in terms of retrieval latency (query-per-second, QPS), index size for inverted indexing, as well as representation size per document (for non-inverted indexing). On the one hand, as shown in Figure 2 , our LexMAE achieves the best efficacy-efficiency trade-off among all dense-vector, quantized-dense, and lexicon-based methods. Compared to the previous state-of-the-art retriever, SimLM, we improve its retrieval effectiveness by 1.5% MRR@10 with 14.1× acceleration. With top-L sparsifying, LexMAE can achieve competitive performance with SimLM with 100+ QPS. In addition, LexMAE shows a much better trade-off than the recent best PQ-IVF dense-vector retriever, RepCONC. Surprisingly, when only 4 tokens were kept for each passage, the performance of LexMAE (24.0% MRR@10) is still better than BM25 retrieval (18.5%). On the other hand, as listed in Table 4 , we also compare different retrieval paradigms in the aspect of their storage requirements. Note that each activated (non-zero) term in lexicon-weighed sparse vector needs 3 bytes (2 bytes for indexing and 1 byte for its quantized weight). Compared to densevector methods, lexicon-based methods, including our LexMAE, inherently show fewer storage requirements in terms of both index size of the collection and representation Byte per document. Figure 2 : Retrievers w/ their MS-Marco dev MRR@10 and QPS, including dense-vector methods (i.e., SimLM, AR2), quantizeddense methods (i.e., RepCONC (Zhan et al., 2022) , ADORE-IVF (Zhan et al., 2021) ), and lexicon-based methods (i.e., SPLADEv2 (Formal et al., 2021a) , BT-SPLADE (Lassance & Clinchant, 2022) , DocT5query (Nogueira et al., 2019a) , BM25, and ours). 1 An ensemble of 4 SPLADE models.

Method

Method M@10 R@1 LexMAE-pipeline 43.1 28.8 LexMAE-ensemble 43.1 28.8 UnifieRuni (Shen et al., 2022) 40.7 26.9 Ensemble of SPLADE 1 40.0 -COIL-full (Gao et al., 2021a) 35.5 -CLEAR (Gao et al., 2021b) 33.8 -Table 6 : Performance on different stages (see §A for their details) of the fine-tuning pipeline on MS-Marco Dev. BM25 Negatives Hard Negatives Reranker-Distilled M@10 R@1k M@10 R@1k M@10 R@1k Meantime, compared to BM25 building its index at the word level, the learnable lexicon-weighting methods, based on the smaller vocabulary of sub-words, are more memory-friendly.

4.3. FURTHER ANALYSIS

Analysis of Dense-Lexicon Complement. As verified by Shen et al. (2022) , the lexicon-weighting retrieval is complementary to dense-vector retrieval and a simple linear combination of them can achieve excellent performance. As shown in Table 5 , we conduct an experiment to complement our LexMAE with our re-implemented state-of-the-art dense-vector retrieval method, i.e., SimLM (Wang et al., 2022) . Specifically, we propose to leverage two strategies: i) ensemble: a combination is applied to the retrieval scores of both paradigms, resulting in significant overheads due to twice large-scale retrieval; and ii) pipeline: a retrieval pipeline is leveraged to avoid twice retrieval: our lexicon-weighting retrieval is to retrieve top-K documents from the collection and then our densevector retrieval is merely applied to the constrained candidates for dense scores. It is shown that we improve the previous state-of-the-art hybrid retrieval method by 2.4% MRR@10 on MS-Marco Dev. Multi-stage Retrieval Performance. As shown in Table 6 , we exhibit more details about the retrieval performance of different pre-training methods on multiple fine-tuning stages (see §A). It can be seen that our LexMAE consistently achieves the best or competitive results on all the 3 stages. We conduct extensive experiments to check our model choices and their ablations from multiple aspects in Table 7 . Note that, our LexMAE uses softmax-norm CBoW bottleneck, shares LM logits of encoder and bottleneck, and employs the inclusive masking strategy with 30% for the encoder and 50% for the decoder.

4.4. MODEL CHOICES & ABLATION STUDY

First, we try three other bottlenecks in Eq.( 5), i.e., saturated norm for CBoW (as detailed in 'Zero-shot Retrieval'), dense bottleneck by using [CLS], and no bottleneck by cutting the bottleneck off. As in Figure 4 , their training loss curves show our CBoW bottlenecks do help the decoding compared to 'no-bottleneck', but are inferior to [CLS] contextual dense vector. But, attributed to pretraining-finetuning consistency, CBoW bottlenecks are better in lexicon-weighting retrieval. As for the two different lexicon CBoW bottlenecks, we show their fine-tuning dev curves in Figure 5 : 'sat-norm' shows its great performance in early finetuning stages due to the same lexicon-representing way whereas 'softmax-norm' show better later fine-tuning results due to its generalization. Then, we make some subtle architecture changes to LexMAE: i) enabling gradient back-propagation to the word embedding matrix leads to a learning short-cut thus worse fine-tuning results; ii) both sharing the LM heads of our encoder and decoder (Eq.(1) and Eq.( 5)) and adding an extra LM head specially for bottleneck LM logits (Eq.( 3)) result in a minor drop, and iii) replacing BERT with DistilBERT (Sanh et al., 2019) for our initialization still outperforms a bunch of competitors. Lastly, masking strategies other than our used 'inclusive' strategy in §3.3 do have minor effects on downstream fine-tuning. And, masking proportion of the encoder and decoder can affect the LexMAE's capability, and the negative 'affect' becomes unnoticeable when their proportion is large. In summary, pre-training LexMAE is very stable against various changes and consistently delivers great results. And please refer to §E for more experimental comparisons.

5. CONCLUSION

In this work, we propose to improve the lexicon-weighing retrieval by pre-training a lexiconbottlenecked masked autoencoder (LexMAE) which alleviates the objective mismatch between the masked language modeling encoders and relevance-oriented lexicon importance. After pretraining LexMAE on large-scale collections, we first observe great zero-shot performance. Then after fine-tuning the LexMAE on the large-scale retrieval benchmark, we obtain state-of-the-art retrieval quality with very high efficiency and also deliver state-of-the-art zero-shot transfer performance on BEIR benchmark. Further detailed analyses on the efficacy-efficiency trade-off in terms of retrieval latency and storage memory also verify the superiority of our fine-tuned LexMAE.

A MULTI-STAGE RETRIEVER FINE-TUNING A.1 FINE-TUNING PIPELINE

To train a state-of-the-art lexicon-weighting retriever, we adapt the fine-tuning pipeline in a recent dense-vector retrieval method (Wang et al., 2022) as illustrated in Figure 6 . The major difference is the source of the involved reranker for knowledge distillation into a retriever: In contrast to (Wang et al., 2022 ) that trains a reranker on the fly for retriever-specific reranker but suffers from high computation overheads, we propose to leverage an off-the-shelf reranker by (Zhou et al., 2022b) . Stage 1: BM25 Negatives. In the first stage, we sample negatives for each query q within top-K 1 document candidates by BM25 retrieval system, which is denoted as N (bm25) . Therefore, the contrastive learning loss of stage 1 of our retriever fine-tuning is written as L (r1) = q -log P (d = d + |q, {d + } ∪ N (bm25) ; θ (enc) ) + λ 1 FLOPS . ( ) Stage 2: Hard Negatives. Then, we sample the hard negatives N (hn1) for each query q within top-K 2 candidates based on the relevance scores by the retriever we obtain in stage 1, and the training loss of our stage 2 is defined as L (r2) = q -log P (d = d + |q, {d + } ∪ N (hn1) ; θ (enc) ) + λ 2 FLOPS . Figure 6 : An illustration of the fine-tuning pipeline of our retrievers. Here, the fine-tuned reranker is directly adopted from (Zhou et al., 2022b) to avoid expensive reranker training. Stage 3: Reranker-Distilled. Lastly, we further sample hard negatives N (hn2) for each query q within top-K 3 candidates by the 2nd-stage retriever. Besides the contrastive learning objective, we also distill a well-trained reranker into our stage 3 of the retriever, which is written as L (r3) = q KL-Div(P (d|q, {d + } ∪ N (hn2) ; θ (enc) )||P (d|q, {d + } ∪ N (hn2) ; θ (rk) )) -γ log P (d = d + |q, {d + } ∪ N (hn2) ; θ (enc) ) + λ 3 FLOPS . Here, θ (rk) parameterizes an expensive but effective cross-encoder as the reranker for knowledge distillation, and KL divergence, KL-Div(•||•), is used as distillation loss with θ (rk) frozen.

A.2 FINE-TUNING SETUPS

We share some hyperparameters across all three stages: learning rate is set to 2 × 10 -5 by following Shen et al. (2022) , the number of training epochs is set to 3, model initialization is always our LexMAE, max document length is set to 144, and max query length is set to 32. Following (Wang et al., 2022) , the γ in Eq.( 13) is set to 0.2. In contrast to (Wang et al., 2022) using 4 GPUs for fine-tuning, we limited all the fine-tuning experiments on one A100 GPU. The batch size (w.r.t the number of queries) is set to 24 with 1 positive and 15 negative documents in the fine-tuning stage 1 and 2, whereas that is set to 16 with 1 positive and 23 negative documents (to increase the number of negatives but fit one GPU memory by reducing the batch size). Another important hyperparameter is the depth (keeping how many top candidates as negatives) of negative sampling, i.e., K in Eq. (11 -13) . By following Wang et al. (2022) and Gao & Callan (2022) , we keep 1000 for BM25 negatives and 200 for other negatives, i.e., K 1 = 1000, K 2 = 200, K 3 = 200. The only hyperparameter we have tuned is the loss weight λ in Eq.(11 -13), i.e., λ 1 ∈ {0.001, 0.002, 0.004} (corresponding to BM25 negatives) and λ 2/3 ∈ {0.004, 0.008, 0.016} (corresponding to hard negatives). Empirically, we found λ 1 = 0.002, λ 2 = 0.008, λ 3 = 0.008 achieving a great performance-efficiency trade-off. Again, we fix the random seed always to 42 in all our experiments without any tuning.

B LEXICON-WEIGHTING INFERENCE FOR LARGE-SCALE RETRIEVAL

In the inference phase of large-scale retrieval, there are some differences between dense-vector and lexicon-weighting retrieval methods. As in Eq.( 10), we use the dot-product between the real-valved sparse lexicon-weighting representations as a relevance metric, where 'real-valved' is a prerequisite of gradient back-propagation and end-to-end learning. However, it is inefficient and infeasible to leverage the real-valved sparse representations, especially for the open-source term-based retrieval systems, e.g., LUCENE and Anserini (Yang et al., 2017) . Following Formal et al. (2021a) , we adopt 'quantization' and 'term-based system' to complete our retrieval procedure. That is, to transfer the high-dimensional sparse vectors back to the corresponding lexicons and their virtual frequencies, the lexicons are first obtained by keeping the non-zero elements in a high-dim sparse vector, and each virtual frequency then is derived from a straightforward quantization (i.e., ⌊100 × v⌋). In summary, the overall procedure of our large-scale retrieval based on a fine-tuned LexMAE is i) generating the high-dim sparse vector for each document and transferring it to lexicons and frequencies, ii) building a term-based inverted index via Anserini (Yang et al., 2017) for all documents in a collection, iii) given a test query, generating the lexicons and frequencies, in the same way, and iv) querying the built index to get top document candidates.



We released our codes and models at https://github.com/taoshen58/LexMAE.



Figure 1: An illustration of lexicon-bottlenecked masked autoencoder (Lex-MAE) pre-training architecture.

Figure 3: Zero-shot retrieval results (nDCG@10) on MS-Marco Dev.Zero-shot Retrieval. To figure out if our LexMAE pre-training can learn lexicon-importance distribution, we conduct a zeroshot retrieval on MS-Marco. Specifically, instead of softmax normalization function in Eq.(3), we present a saturation functionbased L1 norm (i.e., L1-Norm(log(1 + ReLU(•)))) and keep the other parts unchanged. W/o fine-tuning, we apply the pretrained LexMAE to MS-Marco retrieval task, with the sparse representation of log(1 + ReLU(•))). As in Figure3, our lexiconimportance embedding by LexMAE (110M parameters) beats BM25 in terms of large-scale retrieval and is competitive with a very large model, SGPT-CE(Muennighoff, 2022) with 6.1B parameters.

Figure 4: MLM pre-training losses (99% moving average has been applied) of LexMAE's encoder and decoder for various bottlenecks.

Passage retrieval results on MS-Marco Dev, TREC Deep Learning 2019 (DL'19), and TREC Deep



Zero-shot transfer performance (nDCG@10) on BEIR benchmark. 'BEST ON' and 'AVERAGE' do not take the in-domain result into account. 'ColBERT' is its v2 version(Santhanam et al., 2021). As shown in Table 1, we also evaluate our LexMAE on both the TREC Deep Learning 2019 (DL'19) and the TREC Deep Learning 2020 (DL'20). It is observed that LexMAE consistently achieves new state-of-the-art performance on both datasets.

Document Retrieval on Marco Doc Dev.

Ensemble & hybrid retrievers.



C EXPLANATION OF DIFFERENT RECALL METRICS

Regarding R@N metric, we found there are two kinds of calculating ways, and we strictly follow the official evaluation one at https://github.com/usnistgov/trec_eval and https: //github.com/castorini/anserini, which is defined aswhere there may be multiple positive documents D + ∈ D, Q denotes the test queries and D denotes top-K document candidates by a retrieval system. We also call this metric all-positive-macro Recall@N. On the other hand, another recall calculation method following DPR (Karpukhin et al., 2020) is defined aswhich we call one-positive-enough Recall@N. Therefore, The official (all-positive-macro) Recall@N is usually less than DPR (one-positive-enough) Recall@N, and the smaller N, the more obvious.Therefore, we make the unofficial one-positive-enough Recall@N standalone in Table 8 for more precise comparisons. It is observed that our LexMAE is still the best retriever among its competitors.

D SPARSIFYING LEXICON REPRESENTATIONS

Compared to dense-vector retrieval methods (Zhan et al., 2022; Xiao et al., 2022 ) that rely on product-quantization (PQ) and inverted file (IVF) to compromise their performance (-3% ∼ 4%) for memory & time efficiency, the lexicon-weighting method with high-dimension sparse representation is intrinsically efficient for large-scale retrieval as demonstrated in §B -fully compatible with traditional term-based retrieval system, e.g., BM25 -only manipulating the term frequency and document frequency by the neural language modeling encoder.To dive into LexMAE's efficacy-efficiency trade-off, we need to adjust the sparsity of lexicon representations for the documents. In general, the 'sparsity' here denotes how many lexicons in the vocabulary we used to represent each document. Since the hyperparameter λ in Eq.( 10) denotes the strength of sparse regularization during fine-tuning and controls the efficacy-efficiency trade-off for the retriever, it is straightforward to tune the λ for the purpose of sparsifying. However, this requires fine-tuning the retriever multiple times with different λ for adequate data points, leading to huge computation overheads. What's worse, there is no certain correlation between the λ and the sparsity, leading to uncontrollable sparsifying and increasing the number of trials. To make the sparsifying procedure more controllable, Yang et al. (2021) and Lassance & Clinchant (2022) propose to sparsify the lexicon-weighing representations by controlling fine-tuning hyperparameters, e.g., constraining the number of activated lexicons, however still leading to extra fine-tuning efforts.Therefore, in this work we present a simple but effective and controllable sparsifying method, which only presents during embedding documents in the inference phase and requires almost zero extra overheads. Specifically, it only keeps top-K weighted lexicons in the sparse lexicon representations v (d) ∈ R * |V| by Eq.( 8), while removing the others by assigning zero weights, which can be formally written aswhere ⊙ denotes element-wise product, t denotes the K-th large value in v (d) , and 1 ∈ {0, 1} |V| is a mask where an entry equals 1 only if the corresponding value in top-K of v (d) . If K is smaller than the number of activated (i.e., the weight v d) , applying this sparsifying method would not make any change. All the sparsified lexicon representations v(d) K with different K values are derived from the same original representation v (d) , so both the fine-tuning and embedding procedures are invoked only once, saving a lot of computing resources. Lastly, the sparsified lexicon representations, v(d) K , are used to build the inverted index for large-scale retrieval.Table 8 : Retrieval results on MS-Marco Dev on onepositive-enough recall. Note that the official Recall@50 of our fine-tuned LexMAE is 88.9%.

Method

M@10 R@50 R@1K RocketQA (Qu et al., 2021) 37.0 85.5 97.9 PAIR (Ren et al., 2021a) 37.9 86.4 98.2 RocketQAv2 (Ren et al., 2021b) 38.8 86.2 98.1 AR2 (Zhang et al., 2022) 39.5 87.8 98.6 UnifieRlexicon (Shen et al., 2022) Furthermore, our LexMAE retriever can outperform many state-of-the-art retrieval & rerank pipeline methods as in Table 9 . It is noteworthy that these rerankers (a.k.a., cross-attention model or cross-encoder) are very costly as they must be applied to every query-document text concatenation, instead of query-agnostic representations from a bi-encoder.Comparisons with Different Pre-training Objectives. As listed in Table 10 , we compare our pretraining objective with a batch of other objectives by fine-tuning the pre-trained model on MS-Marco BM25 negatives. It can be seen that our LexMAE improves the previous best by 1.3% MRR@10 in MS-Marco Dev.

F BACKGROUND: DENSE-VECTOR AND LEXICON-WEIGHTING RETRIEVAL

With recent surging pre-trained language models (PLMs) by self-supervised learning, e.g., GPT (Brown et al., 2020) , BERT (Devlin et al., 2019) , RoBERTa (Liu et al., 2019) , and T5 (Raffel et al., 2020) , deep representation learning has entered a new era to offer more expressively powerful text representations. As a practical task that relies heavily on text representation learning, large-scale retrieval directly benefits from these PLMs by leveraging the PLMs as neural encoders and fine-tuning them on the downstream datasets. Thereby, recent works upon the PLMs propose to learn encoders for large-scale retrieval, which are coarsely grouped into two paradigms according to their encoding spaces, i.e., dense-vector and lexicon-weighting retrieval.Dense-vector Encoding Methods. The most straightforward way to leverage the Transformerbased PLMs is directly representing a document/query as a fixed-length real-valued low-dimension dense vector u ∈ R e . By following the previous common practice, the dense vector is derived from either a contextual embedding of the special token '[CLS]' or a mean pooling over the sequence of word-level contextual embeddings. It is noteworthy that e is embedding size and usually small (e.g., 768 for base-size PLMs), and the 'fixed-length' is not limited to one vector for each collection entry, but maybe multi-vector (Humeau et al., 2020; Khattab & Zaharia, 2020) . Lastly, the relevance score between a document and a query is calculated by a very lightweight metric, e.g., dot-product or cosine similarity (Khattab & Zaharia, 2020; Xiong et al., 2021; Zhan et al., 2021; Gao & Callan, 2022) . Although PLM-based dense-vector retrieval methods enjoy their off-the-shelf dense embeddings and easy-to-calculate relevance metrics, the methods are limited by their intrinsic representing mannersi) real-valued vectors leading to large index size and high retrieval latency and ii) high-level vector representations losing the key relevance feature about lexicon overlap.

