MASTER: MULTI-TASK PRE-TRAINED BOTTLE-NECKED MASKED AUTOENCODERS ARE BETTER DENSE RETRIEVERS

Abstract

Dense retrieval aims to map queries and passages into low-dimensional vector space for efficient similarity measuring, showing promising effectiveness in various large-scale retrieval tasks. Since most existing methods commonly adopt pre-trained Transformers (e.g., BERT) for parameter initialization, some work focuses on proposing new pre-training tasks for compressing the useful semantic information from passages into dense vectors, achieving remarkable performances. However, it is still challenging to effectively capture the rich semantic information and relations about passages into the dense vectors via one single particular pretraining task. In this work, we propose a multi-task pre-trained model, MASTER, that unifies and integrates multiple pre-training tasks with different learning objectives under the bottlenecked masked autoencoder architecture. Concretely, MAS-TER utilizes a multi-decoder architecture to integrate three types of pre-training tasks: corrupted passages recovering, related passage recovering and PLMs outputs recovering. By incorporating a shared deep encoder, we construct a representation bottleneck in our architecture, compressing the abundant semantic information across tasks into dense vectors. The first two types of tasks concentrate on capturing the semantic information of passages and relationships among them within the pre-training corpus. The third one can capture the knowledge beyond the corpus from external PLMs (e.g., GPT-2). Extensive experiments on several large-scale passage retrieval datasets have shown that our approach outperforms the previous state-of-the-art dense retrieval methods.

1. INTRODUCTION

Recent years have witnessed the great success of dense retrieval methods (Karpukhin et al., 2020; Qu et al., 2021; Xiong et al., 2021) in industrial applications, e.g., web search (Brickley et al., 2019; Qiu et al., 2022) and question answering (Karpukhin et al., 2020; Izacard & Grave, 2021) . These methods typically encode queries and passages into low-dimensional dense vectors and utilize the vector similarity between them to measure semantic relevance. In real-world applications, the dense vectors of large amounts of passages will be pre-computed. Then the approximate nearest neighbor (ANN) search techniques (Johnson et al., 2021) can be incorporated for efficient retrieval. To generate high-quality dense vectors, pre-trained language models (PLMs) (Devlin et al., 2019; Liu et al., 2019) have been widely adopted as the backbone of the query and passage encoders. However, general PLMs (e.g., BERT (Devlin et al., 2019) ) may not be the best for dense retrieval, as their produced native dense representations (usually the [CLS] embedding) are not designed on purpose to generalize the information from the input text. To solve it, recent studies (Gao & Callan, 2021a; Lu et al., 2021; Sachan et al., 2021) adopt pre-training techniques to endow the [CLS] embedding the capacity of compressing the semantic information of the input text. They either rely on the autoencoding task that utilizes the [CLS] embedding to recover the corrupted text (e.g., masked or replaced tokens) (Liu & Shao, 2022; Wang et al., 2022; Wu et al., 2022) , or leverage the contrastive learning objective to capture the relations among passages (e.g., co-occurrence) (Ram et al., 2022; Sachan et al., 2021) , outperforming general PLMs in this task. Despite the success, it is obvious that neither the autoencoding nor contrastive learning pre-training task is optimal to fully exploit the useful characteristics into the dense embedding for the retrieval task, as either mainly relies on limited specific information or relation from the corpus. From this point of view, adopting the multi-task pre-training framework that jointly learns various supervised signals from different tasks is promising. However, due to the divergences of input formats and learning objectives among different tasks, an arbitrary integration of these tasks is inappropriate, which may even cause detrimental gradient interference, leading to performance degradation (Kendall et al., 2018; Yu et al., 2020) . To address this problem, we consider integrating multiple pre-training tasks in a unified format and reducing the divergence of different training objectives. Since most of the NLP tasks can be reformulated as the text-to-text format (Xie et al., 2022; Raffel et al., 2020) , we can also reconstruct the available pre-training tasks into such a format. Recently, the idea of bottlenecked masked autoencoder (BMAE) (Liu & Shao, 2022; Wang et al., 2022; Wu et al., 2022) has been proposed to pre-train dense retrievers, which typically adopts an encoder-decoder architecture, consisting of a deep encoder to generate the dense vector of the input texts and a shallow decoder that relies on the dense vector to recover an aggressively-masked text. In this way, an information bottleneck is constructed. The deep encoder should force the dense vector to reserve as much useful information as possible that is beneficial for recovering the text in the shallow decoder. Inspired by it, we consider unifying multiple different pre-training tasks into the BMAE format, i.e., taking texts as the input of the encoder and recovering itself or its related text in the decoder. For example, to capture the passage relations (e.g., co-occurrence), we can utilize a passage as the encoder's input and leverage its dense vector to help recover an aggressive masked related passage in the decoder. Such a unified way is promising to solve the central issue of the multi-task pre-training derived from the divergences of input formats and learning objectives among different tasks, and capture a variety of semantics or relations in different tasks to pre-train effective dense vectors. Based on the above motivation, we propose MASTER, a multi-task pre-trained bottlenecked masked autoencoder, that adopts an multi-decoder architecture to integrate diverse pre-training tasks in the BMAE format. For each pre-training task, we devise a task-specific decoder to accomplish it, and all these decoders should rely on the output dense vector from the shared deep encoder to guide the decoding process. In this way, we construct multiple information bottlenecks to enforce the deep encoder to generate more informative dense vectors, leading to compressed high-quality representations. To learn sufficient useful semantics and relations, we devise three types of pre-training tasks: corrupted passages recovering, related passages recovering, and PLMs outputs recovering, respectively, a total of five tasks for pre-training. The first two types of tasks focus on compressing the semantic information of passages and modeling the relationships among passages within the corpus. The third type of tasks forces the dense vector of the input passage to recover the output text from other public generative PLMs like GPT-2 (Radford et al., 2019) , which are capable of capturing the semantic information and relations beyond the corpus to further enhance the dense vectors. To verify the effectiveness of our approach, we conduct extensive experiments on several text retrieval datasets, e.g., MS-MARCO Passage Ranking (Nguyen et al., 2016) , TREC Deep Learning Track (Craswell et al., 2020; 2021 ), Natural Questions (Kwiatkowski et al., 2019) and BEIR zeroshot retrieval benchmark Thakur et al. (2021) . Experimental results show that our approach can achieve new state-of-the-art performances in dense retrieval. We will make the code and model checkpoints publicly available.

2. RELATED WORK

Dense Retrieval. Recent years have witnessed the remarkable progress of dense retrieval (Karpukhin et al., 2020; Zhan et al., 2020; Hong et al., 2022; Ram et al., 2022) . Different from traditional sparse retrieval methods (e.g., BM25 (Robertson et al., 2009) ), dense retrieval approaches typically map queries and documents into low-dimensional dense vectors via a dual-encoder architecture and then utilize vector distance metrics (e.g., dot product and cosine similarity) as the relevance scores. Such a way is supported by the efficient approximate nearest neighbor (ANN) search engines, e.g., FAISS (Johnson et al., 2021) . For effectively training dense retrieval models, existing work typically leverages pre-trained Transformers (Liu et al., 2019; Devlin et al., 2019) to initialize the dual encoders and then samples high-quality negatives when fine-tuning the encoders with con-

