MASTER: MULTI-TASK PRE-TRAINED BOTTLE-NECKED MASKED AUTOENCODERS ARE BETTER DENSE RETRIEVERS

Abstract

Dense retrieval aims to map queries and passages into low-dimensional vector space for efficient similarity measuring, showing promising effectiveness in various large-scale retrieval tasks. Since most existing methods commonly adopt pre-trained Transformers (e.g., BERT) for parameter initialization, some work focuses on proposing new pre-training tasks for compressing the useful semantic information from passages into dense vectors, achieving remarkable performances. However, it is still challenging to effectively capture the rich semantic information and relations about passages into the dense vectors via one single particular pretraining task. In this work, we propose a multi-task pre-trained model, MASTER, that unifies and integrates multiple pre-training tasks with different learning objectives under the bottlenecked masked autoencoder architecture. Concretely, MAS-TER utilizes a multi-decoder architecture to integrate three types of pre-training tasks: corrupted passages recovering, related passage recovering and PLMs outputs recovering. By incorporating a shared deep encoder, we construct a representation bottleneck in our architecture, compressing the abundant semantic information across tasks into dense vectors. The first two types of tasks concentrate on capturing the semantic information of passages and relationships among them within the pre-training corpus. The third one can capture the knowledge beyond the corpus from external PLMs (e.g., GPT-2). Extensive experiments on several large-scale passage retrieval datasets have shown that our approach outperforms the previous state-of-the-art dense retrieval methods.

1. INTRODUCTION

Recent years have witnessed the great success of dense retrieval methods (Karpukhin et al., 2020; Qu et al., 2021; Xiong et al., 2021) in industrial applications, e.g., web search (Brickley et al., 2019; Qiu et al., 2022) and question answering (Karpukhin et al., 2020; Izacard & Grave, 2021) . These methods typically encode queries and passages into low-dimensional dense vectors and utilize the vector similarity between them to measure semantic relevance. In real-world applications, the dense vectors of large amounts of passages will be pre-computed. Then the approximate nearest neighbor (ANN) search techniques (Johnson et al., 2021) can be incorporated for efficient retrieval. To generate high-quality dense vectors, pre-trained language models (PLMs) (Devlin et al., 2019; Liu et al., 2019) have been widely adopted as the backbone of the query and passage encoders. However, general PLMs (e.g., BERT (Devlin et al., 2019) ) may not be the best for dense retrieval, as their produced native dense representations (usually the [CLS] embedding) are not designed on purpose to generalize the information from the input text. To solve it, recent studies (Gao & Callan, 2021a; Lu et al., 2021; Sachan et al., 2021) adopt pre-training techniques to endow the [CLS] embedding the capacity of compressing the semantic information of the input text. They either rely on the autoencoding task that utilizes the [CLS] embedding to recover the corrupted text (e.g., masked or replaced tokens) (Liu & Shao, 2022; Wang et al., 2022; Wu et al., 2022) , or leverage the contrastive learning objective to capture the relations among passages (e.g., co-occurrence) (Ram et al., 2022; Sachan et al., 2021) , outperforming general PLMs in this task.

