MODELING SEQUENTIAL SENTENCE RELATION TO IMPROVE CROSS-LINGUAL DENSE RETRIEVAL

Abstract

Recently multi-lingual pre-trained language models (PLM) such as mBERT and XLM-R have achieved impressive strides in cross-lingual dense retrieval. Despite its successes, they are general-purpose PLM while the multilingual PLM tailored for cross-lingual retrieval is still unexplored. Motivated by an observation that the sentences in parallel documents are approximately in the same order, which is universal across languages, we propose to model this sequential sentence relation to facilitate cross-lingual representation learning. Specifically, we propose a multilingual PLM called masked sentence model (MSM), which consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. The document encoder is shared for all languages to model the universal sequential sentence relation across languages. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives. Comprehensive experiments on four cross-lingual retrieval tasks show MSM significantly outperforms existing advanced pre-training models, demonstrating the effectiveness and stronger cross-lingual retrieval capabilities of our approach.

1. INTRODUCTION

Cross-lingual retrieval (also including multi-lingual retrieval) is becoming increasingly important as new texts in different languages are being generated every day, and people query and search for the relevant documents in different languages (Zhang et al., 2021b; Asai et al., 2021a) . This is a fundamental and challenging task and plays an essential part in real-world search engines, for example, Google and Bing search which serve hundreds of countries across diverse languages. In addition, it's also a vital component to solve many cross-lingual downstream problems, such as open-domain question answering (Asai et al., 2021a) or fact checking (Huang et al., 2022) . With the rapid development of deep neural models, cross-lingual retrieval has progressed from translation-based methods (Nie, 2010), cross-lingual word embeddings (Sun & Duh, 2020) , and now to dense retrieval built on the top of multi-lingual pre-trained models (Devlin et al., 2019; Conneau et al., 2019) . Dense retrieval models usually adopt pretrained models to encode queries and passages into low-dimensional vectors, so its performance relies on the representation quality of pretrained models, and for multilingual retrieval it also calls for cross-lingual capabilities. Models like mBERT (Devlin et al., 2019 ), XLMR (Conneau et al., 2019) pre-trained with masked language model task on large multilingual corpora, have been applied widely in cross-lingual retrieval (Asai et al., 2021a; b; Shi et al., 2021) and achieved promising performance improvements. However, they are general pre-trained models and not tailored for dense retrieval. Except for the direct application, there are some pre-trained methods tailored for monolingual retrieval. Lee et al. (2019) and Gao & Callan (2021) propose to perform contrastive learning with synthetic querydocument pairs to pre-train the retriever. They generate pseudo pairs either by selecting a sentence and its context or by cropping two sentences in a document. Although showing improvements, these approaches have only been applied in monolingual retrieval and the generated pairs by hand-crafted

