MODELING SEQUENTIAL SENTENCE RELATION TO IMPROVE CROSS-LINGUAL DENSE RETRIEVAL

Abstract

Recently multi-lingual pre-trained language models (PLM) such as mBERT and XLM-R have achieved impressive strides in cross-lingual dense retrieval. Despite its successes, they are general-purpose PLM while the multilingual PLM tailored for cross-lingual retrieval is still unexplored. Motivated by an observation that the sentences in parallel documents are approximately in the same order, which is universal across languages, we propose to model this sequential sentence relation to facilitate cross-lingual representation learning. Specifically, we propose a multilingual PLM called masked sentence model (MSM), which consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. The document encoder is shared for all languages to model the universal sequential sentence relation across languages. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives. Comprehensive experiments on four cross-lingual retrieval tasks show MSM significantly outperforms existing advanced pre-training models, demonstrating the effectiveness and stronger cross-lingual retrieval capabilities of our approach.

1. INTRODUCTION

Cross-lingual retrieval (also including multi-lingual retrieval) is becoming increasingly important as new texts in different languages are being generated every day, and people query and search for the relevant documents in different languages (Zhang et al., 2021b; Asai et al., 2021a) . This is a fundamental and challenging task and plays an essential part in real-world search engines, for example, Google and Bing search which serve hundreds of countries across diverse languages. In addition, it's also a vital component to solve many cross-lingual downstream problems, such as open-domain question answering (Asai et al., 2021a) or fact checking (Huang et al., 2022) . With the rapid development of deep neural models, cross-lingual retrieval has progressed from translation-based methods (Nie, 2010), cross-lingual word embeddings (Sun & Duh, 2020) , and now to dense retrieval built on the top of multi-lingual pre-trained models (Devlin et al., 2019; Conneau et al., 2019) . Dense retrieval models usually adopt pretrained models to encode queries and passages into low-dimensional vectors, so its performance relies on the representation quality of pretrained models, and for multilingual retrieval it also calls for cross-lingual capabilities. Models like mBERT (Devlin et al., 2019 ), XLMR (Conneau et al., 2019) pre-trained with masked language model task on large multilingual corpora, have been applied widely in cross-lingual retrieval (Asai et al., 2021a; b; Shi et al., 2021) and achieved promising performance improvements. However, they are general pre-trained models and not tailored for dense retrieval. Except for the direct application, there are some pre-trained methods tailored for monolingual retrieval. Lee et al. (2019) and Gao & Callan (2021) propose to perform contrastive learning with synthetic querydocument pairs to pre-train the retriever. They generate pseudo pairs either by selecting a sentence and its context or by cropping two sentences in a document. Although showing improvements, these approaches have only been applied in monolingual retrieval and the generated pairs by hand-crafted rules may be low-quality and noisy. In addition, learning universal sentence representations across languages is more challenging and crucial than monolingual, so better multilingual pre-training for retrieval needs to be explored. In this paper, we propose a multilingual PLM to leverage sequential sentence relation across languages to improve cross-lingual retrieval. We start from an observation that the parallel documents should each contain approximately the same sentence-level information. Specifically, the sentences in parallel documents are approximately in the same order, while the words in parallel sentences are usually not. It means the sequential relation at sentence-level are similar and universal across languages. This idea has been adopted for document alignment (Thompson & Koehn, 2020; Resnik, 1998) which incorporates the order information of sentences. Motivated by it, we propose a novel Masked Sentence Encoder (MSM) to learn this universal relation and facilitate the isomorphic sentence embeddings for cross-lingual retrieval. It consists of a sentence encoder to generate sentence representations, and a document encoder applied to a sequence of sentences in a document. The document encoder is shared for all languages and can learn the sequential sentence relation that is universal across languages. In order to train MSM, we adopt a sentence-level masked prediction task, which masks the selected sentence vector and predicts it using the output of the document encoder. Distinct from MLM predicting tokens from pre-built vocabulary, we propose a hierarchical contrastive loss with sampled negatives for sentence-level prediction. We conduct comprehensive experiments on 4 cross-lingual dense retrieval tasks including Mr. TyDi, XOR Retrieve, Mewsli-X and LAReQA. Experimental results show that our approach achieves state-of-the-art retrieval performance compared to other advanced models, which validates the effectiveness of our MSM model in cross-lingual retrieval. Our in-depth analysis demonstrates that the cross-lingual transfer ability emerges for MSM can learn the universal sentence relation across languages, which is beneficial for cross-lingual retrieval. Furthermore, we perform ablations to motivate our design choices and show MSM works better than other counterparts.

2. RELATED WORK

Multi-lingual Pre-trained Models. Recently the multilingual pre-trained models (Lample & Conneau, 2019; Conneau et al., 2019; Huang et al., 2019) have empowered great success in different multilingual tasks (Liang et al., 2020; Hu et al., 2020) . Multilingual BERT (Devlin et al., 2019 ) is a transformer model pre-trained on Wikipedia using the multi-lingual masked language model (MMLM) task. XLM-R (Conneau et al., 2019) further extends the corpus to a magnitude more web data with MMLM. XLM (Lample & Conneau, 2019) proposes the translation language model (TLM) task to achieve cross-lingual token alignment. Unicoder (Huang et al., 2019) presents several pre-training tasks upon parallel corpora and ERNIE-M (Ouyang et al., 2021) learns semantic alignment by leveraging back translation. XLM-K (Jiang et al., 2022) leverages the multi-lingual knowledge base to improve cross-lingual performance on knowledge-related tasks. InfoXLM (Chi et al., 2021) and HiCTL (Wei et al., 2020) encourage bilingual alignment via InfoNCE based contrastive loss. These models usually focus on cross-lingual alignment leveraging bilingual data, while it's not fit for cross-lingual retrieval that calls for semantic relevance between query and passage. There is few explore on how to improve pre-training tailored for cross-lingual retrieval, which is exactly what our model addresses. Cross-lingual Retrieval. Cross-lingual (including multi-lingual) retrieval is becoming increasingly important in the community and impacting our lives in real-world applications. In the past, multilingual retrieval relied on community-wide datasets at TREC, CLEF, and NCTIR, such as CLEF 2000 -2003 collection (Ferro & Silvello, 2015) . They usually comprise a small number of queries (at most a few dozen) with relevance judgments and only for evaluation, which are insufficient for dense retrieval. Recently, more large scale cross-lingual retrieval datasets (Zhang et al., 2021b; Ruder et al., 2021) have been proposed to promote cross-lingual retrieval research, such as Mr. TyDi (Asai et al., 2021a) proposed in open-QA domain, Mewsli-X (Ruder et al., 2021) for news entity retrieval, etc. The technique of the cross-lingual retrieval field has progressed from translation-based methods (Nie, 2010; Shi et al., 2021) to cross-lingual word embeddings by neural models (Sun & Duh, 2020) , and now to dense retrieval built on the top of multi-lingual pre-trained models (Devlin et al., 2019; Conneau et al., 2019) . Asai et al. (2021a; b) modify the bi-encoder retriever to be equipped with mBERT, which plays an essential part in the open-QA system, and Zhang et al. (2022b) explore

