MODELING SEQUENTIAL SENTENCE RELATION TO IMPROVE CROSS-LINGUAL DENSE RETRIEVAL

Abstract

Recently multi-lingual pre-trained language models (PLM) such as mBERT and XLM-R have achieved impressive strides in cross-lingual dense retrieval. Despite its successes, they are general-purpose PLM while the multilingual PLM tailored for cross-lingual retrieval is still unexplored. Motivated by an observation that the sentences in parallel documents are approximately in the same order, which is universal across languages, we propose to model this sequential sentence relation to facilitate cross-lingual representation learning. Specifically, we propose a multilingual PLM called masked sentence model (MSM), which consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. The document encoder is shared for all languages to model the universal sequential sentence relation across languages. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives. Comprehensive experiments on four cross-lingual retrieval tasks show MSM significantly outperforms existing advanced pre-training models, demonstrating the effectiveness and stronger cross-lingual retrieval capabilities of our approach.

1. INTRODUCTION

Cross-lingual retrieval (also including multi-lingual retrieval) is becoming increasingly important as new texts in different languages are being generated every day, and people query and search for the relevant documents in different languages (Zhang et al., 2021b; Asai et al., 2021a) . This is a fundamental and challenging task and plays an essential part in real-world search engines, for example, Google and Bing search which serve hundreds of countries across diverse languages. In addition, it's also a vital component to solve many cross-lingual downstream problems, such as open-domain question answering (Asai et al., 2021a) or fact checking (Huang et al., 2022) . With the rapid development of deep neural models, cross-lingual retrieval has progressed from translation-based methods (Nie, 2010) , cross-lingual word embeddings (Sun & Duh, 2020) , and now to dense retrieval built on the top of multi-lingual pre-trained models (Devlin et al., 2019; Conneau et al., 2019) . Dense retrieval models usually adopt pretrained models to encode queries and passages into low-dimensional vectors, so its performance relies on the representation quality of pretrained models, and for multilingual retrieval it also calls for cross-lingual capabilities. Models like mBERT (Devlin et al., 2019) , XLMR (Conneau et al., 2019) pre-trained with masked language model task on large multilingual corpora, have been applied widely in cross-lingual retrieval (Asai et al., 2021a; b; Shi et al., 2021) and achieved promising performance improvements. However, they are general pre-trained models and not tailored for dense retrieval. Except for the direct application, there are some pre-trained methods tailored for monolingual retrieval. Lee et al. (2019) and Gao & Callan (2021) propose to perform contrastive learning with synthetic querydocument pairs to pre-train the retriever. They generate pseudo pairs either by selecting a sentence and its context or by cropping two sentences in a document. Although showing improvements, these approaches have only been applied in monolingual retrieval and the generated pairs by hand-crafted rules may be low-quality and noisy. In addition, learning universal sentence representations across languages is more challenging and crucial than monolingual, so better multilingual pre-training for retrieval needs to be explored. In this paper, we propose a multilingual PLM to leverage sequential sentence relation across languages to improve cross-lingual retrieval. We start from an observation that the parallel documents should each contain approximately the same sentence-level information. Specifically, the sentences in parallel documents are approximately in the same order, while the words in parallel sentences are usually not. It means the sequential relation at sentence-level are similar and universal across languages. This idea has been adopted for document alignment (Thompson & Koehn, 2020; Resnik, 1998) which incorporates the order information of sentences. Motivated by it, we propose a novel Masked Sentence Encoder (MSM) to learn this universal relation and facilitate the isomorphic sentence embeddings for cross-lingual retrieval. It consists of a sentence encoder to generate sentence representations, and a document encoder applied to a sequence of sentences in a document. The document encoder is shared for all languages and can learn the sequential sentence relation that is universal across languages. In order to train MSM, we adopt a sentence-level masked prediction task, which masks the selected sentence vector and predicts it using the output of the document encoder. Distinct from MLM predicting tokens from pre-built vocabulary, we propose a hierarchical contrastive loss with sampled negatives for sentence-level prediction. We conduct comprehensive experiments on 4 cross-lingual dense retrieval tasks including Mr. TyDi, XOR Retrieve, Mewsli-X and LAReQA. Experimental results show that our approach achieves state-of-the-art retrieval performance compared to other advanced models, which validates the effectiveness of our MSM model in cross-lingual retrieval. Our in-depth analysis demonstrates that the cross-lingual transfer ability emerges for MSM can learn the universal sentence relation across languages, which is beneficial for cross-lingual retrieval. Furthermore, we perform ablations to motivate our design choices and show MSM works better than other counterparts.

2. RELATED WORK

Multi-lingual Pre-trained Models. Recently the multilingual pre-trained models (Lample & Conneau, 2019; Conneau et al., 2019; Huang et al., 2019) have empowered great success in different multilingual tasks (Liang et al., 2020; Hu et al., 2020) . Multilingual BERT (Devlin et al., 2019) is a transformer model pre-trained on Wikipedia using the multi-lingual masked language model (MMLM) task. XLM-R (Conneau et al., 2019) further extends the corpus to a magnitude more web data with MMLM. XLM (Lample & Conneau, 2019) proposes the translation language model (TLM) task to achieve cross-lingual token alignment. Unicoder (Huang et al., 2019) presents several pre-training tasks upon parallel corpora and ERNIE-M (Ouyang et al., 2021) learns semantic alignment by leveraging back translation. XLM-K (Jiang et al., 2022) leverages the multi-lingual knowledge base to improve cross-lingual performance on knowledge-related tasks. InfoXLM (Chi et al., 2021) and HiCTL (Wei et al., 2020) encourage bilingual alignment via InfoNCE based contrastive loss. These models usually focus on cross-lingual alignment leveraging bilingual data, while it's not fit for cross-lingual retrieval that calls for semantic relevance between query and passage. There is few explore on how to improve pre-training tailored for cross-lingual retrieval, which is exactly what our model addresses. Cross-lingual Retrieval. Cross-lingual (including multi-lingual) retrieval is becoming increasingly important in the community and impacting our lives in real-world applications. In the past, multilingual retrieval relied on community-wide datasets at TREC, CLEF, and NCTIR, such as CLEF 2000-2003 collection (Ferro & Silvello, 2015) . They usually comprise a small number of queries (at most a few dozen) with relevance judgments and only for evaluation, which are insufficient for dense retrieval. Recently, more large scale cross-lingual retrieval datasets (Zhang et al., 2021b; Ruder et al., 2021) have been proposed to promote cross-lingual retrieval research, such as Mr. TyDi (Asai et al., 2021a) proposed in open-QA domain, Mewsli-X (Ruder et al., 2021) for news entity retrieval, etc. The technique of the cross-lingual retrieval field has progressed from translation-based methods (Nie, 2010; Shi et al., 2021) to cross-lingual word embeddings by neural models (Sun & Duh, 2020) , and now to dense retrieval built on the top of multi-lingual pre-trained models (Devlin et al., 2019; Conneau et al., 2019) . Asai et al. (2021a; b) modify the bi-encoder retriever to be equipped with mBERT, which plays an essential part in the open-QA system, and Zhang et al. (2022b) 

𝒑 𝒕+𝟐

Figure 1 : The general framework of masked sentence model (MSM), which has a hierarchical model architecture including the sentence encoder and the document encoder. The masked sentence prediction task predicts the masked sentence vector p t , given the original vector h t as the positive anchor, via a hierarchical contrastive loss. the impact of data and model. However, most of the existing work focuses on fine-tuning a specific task, while ours targets pre-training and conducts evaluations on diverse benchmarks. There also exist some similarity-specialized multi-lingual models (Litschko et al., 2021) , trained with parallel or labeled data supervision. LASER (Artetxe & Schwenk, 2019 ) train a seq2seq model on largescale parallel data and LaBSE (Feng et al., 2022) encourage bilingual alignment via contrastive loss. m-USE (Yang et al., 2019) is trained with mined QA pairs, translation pairs and SNLI corpus. Some others also utilize distillation (Reimers & Gurevych, 2020; Li et al., 2021) , adapter (Pfeiffer et al., 2020; Litschko et al., 2022) , siamese learning (Zhang et al., 2021c) . Compared to them, MSM is unsupervised without any parallel data, which is more simple and effective (Artetxe et al., 2020b) , and can also be continually trained with these bi-lingual tasks. Dense Retrieval. Dense retrieval (Karpukhin et al., 2020; Lee et al., 2019; Xiong et al., 2020; Zhang et al., 2022a) (usually monolingual here) typically utilizes bi-encoder model to encode queries and passages into low-dimensional representations. Recently there have been several directions explored in the pre-training tailored for dense retrieval: utilizing the hyperlinks between the Wikipedia pages (Ma et al., 2021; Zhou et al., 2022) , synthesizing query-passage datasets for pre-training (Oguz et al., 2021; Reddy et al., 2021) , and auto-encoder-based models that force the better representations (Lu et al., 2021; Ma et al., 2022) . Among them, there is a popular direction that leverage the correlation of intra-document text pairs for the pre-training. Lee et al. (2019) and Chang et al. (2020) propose Inverse Close Task (ICT) to treat a sentence as pseudo-query and the concatenated context as the pseudo-passage for contrastive pre-training. Another way is cropping two sentence spans (we call them CROP in short) from a document as positive pairs (Giorgi et al., 2021; Izacard et al., 2021a) , including Wu et al. (2022a) ; Iter et al. (2020) that use two sentences and Gao & Callan (2021) that adopts two non-overlapping spans. The most relevant to ours are ICT and CROP, which generate two views of a document for contrastive learning. However, the correlation of the pseudo pair is coarse-granular and even not guaranteed. In contrast, ours utilizes a sequence of sentences in a document and models the universal sentence relation across languages via an explicit document encoder, resulting in better representation for cross-lingual retrieval.

3.1. HIERARCHICAL MODEL ARCHITECTURE

In this section, we first present the hierarchical model architecture. As illustrated in Figure .1, our Masked Sentence Encoder (MSM) has a hierarchical architecture that contains the Sentence Encode and the Document Encoder. The document encoder is applied to the sentence vectors generated by the sentence encoder from a sequence of sentences in a document. Sentence Encoder. Given a document containing a sequence of sentences D = (S 1 , S 2 , ..., S N ) in which S i denote a sentence in document, and each sentence contains a list of words. As shown in 1, the sentence encoder extracts the sentence representations for the sequence in a document, and the document encoder is to model the sentence relation and predict the masked sentences. First, we adopt a transformer-based encoder as our sentence encoder. Then as usual the sentence is passed through the embedding layer and transformer layers, and we take the last hidden state of the CLS token as the sentence representation. Note that all the sentence encoders share the parameters and we can get N sentence vectors D H = (h 1 , h 2 , ..., h N ) respectively. In this task, we just encoder the complete sentences without the mask to get thorough sentence representations. Document Encoder. Then the sentence-level vectors run through the document encoder, which has similar transformer-based architecture. Considering the sentences have sequential order in a document, the sentence position is also taken into account and it doesn't have token type embedding. After equipped with sentence position embeddings, we encode them through document encoder layers to get document-level context aware embeddings D P = (p 1 , p 2 , ..., p N ). In order to train our model, we apply the sentence-level mask to the sentence vectors for our masked sentence prediction task. Specifically, D H = (h 1 , h 2 , ..., h N ) are the original sentence vectors, and we mask selected sentence vector h t to [mask] token and keep other the original ones. The original sentence vector h t is seen as the positive anchor for the output vector p t of document encoder corresponding to [mask] . Considering a document that contains N sentences, we mask each sentence in turn and keep the others the original, to get N pairs of p t and h t . It is effective to get as many samples as possible at the same time for efficient training. Since the length of document encoder's input is not long (for the number of sentences in a document is not long) and our document encoder is also shallow, it makes our approach efficient without much computation. There are some models also adopting a hierarchical transformer-based model (Santra et al., 2021) . For example, HiBERT (Zhang et al., 2019) uses a multi-level transformers for document summarization, while it applies the mask to the words with a decoder for autoregressive pre-training. Poolingformer (Zhang et al., 2021a) proposes a two-level pooling attention schema for long document but can't be applied for retrieval. They mainly adopt token-level MLM and targets document understanding, while ours focuses on masked sentence prediction and is directed at cross-lingual retrieval.

3.2. MASKED SENTENCE PREDICTION TASK

In order to model the sentence relation, we propose a masked sentence prediction task that aligns masked sentence vectors p t with corresponding original h t via the hierarchical contrastive loss. Distinct from Masked Language Model which can directly compute cross-entropy loss between masked tokens and pre-built vocabulary, our model lacks a sentence-level vocabulary. Here we propose a novel hierarchical contrastive loss on sentence vectors to address it. Contrastive learning has been shown effective in sentence representation learning (Karpukhin et al., 2020; Gao et al., 2021) , and our model modifies the typical InfoNCE loss (Oord et al., 2018) to a hierarchical contrastive loss for the masked sentence prediction task. As shown in Figure .1, for masked sentence vectors p t , the positive anchor is original h t and we collect two categories of negatives: (a) Cross-Doc Negatives are the sentence vectors from different documents, i.e. h C k , which can be seen as random negatives as usual. (b) Intra-Doc Negatives are the sentence vectors in a same document generated by sentence encoder, i.e. h I j , j̸ =t. Then the masked sentence vectors p t with them are passed through the projection layer, and the output vectors are involved in the hierarchical contrastive loss as: L msm (p t , {h I t , h I 1 , . . . , h I |I| , h C 1 , . . . , h C |C| }) = -log e sim(pt,h I t ) e sim(pt,h I t )-+ |I| j=1,j̸ =t e sim(pt,h I j )-µα + |C| k=1 e sim(pt,h C k ) In the previous study (Gao & Callan, 2021) , two sampled views or sentences of the same document are often seen as a positive pair to leverage their correlation. However, it limits the representation capability for it encourages the alignment between two views, just as a coarse-grained topic model (Yan et al., 2013) . In contrast, we treat them as Intra-Doc Negatives, which could help the model to distinguish sentences from the same document to learn fine-grained representations. The intra-doc samples usually have closer semantic relation than cross-doc ones and directly treating them as negatives could hurt the uniformity of embedding space. To prevent this negative impact, we set the dynamic bias subtracted from their similarity scores. As seen in Eq.1, the dynamic bias is -µα in which µ is a hyper-parameter and α is computed as: α = |I| j=1,j̸ =t sim pt, h I j |I| -1 - |C| k=1 sim pt, h C k |C| .detach() It represents the gap between the average similarity score of Intra-Doc Negatives and them from Cross-Doc Negatives. Subtracting the dynamic bias can tune the high similarity of intra-doc negatives to the level of cross-doc negatives, which can also be seen as interpolation to generate soft samples. Note that we only use the value but do not pass the gradient, so we adopt the detach function after computation. Our experimental result in Sec.5.4 validates that the hierarchical contrastive loss is beneficial for representation learning in our model. Considering the expensive cost of pre-training from scratch, we initialize the sentence encoder with pre-trained XLM-R weight and solely the document encoder from scratch. To prevent gradient back propagated from the randomly initialized document encoder from damaging sentence encoder weight, we adopt MLM task to impose a semantic constraint. Therefore our total loss consists of a token-level MLM loss and a sentence-level contrastive loss: L = L msm + L mlm After pre-training, we discard the document encoder and leave the sentence encoder for fine-tuning. In fact, the document encoder in our MSM plays as a bottleneck (Li et al., 2020) : the sentence encoder press the sentence semantics into sentence vectors, and the document encoder leverage the limited information to predict the masked sentence vector, thus enforcing an information bottleneck on the sentence encoder for better representations. It also coincides with the recent works utilizing similar bottleneck theory for better text encoders (Lu et al., 2021; Liu & Shao, 2022) . By the way, the sentence encoder has the same architecture as XLMR, which ensures a fair comparison.

4.1. EVALUATION DATASETS

We evaluate our model with other counterparts on 4 popular datasets: Mr. TyDi is for query-passage retrieval, XOR Retrieve is cross-lingual retrieval for open-domain QA, Mewsli-X and LAReQA are for language-agnostic retrieval. Mr. TyDi (Zhang et al., 2021b) aims to evaluate cross-lingual passage retrieval with dense representations. Given a question in language L, the model should retrieve relevant texts in language L that can respond to the query. XOR Retrieve (Asai et al., 2021a) is proposed for multilingual open-domain QA, and we take its sub-task XOR-Retrieve: given a question in L (e.g., Korean), the task is to retrieve English passages that can answer the query. Mewsli-X is built on top of Mewsli (Botha et al., 2020) and we follow the setting of XTREME-R (Ruder et al., 2021) , in which it consisting of 15K mentions in 11 languages. LAReQA (Roy et al., 2020 ) is a retrieval task that each query has target answers in multiple languages, and models require retrieving all correct answers regardless of language. More details refer to Appendix A.1.

4.2. IMPLEMENTATION DETAILS

For the pre-training stage, we adopt transformer-based sentence encoder initialized from the XLMR weight, and a 2-layers transformer document encoder trained from scratch. We use a learning rate of 4e-5 and Adam optimizer with a linear warm-up. Following Wenzek et al. (2019) , we collect a clean version of Common Crawl (CC) including a 2,500GB multi-lingual corpus covering 108 languages, which adopt the same pre-processing method as XLMR (Conneau et al., 2019) . Note that we only train on CC without any bilingual parallel data, in an unsupervised manner. To limit the memory consumption during training, we limit the length of each sentence to 64 words (longer parts are truncated) and split documents with more than 32 sentences into smaller with each containing at most 32 sentences. The rest settings mainly follow the original XLMR's in FairSeq. We conduct pre-training on 8 A100 GPUs for about 200k steps. For the fine-tuning stage, we mainly follow the hyper-parameters of the original paper for the Mr. TyDi and XOR Retrieve tasks separately. And for Mewsli-X and LAReQA tasks, we mainly follow the settings of XTREME-R using its open-source codebase. Note that we didn't tune the hyper-parameters and mainly adopted the original settings using the same pipeline for a fair comparison. More details of fine-tuning hyper-parameters refer to Appendix.A.1.

5.1. EVALUATION SETTINGS

Cross-lingual Zero-shot Transfer. This setting is most widely adopted for the evaluation of multilingual scenarios with English as the source language, as many tasks only have labeled train data available in English. Concretely, the models are fine-tuned on English labeled data and then evaluated on the test data in the target languages. It also facilitates evaluation as models only need to be trained once and can be evaluated on all other languages. For Mr. TyDi dataset, the original paper adopt the Natural Questions data (Kwiatkowski et al., 2019) for fine-tuning while later Zhang et al. (2022b) suggests fine-tuning on MS MARCO for better results, so we fine-tune on MARCO when compared with best-reported results and on NQ otherwise. For XOR Retrieve, we fine-tune on NQ dataset as the original paper (Asai et al., 2021a) . For Mewsli-X and LAReQA, we follow the settings in XTREME-R, where Mewsli-X on a predefined set of English-only mention-entity pairs and LAReQA on the English QA pairs from SQuAD v1.1 train set. Multi-lingual Fine-tune. For the tasks where multi-lingual training data is available, we additionally compare the performance when jointly trained on the combined training data of all languages. Following the setting of Mr. TyDi and XOR Retrieve, we pre-fine-tune models as in Cross-lingual Zero-shot Transfer and then fine-tune on multi-lingual data if available. For the Mewsli-X and LAReQA, there is no available multi-lingual labeled data.

5.2. MAIN RESULTS

In this section, we evaluate our model on diverse cross-lingual and multi-lingual retrieval tasks and compare it with other strong pre-training baselines. Multilingual BERT (Devlin et al., 2019) pre-trained on multilingual Wikipedia using MLM and XLMR (Conneau et al., 2019) extend to a magnitude more web data for 100 languages. We report the main results in improvements on all the tasks, which demonstrates the effectiveness of our proposed MSM. (2) Under the setting of cross-lingual zero-shot transfer, it shows that our MSM outperforms other models by a large margin. Compared to strong XLM-R, MSM improves 7% MRR@100 on Mr. TyDi, 4.3% R@2k on XOR Retrieve, 1.8% mAP@20 of Mewsli-X, and 4.2% mAP@20 of LAReQA. (3) Under the setting of multi-lingual fine-tuning, the performance can be further improved by fine-tuning on multi-lingual train data and MSM can achieve the best results. However, there usually doesn't exist available multi-lingual data (such as Mewsli-X and LAReQA), especially for low-resource languages, and in this case MSM can achieve more gains for its stronger cross-lingual ability. results and most languages. LaBSE is slightly better on Recall@100 for it extends to a 500k vocabulary which is twice ours, and it utilizes parallel data. Interestingly though only fine-tuning on English data, MSM also achieves more gains in other languages. For example, MSM improves more than 5% on recall@100 on AR, FI, RU, and SW compared to XLMR, which clearly shows that MSM leads to better cross-lingual transfer compared to other baselines.

5.3. RESULTS ACROSS LANGUAGES

In Table .3, we mainly compare MSM with unsupervised pre-trained models. We reproduced two strong baselines proposed for monolingual retrieval, i.e. ICT and CROP, by extending them to multilingual corpora. We follow the original setting (Lee et al., 2019) for ICT and for CROP follow Wu et al. (2022a) . It indicates that learning from two views of the same document (i.e. ICT and CROP) can achieve competitive results. Yet compared to them, MSM achieves more gains especially in low-resource languages, which indicates modeling sequential sentence relation across languages is indeed beneficial for cross-lingual transfer. More detailed results across different languages on other tasks can be seen in Appendix A.2 and there provides more analysis on multilinguality.

5.4. ANALYSIS OF CROSS-LINGUAL ABILITY

To investigate why MSM can advantage cross-lingual retrieval and how the cross-lingual ability emerges, we design some analytical experiments. Recall that in the zero-shot transfer setting, the pre-trained model is first fine-tuned on English data and then evaluated on all languages, which relies on the cross-lingual transfer ability from en to others. So in this experiment, we set individual the document encoder for en language and other languages to break the sentence relation shared between en and others, to see how it impacts the retrieval performance. In Table .4, we report the results of different settings: Share All mean the original MSM where the document encoder is shared for all languages, Sep Doc sets two separate document encoder for EN and others, and Sep Doc + Head separates both encoder and projection head. The results clearly show that if EN and others don't share the same document encoder, the cross-lingual transfer ability drops rapidly. The previous works on multi-lingual BERTology (Conneau et al., 2020; Artetxe et al., 2020a; Rogers et al., 2020) found the shared model can learn the universal word embeddings across languages. Similar to it, our findings indicate that the shared document encoder benefits universal sentence embedding. This experiment further demonstrates that the sequential sentence relation is universal across languages, and modeling this relation is helpful for cross-lingual retrieval.

5.5. ABLATION STUDY

In this section, we conduct the ablation study on several components in our model. Considering computation efficiency and for a fair comparison, we fine-tune all the pre-trained models on NQ data and evaluate them on the target data. Ablation of Loss Function. We first study the effectiveness of the hierarchical contrastive loss proposed in Eq.1. As shown in Table .5, Cross-doc means only using cross-doc negatives without the intra-doc negatives. It results in poor performance due to the lack of utilization of intra-doc samples' information. When w/o bias, it leads to a significant decrease for it regards intra-doc sentences as negatives, which would harm the representation space as we stated in Sec.3.2. We can change the hyper µ to tune the impact of intra-doc negatives, and it gets the best results when setting µ at an appropriate value, which indicates ours can contribute to better representation learning. Impact of Contrastive Negative Size. We analyze how the number of negatives influences performance and range it from 256 to 4096. As shown in Table .6, the performance increase as the negative size become larger and it has diminishing gain if the batch is sufficiently large. Interestingly the model performance does not improve when continually increasing batch size, which has been also observed in some work (Cao et al., 2022; Chen et al., 2021) on contrastive learning. In our work, it may be due to when the total negative number increases to a large number, the impact of intradocument negatives would be diminished and hurt the performance. By the way, the performance would be harmed by the instability when the batch size is too large (Chen et al., 2021) . Impact of Projector Setting. Existing work has shown the projection layers (Dong et al., 2022; Cao et al., 2022) between the representation and the contrastive loss affect the quality of learned representations. We explore the impact of different projection layer under different settings in Tab.7. Referring to Fig. 1 , Shared PL means the two projection layers share the same parameters, and Asymmetric PL means not sharing the layers. The results show that using an Asymmetric PL performs better than others and the removal of projection layers badly degrades the performance. One possible reason is that the projection layer can bridge the semantic gap between the representation of different samples in the embedding spaces. Impact of Decoder Layers. We explore the impact of the size of document encoder layers in Table .8 and find a two-layer document encoder can achieve the best results. When the document encoder only has one layer, its capability is not enough to model sequential sentence relation, resulting in inefficient utilization of information. When the layer increase to larger, the masked sentence prediction task may depend more on the document encoder's ability and causes degradation of sentence representations, which is also in line with the findings of (Wu et al., 2022b; Lu et al., 2021) .

6. CONCLUSION

In this paper, we propose a novel masked sentence model which leverages sequential sentence relation for pre-training to improve cross-lingual retrieval. It contains a two-level encoder in which the document encoder applied to the sentence vectors generated by the sentence encoder from a sequence of sentences. Then we propose a masked sentence prediction task to train the model, which masks and predicts the selected sentence vector via a hierarchical contrastive loss with sampled negatives. Through comprehensive experiments on 4 cross-lingual retrieval benchmark datasets, we demonstrate that MSM significantly outperforms existing advanced pre-training methods. Our further analysis and detailed ablation study clearly show the effectiveness and stronger cross-lingual retrieval capabilities of our approach. @100 43.8 41.2 29.3 33.2 45.4 27.0 33.4 32.2 35.3 44.5 49.7 37.7 XLMR Recall@100 77.4 81.5 68.5 72.2 81.8 61.2 66.5 66.4 63.9 75.5 85.4 72.8 MRR@100 51.6 53.0 31.6 39.4 50.5 32.0 36.8 37.2 43.4 62.6 53.5 (Tab.9) it improves + 8.1 MRR@100 for SW and + 11.8 for BN compared to XLMR, and for XOR Retrieve (Tab.10) + 8.8 R@5k for TE. Though there are limited data in low-resource languages and the model suffers from the curse of multilinguality, our MSM can lead to better transfer to them, benefiting from modeling the sequential sentence relation across languages. (2) The target languages closer to pivot language (i.e. EN in our experiment) usually perform better and achieve more improvements. On Mewsli-X task (Tab.11) MSM can improve + 5.3 MAP for language DE while only + 1.3 for UK, for German (DE) is more similar to English in both scripts and language family (Ruder et al., 2021) . Similar observations also exist in LAReQA (Tab.12) that MSM performs better and improves more on DE, EL, ES and poorer on ZH and TH. (3) The multi-lingual data lead to better cross-lingual retrieval performance. It can be seen in Tab.9 and Tab.10, the performance can be further improved after fine-tuning multi-lingual train data. It indicates that the target languages can benefit from the other languages' data, and also shows that fine-tuning on multi-lingual data is necessary if available. It is worth mentioning that in this setting MSM can also achieve more gains, which demonstrates better cross-lingual transfer ability. 

A.3 COMPARISON WITH SEVERAL EXISTING METHODS

Table 13 shows the comparison of MSM and several existing retrieval approaches. Distilm-BERT (Reimers & Gurevych, 2020) distills knowledge from m-USE (Yang et al., 2019) trained on labeled pair data into mBERT. LaBSE (Feng et al., 2022) and InfoXLM (Chi et al., 2021) encourage bilingual alignment via a translation ranking loss, and also trained with MLM and TLM tasks (Lample & Conneau, 2019) . InfoXLM adopts the momentum contrast and LaBSE proposed additive margin softmax for contrastive learning. They all use additional parallel corpora, while our MSM only needs multi-lingual data without relying on any parallel or labeled data. Among unsupervised pre-trained models, mBERT (Devlin et al., 2019) and XLMR (Conneau et al., 2019) are general-purpose multilingual text encoders trained with MLM. XLMR-Long (Sagen, 2021) (or XLMR Longformer) is an XLMR model that has been extended to allow sequence lengths up to 4096 tokens. mContriever (Izacard et al., 2021b) and CCP (Wu et al., 2022a) similarly mine positive pairs by cropping two spans in a document. The former proposes random cropping while the latter utilizes two sentences. However, the quality of the cropped pairs is not guaranteed. In contrast, MSM utilizes a sequence of sentences in a document and models the universal sentence relation across languages via an explicit document encoder, which results in better cross-lingual retrieval capability. Motivated by the recent progress of giant models, we also increase the model capability. Considering expensive computation, it is initialized with XLMR-large and other settings keep the same as the base model. As shown in Tab.14, we report the zero-shot transfer results on Mr. TyDi after fine-tuning on NQ data. It clearly shows that as the model capacity increase, the performance on the downstream task can be consistently improved. We also report several strong large-size pre-trained models, including InfoXLM and CCP which are both initialized with XLM-R Large. Compared to them, MSM-Large achieves outperforming results on MRR@100 and comparable results on Recall@100, which further demonstrates the effectiveness of MSM in different model capabilities.

A.5 COMPARISON TO MT-BASED CROSS-LINGUAL RETRIEVAL

In this section, we compare the model performance to the MT-based (machine translation based) cross-lingual retrieval. As shown in Table .15, we provide four MT-based baselines borrowed from (Asai et al., 2021a) , all of which first translate the query to the document language using a translation system and then perform monolingual retrieval. For translation systems, GMT means Google's online machine translation service and White-MT means white-box translation model based on autoregressive transformers. For the monolingual retrieval model, PATH means Path Retriever (Asai et al., 2019) , a graph-based recurrent retrieval approach, and DPR (Karpukhin et al., 2020) is a typical bi-encoder based retriever. Both of them are trained on the human translated queries with the annotated gold paragraph data of XOR Retrieve. Results in Table .15 clearly shows that MSM can outperform While-box MT-based methods by a large margin, which demonstrates the effectiveness of MSM. Moreover, upgrading the MT system to GMT achieve even better results, due to the superiority of industrial MT systems (large parallel data, models and pipelines, etc.) (Asai et al., 2021a) . These observations indicate that the performance of MT-base methods heavily depends on the quality of MT system. However, GMT is a black-box system so it's difficult to be analyzed. By the way, the MT-based method relies on the two-stage pipeline that first translates and then retrieves, which may lead to cumulative errors. In contrast, our MSM is a universal pre-trained model and can be easily applied in bi-encoder retrievers, which have more advantages in terms of deployments and diagnosis.

A.6 MONOLINGUAL EXPERIMENT

We adapt our MSM pre-training method to the monolingual domain, in which we narrow the train corpus and model to English only. We initialize the sentence encoder with ERNIE-2.0-Base model and others adopt the same setting to our multi-lingual experiment. In Table .16, we report the performance on the Natural Question (Kwiatkowski et al., 2019) test set after fine-tuning. It shows that our proposed unsupervised model achieves better performance than advanced baselines including





Retrieval performance comparison on four benchmark datasets. The best performing models are marked bold and the results unavailable are left blank. * means the results are borrowed from published papers: † from Zhang et al. (2022b), ‡ from Asai et al. (2021a), § from Ruder et al. (2021), while others are evaluated using the same pipeline by us for a fair comparison.

Performance comparison on Mr. TyDi across languages in the cross-lingual zero-shot transfer setting, where all the models are fine-tuned on MS MARCO data. -means it doesn't support BN and TE languages and the average is for the supported languages.





Performance comparison of zero-shot cross-lingual retrieval when setting individual document encoder (and projection head) for English and other languages. OTHS means the average of languages except for EN.

Comparison of different settings for the hierarchical contrastive loss.

Impact of contrastive negatives number. Best are marked bold.

Impact of the projector settings. Best are marked bold.

Comparison of different document encoder layers. Best are marked bold.

Mr. TyDi results across languages. We report MRR@100 and Recall@100 on the test sets of Mr. TyDi in the two settings.

XOR Retrieve results across languages. We report R@2k and R@5 metrics on the test sets in the two settings.

Mewsli-X results across different input languages. We report the mean average precision@20 (mAP@20) results.

LAReQA results across different question languages. We report the mean average precision@20 (mAP@20) results.

Comparison with existing approaches. For the training objective, BI means bi-lingual pair alignment task, and CROP means contrastive learning with cropped spans. For the corpora, CC means CommonCrawl, mWiki means multi-lingual data from Wikipedia, and Bi-lingual may include MultiUN, OPUS, WikiMatrix, etc(Chi et al., 2021), which depends on models.

Cross-lingual zero-shot transfer retrieval performance for different size models. We report MRR@100 and Recall@100 on the test sets of Mr. TyDi after finetuning on NQ dataset.

A APPENDIX

A.1 DETAILS OF DATASETS AND HYPERPARAMETERS CC-108. Since Common Crawl (Wenzek et al., 2019) is not a public dataset, so we need to preprocess it by ourselves. Our reserved 108 languages are the union of the languages that XLMR and mBERT support. And we followed the processing method adopted by XLMR (Conneau et al., 2019) to pre-process CommonCrawl and retain 108 languages.Mr. TyDi. (Zhang et al., 2021b) It aims to evaluate cross-lingual passage retrieval with dense representations. Mr. TyDi is a multi-lingual benchmark dataset for mono-lingual query passage retrieval in eleven typologically distinct languages. Given a question in language L, the model should retrieve relevant texts in language L that can respond to the query. As the original paper suggests, we take MRR@100 and Recall@100 for evaluation.We mainly follow the open-source codebase DPR (Karpukhin et al., 2020) with minor modifications for multi-lingual models. When fine-tuned on MS MARCO in the zero-shot setting, we use AdamW optimizer with a learning rate of 2e-5. The model is trained for up to 3 epochs with a mini-batch size of 64. When fine-tuning on the NQ dataset, it is up to 40 epochs with a mini-batch size of 128. When further fine-tuning on Mr. TyDi's in-language data, it is 40 epochs, mini-batch size 128, and 1e-5 learning rate. Note all of them mainly follow what the original papers (Zhang et al., 2021b; 2022b) suggest. All these experiments are conducted on 8 NVIDIA Tesla A100 GPUs. XOR Retrieve. (Asai et al., 2021a) XOR QA is proposed for multilingual open-domain QA, and we take its sub-task XOR-Retrieve for our evaluation: given a question in L (e.g., Korean), the task is to retrieve English passages that can answer the query. Following Asai et al. (2021a) , we calculate the recall by computing the fraction of the questions for which the minimal answer is included in the top n tokens selected, and take R@2kt and R@5kt (kilo-tokens) as the metrics.Similar to the settings of the previous one, we use AdamW Optimizer with a learning rate of 2e-5. The model is trained up to 40 epochs with a mini-batch size of 128 when fine-tuning on NQ dataset. And when further tuning on XOR's data, the hyper-parameter remains the same. We evaluate all the compared pre-trained models in the same pipelines on 8 NVIDIA Tesla A100 GPUs.Mewsli-X. (Ruder et al., 2021) Mewsli-X is built on top of Mewsli (Botha et al., 2020) . We follow the setting of XTREME-R (Ruder et al., 2021) , which builds Mewsli-X consisting of 15K mentions in 11 languages: given a mention in context, it is to retrieve the correct target entity description from a candidate pool ranging over 1M candidates across 50 languages.We mainly follow the setting of XTREME-R (Hu et al., 2020) . It adopts a 2e-5 learning rate, and it is trained for up to 2 epochs with batch size of 16. As the original paper suggests, all the evaluations are conducted on NVIDIA Tesla V100 GPU for a fair comparison.LAReQA. (Roy et al., 2020) Language Agnostic Retrieval Question Answering is a retrieval task in which each query has target answers in multiple languages, and models require retrieving all correct answers from the candidate pool regardless of language. Following Ruder et al. ( 2021), we use the LAReQA XQuAD-R dataset which consists of 13,090 questions, each of which has 11 target answers (in 11 different languages) within 13,014 candidate answer sentences.Following Ruder et al. (2021) , we use the LAReQA XQuAD-R dataset which consists of 13,090 questions, each of which has 11 target answers (in 11 different languages) within 13,014 candidate answer sentences. It also follows the setting proposed by XTREME-R (Hu et al., 2020) . It adopts a 2e-5 learning rate and it is trained up to 3 epochs with batch size of 4. All the evaluations are conducted on NVIDIA Tesla V100 GPU.

A.2 DETAILED RESULTS ACROSS LANGUAGES

We show the detailed results for each task across different languages corresponding to Table .1. Specifically, the results of Mr. TyDi are as shown in Table.9, XOR Retrieve in Table.10, and LAReQA in Table. 12.Through the detailed results across languages, there are some findings on multilinguality: (1) MSM can achieve more gains in low-resource language. For example, in zero-shot setting of Mr. TyDi Published as a conference paper at ICLR 2023 BERT and ERNIE2.0 (Sun et al., 2020) . It further proves our MSM is a general pre-training method tailored for dense retrieval including monolingual and multi-lingual domains.

