UNIVERSAL SENTENCE REPRESENTATIONS LEARN-ING WITH CONDITIONAL MASKED LANGUAGE MODEL

Abstract

This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on the encoded vectors of adjacent sentences. Our English CMLM model achieves state-of-the-art performance on SentEval (Conneau & Kiela, 2018) , even outperforming models learned using (semi-)supervised signals. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains. We find that a multilingual CMLM model co-trained with bitext retrieval (BR) and natural language inference (NLI) tasks outperforms the previous state-of-the-art multilingual models by a large margin. We explore the same language bias of the learned representations, and propose a principle component based approach to remove the language identifying information from the representation while still retaining sentence semantics.

1. INTRODUCTION

Sentence embeddings map sentences into a vector space. The vectors capture rich semantic information that can be used to measure semantic textual similarity (STS) between sentences or train classifiers for a broad range of downstream tasks (Conneau et al., 2017; Subramanian et al., 2018; Logeswaran & Lee, 2018b; Cer et al., 2018; Reimers & Gurevych, 2019; Yang et al., 2019a; d; Giorgi et al., 2020) . State-of-the-art models are usually trained on supervised tasks such as natural language inference (Conneau et al., 2017) , or with semi-structured data like question-answer pairs (Cer et al., 2018) and translation pairs (Subramanian et al., 2018; Yang et al., 2019a) . However, labeled and semi-structured data are difficult and expensive to obtain, making it hard to cover many domains and languages. Conversely, recent efforts to improve language models include the development of masked language model (MLM) pre-training from large scale unlabeled corpora (Devlin et al., 2019; Lan et al., 2020; Liu et al., 2019) . While internal MLM model representations are helpful when fine-tuning on downstream tasks, they do not directly produce good sentence representations, without further supervised (Reimers & Gurevych, 2019) or semi-structured (Feng et al., 2020) finetuning. In this paper, we explore an unsupervised approach, called Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations from large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on sentence level representations produced by adjacent sentences. The model therefore needs to learn effective sentence representations in order to perform good MLM. Since CMLM is fully unsupervised, it can be easily extended to new languages. We explore CMLM for both English and multilingual sentence embeddings for 100+ languages. Our English CMLM model achieves state-of-the-art performance on SentEval (Conneau & Kiela, 2018) , even outperforming models learned using (semi-)supervised signals. Moreover, models training on the English Amazon review data using our multilingual vectors exhibit strong multilingual transfer performance on translations of the Amazon review evaluation data to French, German and Japanese, outperforming existing multilingual sentence embedding models by > 5% for non-English languages and by > 2% on the original English data. We further extend the multilingual CMLM to co-training with parallel text (bitext) retrieval task, and finetuning with cross-lingual natural language inference (NLI) data, inspired by the success of prior work on multitask sentence representation learning (Subramanian et al., 2018; Yang et al., 2019a) "Life is a box of chocolates." and NLI learning (Conneau et al., 2017; Reimers & Gurevych, 2019) . We achieve performance 1.4% better than the previous state-of-the-art multilingual sentence representation model (see details in section 4.2). Language agnostic representations require semantically similar cross-lingual pairs to be closer in representation space than unrelated same-language pairs (Roy et al., 2020) . While we find our original sentence embeddings do have a bias for same language sentences, we discover that removing the first few principal components of the embeddings eliminates the self language bias. ! ! ! " ! # . . . The rest of the paper is organized as follows. Section 2 describes the architecture for CMLM unsupervised learning. In Section 3 we present CMLM trained on English data and evaluation results on SentEval. In Section 3 we apply CMLM to learn sentence multilingual sentence representations. Multitask training strategies on how to effectively combining CMLM, bitext retrieval and cross lingual NLI finetuning are explored. In Section 5, we investigate self language bias in multilingual representations and how to eliminate it. The contributions of this paper can be summarized as follows: (1) A novel pre-training technique CMLM for unsupervised sentence representation learning on unlabeled corpora (either in monolingual and multilingual). (2) An effective multitask training framework, which combines unsupervised learning task CMLM with supervised learning Bitext Retrieval and cross-lingual NLI finetuning. (3) An evaluation benchmark for multilingual sentence representations. (4) A simple and effective algebraic method to remove same language bias in multilingual representations.

2. CONDITIONAL MASKED LANGUAGE MODELING

We introduce Conditional Masked Language Modeling (CMLM) as a novel architecture for combining next sentence prediction with MLM training. By "conditional," we mean the MLM task for one sentence depends on the encoded sentence level representation of the adjacent sentence. This builds on prior work on next sentence prediction that has been widely used for learning sentence level representations (Kiros et al., 2015; Logeswaran & Lee, 2018b; Cer et al., 2018; Yang et al., 2019a) , but has thus far produced poor quality sentence embeddings within BERT based models (Reimers & Gurevych, 2019) . While existing MLMs like BERT include next sentence prediction tasks, they do so without any inductive bias to try to encode the meaning of a sentence within a single embedding vector. We introduce a strong inductive bias for learning sentence embeddings by structuring the task as follows. Given a pair of ordered sentences, the first sentence is fed to an encoder that produces a sentence level embedding. The embedding is then provided to an encoder that conditions on the sentence embedding in order to better perform MLM prediction over the second sentence. This is notably similar to Skip-Thought (Kiros et al., 2015) , but replaces the generation of the complete second sentence with the MLM denoising objective. It is also similar to T5's MLM inspired unsupervised encode-decoder objective (Raffel et al., 2019) , with the second encoder acting as a sort of decoder given the representation produced for the first sentence. Our method critically differs from T5's in that a sentence embedding bottleneck is used to pass information between two model components and in that the task involves denoising a second sentence when conditioning on the first rather than denoising a single text stream. Fig. 1 illustrates the architecture of our model. The first sentence s 1 is tokenized and input to a transformer encoder and a sentence vector v 2 R d is computed from the sequence outputs by average pooling. 1 The sentence vector v is then projected into N spaces with one of the projections being the identity mapping, i.e. v p = P (v) 2 R d⇥N . Here we use a three-layer MLP as the projection P (•). The second sentence s 2 is then masked following the procedure described in the original BERT paper, including random replacement and the use of unchanged tokens. The second encoder shares the same weights with the encoder used to embed s 1foot_1 . Tokens in the masked s 2 are converted into word vectors and concatenated with v p . The concatenated representations are provided to the transformer encoder to predict the masked tokens in s 2 . At inference time, we keep the first encoding module and discard the subsequent MLM prediction. In Section 5.2, we explore various different configurations of CMLM, including the number of projection spaces, and how the projected vectors are connected to the embeddings of the second sentence.

3. LEARNING ENGLISH SENTENCE REPRESENTATIONS WITH CMLM

For training English sentence encoders with CMLM, we use three Common Crawl dumps. The data are filtered by a classifier which detects whether a sentence belongs to the main content of the web page or not. We use WordPiece tokenization and the vocabulary is the same as public English uncased BERT. In order to enable the model to learn bidirectional information, for two consecutive sequences s 1 and s 2 , we swap their order for 50% of the time. This order-swapping process echos with the preceding and succeeding sentences prediction in Skip-Thought (Kiros et al., 2015) . The length of s 1 and s 2 are set to be 256 tokens (the maximum length). The number of masked tokens in s 2 are 80 (31.3%), moderately higher than classical BERT. This change in the ratio of masked tokens is to make the task more challenging, due to the fact that in CMLM, language modeling has access to extra information from adjacent sentences. We train with batch size of 2048 for 1 million steps. The optimizer is LAMB with learning rate of 10 3 , 1 = 0.9, 2 = 0.999, warm-up in the first 10,000 steps and linear decay afterwards. We explore two transformer configurations, base and large, same as in the original BERT paper. The number of projections N is 15 by experimenting with multiple choices.

3.1. EVALUATION

We evaluate the sentence representations on the following tasks: (1) classification: MR (movie reviews Pang & Lee (2005) ), binary SST (sentiment analysis, Socher et al. (2013) ), TREC (question-type, Voorhees & Tice (2000)), CR (product reviews, Hu & Liu (2004) ), SUBJ (subjectivity/objectivity, Pang & Lee (2004) ). ( 2) Entailment: SNLI (Bowman et al., 2015) and SICK dataset for entailment (SICK-E, Marelli et al. (2014) ). The evaluation is done using SentEval (Conneau & Kiela, 2018) which is a prevailing evaluation toolkit for sentence embeddings. The classifier for the downstream is logistic regression. For each task, the encoder and embeddings are fixed and only downstream neural structures are trained. The baseline sentence embedding models include SkipThought (Kiros et al., 2015) , InferSent (Conneau et al., 2017) , USE (Cer et al., 2018) , QuickThought (Logeswaran & Lee, 2018a) and English BERT using standard pre-trained models from TensorFlow Hub website (Devlin et al. (2019) , and SBert (Reimers & Gurevych, 2019) . To evaluate the possible improvements coming from training data and processes, we train standard BERT models (English BERT base/large (CC)) on the same Common Crawl Corpora that CMLM is trained on. Similarly, we also train QuickThought, a competitive unsupervised sentence representations learning model, on the same Common Crawl Corpora (denoted as "QuickThought (CC)"). To further address the possible advantage from using Transformer encoder, we use a Transformer encoder as the sentence encoder in QuickThought (CC). The representations for BERT are computed by averaging pooling of the sequence outputs (we also explore options including [CLS] vector and max pooling and the results are available in the appendix).

3.2. RESULTS

Evaluation results are presented in Table 1 . CMLM outperforms existing models overall, besting MLM (both English BERT and English BERT (CC)) using both base and large configurations. The closest competing model is SBERT, which uses supervised NLI data rather than a purely unsupervised approach. Interestingly, CMLM outperforms SBERT on the SICK-E NLI task.

Model

MR CR SUBJ MPQA SST TREC MRPC SICK-E SICK-R Avg. 

4. LEARNING MULTILINGUAL SENTENCE REPRESENTATIONS WITH CMLM

As a fully unsupervised method, CMLM can be conveniently extended to multilingual modeling even for less well resourced languages. Learning good multilingual sentence representations is more challenging than learning monolingual ones, especially when attempting to capture the semantic alignment between different languages. As CMLM does not explicitly address cross-lingual alignment, we explore several modeling approaches besides CMLM: (1) Co-training CMLM with a bitext retrieval task; (2) Fine-tuning with cross-lingual NLI data.

4.1. MULTILINGUAL CMLM

We follow the same configuration used to learn English sentence representations with CMLM, but extend the training data to include more languages. Results below will show that CMLM again exhibits competitive performance as a general technique to learn from large scale unlabeled corpora.

4.2. MULTITASK TRAINING WITH CMLM AND BITEXT RETRIEVAL

Besides the monolingual pretraining data, we collect a dataset of bilingual translation pairs {(s i , t i )} using a bitext mining system (Feng et al., 2020) . The source sentences {s i } are in English and the target sentences {t i } covers over 100 languages. We build a retrieval task with the translation parallel data, identifying the corresponding translation of the input sentence from candidates in the same batch. Concretely, incorporating Additive Margin Softmax (Yang et al., 2019b) , we compute the bitext retrieval loss L s br for the source sentences as: L s br = 1 B B X i=1 e (si,ti) m e (si,ti) m + P B j=1,j6 =i e (si,tj ) (1) Above (l (i) s , l t ) denotes the the inner products of sentence vectors of l (i) s and l (i) t (embedded by the transformer encoder); m and B denotes the additive margin and the batch size respectively. Note the way to generate sentence embeddings is the same as in CMLM. We can compute the bitext retrieval loss for the target sentences L t br by normalizing over source sentences, rather than target sentences, in the denominator. 3 The final bitext retrieval loss L br is given as L br = L s br + L t br . There are several ways to incorporate the monolingual CMLM task and bitext retrieval (BR). We explore the following multistage and multitask pretraining strategies: S1. CMLM+BR: Train with both CMLM and BR from the start; S2. CMLM ! BR: Train with CMLM in the first stage and then train with on BR; S3. CMLM ! CMLM+BR: Train with only CMLM in the first stage and then with both tasks. When training with both CMLM and BR, the optimization loss is a weighted sum of the language modeling and the retrieval loss L br , i.e. L = L CMLM + ↵L br . We empirically find ↵ = 0.2 works well. As shown in Table 3 , S3 is found to be the most effective. Unless otherwise denoted, our models trained with CMLM and BR follow S3. We also discover that given a pre-trained transformer encoder, e.g. mBERT, we can improve the quality of sentence representations by finetuning the transformer encoder with CMLM and BR. As shown in Table 2 and Table 3 , the improvements between "mBERT" and "f-mBERT" (finetuned mBERT) are significant.

4.3. FINETUNING WITH CROSS LINGUAL NATURAL LANGUAGE INFERENCE

Finetuning with NLI data has proved to be an effective method to improve the quality of embeddings for English models. We extend this to the multilingual domain. Given a premise sentence u and a hypothesis sentence v, we train a 3-way classifier on the concatenation of [u, v, |u v|, u ⇤ v]. Weights of transformer encoders are also updated in the finetuning process. Different from previous work also using multilingual NLI data (Yang et al., 2019a) , the premise u and hypothesis v here are in different languages. The cross lingual NLI data are generated by translating Multi-Genre NLI Corpus (Williams et al., 2018 ) into 14 languages using Google Translate API.

4.4. CONFIGURATIONS

Monolingual training data for CMLM are generated from 3 versions of Common Crawl data in 113 languages. The data cleaning and filtering is the same as the English-only ones. A new cased vocabulary is built from the all data sources using the WordPiece vocabulary generation library from Tensorflow Text. The language smoothing exponent from the vocab generation tool is set to 0.3, as the distribution of data size for each language is imbalanced. The final vocabulary size is 501,153. The number of projections N is set to be 15, the batch size B is 2048 and the positive margin is 0.3. For CMLM only pretraining, the number of steps is 2 million. In multitask learning, for S1 and S3, the first stage is of 1.5 million and the second stage is of 1 million steps; for S2, number of training steps is 2 million. The transformer encoder uses the BERT base configuration. Initial learning rate and optimizer chosen are the same as the English models.

4.5.1. XEVAL: MULTILINGUAL BENCHMARKS FOR SENTENCE REPRESENTATIONS EVALUATION

Evaluations in previous multilingual literature focused on the cross lingual transfer learning ability from English to other languages. However, this evaluation protocol that treats English as the "anchor" does not equally assess the quality of non-English sentence representations with English ones. In order to address the issue, we prepare a new benchmark for multilingual sentence vectors, XEVAL, by translating SentEval (English) to other 14 languages with an industrial translation API. Results of models trained with monolingual data are shown in Table 2 . Baseline models include mBERT (Devlin et al., 2019) , XLM-R (Ruder et al., 2019) multilingual model CMLM on monolingual data outperform all baselines in 12 out of 15 languages and the average performance. Results of models trained with cross lingual data are presented in Table 3 . Baseline models for comparison include LASER (Artetxe & Schwenk (2019) , trained with parallel data) and multilingual USE ( (Yang et al., 2019a) , trained with cross lingual NLI). Our model (S3) outperforms LASER in 13 out of 15 languages. Notably, finetuning with NLI in the cross lingual way produces significant improvement (S3 + NLI v.s. S3) and it also outperforms mUSE by significant marginsfoot_3 . Multitask learning with CMLM and BR can also be used to increase the performance of pretrained encoders, e.g. mBERT. mBERT trained with CMLM and BR (f-mBERT) has a significant improvement upon mBERT.

4.5.2. AMAZON REVIEWS

We also conduct a zero-shot transfer learning evaluation on Amazon reviews dataset (Prettenhofer & Stein, 2010) . Following Chidambaram et al. (2019) , the original dataset is converted to a classification benchmark by treating reviews with strictly more than 3 stars as positive and negative otherwise. We split 6000 English reviews in the original training set into 90% for training and 10% for development. The two-way classifier, upon the concatenation of [u, v, |u v|, u ⇤ v] (following previous works e.g. Reimers & Gurevych (2019) ), is trained on the English training set and then evaluated on English, French, German and Japanese test sets (each has 6000 examples). Note the same multilingual encoder and classifier are used for all the evaluations. We also experiment with whether freezing the encoder weights or not during training. As presented in Table 4 , CMLM alone has already outperformed baseline models. Training with BR and cross lingual NLI finetuning further boost the performance.

4.6. TATOEBA: SEMANTIC SEARCH

To directly assess the ability of our models on capturing semantics, we test on Tatoeba dataset proposed in Artetxe & Schwenk (2019) . Tatoeba dataset include up to 1,000 English-aligned sentence pairs for each evaluated language. The task is to find the nearest neighbor for the query sentence in the other language by cosine similarity. The experiments is conducted on the 36 languages sent as in XTREME benchmark (Hu et al., 2020) and the evaluation metric is retrieval accuracy. We test

Models

English French German Japanese Encoder parameters are frozen during finetuning 5 . Our model CMLM+BR outperforms all baseline models in 30 out of 36 languages and has the highest average performance. One interesting observation is that finetuning with NLI actually undermines the model performance on semantic search, in contrary with the significant improvements from CMLM+BR to CMLM+BR+NLI on XEVAL (Table 3 ). We speculate this is because unlike semantic search, NLI inference is not a linear process. Finetuning with cross-lingual NLI is not expected to help the linear retrieval by nearest neighbor search. 90.5 83.6 92.6 86.4 97.6 91.6 9.5 82.6 76.3 90.7 88.9 93.5 86.8 94.6 89.6 91.7 90.4 1 20.3 26.4 35.9 29.4 36.7 65.7 24.3 74.7 68.3 57.3 LASER 23.0 35.9 18.6 88.9 96.9 91.5 96.3 95.2 94.4 57.5 69.4 79.7 95.4 50.6 97.5 81.9 96.8 95.5 84.4 CMLM+BR 83.4 94.9 88.6 92.4 98.9 94.5 97.3 95.3 94.9 87.0 91.2 97.9 96.6 95.3 98.6 94.4 97.5 95.6 94.7 CMLM+BR+NLI 66.9 88.1 80.3 85.6 94.9 90.7 93.2 92.3 91.7 76.7 88.6 92.8 94.7 82.0 94.3 84.7 94.3 93.1 88.8 Table 5: Tatoeba results (retrieval accuracy) for each language. Our model CMLM+BR achieves the best results on 30 out of 36 languages.

5.1. LANGUAGE AGNOSTIC PROPERTIES

Language Agnosticism has been a property of great interest for multilingual representations. However, there has not been a qualitative measurement or rigid definition for this property. Here we propose that "language agnostic" refers to the property that sentences representations are neutral w.r.t their language information. For example, two sentences with similar semantics should be close in embedding space whether they are of the same languages or not. Another case is that given one query sentence in language l 1 and two candidate sentences with the identical meanings (different from the query sentence) in languages l 1 and l 2 , the l 1 input sentence should not be biased towards the l 1 candidate sentence. To capture this intuition, we convert the PAWS-X dataset (Yang et al., 2019c) to a retrieval task to measure the language agnostic property. Specifically, PAWS-X dataset consists of English sentences and their translations in other six languages (x-axis labels in Fig. 2 ). Given a query, we inspect the language distribution of the retrieved sentences (by ranking cosine similarities). In Fig. 2 , query sentences are in German, French and Chinese for each row. Representations of mBERT (first row) have a strong self language bias, i.e. sentences in the language matching the query are dominant. In contrast, the bias is much weaker in our model trained with CMLM and BR (the third column), probably due to the cross lingual retrieval pretraining. We discover that removing the first principal component of each monolingual space from sentence representations effectively eliminate the self language bias. As shown in the second and the fourth column in Fig. 2 , with principal component removal (PCR), the language distribution is much more uniform. We further explore PCR by experimenting on the Tatoeba dataset. Table 5 shows the retrieval accuracy of multilingual model with and w/o PCR. PCR increases the overall retrieval performance for both two models. This suggests the first principal components in each monolingual space primarily encodes language identification information. We also visualize the sentence representations in Tatoeba dataset in Fig. 3 . Our model (the first row) shows both weak and strong semantic alignment (Roy et al., 2020) . Representations are close to others with similar semantics regardless of their languages (strong alignment), especially for French and Russian, where representations form several distinct clusters. Also representations from the same language tend to cluster (weak alignment). While representations from mBERT generally exhibit weak alignment.

5.2. ABLATION STUDY

In this section, we explore different configurations of CMLM, including the number of spaces in the projection N and CMLM architecture. As shown in Table 7 , projecting the sentence vector into N = 15 produces highest overall performance. We also try a modification to CMLM architecture. Besides the concatenation with token embeddings of s 2 before input to the transformer encoder, the projected vectors are also concatenated with the sequence outputs of s 2 for the masked token prediction. This version of architecture is denoted as "skip" and model performance actually becomes worse. Note that the projected vector can also be used to produce the sentence representation v s . For example, one way is to use the average of projected vectors, i.e. v s = 1 N P i v (i) p . Recall v (i) p is the ith projection. This version is denoted as "proj" in Table 7 . Sentence representations produced in this way still yield competitive performance, which further confirm the effectiveness of the projection.

Model

MR CR SUBJ MPQA SST TREC MRPC SICK-E SICK-R Avg. 

6. CONCLUSION

We present a novel sentence representation learning method Conditional Masked Language Modeling (CMLM) for training on large scale unlabeled corpus. CMLM outperforms the previous state-ofthe-art English sentence embeddings models, including those trained with (semi-)supervised signals. For multilingual representations learning, we discover that co-training CMLM with bitext retrieval and cross-lingual NLI finetuning achieves state-of-the-art performance. We also discover that multilingual representations have the same language bias and principal component removal (PCR) can eliminate the bias by separating language identity information from semantics.



One can equivalently choose other pooling methods, such as max pooling or use the vector output corresponding to a special token position such as the [CLS] token. The dual-encoder sharing encoder weights for different inputs can be also referred as "siamese encoder" i.e., by swapping the i and j subscripts in the last term of the denominator. Note mUSE only supports 16 languages, the best CMLM model is still significantly better if only considering the mUSE supported languages (underline in table 2 indicates the unsupported languages by mUSE)



Figure 1: The architecture of Conditional Masked Language Modeling (CMLM).

Figure 3: Visualizations of sentence embeddings in Tatoeba dataset in 2D. The target languages are all English and the source languages are French, German, Russian and Spanish from left to right columns. The first and second rows are our model and mBERT respectively.

Performance (accuracy) of multilingual models trained with monolingual data on XEVAL. Highest numbers are highlighted in bold.

Performance (accuracy) of models trained with cross lingual data on XEVAL. mUSE only supports 16 languages, underline indicates the language is not supported by mUSE. We test with multiple strategies for multitask pretraining: [S1]: CMLM ! BR; [S2]: CMLM+BR; [S3]: CMLM ! CMLM+BR. [f-mBERT] denotes finetuning mBERT with CMLM and BR.

Classification accuracy on the Amazon Reviews dataset. The experiments examine the zeroshot cross-lingual ability of multilingual models. We explore both freezing/updating the weights of the multilingual encoder during training on English data.

Average retrieval accuracy on 36 languages of multilingual representations model with and without principal component removal (PCR) on Tatoeba dataset.

Ablation study of CMLM designs, including the number of projection spaces, architecture and sentence representations. The experiments are conducted on SentEval.

