UNIVERSAL SENTENCE REPRESENTATIONS LEARN-ING WITH CONDITIONAL MASKED LANGUAGE MODEL

Abstract

This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on the encoded vectors of adjacent sentences. Our English CMLM model achieves state-of-the-art performance on SentEval (Conneau & Kiela, 2018), even outperforming models learned using (semi-)supervised signals. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains. We find that a multilingual CMLM model co-trained with bitext retrieval (BR) and natural language inference (NLI) tasks outperforms the previous state-of-the-art multilingual models by a large margin. We explore the same language bias of the learned representations, and propose a principle component based approach to remove the language identifying information from the representation while still retaining sentence semantics.

1. INTRODUCTION

Sentence embeddings map sentences into a vector space. The vectors capture rich semantic information that can be used to measure semantic textual similarity (STS) between sentences or train classifiers for a broad range of downstream tasks (Conneau et al., 2017; Subramanian et al., 2018; Logeswaran & Lee, 2018b; Cer et al., 2018; Reimers & Gurevych, 2019; Yang et al., 2019a; d; Giorgi et al., 2020) . State-of-the-art models are usually trained on supervised tasks such as natural language inference (Conneau et al., 2017) , or with semi-structured data like question-answer pairs (Cer et al., 2018) and translation pairs (Subramanian et al., 2018; Yang et al., 2019a) . However, labeled and semi-structured data are difficult and expensive to obtain, making it hard to cover many domains and languages. Conversely, recent efforts to improve language models include the development of masked language model (MLM) pre-training from large scale unlabeled corpora (Devlin et al., 2019; Lan et al., 2020; Liu et al., 2019) . While internal MLM model representations are helpful when fine-tuning on downstream tasks, they do not directly produce good sentence representations, without further supervised (Reimers & Gurevych, 2019) or semi-structured (Feng et al., 2020) finetuning. In this paper, we explore an unsupervised approach, called Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations from large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on sentence level representations produced by adjacent sentences. The model therefore needs to learn effective sentence representations in order to perform good MLM. Since CMLM is fully unsupervised, it can be easily extended to new languages. We explore CMLM for both English and multilingual sentence embeddings for 100+ languages. Our English CMLM model achieves state-of-the-art performance on SentEval (Conneau & Kiela, 2018) , even outperforming models learned using (semi-)supervised signals. Moreover, models training on the English Amazon review data using our multilingual vectors exhibit strong multilingual transfer performance on translations of the Amazon review evaluation data to French, German and Japanese, outperforming existing multilingual sentence embedding models by > 5% for non-English languages and by > 2% on the original English data. We further extend the multilingual CMLM to co-training with parallel text (bitext) retrieval task, and finetuning with cross-lingual natural language inference (NLI) data, inspired by the success of prior work on multitask sentence representation learning (Subramanian et al., 2018; Yang et al., 2019a) "Life is a box of chocolates." and NLI learning (Conneau et al., 2017; Reimers & Gurevych, 2019) . We achieve performance 1.4% better than the previous state-of-the-art multilingual sentence representation model (see details in section 4.2). Language agnostic representations require semantically similar cross-lingual pairs to be closer in representation space than unrelated same-language pairs (Roy et al., 2020). While we find our original sentence embeddings do have a bias for same language sentences, we discover that removing the first few principal components of the embeddings eliminates the self language bias. ! ! ! " ! # . . . The rest of the paper is organized as follows. Section 2 describes the architecture for CMLM unsupervised learning. In Section 3 we present CMLM trained on English data and evaluation results on SentEval. In Section 3 we apply CMLM to learn sentence multilingual sentence representations. Multitask training strategies on how to effectively combining CMLM, bitext retrieval and cross lingual NLI finetuning are explored. In Section 5, we investigate self language bias in multilingual representations and how to eliminate it. The contributions of this paper can be summarized as follows: (1) A novel pre-training technique CMLM for unsupervised sentence representation learning on unlabeled corpora (either in monolingual and multilingual). ( 2) An effective multitask training framework, which combines unsupervised learning task CMLM with supervised learning Bitext Retrieval and cross-lingual NLI finetuning. (3) An evaluation benchmark for multilingual sentence representations. (4) A simple and effective algebraic method to remove same language bias in multilingual representations.

2. CONDITIONAL MASKED LANGUAGE MODELING

We introduce Conditional Masked Language Modeling (CMLM) as a novel architecture for combining next sentence prediction with MLM training. By "conditional," we mean the MLM task for one sentence depends on the encoded sentence level representation of the adjacent sentence. This builds on prior work on next sentence prediction that has been widely used for learning sentence level representations (Kiros et al., 2015; Logeswaran & Lee, 2018b; Cer et al., 2018; Yang et al., 2019a) , but has thus far produced poor quality sentence embeddings within BERT based models (Reimers & Gurevych, 2019) . While existing MLMs like BERT include next sentence prediction tasks, they do so without any inductive bias to try to encode the meaning of a sentence within a single embedding vector. We introduce a strong inductive bias for learning sentence embeddings by structuring the task as follows. Given a pair of ordered sentences, the first sentence is fed to an encoder that produces a sentence level embedding. The embedding is then provided to an encoder that conditions on the sentence embedding in order to better perform MLM prediction over the second sentence. This is notably similar to Skip-Thought (Kiros et al., 2015) , but replaces the generation of the complete second sentence with the MLM denoising objective. It is also similar to T5's MLM inspired unsupervised encode-decoder objective (Raffel et al., 2019) , with the second encoder acting as a sort of decoder given the representation produced for the first sentence. Our method critically differs from T5's in that a sentence embedding bottleneck is used to pass information between two model components



Figure 1: The architecture of Conditional Masked Language Modeling (CMLM).

