UNIVERSAL SENTENCE REPRESENTATIONS LEARN-ING WITH CONDITIONAL MASKED LANGUAGE MODEL

Abstract

This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on the encoded vectors of adjacent sentences. Our English CMLM model achieves state-of-the-art performance on SentEval (Conneau & Kiela, 2018), even outperforming models learned using (semi-)supervised signals. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains. We find that a multilingual CMLM model co-trained with bitext retrieval (BR) and natural language inference (NLI) tasks outperforms the previous state-of-the-art multilingual models by a large margin. We explore the same language bias of the learned representations, and propose a principle component based approach to remove the language identifying information from the representation while still retaining sentence semantics.

1. INTRODUCTION

Sentence embeddings map sentences into a vector space. The vectors capture rich semantic information that can be used to measure semantic textual similarity (STS) between sentences or train classifiers for a broad range of downstream tasks (Conneau et al., 2017; Subramanian et al., 2018; Logeswaran & Lee, 2018b; Cer et al., 2018; Reimers & Gurevych, 2019; Yang et al., 2019a; d; Giorgi et al., 2020) . State-of-the-art models are usually trained on supervised tasks such as natural language inference (Conneau et al., 2017) , or with semi-structured data like question-answer pairs (Cer et al., 2018) and translation pairs (Subramanian et al., 2018; Yang et al., 2019a) . However, labeled and semi-structured data are difficult and expensive to obtain, making it hard to cover many domains and languages. Conversely, recent efforts to improve language models include the development of masked language model (MLM) pre-training from large scale unlabeled corpora (Devlin et al., 2019; Lan et al., 2020; Liu et al., 2019) . While internal MLM model representations are helpful when fine-tuning on downstream tasks, they do not directly produce good sentence representations, without further supervised (Reimers & Gurevych, 2019) or semi-structured (Feng et al., 2020) finetuning. In this paper, we explore an unsupervised approach, called Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations from large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on sentence level representations produced by adjacent sentences. The model therefore needs to learn effective sentence representations in order to perform good MLM. Since CMLM is fully unsupervised, it can be easily extended to new languages. We explore CMLM for both English and multilingual sentence embeddings for 100+ languages. Our English CMLM model achieves state-of-the-art performance on SentEval (Conneau & Kiela, 2018) , even outperforming models learned using (semi-)supervised signals. Moreover, models training on the English Amazon review data using our multilingual vectors exhibit strong multilingual transfer performance on translations of the Amazon review evaluation data to French, German and Japanese, outperforming existing multilingual sentence embedding models by > 5% for non-English languages and by > 2% on the original English data. We further extend the multilingual CMLM to co-training with parallel text (bitext) retrieval task, and finetuning with cross-lingual natural language inference (NLI) data, inspired by the success of prior work on multitask sentence representation learning (Subramanian et al., 2018; Yang et al., 2019a) 

