TOWARDS MULTI-SENSE CROSS-LINGUAL ALIGNMENT OF CONTEXTUAL EMBEDDINGS

Abstract

Cross-lingual word embeddings (CLWE) have been proven useful in many crosslingual tasks. However, most existing approaches to learn CLWE including the ones with contextual embeddings are sense agnostic. In this work, we propose a novel framework to align contextual embeddings at the sense level by leveraging cross-lingual signal from bilingual dictionaries only. We operationalize our framework by first proposing a novel sense-aware cross entropy loss to model word senses explicitly. The monolingual ELMo and BERT models pretrained with our sense-aware cross entropy loss demonstrate significant performance improvement for word sense disambiguation tasks. We then propose a sense alignment objective on top of the sense-aware cross entropy loss for cross-lingual model pretraining, and pretrain cross-lingual models for several language pairs (English to German/Spanish/Japanese/Chinese). Compared with the best baseline results, our cross-lingual models achieve 0.52%, 2.09% and 1.29% average performance improvements on zero-shot cross-lingual NER, sentiment classification and XNLI tasks, respectively. We will release our code.

1. INTRODUCTION

Cross-lingual word embeddings (CLWE) provide a shared representation space for knowledge transfer between languages, yielding state-of-the-art performance in many cross-lingual natural language processing (NLP) tasks. Most of the previous works have focused on aligning static embeddings. To utilize the richer information captured by the pre-trained language model, more recent approaches attempt to extend previous methods to align contextual representations. Aligning the dynamic and complex contextual spaces poses significant challenges, so most of the existing approaches only perform coarse-grained alignment. Schuster et al. (2019) compute the average of contextual embeddings for each word as an anchor, and then learn to align the static anchors using a bilingual dictionary. In another work, Aldarmaki & Diab (2019) use parallel sentences in their approach, where they compute sentence representations by taking the average of contextual word embeddings, and then they learn a projection matrix to align sentence representations. They find that the learned projection matrix also works well for word-level NLP tasks. Besides, unsupervised multilingual language models (Devlin et al., 2018; Artetxe & Schwenk, 2019; Conneau et al., 2019; Liu et al., 2020) pretrained on multilingual corpora have also demonstrated strong cross-lingual transfer performance. Cao et al. (2020) and Wang et al. (2020) show that unsupervised multilingual language model can be further aligned with parallel sentences. Though contextual word embeddings are intended to provide different representations of the same word in distinct contexts, Schuster et al. (2019) find that the contextual embeddings of different senses of one word are much closer compared with that of different words. This contributes to the anisomorphic embedding distribution of different languages and causes problems for cross-lingual alignment. For example, it will be difficult to align the English word bank and its Japanese translations 銀行 and 岸 that correspond to its two different senses, since the contextual embeddings of different senses of bank are close to each other while those of 銀行 and 岸 are far. Recently, Zhang et al. (2019) propose two solutions to handle multi-sense words: 1) remove multi-sense words and then align anchors in the same way as Schuster et al. (2019) ; 2) generate cluster level average anchor for contextual embeddings of multi-sense words and then learn a projection matrix in an unsupervised way with MUSE (Conneau et al., 2017) . They do not make good use of the bilingual dictionaries,

