TOWARDS MULTI-SENSE CROSS-LINGUAL ALIGNMENT OF CONTEXTUAL EMBEDDINGS

Abstract

Cross-lingual word embeddings (CLWE) have been proven useful in many crosslingual tasks. However, most existing approaches to learn CLWE including the ones with contextual embeddings are sense agnostic. In this work, we propose a novel framework to align contextual embeddings at the sense level by leveraging cross-lingual signal from bilingual dictionaries only. We operationalize our framework by first proposing a novel sense-aware cross entropy loss to model word senses explicitly. The monolingual ELMo and BERT models pretrained with our sense-aware cross entropy loss demonstrate significant performance improvement for word sense disambiguation tasks. We then propose a sense alignment objective on top of the sense-aware cross entropy loss for cross-lingual model pretraining, and pretrain cross-lingual models for several language pairs (English to German/Spanish/Japanese/Chinese). Compared with the best baseline results, our cross-lingual models achieve 0.52%, 2.09% and 1.29% average performance improvements on zero-shot cross-lingual NER, sentiment classification and XNLI tasks, respectively. We will release our code.

1. INTRODUCTION

Cross-lingual word embeddings (CLWE) provide a shared representation space for knowledge transfer between languages, yielding state-of-the-art performance in many cross-lingual natural language processing (NLP) tasks. Most of the previous works have focused on aligning static embeddings. To utilize the richer information captured by the pre-trained language model, more recent approaches attempt to extend previous methods to align contextual representations. Aligning the dynamic and complex contextual spaces poses significant challenges, so most of the existing approaches only perform coarse-grained alignment. Schuster et al. (2019) compute the average of contextual embeddings for each word as an anchor, and then learn to align the static anchors using a bilingual dictionary. In another work, Aldarmaki & Diab (2019) use parallel sentences in their approach, where they compute sentence representations by taking the average of contextual word embeddings, and then they learn a projection matrix to align sentence representations. They find that the learned projection matrix also works well for word-level NLP tasks. Besides, unsupervised multilingual language models (Devlin et al., 2018; Artetxe & Schwenk, 2019; Conneau et al., 2019; Liu et al., 2020) pretrained on multilingual corpora have also demonstrated strong cross-lingual transfer performance. Cao et al. (2020) and Wang et al. (2020) show that unsupervised multilingual language model can be further aligned with parallel sentences. Though contextual word embeddings are intended to provide different representations of the same word in distinct contexts, Schuster et al. (2019) find that the contextual embeddings of different senses of one word are much closer compared with that of different words. This contributes to the anisomorphic embedding distribution of different languages and causes problems for cross-lingual alignment. For example, it will be difficult to align the English word bank and its Japanese translations 銀行 and 岸 that correspond to its two different senses, since the contextual embeddings of different senses of bank are close to each other while those of 銀行 and 岸 are far. Recently, Zhang et al. (2019) propose two solutions to handle multi-sense words: 1) remove multi-sense words and then align anchors in the same way as Schuster et al. ( 2019); 2) generate cluster level average anchor for contextual embeddings of multi-sense words and then learn a projection matrix in an unsupervised way with MUSE (Conneau et al., 2017) . They do not make good use of the bilingual dictionaries, which are usually easy to obtain, even in low-resource scenarios. Moreover, their projection-based approach still cannot handle the anisomorphic embedding distribution problem. In this work, we propose a novel sense-aware cross entropy loss to model multiple word senses explicitly, and then leverage a sense level translation task on top of it for cross-lingual model pretraining. The proposed sense level translation task enables our models to provide more isomorphic and better aligned cross-lingual embeddings. We only use the cross-lingual signal from bilingual dictionaries for supervision. Our pretrained models demonstrate consistent performance improvements on zero-shot cross-lingual NER, sentiment classification and XNLI tasks. Though pretrained on less data, our model achieves the state-of-the-art result on zero-shot cross-lingual German NER task. To the best of our knowledge, we are the first to perform sense-level contextual embedding alignment with only bilingual dictionaries.

2. BACKGROUND: PREDICTION TASKS OF LANGUAGE MODELS

Next token prediction and masked token prediction are two common tasks in neural language model pretraining. We take two well-known language models, ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) , as examples to illustrate these two tasks (architectures are shown in Appendix A). Next token prediction ELMo uses next token prediction tasks in a bidirectional language model. Given a sequence of N tokens (t 1 , t 2 , . . . , t N ), it first prepares a context independent representation for each token by using a convolutional neural network over the characters or by word embedding lookup (a.k.a. input embeddings). These representations are then fed into L layers of LSTMs to generate the contextual representations: h i,j for token t i at layer j. The model assigns a learnable output embedding w for each token in the vocabulary, which has the same dimension as h i,L . Then, the forward language model predicts the token at position k with: p(t k |t 1 , t 2 , . . . , t k-1 ) = softmax(h T k-1,L w k ) = exp(h T k-1,L w k ) V i=1 exp(h T k-1,L w i ) where k is the index of token t k in the vocabulary, V is the size of the vocabulary, and (w 1 , . . . , w V ) are the output embeddings for the tokens in the vocabulary. The backward language model is similar to the forward one, except that tokens are predicted in the reverse order. Since the forward and backward language models are very similar, we will only describe our proposed approach in the context of the forward language model in the subsequent sections.

Masked token prediction

The Masked Language Model (MLM) in BERT is a typical example of masked token prediction. Given a sequence (t 1 , t 2 , . . . , t N ), this approach randomly masks a certain percentage (15%) of the tokens and generates a masked sequence (m 1 , m 2 , . . . , m N ), where m k = [mask] if the token at position k is masked, otherwise m k = t k . BERT first prepares the context independent representations (x 1 , x 2 , . . . , x N ) of the masked sequence via token embeddings. It is then fed into L layers of transformer encoder (Vaswani et al., 2017) to generate "bidirectional" contextual token representations. The final layer representations are then used to predict the masked token at position k as follows: p(m k = t k |m 1 , . . . , m N ) = softmax(h T k,L w k ) = exp(h T k,L w k ) V i=1 exp(h T k,L w i ) where k , V , h and w are similarly defined as in Eq. 1. Unlike ELMo, BERT ties the input and output embeddings.

3. PROPOSED FRAMEWORK

We first describe our proposed sense-aware cross entropy loss to model multiple word senses explicitly in language model pretraining. Then, we present our joint training approach with sense alignment objective for cross-lingual mapping of contextual word embeddings. The proposed framework can be applied to most of the recent neural language models, such as ELMo, BERT and their variants. See Table 1 for a summary of the main notations used in this paper.

