ON LEARNING UNIVERSAL REPRESENTATIONS ACROSS LANGUAGES

Abstract

Recent studies have demonstrated the overwhelming advantage of cross-lingual pre-trained models (PTMs), such as multilingual BERT and XLM, on crosslingual NLP tasks. However, existing approaches essentially capture the cooccurrence among tokens through involving the masked language model (MLM) objective with token-level cross entropy. In this work, we extend these approaches to learn sentence-level representations and show the effectiveness on crosslingual understanding and generation. Specifically, we propose a Hierarchical Contrastive Learning (HICTL) method to (1) learn universal representations for parallel sentences distributed in one or multiple languages and (2) distinguish the semantically-related words from a shared cross-lingual vocabulary for each sentence. We conduct evaluations on two challenging cross-lingual tasks, XTREME and machine translation. Experimental results show that the HICTL outperforms the state-of-the-art XLM-R by an absolute gain of 4.2% accuracy on the XTREME benchmark as well as achieves substantial improvements on both of the highresource and low-resource English→X translation tasks over strong baselines.

1. INTRODUCTION

Pre-trained models (PTMs) like ELMo (Peters et al., 2018) , GPT (Radford et al., 2018) and BERT (Devlin et al., 2019) have shown remarkable success of effectively transferring knowledge learned from large-scale unlabeled data to downstream NLP tasks, such as text classification (Socher et al., 2013) and natural language inference (Bowman et al., 2015; Williams et al., 2018) , with limited or no training data. To extend such pretraining-finetuning paradigm to multiple languages, some endeavors such as multilingual BERT (Devlin et al., 2019) and XLM (Conneau & Lample, 2019) have been made for learning cross-lingual representation. More recently, Conneau et al. (2020) present XLM-R to study the effects of training unsupervised cross-lingual representations at a huge scale and demonstrate promising progress on cross-lingual tasks. However, all of these studies only perform a masked language model (MLM) with token-level (i.e., subword) cross entropy, which limits PTMs to capture the co-occurrence among tokens and consequently fail to understand the whole sentence. It leads to two major shortcomings for current cross-lingual PTMs, i.e., the acquisition of sentence-level representations and semantic alignments among parallel sentences in different languages. Considering the former, Devlin et al. (2019) introduced the next sentence prediction (NSP) task to distinguish whether two input sentences are continuous segments from the training corpus. However, this simple binary classification task is not enough to model sentence-level representations (Joshi et al., 2020; Yang et al., 2019; Liu et al., 2019; Lan et al., 2020; Conneau et al., 2020) . For the latter, (Huang et al., 2019) defined the cross-lingual paraphrase classification task, which concatenates two sentences from different languages as input and classifies whether they are with the same meaning. This task learns patterns of sentence-pairs well but fails to distinguish the exact meaning of each sentence. In response to these problems, we propose to strengthen PTMs through learning universal representations among semantically-equivalent sentences distributed in different languages. We introduce a novel Hierarchical Contrastive Learning (HICTL) framework to learn language invariant sentence representations via self-supervised non-parametric instance discrimination. Specifically, we use a BERT-style model to encode two sentences separately, and the representation of the first token (e.g., [CLS] in BERT) will be treated as the sentence representation. Then, we conduct instance-wise comparison at both sentence-level and word-level, which are complementary to each other. At the sentence level, we maximize the similarity between two parallel sentences while minimizing which among non-parallel ones. At the word-level, we maintain a bag-of-words for each sentence-pair, each word in which is considered as a positive sample while the rest words in vocabulary are negative ones. To reduce the space of negative samples, we conduct negative sampling for word-level contrastive learning. With the HICTL framework, the PTMs are encouraged to learn language-agnostic representation, thereby bridging the semantic discrepancy among cross-lingual sentences. The HICTL is conducted on the basis of XLM-R (Conneau et al., 2020) and experiments are performed on several challenging cross-lingual tasks: language understanding tasks (e.g., XNLI, XQuAD, and MLQA) in the XTREME (Hu et al., 2020) benchmark, and machine translation in the IWSLT and WMT benchmarks. Extensive empirical evidence demonstrates that our approach can achieve consistent improvements over baselines on various tasks of both cross-lingual language understanding and generation. In more detail, our HICTL obtains absolute gains of 4.2% (up to 6.0% on zero-shot sentence retrieval tasks, e.g. BUCC and Tatoeba) accuracy on XTREME over XLM-R. For machine translation, our HICTL achieves substantial improvements over baselines on both low-resource (IWSLT English→X) and high-resource (WMT English→X) translation tasks.

2. RELATED WORK

Pre-trained Language Models. Recently, substantial work has shown that pre-trained models (PTMs) (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2019) on the large corpus are beneficial for downstream NLP tasks. The application scheme is to fine-tune the pre-trained model using the limited labeled data of specific target tasks. For cross-lingual pre-training, both Devlin et al. (2019) and Conneau & Lample (2019) trained a transformer-based model on multilingual Wikipedia which covers various languages, while XLM-R (Conneau et al., 2020) studied the effects of training unsupervised cross-lingual representations on a very large scale. For sequence-to-sequence pre-training, UniLM (Dong et al., 2019) fine-tuned BERT with an ensemble of masks, which employs a shared Transformer network and utilizing specific self-attention mask to control what context the prediction conditions on. Song et al. (2019) extended BERT-style models by jointly training the encoder-decoder framework. XLNet (Yang et al., 2019) trained by predicting masked tokens auto-regressively in a permuted order, which allows predictions to condition on both left and right context. Raffel et al. (2019) unified every NLP problem as a text-to-text problem and pre-trained a denoising sequence-to-sequence model at scale. Concurrently, BART (Lewis et al., 2020) pre-trained a denoising sequence-to-sequence model, in which spans are masked from the input but the complete output is auto-regressively predicted. Previous works have explored using pre-trained models to improve text generation, such as pretraining both the encoder and decoder on several languages (Song et al., 2019; Conneau & Lample, 2019; Raffel et al., 2019) or using pre-trained models to initialize encoders (Edunov et al., 2019; Zhang et al., 2019a; Guo et al., 2020) . Zhu et al. (2020) and Weng et al. (2020) proposed a BERTfused NMT model, in which the representations from BERT are treated as context and fed into all layers of both the encoder and decoder. Zhong et al. (2020) formulated the extractive summarization task as a semantic text matching problem and proposed a Siamese-BERT architecture to compute the similarity between the source document and the candidate summary, which leverages the pre-trained BERT in a Siamese network structure. Our approach also belongs to the contextual pre-training so it could be applied to various downstream NLU and NLG tasks. Contrastive Learning. Contrastive learning (CTL) (Saunshi et al., 2019) aims at maximizing the similarity between the encoded query q and its matched key k + while keeping randomly sampled  XLM-R ℳ(•) XLM-R ℳ(•) L ctl = -log exp(s(q, k + )) exp(s(q, k + )) + i exp(s(q, k - i )) , where the score function s(q, k) is essentially implemented as the cosine similarity q T k q • k . q and k are often encoded by a learnable neural encoder, such as BERT (Devlin et al., 2019) or ResNet (He et al., 2016) . k + and k -are typically called positive and negative samples. In addition to the form illustrated in Eq. (1), contrastive losses can also be based on other forms, such as margin-based loses (Hadsell et al., 2006) and variants of NCE losses (Mnih & Kavukcuoglu, 2013) . Contrastive learning is at the core of several recent work on unsupervised or self-supervised learning from computer vision (Wu et al., 2018; Oord et al., 2018; Ye et al., 2019; He et al., 2019; Chen et al., 2020; Tian et al., 2020) to natural language processing (Mikolov et al., 2013; Mnih & Kavukcuoglu, 2013; Devlin et al., 2019; Clark et al., 2020b; Feng et al., 2020; Chi et al., 2020) . Kong et al. (2020) improved language representation learning by maximizing the mutual information between a masked sentence representation and local n-gram spans. Clark et al. (2020b) utilized a discriminator to predict whether a token is replaced by a generator given its surrounding context. Iter et al. (2020) proposed to pre-train language models with contrastive sentence objectives that predict the surrounding sentences given an anchor sentence. In this paper, we propose HICTL to encourage parallel cross-lingual sentences to have the identical semantic representation and distinguish whether a word is contained in them as well, which can naturally improve the capability of cross-lingual understanding and generation for PTMs.

3.1. HIERARCHICAL CONTRASTIVE LEARNING

We propose hierarchical contrastive learning (HICTL), a novel comparison learning framework that unifies cross-lingual sentences as well as related words. HICTL can learn from both non-parallel and parallel multilingual data, and the overall architecture of HICTL is illustrated in Figure 1 . We represent a training batch of the original sentences as x = {x 1 , x 2 , ..., x n } and its aligned counterpart is denoted as y = {y 1 , y 2 , ..., y n }, where n is the batch size. For each pair x i , y i , y i is either the translation in the other language of x i when using parallel data or the perturbation through reordering tokens in x i when only monolingual data is available.x \i is denoted as a modified version of x where the i-th instance is removed. Sentence-Level CTL. As illustrated in Figure 1a , we apply the XLM-R as the encoder to represent sentences into hidden representations. The first token of every sequence is always a special token  + = k + -q 2 ) in the embedding space represents a manifold near in which sentences are semantically equivalent. We can generate a coherent sample (i.e., k-) that interpolate between known pair q and k -. The synthetic negative kcan be controlled adaptively with proper difficulty during training. The curly brace in green indicates the walking range of hard negative samples, the closer to the circle the harder the sample is. (e.g., [CLS]), and the final hidden state corresponding to this token is used as the aggregate sentence representation for pre-training, that is, r x = f • g(M(x)) where g(•) is the aggregate function and f (•) is a linear projection, • denotes the composition of operations. To obtain universal representation among semantically-equivalent sentences, we encourage r xi (the query, denoted as q) to be as similar as possible to r yi (the positive sample, denoted as k + ) but dissimilar to all other instances (i.e., y \i ∪ x \i , considered as a series of negative samples, denoted as {k - 1 , k - 2 , ..., k - 2n-2 }) in a training batch. Formally, the sentence-level contrastive loss for x i is defined as L sctl (x i ) = -log exp •s(q, k + ) exp •s(q, k + ) + |y \i ∪x \i | j=1 exp •s(q, k - j ) . (2) Symmetrically, we also expect r yi (the query, denoted as q) to be as similar as possible to r xi (the positive sample, denoted as k+ ) but dissimilar to all other instances in the same training batch, thus, L sctl (y i ) = -log exp •s(q, k+ ) exp •s(q, k+ ) + |y \i ∪x \i | j=1 exp •s(q, k- j ) . (3) The sentence-level contrastive loss over the training batch can be formulated as L S = 1 2n n i=1 L sctl (x i ) + L sctl (y i ) . For sentence-level contrastive learning, we treat other instances contained in the training batch as negative samples for the current instance. However, such randomly selected negative samples are often uninformative, which poses a challenge of distinguishing very similar but nonequivalent samples. To address this issue, we employ smoothed linear interpolation (Bowman et al., 2016; Zheng et al., 2019) between sentences in the embedding space to alleviate the lack of informative samples for pre-training, as shown in Figure 2 . Given a training batch { x i , y i } n i=1 , where n is the batch size. In this context, having obtained the embeddings of a triplet, an anchor q and a positive k + as well as a negative k -(supposing q, k + and k -are representations of sentences x i , y i and y - i ∈ x \i ∪ y \i , respectively), we construct a harder negative sample kto replace k - j : k-= q + λ(k --q), λ ∈ ( d + d -, 1] if d -> d + ; k - if d -≤ d + . ( ) where d + = k + -q 2 and d -= k --q 2 . For the first condition, the hardness of kincreases when λ becomes smaller. To this end, we intuitively set λ as λ = d + d - ζ•p + avg , ζ ∈ (0, 1) where p + avg = 1 100 ∈[-100,-1] e -L ()

S

is the average log-probability over the last 100 training batches and L S formulated in Eq. ( 4) is the sentence-level contrastive loss of one training batch. During pre-training, when the model tends to distinguish positive samples easily, which means negative samples are not informative already. At this time, p + avg ↑ and d + d -↓, which leads λ ↓ and harder negative samples are adaptively synthesized in the following training steps, vice versa. As hard negative samples usually result in significant changes of the model parameters, we introduce the slack coefficient ζ to prevent the model from being trained in the wrong direction, when it accidentally switch from random negative samples to very hard ones. In practice, we empirically set ζ = 0.9. Word-Level CTL. Intuitively, predicting the related words in other languages for each sentence can bridge the representations of words in different languages. As shown in Figure 1b , we concatenate the sentence pair x i , y i as x i • y i : [CLS] x i [SEP] y i [SEP] and the bag-of-words of which is denoted as B. For word-level contrastive learning, the final state of the first token is treated as the query (q), each word w t ∈ B is considered as the positive sample and all the other words (V\B, i.e., the words in V that are not in B where V indicates the overall vocabulary of all languages) are negative samples. As the vocabulary usually with large space, we propose to only use a subset S ⊂ V\B sampled according to the normalized similarities between q and the embeddings of the words. As a result, the subset S naturally contains the hard negative samples which are beneficial for learning high-quality representations (Ye et al., 2019) . Specifically, the word-level contrastive loss for x i , y i is defined as L wctl (x i , y i ) = - 1 |B| |B| t=1 log exp •s(q, e(w t )) exp •s(q, e(w t )) + wj ∈S exp •s(q, e(w j )) . ( ) where e(•) is the embedding lookup function and |B| is the number of unique words in the concatenated sequence x i • y i . The overall word-level contrastive loss can be formulated as: L W = 1 n n i=1 L wctl (x i , y i ). Multi-Task Pre-training. Both MLM and translation language model (TLM) are combined with HICTL by default, as the prior work (Conneau & Lample, 2019) has verified the effectiveness of them in XLM. In summary, the model can be optimized by minimizing the entire training loss: L = L LM + L S + L W , where L LM is implemented as either the TLM when using parallel data or the MLM when only monolingual data is available to recover the original words of masked positions given the contexts.

3.2. CROSS-LINGUAL FINE-TUNING

Language Understanding. The representations produced by HICTL can be used in several ways for language understanding tasks whether they involve single text or text pairs. Concretely, (i) the [CLS] representation of single-sentence in sentiment analysis or sentence pairs in paraphrasing and entailment is fed into an extra output-layer for classification. (ii) The pre-trained encoder can be used to assign POS tags to each word or to locate and classify all the named entities in the sentence for structured prediction, as well as (iii) to extract answer spans for question answering. 𝑥 1

Pre-trained Encoder

Randomly Initialized Decoder 𝑐𝑙𝑠 𝑥 0 𝑥 2 𝑠𝑒𝑝 𝑦 1 𝑏𝑜𝑠 𝑦 0 𝑦 2 𝑦 3 𝑦 1 𝑦 0 𝑦 2 𝑦 3 𝑒𝑜𝑠 𝑐𝑙𝑠 Figure 3: Fine-tuning on NMT task. Language Generation. We also explore using HICTL to improve machine translation. In the previous work, Conneau & Lample (2019) has shown that the pre-trained encoders can provide a better initialization of both supervised and unsupervised NMT systems. Liu et al. (2020b) has shown that NMT models can be improved by incorporating pre-trained sequence-to-sequence models on various language pairs but highest-resource settings. As illustrated in Figure 3 , we use the model pre-trained by HICTL as the encoder, and add a new set of decoder parameters that are learned from scratch. To prevent pre-trained weights from being washed out by supervised training, we train the encoder-decoder model in two steps. In the first step, we freeze the pre-trained encoder and only update the decoder. In the second step, we train all parameters for a relatively small number of iterations. In both cases, we compute the similarities between the [CLS] representation of the encoder and all target words in advance. Then we aggregate them with the logits before the softmax of each decoder step through an element-wise additive operation. The encoder-decoder model is optimized by maximizing the log-likelihood of bitext at both steps.

4. EXPERIMENTS

We consider two evaluation benchmarks: nine cross-lingual language understanding tasks in the XTREME benchmark and machine translation tasks (IWSLT'14 English↔German, IWSLT'14 English→Spanish, WMT'16 Romanian→English, IWSLT'17 English→{French, Chinese} and WMT'14 English→{German, French}). In this section, we describe the data and training details, and provide detailed evaluation results.

4.1. DATA AND MODEL

During pre-training, we follow Conneau et al. (2020) to build a Common-Crawl Corpus using the CCNet (Wenzek et al., 2019 ) toolfoot_0 for monolingual texts. Table 7 (see appendix A) reports the language codes and data size in our work. For parallel data, we use the same (English-to-X) MT dataset as (Conneau & Lample, 2019) , which are collected from MultiUN (Eisele & Yu, 2010) for French, Spanish, Arabic and Chinese, the IIT Bombay corpus (Kunchukuttan et al., 2018a) for Hindi, the OpenSubtitles 2018 for Turkish, Vietnamese and Thai, the EUbookshop corpus for German, Greek and Bulgarian, Tanzil for both Urdu and Swahili, and GlobalVoices for Swahili. Table 8 (see appendix A) shows the statistics of the parallel data. We adopt the Transformer-Encoder (Vaswani et al., 2017) as the backbone with 12 layers and 768 hidden units for HICTL Base , and 24 layers and 1024 hidden units for HICTL. We initialize the parameters of HICTL with XLM-R (Conneau et al., 2020) . Hyperparameters for pre-training and fine-tuning are shown in Table 9 (see appendix B). We run the pre-training experiments on 8 V100 GPUs, batch size 1024. The number of negative samples m=512 for word-level contrastive learning.

4.2. EXPERIMENTAL EVALUATION

Cross-lingual Language Understanding (XTREME) There are nine tasks in XTREME that can be grouped into four categories: (i) sentence classification consists of Cross-lingual Natural Language Inference (XNLI) (Conneau et al., 2018) and Cross-lingual Paraphrase Adversaries from Word Scrambling (PAWS-X) (Zhang et al., 2019b) . (ii) Structured prediction includes POS tagging and NER. We use POS tagging data from the Universal Dependencies v2.5 (Nivre et al., 2018) treebanks. Each word is assigned one of 17 universal POS tags. For NER, we use the Wikiann dataset (Pan et al., 2017) . (iii) Question answering includes three tasks: Cross-lingual Question Answering (XQuAD) (Artetxe et al., 2019) , Multilingual Question Answering (MLQA) (Lewis et al., 2019) , and the gold passage version of the Typologically Diverse Question Answering dataset (TyDiQA-GoldP) (Clark et al., 2020a) . (iv) Sentence retrieval includes two tasks: BUCC (Zweigenbaum et al., 2017) and Tatoeba (Artetxe & Schwenk, 2019) , which aims to extract parallel sentences between the English corpus and target languages. As XTREME provides no training data, thus we directly evaluate pre-trained models on test sets. Table 1 provides detailed results on four categories in XTREME. First, compared to the state of the art XLM-R baseline, HICTL further achieves significant gains of 1.43% and 2.80% on average on nine tasks with cross-lingual zero-shot transfer and translate-train-all settings, respectively. Second, mining hard negative samples via smoothed linear interpolation play an important role in contrastive learning, which significantly improves accuracy by 1.6 points on average. Third, HICTL with hardness aware augmentation delivers large improvements on zero-shot sentence retrieval tasks (scores 5.8 and 6.0 points higher on BUCC and Tatoeba, respectively). Following (Hu et al., 2020) , we directly evaluate pre-trained models on test sets without any extra labeled data or fine-tuning techniques used in (Fang et al., 2020; Luo et al., 2020) . These results demonstrate the capacity of HICTL on learning cross-lingual representations. We also compare our best model with two existing models: FILTER (Fang et al., 2020) and VECO (Luo et al., 2020) . The results demonstrate that HICTL achieves the best performance on most tasks with less monolingual data. Ablation experiments are present at Table 3 . Comparing the full model, we can draw several conclusions: (1) removing the sentence-level CTL objective hurts performance consistently and significantly, (2) the word-level CTL objective has least drop compared to others, and (3) the parallel (MT) data has a large impact on zero-shot multilingual sentence retrieval tasks. Moreover, Table 2 provides the comparisons between HICTL and existing methods.

Machine Translation

The main idea of HICTL is to summarize cross-lingual parallel sentences into a shared representation that we term as semantic embedding, using which semantically related words can be distinguished from others. Thus it is natural to apply this global embedding to text generation. We fine-tune the pre-trained HICTL with the base setting on machine translation tasks with both low-resource and high-resource settings. For the low-resource scenario, we choose IWSLT'14 English↔German (En↔De)foot_1 , IWSLT'14 English→Spanish (En→Es), WMT'16 Romanian→English (Ro→En), IWSLT'17 English→French (En→Fr) and English→Chinese (En→Zh) translationfoot_2 . There are 160k, 183k, 236k, 235k, 0.6M bilingual sentence pairs for En↔De, En→Es, En→Fr, En→Zh and Ro→En tasks. For the rich-resource scenario, we work on WMT'14 En→{De, Fr}, the corpus sizes are 4.5M and 36M respectively. We concatenate newstest 2012 and newstest 2013 as the validation set and use newstest 2014 as the test set. During fine-tuning, we use the pre-trained model to initialize the encoder and introduce a randomly initialized decoder. We develop a shallower decoder with 4 identical layers to reduce the computation overhead. At the first fine-tune step, we concatenate the datasets of all language pairs in either low-resource or high-resource settings to optimize the decoder only until convergencefoot_3 . Then we tune the whole encoder-decoder model using a per-language corpus at the second step. The initial learning rate is 2e-5 and inverse sqrt learning rate (Vaswani et al., 2017) scheduler is also adopted. For WMT'14 En→De, we use beam search with width 4 and length penalty 0.6 for inference. For other tasks, we use width 5 and a length penalty of 1.0. We use multi-bleu.perl to evaluate IWSLT'14 En↔De and WMT tasks, but sacreBLEU for the remaining tasks, for fair comparison with previous work. Results on both high-resource and low-resource tasks are reported in Table 4 and Table 5 , respectively. We implemented standard Transformer (apply the base and big setting for IWSLT and WMT tasks respectively) as baseline. The proposed HICTL can improve the BLEU scores of the eight tasks by 3.34, 2.95, 3.24, 3.45, 2.8, 6.37, 4.4, and 3.4 . In addition, our approach also outperforms the BERT-fused model (Yang et al., 2020) , a method treats BERT as an extra context We also evaluate our model on tasks where no bi-text is available for the target language pair. Following mBART (Liu et al., 2020b) , we adopt the setting of language transfer. That is, no bi-text for the target pair is available, but there is bi-text for translating from some other language into the target language. For explanation, supposing there is no parallel data for the target language pair Italian→English (It→En), but we can transfer knowledge learned from Czech→English (Cs→En, a high-resource language pair) to It→En. We consider X→En translation, covering Indic languages (Ne, Hi, Si, Gu) and European languages (Ro, It, Cs, Nl). For European languages, we fine-tune on Cs→En translation, the parallel data is from WMT'19 that contains 11M sentence pairs. We test on {Cs, Ro, It, Nl}→En, in which test sets are from previous WMT (Cs, Ro) or IWSLT (It, Nl) competitions. For Indic languages, we fine-tune on Hi→En translation (1.56M sentence pairs are from IITB (Kunchukuttan et al., 2018b )), and test on {Ro, It, Cs, Nl}→En translations. Results are shown in Table 6 . We can always obtain reasonable transferring scores at low-resource pairs over different fine-tuned models. However, our experience shows that the randomly initialized models without pre-training always achieve near 0 BLEU. The underlying scenario is that multilingual pre-training produces universal representations across languages so that once the model learns to translate one language, it learns to translate all languages with similar representations. Moreover, a failure happened in Gu→En translation, we conjecture that we only use 0.3GB monolingual data for pre-training, which is difficult to learn informative representations for Gujarati.

5. CONCLUSION

We have demonstrated that pre-trained language models (PTMs) trained to learn commonsense knowledge from large-scale unlabeled data highly benefit from hierarchical contrastive learning (HICTL), both in terms of cross-lingual understanding and generation. Learning universal representations at both word-level and sentence-level bridges the semantic discrepancy across languages. As a result, our HICTL sets a new level of performance among cross-lingual PTMs, improving on the state of the art by a large margin. 7 reports the language codes and data size in our work. For parallel data, we use the same (English-to-X) MT dataset as (Conneau & Lample, 2019) , which are collected from MultiUN (Eisele & Yu, 2010) for French, Spanish, Arabic and Chinese, the IIT Bombay corpus (Kunchukuttan et al., 2018a) for Hindi, the OpenSubtitles 2018 for Turkish, Vietnamese and Thai, the EUbookshop corpus for German, Greek and Bulgarian, Tanzil for both Urdu and Swahili, and GlobalVoices for Swahili. Table 8 shows the statistics of the parallel data.

B HYPERPARAMETERS FOR PRE-TRAINING AND FINE-TUNING

As shown in Table 9 , we present the hyperparameters for pre-training HICTL. We use the same vocabulary as well as the sentence-piece model with XLM-R (Conneau et al., 2020) . During finetuning on XTREME, we search the learning rate over {5e-6, 1e-5, 1.5e-5, 2e-5, 2.5e-5, 3e-5} and batch size over {16, 32} for BASE-size models. And we select the best LARGE-size model by searching the learning rate over {3e-6, 5e-6, 1e-5} as well as batch size over {32, 64}. of HICTL on learning universal representations across different languages. Note that the t-SNE visualization of HICTL still demonstrates some noises, we attribute them to the lack of hard negative examples for sentence-level contrastive learning and leave this to future work for consideration. We collect 10 sets of samples from WMT'14-19, each of them contains 100 parallel sentences distributed in 5 languages (i.e., English, French, German, Russian, and Spanish). Each set is identified by a color and different languages marked by different shapes. We can see that a set of sentences under the same meaning are clustered more densely for HICTL than XLM-R, which reveals the strong capability of HICTL on learning universal representations across different languages.



https://github.com/facebookresearch/cc_net We split 7k sentence pairs from the training dataset for validation and concatenate dev2010, dev2012, tst2010, tst2011, tst2012 as the test set. https://wit3.fbk.eu/mt.php?release=2017-01-ted-test Zhao et al. (2020) conducted a theoretical investigation on learning universal representations for the task of multilingual MT, while we directly use a shared encoder and decoder across languages for simplicity. https://github.com/facebookresearch/cc_net



Figure2: Illustration of constructing hard negative samples (HNS). A circle (the radius is d + = k + -q 2 ) in the embedding space represents a manifold near in which sentences are semantically equivalent. We can generate a coherent sample (i.e., k-) that interpolate between known pair q and k -. The synthetic negative kcan be controlled adaptively with proper difficulty during training. The curly brace in green indicates the walking range of hard negative samples, the closer to the circle the harder the sample is.

Figure4: Visualizations (t-SNE projection) of sentence embeddings output by HICTL (left) and XLM-R (right). We collect 10 sets of samples from WMT'14-19, each of them contains 100 parallel sentences distributed in 5 languages (i.e., English, French, German, Russian, and Spanish). Each set is identified by a color and different languages marked by different shapes. We can see that a set of sentences under the same meaning are clustered more densely for HICTL than XLM-R, which reveals the strong capability of HICTL on learning universal representations across different languages.

Illustration of Hierarchical Contrastive Learning (HICTL). n is the batch size, m denotes the number of negative samples for word-level contrastive learning. B and V indicates the bag-ofwords of the instance x i , y i and the overall vocabulary of all languages, respectively.

Overall results on XTREME benchmark. Results of mBERT(Devlin et al., 2019), XLM(Conneau & Lample, 2019)  and XLM-R(Conneau et al., 2020) are from XTREME(Hu et al., 2020). Results of ‡ are from our in-house replication. HNS is short for "Hard Negative Samples".Translate-train-all (models are trained on English training data and its translated data on the target language)

Comparison with existing methods on XTREME tasks.

Ablation study on XTREME tasks.

BLEU scores [%] on high-resource tasks. Results with † and ‡ are from VECO(Luo et al., 2020) and our in-house implementation, respectively. In our implementation, we use XLM-R and the best version of HiCTL (pre-traind with CCNet-100 and hard negative samples) to initialize the encoder, respectively.

BLEU scores [%] on low-resource tasks. Results with ‡ are from our in-house implementation. We provide additional experimental results (to follow experiments in Zhu et al. (2020)) on IWSLT'14 English→Spanish (En→Es) task. HICTL Base represents the BASE sized model that is pre-trained on CCNet-100 with hard negative samples.

BLEU scores [%] on Zero-shot MT via Language Transfer. We bold the highest transferring score for each language family. and fuses the representations extracted from BERT with each encoder and decoder layer. Note we achieve new state-of-the-art results on IWSLT'14 En→De, IWSLT'17 En→{Fr, Zh} translations. These improvements show that mapping different languages into a universal representation space is beneficial for both low-resource and high-resource translations.

The statistics of CCNet corpus used for pretraining.

Parallel data used for pre-training.

PAWS-X accuracy scores for each language.

POS results (Accuracy) for each language.

NER results (F1) for each language. 82.3 55.2 84.7 79.0 81.2 80.1 81.6 79.8 61.4 61.9 82.8 80.5 60.4 74.6 79.8 54.8 83.5 24.9 66.1 HICTL, CCNet-100 + MT 88.6 80.9 55.4 85.6 81.8 82.0 82.5 80.8 81.2 62.5 64.2 81.2 83.0 60.3 77.3 84.4 55.8 83.7 26.0 65.0 +HARD NEGATIVE SAMPLES 88.9 82.0 56.6 83.7 83.4 82.8 84.8 83.0 83.8 65.4 65.4 82.0 82.6 60.5 74.7 81.5 58.1 84.7 27.9 65.9 .7 62.2 69.4 68.8 57.9 55.6 87.9 84.2 71.9 74.4 61.6 59.2 2.2 74.2 79.5 58.1 83.0 35.2 33.0 HICTL, CCNet-100 + MT 72.8 57.6 64.6 70.4 71.5 61.1 59.0 87.7 85.1 70.3 74.3 60.6 57.9 5.6 77.5 79.0 59.8 83.7 37.7 36.9 +HARD NEGATIVE SAMPLES 76.8 60.9 65.0 71.4 72.5 59.0 56.3 85.9 84.5 71.4 75.6 62.9 58.8 3.9 77.7 80.4 59.1 83.6 37.7 37.2

Tatoeba results (Accuracy) for each language .5 72.2 45.4 89.5 61.3 77.6 51.7 38.6 71.7 72.8 76.9 66.3 73.1 65.1 77.5 68.5  63.1 HICTL, Wiki-15 + MT 61.5 51.4 76.1 47.9 92.1 63.4 80.5 55.9 37.8 74.6 76.7 78.0 68.4 74.5 68.8 80.4 70.2 63.9 HICTL, CCNet-100 + MT 63.0 50.9 76.8 47.0 94.6 68.8 80.9 59.3 41.5 77.3 78.2 80.3 70.2 77.9 72.1 81.3 73.7 66.2 +HARD NEGATIVE SAMPLES 68.9 57.7 83.2 55.4 98.2 74.5 88.5 62.4 47.7 80.2 82.9 85.5 79.1 85.0 76.8 90.3 80.8 72.7 .3 51.2 63.1 66.2 59.0 81.0 84.4 76.9 19.8 28.3 37.8 28.9 36.7 68.9 26.6 77.9 69.8 HICTL, Wiki-15 + MT 18.7 55.8 51.0 65.5 67.3 61.2 82.9 84.4 78.3 22.2 28.6 41.4 33.5 41.6 71.2 26.7 80.2 73.6 HICTL, CCNet-100 + MT 19.6 57.3 54.6 68.0 71.8 62.0 88.1 88.9 77.7 26.1 32.9 39.5 32.9 43.2 71.2 27.8 79.9 74.7 +HARD NEGATIVE SAMPLES 27.2 63.0 61.5 72.6 75.3 67.8 92.8 92.8 85.4 32.0 36.7 47.8 41.5 49.8 77.0 34.3 84.3 81.3

ACKNOWLEDGMENTS

We would like to thank the anonymous reviewers for the helpful comments. We also thank Jing Yu for the instructive suggestions. This work is supported by the National Key R&D Program of China under Grant No.2017YFB0803301 and No. 2018YFB1403202.

annex

Published as a conference paper at ICLR 2021 Table 10 : Results on Cross-lingual Natural Language Inference (XNLI) for each language. We report the accuracy on each of the XNLI languages and the average accuracy of our HICTL as well as five baselines: BiLSTM (Conneau et al., 2018) , mBERT (Devlin et al., 2019) , XLM (Conneau & Lample, 2019) , Unicoder (Huang et al., 2019) and XLM-R (Conneau et al., 2020) . Results of ‡ are from our in-house replication. 

D VISUALIZATION OF SENTENCE EMBEDDINGS

We collect 10 sets of samples from WMT'14-19, each of them contains 100 parallel sentences distributed in 5 languages. As the t-SNE visualization in Figure 4 , a set of sentences under the same meaning are clustered more densely for HICTL than XLM-R, which reveals the strong capability

