VECO: VARIABLE ENCODER-DECODER PRE-TRAINING FOR CROSS-LINGUAL UNDERSTANDING AND GENERATION

Abstract

Recent studies about learning multilingual representations have achieved significant performance gains across a wide range of downstream cross-lingual tasks. They train either an encoder-only Transformer mainly for understanding tasks, or an encoder-decoder Transformer specifically for generation tasks, ignoring the correlation between the two tasks and frameworks. In contrast, this paper presents a variable encoder-decoder (VECO) pre-training approach to unify the two mainstreams in both model architectures and pre-training tasks. VECO splits the standard Transformer block into several sub-modules trained with both innersequence and cross-sequence masked language modeling, and correspondingly reorganizes certain sub-modules for understanding and generation tasks during inference. Such a workflow not only ensures to train the most streamlined parameters necessary for two kinds of tasks, but also enables them to boost each other via sharing common sub-modules. As a result, VECO delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark covering text classification, sequence labeling, question answering, and sentence retrieval. For generation tasks, VECO also outperforms all existing cross-lingual models and state-of-the-art Transformer variants on WMT14 English-to-German and English-to-French translation datasets, with gains of up to 1∼2 BLEU.

1. INTRODUCTION

Driven by the striking success of pre-trained language models (Devlin et al., 2019) , recent crosslingual pre-training (Lample & Conneau, 2019; Liu et al., 2020b) has attracted an increasing of attention. It provides cross-lingual contextualized representations for the inputs of different languages, which significantly advances performance in both natural language understanding (NLU) and generation (NLG) tasks. There are two mainstream architectures in current cross-lingual pre-training literature: encoderonly and encoder-decoder. The former like XLM (Lample & Conneau, 2019) focuses on conducting masked language modeling (MLM) with a single Transformer (Vaswani et al., 2017) encoder. This paradigm is naturally compatible with various NLU tasks, but tends to suffer from limited gains on cross-lingual generation tasks (e.g., machine translation) due to the lack of effective decoder initialization. In contrast, the latter like mBART (Liu et al., 2020b) attempts to pre-train the encoder-decoder Transformer via denoising auto-encoding tasks to provide complete initialization for downstream generation tasks. However, when applied in NLU scenarios, it usually requires more computation and memory to match the performance of the encoder-only models. In light of the above pros and cons, this work presents Variable Encoder-deCOder (VECO) pretraining, which targets at providing pre-trained model initialization for both the encoder-only and encoder-decoder Transformer with the most streamlined parameters. We observe that Transformer encoder and decoder blocks have two common modules: SelfAttention and FFN (feed-forward network), with the main difference that the latter introduces an extra CrossAttention (attention across from the encoder to the decoder) module. Inspired by the lottery ticket hypothesis (Frankle & Carbin, 2018) , we split the standard Transformer block into three independent modules Figure 1 : The overview of VECO. During pre-training, we feed two masked segments x and ŷ into different modules to perform inner-sentence mask language modeling (IS-MLM) and cross-sentence mask language modeling (CS-MLM). More specifically, the masked segment x can only attend to its context via self-attention to recover the original tokens x (IS-MLM), while masked segment ŷ can attend to its preceding tokens via self-attention and the context x via cross-attention to predict the original tokens ȳ (CS-MLM). For downstream NLU tasks, we throw out the cross-attention module and only fine-tune on the self-attention and FFN modules acted as an encoder. For NLG tasks, we keep all modules to initialize the corresponding encoder and decoders. {SelfAttention, CrossAttention, FFN} to be collaboratively trained via two specific MLM tasks. After that, we rebuild the desired complete architecture applicable for NLU or NLG with different specific combinations of these modules during fine-tuning.foot_0 Specifically, to be equipped with the ability of language understanding during pre-training, SelfAttention and FFN are assembled into a standard Transformer encoder for conducting inner-sequence masked language modeling (IS-MLM). In terms of generation, SelfAttention, CrossAttention, and FFN act together as the decoder in the standard sequence-to-sequence model, and are trained by the elaborately designed cross-sequence masked language modeling (CS-MLM) task. When applied to downstream finetuning, both SelfAttention and FFN modules constitute the Transformer encoder for contextual modeling in NLU or NLG, or cooperate with additional CrossAttention to provide the effective initialization of Transformer decoder. With such kind of workflow, VECO can be applied to both NLU and NLG tasks with the most streamlined parameters, which significantly reduces computational overhead and memory costs. Moreover, IS-MLM is specifically designed for understanding of individual sequences, while both understanding and generation tasks can benefit from CS-MLM. With such parameter sharing, VECO enables SelfAttention and FFN modules to be jointly trained by the two MLMs, which boosts both NLU and NLG performance. We validate VECO on a variety of representative cross-lingual NLU and NLG benchmarks. For cross-lingual understanding tasks, we conduct experiments on the XTREME benchmark consisting of 9 cross-lingual tasks, including text classification, sequence labeling, question answering, and sentence retrieval. VECO ranks first at the XTREME leaderboardfoot_1 at the submission deadline and obtains new state-of-the-art results on most of the tasks. For cross-lingual generation tasks, we validate VECO on the widely used WMT14 English-German and English-French machine translation benchmarks. VECO obtains 44.4 and 31.5 BLEU scores, consistently outperforming existing crosslingual pre-training approaches and state-of-the-art Transformer variants by around 1∼2 BLEU.

2.1. BACKBONE NETWORK

The backbone network of VECO is composed of a stack of N identical layers. Each layer has three modules, consisting of a required self-attention module, an optional cross-attention module, and a required feed-forward linear module. Both self-attention and cross-attention modules are based on the multi-head attention (Vaswani et al., 2017) : MultiHead(Q, K, V) = Concat(head 1 , ..., head h )W O head i = Attention(QW Q , KW K , VW V ) (1) where W O , W Q , W K and W V are parameter matrices. Attention(a, b, c) represents the attention operation with a as query, b as key, and c as value. We refer the readers to Vaswani et al. (2017) for more details. The main difference of the self-attention module and cross-attention module is that Q = K = V holds in the self-attention module while only K = V exists in the cross-attention module. We formalize these two modules as: SelfAttention(x; θ s ) = AddNorm MultiHead(Q = x, K = x, V = x) CrossAttention(x, y; θ c ) = AddNorm MultiHead(Q = y, K = x, V = x) where θ s and θ c are the corresponding parameters, and AddNorm denotes a residual connection (He et al., 2016) with a post layer normalization (Ba et al., 2016) . After that, a fully connected feed-forward network is applied to each element of input independently: FFN(x; θ f ) = AddNorm W 2 GeLU (W 1 x) where θ f = {W 1 ; W 2 } are matrices of parameters.

2.2. PRE-TRAINING OBJECTIVES

In cross-lingual pre-training scenarios, we can utilize both monolingual and bilingual data widely used in previous works (Lample & Conneau, 2019; Chi et al., 2020b; Yang et al., 2020) . We formalize both the two adjacent segments in the monolingual corpus and a pair of parallel sentence in the bilingual corpus as (x, y). We firstly adopt the same mask strategy like BERT (Devlin et al., 2019) to construct the masked input ( x, ŷ). Then, the backbone Transformer takes the input to perform the inner-sequence and cross-sequence masked language modeling, which enables the model to be optimized jointly for cross-lingual language understanding and cross-sequence generation.

IS-MLM: Inner-Sequence Masked Language Modeling

To be equipped with the ability of language understanding, we perform masked language modeling of a single-sequence on the selfattention and FFN modules, while skipping the cross-attention modules. As shown in Figure 1 , the purple lines show the forward process of the IS-MLM task in each layer. To be specific, the embeddings of the masked input sequence x are fed into the self-attention and FFN modules in each layer to get a contextual presentation X (i) : H (i) = SelfAttention(X (i-1) ; θ s ) X (i) = FFN(H (i) ; θ f ) The contextual presentations X (N ) of the last layer is used to recover the masked tokens x. Thus the training loss of inner-sequence masked language modeling can be formalized as: L IS-MLM (x) = -logP ( x| x; θ s , θ f ) CS-MLM: Cross-Sequence Masked Language Modeling In order to fully train the crossattention module that plays a primary role in semantic mapping between sentences in cross-lingual generation tasks (e.g., machine translation), we simulate a decoder by reusing the self-attention and FFN modules to cooperate with the cross-attention module. As shown in Figure 1 , the green lines depict the forward process of the CS-MLM task in each layer. Specifically, we first extract the contextual representation of ŷ via the SelfAttention modulefoot_2 ; then a cross-attention module is employed to model an interactive representation of ( x, ŷ): S (i) = SelfAttention(Y (i-1) ; θ s ) Z (i) = CrossAttention(S (i) , X (N ) ; θ c ) Y (i) = FFN(Z (i) ; θ f ) Finally, Y (N ) , considering both the context of the semantic-related sequence x and its left segments ŷ<t , is used to predict the masked tokens ȳ: L CS-MLM (x, y) = -logP ( ȳ| x, ŷ<t ; θ s , θ c , θ f ) (7) Note that when optimizing the CS-MLM objective, we detach X (N ) in Eq. ( 6) from the computation graph (i.e., stop the gradients back-propagation from "virtual" decoder to "virtual" encoder) to let the two objectives optimized in isolation. It also speeds up and stabilizes the training of this "virtual" decoder model, since very deep encoder-decoders are typically hard to train. To conclude, the total MLM loss for a training instance (x, y), by exchanging the x and ŷ in Eq. ( 5) and Eq. ( 7) 4 , can be further formalized as: L MLM (x, y) = L IS-MLM (x) + L IS-MLM (y) + L CS-MLM (x, y) + L CS-MLM (y, x) = -logP ( x| x) + logP ( ȳ| ŷ) + logP ( ȳ| x, ŷ<t ) + logP ( x| ŷ, x<t ) Several monolingual pre-training models such as MASS (Song et al., 2019) , BART (Lewis et al., 2019) and PALM (Bi et al., 2020 ) also present similar unidirectional language modeling tasks. Except that VECO focuses on the multilingual scenario, there are some major differences in terms of both model architecture and task design: 1) In terms of model architecture, the core difference is that we share the self-attention and FFN modules in the encoder and decoder. We find that such parameter sharing can enhance the semantic mapping between different languages not only at the embedding-level (e.g, shared BPE vocabulary), but also at the module-level. Moreover, it also acts as a form of regularization that stabilizes the training and helps with generalization (Xia et al., 2019; Lan et al., 2019) . 2) In terms of task design, we differ in several ways. The proposed IS-MLM task forces the model to bidirectionally comprehend the source input (good for NLU), which is a shortage of mBART. Meanwhile, CS-MLM predicts the masked words other than generating the next word (adopted by MASS and mBART), thus keeping in line with IS-MLM towards a more consistent optimization direction (predicting masked words) on the shared parameters. In total, the core contribution of this article is the exquisite cooperation of parameters sharing and pre-training strategy, making VECO flexibly initialize any downstream framework.

2.3. FINE-TUNING ON DOWNSTREAM NLU AND NLG TASKS

When fine-tuning on various downstream tasks, one advantage of VECO is its flexibility for initializing both the encoder-only Transformer and encoder-decoder Transformer. VECO for Cross-lingual Natural Language Understanding Since the mainstream framework for NLU is an encoder-only Transformer, we only keep the self-attention and FFN modules while throwing out the cross-attention module in each layer (Figure 1(b) ). Note that the cross-attention modules only occupy less than 20% of the total parameters, which is smaller than the discarded generator of ELECTRA (Clark et al., 2020b) . VECO for Cross-lingual Natural Language Generation Considering the most popular backbone network for cross-lingual generation tasks like machine translation is the standard Transformer encoder-decoder model, we reorganize the VECO modules to act like that. As shown in Figure 1 (c), the self-attention and FFN modules constitute the Transformer encoder for contextual modeling, and the three modules work together to act as a decoder for both inner-sequence and cross-sequence contextual modeling. Due to the training difficulty and inference speed of the deep network, we can choose a subset (e.g., 6 layers) of all layers to assemble the decoder. More in-depth analysis can be found in Table 4 and Section 5.2.

3. PRE-TRAINING SETUP

Model Configuration We pre-train a 24-layer model with 1024 embedding/hidden size and 4096 feed-forward size (∼ 662M parameters). We do not use language embeddings to allow our model to better deal with downstream tasks of unseen languages. We adopt the same 250K vocabulary that is also used by XLM-R (Conneau et al., 2019) and mBART (Liu et al., 2020b) .

Pre-Training Datasets

For monolingual training datasets, we reconstruct Common-Crawl Corpus used in XLM-R (Conneau et al., 2019) . We extract 1.36TB data in 50 languages, which contains 6.5G sentences and 0.4G documents. We up/down-sample the monolingual text like XLM from each language with a smoothing parameter α = 0.5. For bilingual data, we collect from the OPUS websitefoot_4 like previous works (Lample & Conneau, 2019; Chi et al., 2020b) . There are 6.4G parallel sentences, covering 879 language pairs across 50 languages. More details about the languages and statistics of our training corpus can be founded in Appendix A. Optimization Settings For each training iteration, we alternately sample a batch of adjacent segments from monolingual corpus and a batch of parallel sentences from bilingual datasets to conduct a pair of masked input ( x, ŷ). We firstly perform IS-MLM for both x and ŷ. Then, we reuse their hidden states from the last layer to perform CS-MLM. Meanwhile, we also adopt the translation masked language modeling (TLM) proposed in XLM (Lample & Conneau, 2019) when the inputs are parallel bilingual sentences. Thus the overall training objective is the sum of three MLM objectives. During training, the model parameters except for cross-attention are initialized by XLM-R. We first freeze the parameters of XLM-R and only update the cross-attention parameters for faster convergence. Then, we jointly train the whole model. We pre-train our model with mixed-precision training using 64 Nvidia Telsa V100 32GB GPUs. More hyperparameters can be founded in Table 7 . BUCC 2018 (Zweigenbaum et al., 2018) , Tatoeba (Artetxe & Schwenk, 2019) . Tasks in the first three categories only provide golden training corpus in English and dev/test set in different target languages. For the two zero-shot sentence retrieval tasks, no training datasets are provided. We refer the reader to Hu et al. (2020) for additional details about the datasets.

4. EXPERIMENTS ON LANGUAGE UNDERSTANDING

Fine-tuning Setting Following previous works (Conneau et al., 2019; Hu et al., 2020) , we consider two typical fine-tuning settings: (1) Cross-lingual Transfer which fine-tunes the pre-trained model only using English golden data and directly performs inference on the test data of different target languages; (2) Translate-Train-All which first automatically translates the English golden data to the remaining target languages and fine-tunes a multilingual model on the concatenation of all data. We use the machine-translated data released by XTREME, except for two sequence-labeling tasks (POS, NER) since the golden token labels in the target language are unavailable. To have a fair comparison with the strong baseline XLM-R (Conneau et al., 2019) under the translate-train-all setting, we also show the results of XLM-R using the same fine-tuning hyperparameters as VECO.

4.2. EXPERIMENTAL RESULTS

The detailed test results of 9 tasks on the XTREME benchmark are shown in Table 1 . It demonstrates that the proposed VECO outperforms previous cross-lingual models on most of the datasets. Compared to XLM-R, it averagely scores 5.0 and 2.8 points higher under the cross-lingual transfer and translation-train-all settings, respectively. It is worth noting that, VECO delivers a large improvement on zero-shot sentence retrieval tasks (BUCC, Tatoeba). This phenomenon reflects that our model has a strong cross-lingual modeling ability, thus it can better mine parallel sentences in a multilingual corpus. The reasons are two-folds. First is the introduction of more bilingual data Table 2 : XNLI accuracy scores for each language under the cross-lingual transfer setting. Both XLM and VECO are small-sized models trained from scratch on the same monolingual (Mono.) and bilingual (Bili.) corpus using the same hyperparameters. For this set of experiments, we only use a subset of the full training corpus and report the results of languages appeared in them. during pre-training, which is a direct and effective way to enhance the cross-lingual ability of the model. Second is the mutual improvement between two pre-training tasks and the superiority of model design. To analyze whether bilingual data plays a leading role in improving performance, we conduct a set of more fair experiments. We train small-sized XLM and VECO models from scratch using a subset of full training data and hyperparameters (see Appendix A for details). Table 2 shows the results of XNLI, the most widely used cross-lingual evaluation dataset in the XTREME leaderboard. We observe that, when using monolingual corpus only, VECO can outperform XLM by 0.8 points. It suggests that our models can benefit from adjacent sentences used by the CS-MLM task to be equipped with a stronger ability of contextual modeling. Moreover, when trained on both the monolingual and bilingual corpus, VECO can achieve a larger improvement compared to XLM. It reveals that VECO can better utilize the bilingual corpus, compared to only-optimized translation language modeling (TLM) in XLM.

5.1. EXPERIMENTAL SETUP

Datasets We choose machine translation (MT) task, a typical cross-lingual generation scenario. In order to illustrate the generality of our approach and have a fair comparison with the most recent state-of-the-art Transformer works (Liu et al., 2020a) son with previous Transformer variants and de-tokenized SacreBLEUfoot_5 to avoid the influence of different tokenization and normalization between models (Post, 2018) . Fine-tuning Setting In theory, VECO can initialize a 24-layer encoder and 24-layer decoder Transformer model. However, since VECO follows the post-layernorm design of vanilla Transformer (Vaswani et al., 2017) like XLM-R, we find it is hard to train such a deep post-layernorm Transformer variant without careful parameter searching. This phenomenon is consistent with the findings in recent researches about deep transformer (Liu et al., 2020a; Bachlechner et al., 2020) . Since these works also show that deeper encoders are more worthwhile than deeper decoders, thus the main results of VECO and XLM-R are based on the most normal 6-layer decoder with the full 24-layer encoder. The batch size is 64k and 256k for En-De and En-Fr respectively. The total training updates are set to 100k. The learning rate is 1e-4/2e-4, with linear warmup over the first 16k steps and linear decay. We run En-De and En-Fr MT experiments on 16 and 32 V100 GPUs respectively. We average the last 10 checkpoints and use beam search with a beam size of 5. Baselines We consider two types of Transformer baselines: randomly initialized and cross-lingual models initialized. For random initialization, we take the original Transformer-big and the stateof-the-art Deep Transformer (Liu et al., 2020a) into consideration. Besides, we also reproduce a Transformer baseline which adopts the same fine-tuning hyperparameters as VECO but with random initialization. For cross-lingual encoder-decoder models, we include mBART, which shows impressive results on MT. We also conduct the WMT experiments for XLM-R, following the totally same fine-tuning settings as VECO. Note that the layer settings are not totally same, due to the distinct characteristic of pre-trained models: (1) The layer of pre-trained encoder-decoder models decide the layer of MT experiments. Thus the layer of mBART is set the same as pre-training (ie, 12-layer encoder and 12-layer decoder); (2) The layer of pre-trained encoder models only decides the layer of the MT encoder, while the decoder layer can be any value. Thus XLM-R initialized model is fixed as a 24-layer encoder during fine-tuning. In order to minimize the layer gap between XLM-R and mBART initialized MT model, we choose the most common 6-layer decoder in Table 3 and 3-layer decoder in Table 4 . To have a fair comparison to XLM-R, VECO also adopts the same layer settings. We also reproduce a same-sized randomly initialized Transformer-xBig model. In conlusion, we tried our best to use the same layer setting among most of the models if possible (Transformer-xBig, XLM-R, and VECO). (2) the last n layers from a 24-layer pre-trained VECO model. We consider n = {3, 6} to conduct experiments. The predominant method adopted in this paper, the straightforward strategy of selecting the last n layers, exhibits better performance. It is possibly because the last several layers play a more important role in making predictions over the whole vocabulary. Moreover, we can find that there is 0.2∼0.3 BLEU improvement when increasing the decoder layers from 3 to 6.foot_6 Regardless of the initialization method, the VECO-initialized model can gain consistent 1∼2 BLEU improvement over the randomly-initialized model.

5.2. EXPERIMENTAL RESULTS

Moreover, Table 4 (right) displays the sacreBLEU scores of same-sized (24-layer encoder and 6layer decoder) models during training. We find that VECO-initialized model can get a surprising more than 28 sacreBLEU score just after 10 epochs, which is better than the final score of the randomly initialized model at 35 epochs. It reveals that VECO can provide a fairly good initialization for the machine translation model, which can converge quickly and further boost the results. To investigate whether the exciting improvement in MT mainly comes from 1) the use of parallel corpus during pre-training or 2) the superiority of the designed model and pre-training tasks, we also conduct a more comparable experiment. We first train an out-of-domain Transformer-xBig model using the whole En-De parallel data (∼ 68M) used in VECO pre-training, and then continue to train the model on the in-domain WMT14 En-De training dataset. Results are shown in Table 4 (left) marked with *. Under this set of a totally fair comparison, VECO still maintains a lead of 1.1 BLEU score. This directly confirms that the improvement on MT is not only due to the use of bilingual data. More importantly, reasonable pre-training tasks and model design ensure better use of bilingual and large-scale unlabeled multilingual corpus.

6. RELATED WORK

Encoder-only Cross-lingual Pre-training mBERT (Devlin et al., 2019) is the first work to pretrain a Transformer encoder over multiple languages. There have been several extensions that use the same encoder-only backbone as mBERT, with the main difference is the introduction of more training corpus and pre-training tasks. XLM (Lample & Conneau, 2019) utilizes both monolingual and bilingual corpus to perform mask language modeling. XLM-R (Conneau et al., 2019) extends to be built on RoBERTa (Liu et al., 2019) using larger monolingual training data. Unicoder (Huang et al., 2019) , ALM (Yang et al., 2020), and InfoXLM (Chi et al., 2020b) propose new pre-training tasks to better utilize the bilingual data. These works deliver impressive performance on crosslingual understanding tasks, while only marginal improvement has been gained on cross-lingual generation tasks like machine translation, especially on high-resource languages. Encoder-Decoder Cross-lingual Pre-training BART (Lewis et al., 2019) pre-trains denoising auto-encoder built with a standard Transformer-based encoder-decoder architecture. And mBART (Liu et al., 2020b) extends it to the multilingual setting, demonstrating significant gains in low/medium-resource machine translation, but with a decrease in high resource languages. Chi et al. (2020a) first trains an encoder via MLM and then frozen the encoder to train the decoder only via two generative tasks. A similar approach is also proposed in Liang et al. (2020) that extends Unicoder (Huang et al., 2019) to generation tasks, with the main difference exists in the joint training of encoder-decoder. All these cross-lingual models emphasize training a dedicated model for NLG, thus these didn't improve or even hurt the NLU capabilities of the encoder. Or they require more computation and memory to match the performance of the encoder-only models when using comparable training resources. For example, BART, when fine-tuning on NLU tasks, the same input is fed into the encoder and decoder, and the final output from the decoder is used. Thus it would cost more memory due to extra cross-attention modules (roughly 10%∼20% more parameters) than encoder-only RoBERTa model, but BART still can't performs better than RoBERTa. Unified Language Representation for NLU and NLG BART (Lewis et al., 2019) and UNILM (Dong et al., 2019 ) also endeavor to build a unified model for NLU and NLG tasks, with the core idea that allows the model to see both unidirectional and bidirectional context. UNILM pre-trains a Transformer encoder with an ensemble of attention masks, while BART pre-trains a Transformer encoder-decoder model with arbitrary noising functions. Our work differs from them in several ways. Firstly, they only focus on the monolingual (English) domain, while we dedicate to the more challenging multilingual domain. Secondly, since multiple languages cannot be completely mapped to the same space at every layer only via self-attention like UNILM, the crossattention module is important to model the cross-lingual mapping. Thus we drop the way to pre-train an encoder-only model under the multilingual scenario. Thirdly, previous work (Xia et al., 2019) have shown that sharing the encoder and decoder of the Transformer can strengthen the semantic correlation between languages and regularize the encoder-decoder with high capacity on machine translation. Due to the above reasons, VECO varies during training to train specific modules via a bidirectional task (IS-MLM) and unidirectional task (CS-MLM).

7. CONCLUSION

We present VECO, a variable cross-lingual pre-training model, targeted at initializing both NLU preferred encoder-only and NLG specialized encoder-decoder Transformer. We analyze three core modules in standard Transformer and propose two masked language modeling tasks to train the reasonable combinations of them. The two tasks jointly optimize for inner-sequence understanding and cross-sequence generation, enabling them to boost each other via strong regularization from module sharing. On this account, VECO achieves consistent improvements on various language understanding and generation tasks compared to existing cross-lingual encoder-only and encoderdecoder approaches, opening up new ways of thinking about pre-trained backbone architecture. 

A PRE-TRAINING DETAILS

For monolingual data, following XLM-R (Conneau et al., 2019) , we build a clean CommonCrawl Corpus using an open-source tool CCNet (Wenzek et al., 2019) . Table 5 reports the language codes and statistics of datasets. There are 1.36TB monolingual data in 50 languages before up/downsampling. We collect bilingual corpus in 50 languages from the OPUS websitefoot_7 , including MultiUN, UNPC, Bombay, EU-bookshop, OpenSubtitles2018, Tanzil, GlobalVoices, ParaCrawl, MultiParaCrawl, DGT, Tilde, Europarl, Wikipedia, ECB, TED2013, News-Commentary, Ubuntu, Books, UN, infopankki-v1, EUconst, and Bianet. In total, there are 1TB bilingual training corpus before preprocessing, covering 879 language pairs. Table 6 lists the statistics for each language pair. We then apply subword tokenization directly on raw text data using Sentence Piece Model (Kudo & Richardson, 2018) without any additional preprocessing. We use the whole corpus to train VECO, while using a subset (∼ 1/4) that contains 33 languages to train XLM SMALL and VECO SMALL . The full set of pre-training hyperparameters for small-sized and large-sized VECO (default) are listed in Table 7 . The comparisons between VECO and other cross-lingual models are shown in Table 8 .

B NLU FINE-TUNING DETAILS

We consider two typical fine-tuning settings: • Cross-lingual Transfer: For all tasks except sentence retrieve tasks (BUCC and Tatoeba), we select the model with the best average result over all the languages on the dev sets, by searching the learning rate over [1e-5,2e-5,3e-5] , training epoch over [3, 5, 10] , and batch size over [16, 32, 64] . As for sentence retrieve tasks without fine-tuning on any parallel sentences at all, we use the average word embeddings in the middle layer to extract the sentence representations. • Translate-Train-All: We found that the large-sized training datasets can benefit from a smaller learning rate. Therefore, for translate-train-all, we select the model on the dev sets, by searching the learning rate over [3e-6,5e-6,1e-5]. For XNLI, PAWS-X, XQuAD and MLQA, we fine-tune our model directly on the translation data provided by the official XTREME repofoot_8 . Following the participants (FILTER and Anonymous1) on the XTREME leaderboard, we use several methods to further improve the scores. For TyDiQA, we start fine-tuning based on the best XQuAD translate-train-all model. For the sentence retrieve tasks, BUCC and Tatoeba, we use the averaged representation in the middle layer of the best XNLI model. Note that we only use these tricks under the translate-train-all setting, wishing to have a possible fair comparison on the XTREME leaderboard. However, we still use less fine-tuned data compared to concurrent works FILTER and Anonymous1. Specifically, FILTER, a model-agnostic fine-tune method, uses more translation data on POS and NER tasks. Anonymous1 uses other labeled data outside of XTREME. Table 18: Tatoeba results (Accuracy) for each language



Thus the word variable means that the backbone Transformer varies during pre-training and fine-tuning. https://sites.research.google/xtreme A triangular attention mask matrix is used to only attend to the preceding tokens of each masked token. Flipping the monolingual sentence order can create "harder" example pairs, thus pushing the model toward a stronger ability of language modeling and understanding. http://opus.nlpl.eu/ fr}+numrefs.1+smooth.exp+test.wmt14/full+tok.13a+version.1.4.9 However, we observe that only marginal improvement can be gained when further increasing the decoder layers to 12 or 24 in our primary experiments, which is also in line with the findings inLiu et al. (2020a). http://opus.nlpl.eu/ https://github.com/google-research/xtreme



EXPERIMENTAL SETUPDownstream TasksWe conduct NLU evaluations on XTREME(Hu et al., 2020), a representative massively multilingual multi-task benchmark. It consists of various NLU tasks over 40 languages. XTREME tasks can be classified into four different categories: (1) sentence-pair classification: XNLI(Conneau et al., 2018), PAWS-X (Yang et al., 2019); (2) structured prediction: UD-POS(Nivre et al., 2018), Wikiann NER(Pan et al., 2017); (3) question answering: XQuAD(Artetxe et al., 2020), MLQA(Lewis et al., 2020), TyDiQA(Clark et al., 2020a); (4) sentence retrieval:

XTREME results on each dataset (as of Oct 02, 2020). Averaged results on the four categories can be found at leaderboard: https://sites.research.google/xtreme. " † " and " ‡ " indicates results fromHu et al. (2020) andFang et al. (2020), respectively. "*" indicates the results obtained by our implementation. The detailed results for each language are in Appendix C.

Results on WMT14 En-Fr and WMT14 En-De Machine Translation.



BLEU scores (left) and learning curves (right) of different initialization methods.

Jian Yang, Shuming Ma, Dongdong Zhang, Shuangzhi Wu, Zhoujun Li, and Ming Zhou. Alternating language modeling for cross-lingual pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/ view/6480/6336.

The statistics of monolingual pre-training corpus.

The statistics of bilingual (parallel) pre-training corpus. .1 85.085.8 84.3 95.4 85.8 78.3 62.8 64.7 78.4 82.8 65.9 66.2 77.3 70.2 87.4  MMTE 86.2 65.9 87.2 85.8 77.7 96.6 85.8 81.6 61.9 67.3 81.1 84.3 57.3 76.4  78.1 73.5 89.2 XLM-R 89.8 67.5 88.1 88.5 86.3 96.1 88.3 86.5 72.5 70.6 85.8 87.2 68.3 76.4 82.6 72.4 89.4 VECO 88.3 67.4 87.4 88.5 86.7 95.9 89.0 87.8 75.1 70.9 86.2 88.9 67.5 76.2 82.9 72.9 89.9 mBERT 49.2 70.5 49.6 69.4 88.6 86.2 85.5 59.0 75.9 41.7 81.4 68.5 57.0 53.2 55.7 61.6 71.5 XLM 49.0 70.2 50.1 68.7 88.1 84.9 86.5 59.8 76.8 55.2 76.3 66.4 61.2 52.4 20.5 65.4 71.3 MMTE 48.6 70.5 59.3 74.4 83.2 86.1 88.1 63.7 81.9 43.1 80.3 71.8 61.1 56.2 51.9 68.1 73.5 XLM-R 15.9 78.1 53.9 80.8 89.5 87.6 89.5 65.2 86.6 47.2 92.2 76.3 70.3 56.8 24.6 25.7 73.8 VECO 31.4 79.3 53.1 84.3 89.8 88.3 90.2 64.3 85.8 48.0 93.7 77.2 69.2 58.1 26.2 39.4 75.1

POS results (Accuracy) for each language. Cross-lingual Transfer mBERT 85.2 77.4 41.1 77.0 70.0 78.0 72.5 77.4 75.4 66.3 46.2 77.2 79.6 56.6 65.0 76.4 53.5 81.5 29.0 66.4 XLM 82.6 74.9 44.8 76.7 70.0 78.1 73.5 74.8 74.8 62.3 49.2 79.6 78.5 57.7 66.1 76.5 53.1 80.7 23.6 63.0 MMTE 77.9 74.9 41.8 75.1 64.9 71.9 68.3 71.8 74.9 62.6 45.6 75.2 73.9 54.2 66.2 73.8 47.9 74.1 31.2 63.9 XLM-R 84.7 78.9 53.0 81.4 78.8 78.8 79.5 79.6 79.1 60.9 61.9 79.2 80.5 56.8 73.0 79.8 53.0 81.3 23.2 62.5 VECO 83.8 77.5 48.2 83.9 77.2 79.4 79.3 75.4 80.4 68.3 68.2 80.6 80.1 55.0 71.0 80.9 52.9 81.7 19.4 63.2 57.2 26.3 59.4 62.4 69.6 47.6 81.2 77.9 63.5 68.4 53.6 49.6 0.3 78.6 71.0 43.0 70.1 26.5 32.4 MMTE 60.9 43.9 58.2 44.8 58.5 68.3 42.9 74.8 72.9 58.2 66.3 48.1 46.9 3.9 64.1 61.9 37.2 68.1 32.1 28.9 XLMR 71.6 56.2 60.0 67.8 68.1 57.1 54.3 84.0 81.9 69.1 70.5 59.5 55.8 1.3 73.2 76.1 56.4 79.4 33.6 33.1 VECO 67.1 51.2 59.9 63.4 65.0 70.0 56.1 83.4 83.1 71.3 70.5 60.5 56.2 1.4 71.3 80.4 69.3 76.0 37.4 29.1

NER results (F1) for each language. 70.9 86.7 57.7 97.5 81.5 94.8 89.7 62.9 82.1 87.9 88.8 74.7 80.7 87.6 89.6 89.2 83.2

