VECO: VARIABLE ENCODER-DECODER PRE-TRAINING FOR CROSS-LINGUAL UNDERSTANDING AND GENERATION

Abstract

Recent studies about learning multilingual representations have achieved significant performance gains across a wide range of downstream cross-lingual tasks. They train either an encoder-only Transformer mainly for understanding tasks, or an encoder-decoder Transformer specifically for generation tasks, ignoring the correlation between the two tasks and frameworks. In contrast, this paper presents a variable encoder-decoder (VECO) pre-training approach to unify the two mainstreams in both model architectures and pre-training tasks. VECO splits the standard Transformer block into several sub-modules trained with both innersequence and cross-sequence masked language modeling, and correspondingly reorganizes certain sub-modules for understanding and generation tasks during inference. Such a workflow not only ensures to train the most streamlined parameters necessary for two kinds of tasks, but also enables them to boost each other via sharing common sub-modules. As a result, VECO delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark covering text classification, sequence labeling, question answering, and sentence retrieval. For generation tasks, VECO also outperforms all existing cross-lingual models and state-of-the-art Transformer variants on WMT14 English-to-German and English-to-French translation datasets, with gains of up to 1∼2 BLEU.

1. INTRODUCTION

Driven by the striking success of pre-trained language models (Devlin et al., 2019) , recent crosslingual pre-training (Lample & Conneau, 2019; Liu et al., 2020b) has attracted an increasing of attention. It provides cross-lingual contextualized representations for the inputs of different languages, which significantly advances performance in both natural language understanding (NLU) and generation (NLG) tasks. There are two mainstream architectures in current cross-lingual pre-training literature: encoderonly and encoder-decoder. The former like XLM (Lample & Conneau, 2019) focuses on conducting masked language modeling (MLM) with a single Transformer (Vaswani et al., 2017) encoder. This paradigm is naturally compatible with various NLU tasks, but tends to suffer from limited gains on cross-lingual generation tasks (e.g., machine translation) due to the lack of effective decoder initialization. In contrast, the latter like mBART (Liu et al., 2020b) attempts to pre-train the encoder-decoder Transformer via denoising auto-encoding tasks to provide complete initialization for downstream generation tasks. However, when applied in NLU scenarios, it usually requires more computation and memory to match the performance of the encoder-only models. In light of the above pros and cons, this work presents Variable Encoder-deCOder (VECO) pretraining, which targets at providing pre-trained model initialization for both the encoder-only and encoder-decoder Transformer with the most streamlined parameters. We observe that Transformer encoder and decoder blocks have two common modules: SelfAttention and FFN (feed-forward network), with the main difference that the latter introduces an extra CrossAttention (attention across from the encoder to the decoder) module. Inspired by the lottery ticket hypothesis (Frankle & Carbin, 2018), we split the standard Transformer block into three independent modules x and ŷ into different modules to perform inner-sentence mask language modeling (IS-MLM) and cross-sentence mask language modeling (CS-MLM). More specifically, the masked segment x can only attend to its context via self-attention to recover the original tokens x (IS-MLM), while masked segment ŷ can attend to its preceding tokens via self-attention and the context x via cross-attention to predict the original tokens ȳ (CS-MLM). For downstream NLU tasks, we throw out the cross-attention module and only fine-tune on the self-attention and FFN modules acted as an encoder. For NLG tasks, we keep all modules to initialize the corresponding encoder and decoders. {SelfAttention, CrossAttention, FFN} to be collaboratively trained via two specific MLM tasks. After that, we rebuild the desired complete architecture applicable for NLU or NLG with different specific combinations of these modules during fine-tuning.foot_0 Specifically, to be equipped with the ability of language understanding during pre-training, SelfAttention and FFN are assembled into a standard Transformer encoder for conducting inner-sequence masked language modeling (IS-MLM). In terms of generation, SelfAttention, CrossAttention, and FFN act together as the decoder in the standard sequence-to-sequence model, and are trained by the elaborately designed cross-sequence masked language modeling (CS-MLM) task. When applied to downstream finetuning, both SelfAttention and FFN modules constitute the Transformer encoder for contextual modeling in NLU or NLG, or cooperate with additional CrossAttention to provide the effective initialization of Transformer decoder. With such kind of workflow, VECO can be applied to both NLU and NLG tasks with the most streamlined parameters, which significantly reduces computational overhead and memory costs. Moreover, IS-MLM is specifically designed for understanding of individual sequences, while both understanding and generation tasks can benefit from CS-MLM. With such parameter sharing, VECO enables SelfAttention and FFN modules to be jointly trained by the two MLMs, which boosts both NLU and NLG performance. We validate VECO on a variety of representative cross-lingual NLU and NLG benchmarks. For cross-lingual understanding tasks, we conduct experiments on the XTREME benchmark consisting of 9 cross-lingual tasks, including text classification, sequence labeling, question answering, and sentence retrieval. VECO ranks first at the XTREME leaderboardfoot_1 at the submission deadline and obtains new state-of-the-art results on most of the tasks. For cross-lingual generation tasks, we validate VECO on the widely used WMT14 English-German and English-French machine translation benchmarks. VECO obtains 44.4 and 31.5 BLEU scores, consistently outperforming existing crosslingual pre-training approaches and state-of-the-art Transformer variants by around 1∼2 BLEU.

2.1. BACKBONE NETWORK

The backbone network of VECO is composed of a stack of N identical layers. Each layer has three modules, consisting of a required self-attention module, an optional cross-attention module, and a required feed-forward linear module. Both self-attention and cross-attention modules are based on



Thus the word variable means that the backbone Transformer varies during pre-training and fine-tuning. https://sites.research.google/xtreme



Figure1: The overview of VECO. During pre-training, we feed two masked segments x and ŷ into different modules to perform inner-sentence mask language modeling (IS-MLM) and cross-sentence mask language modeling (CS-MLM). More specifically, the masked segment x can only attend to its context via self-attention to recover the original tokens x (IS-MLM), while masked segment ŷ can attend to its preceding tokens via self-attention and the context x via cross-attention to predict the original tokens ȳ (CS-MLM). For downstream NLU tasks, we throw out the cross-attention module and only fine-tune on the self-attention and FFN modules acted as an encoder. For NLG tasks, we keep all modules to initialize the corresponding encoder and decoders.

