VECO: VARIABLE ENCODER-DECODER PRE-TRAINING FOR CROSS-LINGUAL UNDERSTANDING AND GENERATION

Abstract

Recent studies about learning multilingual representations have achieved significant performance gains across a wide range of downstream cross-lingual tasks. They train either an encoder-only Transformer mainly for understanding tasks, or an encoder-decoder Transformer specifically for generation tasks, ignoring the correlation between the two tasks and frameworks. In contrast, this paper presents a variable encoder-decoder (VECO) pre-training approach to unify the two mainstreams in both model architectures and pre-training tasks. VECO splits the standard Transformer block into several sub-modules trained with both innersequence and cross-sequence masked language modeling, and correspondingly reorganizes certain sub-modules for understanding and generation tasks during inference. Such a workflow not only ensures to train the most streamlined parameters necessary for two kinds of tasks, but also enables them to boost each other via sharing common sub-modules. As a result, VECO delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark covering text classification, sequence labeling, question answering, and sentence retrieval. For generation tasks, VECO also outperforms all existing cross-lingual models and state-of-the-art Transformer variants on WMT14 English-to-German and English-to-French translation datasets, with gains of up to 1∼2 BLEU.

1. INTRODUCTION

Driven by the striking success of pre-trained language models (Devlin et al., 2019) , recent crosslingual pre-training (Lample & Conneau, 2019; Liu et al., 2020b) has attracted an increasing of attention. It provides cross-lingual contextualized representations for the inputs of different languages, which significantly advances performance in both natural language understanding (NLU) and generation (NLG) tasks. There are two mainstream architectures in current cross-lingual pre-training literature: encoderonly and encoder-decoder. The former like XLM (Lample & Conneau, 2019) focuses on conducting masked language modeling (MLM) with a single Transformer (Vaswani et al., 2017) encoder. This paradigm is naturally compatible with various NLU tasks, but tends to suffer from limited gains on cross-lingual generation tasks (e.g., machine translation) due to the lack of effective decoder initialization. In contrast, the latter like mBART (Liu et al., 2020b) attempts to pre-train the encoder-decoder Transformer via denoising auto-encoding tasks to provide complete initialization for downstream generation tasks. However, when applied in NLU scenarios, it usually requires more computation and memory to match the performance of the encoder-only models. In light of the above pros and cons, this work presents Variable Encoder-deCOder (VECO) pretraining, which targets at providing pre-trained model initialization for both the encoder-only and encoder-decoder Transformer with the most streamlined parameters. We observe that Transformer encoder and decoder blocks have two common modules: SelfAttention and FFN (feed-forward network), with the main difference that the latter introduces an extra CrossAttention (attention across from the encoder to the decoder) module. Inspired by the lottery ticket hypothesis (Frankle & Carbin, 2018) , we split the standard Transformer block into three independent modules

