CROSS-MODEL BACK-TRANSLATED DISTILLATION FOR UNSUPERVISED MACHINE TRANSLATION

Abstract

Recent unsupervised machine translation (UMT) systems usually employ three main principles: initialization, language modeling and iterative back-translation, though they may apply them differently. Crucially, iterative back-translation and denoising auto-encoding for language modeling provide data diversity to train the UMT systems. However, the gains from these diversification processes has seemed to plateau. We introduce a novel component to the standard UMT framework called Cross-model Back-translated Distillation (CBD), that is aimed to induce another level of data diversification that existing principles lack. CBD is applicable to all previous UMT approaches. In our experiments, it boosts the performance of the standard UMT methods by 1.5-2.0 BLEU. In particular, in WMT'14 English-French, WMT'16 German-English and English-Romanian, CBD outperforms cross-lingual masked language model (XLM) by 2.3, 2.2 and 1.6 BLEU, respectively. It also yields 1.5-3.3 BLEU improvements in IWSLT English-French and English-German tasks. Through extensive experimental analyses, we show that CBD is effective because it embraces data diversity while other similar variants do not. 1

1. INTRODUCTION

Machine translation (MT) is a core task in natural language processing that involves both language understanding and generation. Recent neural approaches (Vaswani et al., 2017; Wu et al., 2019) have advanced the state of the art with near human-level performance (Hassan et al., 2018) . However, they continue to rely heavily on large parallel data. As a result, the search for unsupervised alternatives using only monolingual data has been active. While Ravi & Knight (2011) In this paper, we focus on a different aspect of the UMT framework, namely, its data diversificationfoot_1 process. If we look from this view, the DAE and IBT steps of the UMT framework also perform some form of data diversification to train the model. Specifically, the noise model in the DAE process generates new, but noised, versions of the input data, which are used to train the model with a reconstruction objective. Likewise, the IBT step involves the same UMT model to create synthetic parallel pairs (with the source being synthetic), which are then used to train the model. Since the NMT model is updated with DAE and IBT simultaneously, the model generates fresh translations in each back-translation step. Overall, thanks to DAE and IBT, the model gets better at translating



Anonymized code: https://tinyurl.com/y2ru8res. By diversification, we mean sentence level variations (not expanding to other topics or genres).1



and Klementiev et al. (2012) proposed various unsupervised techniques for statistical MT (SMT), Lample et al. (2018a;c) established a general framework for modern unsupervised MT (UMT) that works for both SMT and neural MT (NMT) models. The framework has three main principles: model initialization, language modeling and iterative back-translation. Model initialization bootstraps the model with a knowledge prior like word-level transfer (Lample et al., 2018b). Language modeling, which takes the form of denoising auto-encoding (DAE) in NMT (Lample et al., 2018c), trains the model to generate plausible sentences in a language. Meanwhile, iterative back-translation (IBT) facilitates cross-lingual translation training by generating noisy source sentences for original target sentences. The recent approaches differ on how they apply each of these three principles. For instance, Lample et al. (2018a) use an unsupervised word-translation model (Lample et al., 2018b) for model initialization, while Conneau & Lample (2019) use a pretrained cross-lingual masked language model (XLM).

