CROSS-MODEL BACK-TRANSLATED DISTILLATION FOR UNSUPERVISED MACHINE TRANSLATION

Abstract

Recent unsupervised machine translation (UMT) systems usually employ three main principles: initialization, language modeling and iterative back-translation, though they may apply them differently. Crucially, iterative back-translation and denoising auto-encoding for language modeling provide data diversity to train the UMT systems. However, the gains from these diversification processes has seemed to plateau. We introduce a novel component to the standard UMT framework called Cross-model Back-translated Distillation (CBD), that is aimed to induce another level of data diversification that existing principles lack. CBD is applicable to all previous UMT approaches. In our experiments, it boosts the performance of the standard UMT methods by 1.5-2.0 BLEU. In particular, in WMT'14 English-French, WMT'16 German-English and English-Romanian, CBD outperforms cross-lingual masked language model (XLM) by 2.3, 2.2 and 1.6 BLEU, respectively. It also yields 1.5-3.3 BLEU improvements in IWSLT English-French and English-German tasks. Through extensive experimental analyses, we show that CBD is effective because it embraces data diversity while other similar variants do not. 1

1. INTRODUCTION

Machine translation (MT) is a core task in natural language processing that involves both language understanding and generation. Recent neural approaches (Vaswani et al., 2017; Wu et al., 2019) have advanced the state of the art with near human-level performance (Hassan et al., 2018) . However, they continue to rely heavily on large parallel data. As a result, the search for unsupervised alternatives using only monolingual data has been active. While Ravi & Knight (2011) and Klementiev et al. (2012) proposed various unsupervised techniques for statistical MT (SMT), Lample et al. (2018a; c) established a general framework for modern unsupervised MT (UMT) that works for both SMT and neural MT (NMT) models. The framework has three main principles: model initialization, language modeling and iterative back-translation. Model initialization bootstraps the model with a knowledge prior like word-level transfer (Lample et al., 2018b) . Language modeling, which takes the form of denoising auto-encoding (DAE) in NMT (Lample et al., 2018c) , trains the model to generate plausible sentences in a language. Meanwhile, iterative back-translation (IBT) facilitates cross-lingual translation training by generating noisy source sentences for original target sentences. The recent approaches differ on how they apply each of these three principles. In this paper, we focus on a different aspect of the UMT framework, namely, its data diversification 2 process. If we look from this view, the DAE and IBT steps of the UMT framework also perform some form of data diversification to train the model. Specifically, the noise model in the DAE process generates new, but noised, versions of the input data, which are used to train the model with a reconstruction objective. Likewise, the IBT step involves the same UMT model to create synthetic parallel pairs (with the source being synthetic), which are then used to train the model. Since the NMT model is updated with DAE and IBT simultaneously, the model generates fresh translations in each back-translation step. Overall, thanks to DAE and IBT, the model gets better at translating by iteratively training on the newly created and diversified data whose quality also improves over time. This argument also applies to statistical UMT, except for the lack of the DAE (Lample et al., 2018c) . However, we conjecture that these diversification methods may have reached their limit as the performance does not improve further the longer we train the UMT models. In this work, we introduce a fourth principle to the standard UMT framework: Cross-model Backtranslated Distillation or CBD ( §3), with the aim to induce another level of diversification that the existing UMT principles lack. CBD initially trains two UMT agents (models) using existing approaches. Then, one of the two agents translates the monolingual data from one language s to another t in the first level. In the second level, the generated data are back-translated from t to s by the other agent. In the final step, the synthetic parallel data created by the first and second levels are used to distill a supervised MT model. CBD is applicable to any existing UMT method and is more efficient than ensembling approaches (Freitag et al., 2017) ( §5.3). In the experiments ( §4), we have evaluated CBD on the WMT'14 English-French, WMT'16 English-German and WMT'16 English-Romanian unsupervised translation tasks. CBD shows consistent improvements of 1.0-2.0 BLEU compared to the baselines in these tasks. It also boosts the performance on IWSLT'14 English-German and IWSLT'13 English-French tasks significantly. In our analysis, we explain with experiments why other similar variants ( §5.1) and other alternatives from the literature ( §5.4) do not work well and cross-model back-translation is crucial for our method. We further demonstrate that CBD enhances the baselines by achieving greater diversity as measured by back-translation BLEU ( §5.2).

2. BACKGROUND

Ravi & Knight (2011) were among the first to propose a UMT system by framing the problem as a decipherment task that considers non-English text as a cipher for English. Nonetheless, the method is limited and may not be applicable to the current well-established NMT systems (Luong et al., 2015; Vaswani et al., 2017; Wu et al., 2019) During training, the initialization step is conducted once, while the denoising and back-translation steps are often executed in an alternating manner. 3 It is worth noting that depending on different implementations, the parameters for backward and forward components may be separate (Lample et al., 2018a) or shared (Lample et al., 2018c; Conneau & Lample, 2019) . A parameter-shared cross-lingual NMT model has the capability to translate from either source or target, while a UMT system with parameter-separate models has to maintain two models. Either way, we deem a standard UMT system to be bidirectional, i.e. it is capable of translating from either source or target language.



Anonymized code: https://tinyurl.com/y2ru8res. By diversification, we mean sentence level variations (not expanding to other topics or genres). The KenLM language model in PBSMT(Lample et al., 2018c) was kept fixed during the training process.



For instance, Lample et al. (2018a) use an unsupervised word-translation model (Lample et al., 2018b) for model initialization, while Conneau & Lample (2019) use a pretrained cross-lingual masked language model (XLM).

. Lample et al. (2018a) set the foundation for modern UMT. They propose to maintain two encoder-decoder networks simultaneously for both source and target languages, and train them via denoising auto-encoding, iterative back-translation and adversarial training. In their follow-up work, Lample et al. (2018c) formulate a common UMT framework for both PBSMT and NMT with three basic principles that can be customized. The three principles are: • Initialization: A non-randomized cross-or multi-lingual initialization that represents a knowledge prior to bootstrap the UMT model. For instance, Lample et al. (2018a) and Artetxe et al. (2019) use an unsupervised word-translation model MUSE (Lample et al., 2018b) as initialization to promote word-to-word cross-lingual transfer. Lample et al. (2018c) use a shared jointly trained sub-word (Sennrich et al., 2016b) dictionary. On the other hand, Conneau & Lample (2019) use a pretrained cross-lingual masked language model (XLM) to initialize the unsupervised NMT model. • Language modeling: Training a language model on monolingual data helps the UMT model to generate fluent texts. The neural UMT approaches (Lample et al., 2018a;c; Conneau & Lample, 2019) use denoising auto-encoder training to achieve language modeling effects in the neural model. Meanwhile, the PBSMT variant proposed by Lample et al. (2018c) uses the KenLM smoothed n-gram language models (Heafield, 2011). • Iterative back-translation: Back-translation (Sennrich et al., 2016a) brings about the bridge between source and target languages by using a backward model that translates data from target to source. The (source and target) monolingual data is translated back and forth iteratively to progress the UMT model in both directions.

