DEMYSTIFYING LEARNING OF UNSUPERVISED NEU-RAL MACHINE TRANSLATION

Abstract

Unsupervised Neural Machine Translation or UNMT has received great attention in recent years. Though tremendous empirical improvements have been achieved, there still lacks theory-oriented investigation and thus some fundamental questions like why certain training protocol can work or not under what circumstances have not yet been well understood. This paper attempts to provide theoretical insights for the above questions. Specifically, following the methodology of comparative study, we leverage two perspectives, i) marginal likelihood maximization and ii) mutual information from information theory, to understand the different learning effects from the standard training protocol and its variants. Our detailed analyses reveal several critical conditions for the successful training of UNMT.

1. INTRODUCTION

Unsupervised Neural Machine Translation or UNMT have grown from its infancy (Artetxe et al., 2018; Lample et al., 2018a) to close-to-supervised performance recently on some translation scenarios (Lample & Conneau, 2019; Song et al., 2019) . Early UNMT works (Artetxe et al., 2017; Lample et al., 2018a; Yang et al., 2018) adopt complex training strategies including model initialization, synthetic parallel data for warming up the model, adversarial loss for making encoder universal, different weight sharing mechanisms etc. Then Lample et al. (2018b) simplifies all these and establishes a two-components framework, involving an initialization strategy followed by iterative training on two tasks, i.e. denoising auto-encoding with the DAE loss and online back-translation with the BT loss. Works afterwards mainly focus on developing better initialization strategies (Lample & Conneau, 2019; Ren et al., 2019; Song et al., 2019; Liu et al., 2020) . Although obtaining impressive performance, it is unclear why this standard training protocol is possible to be successful. Kim et al. (2020) and Marchisio et al. (2020) consider the standard training as a black-box and empirically analyze its success or failure regarding different data settings (i.e. text domains and language pairs). Unfortunately, due to the lack of theoretical guidelines, some fundamental questions are still remained unknown: what standard training tries to minimize under the general unsupervised training paradigm (Ghahramani, 2004) and when a certain training protocol can work for training UNMT? In this paper, we attempt to open the back-box training of UNMT and understand its theoretical essence from two angles: i) a marginal likelihood maximization view; and ii) an information-theoretic view by ablating standard training protocol with other variants. Our contributions are as follows. A. By making an analogy of standard training protocol with marginal likelihood or Evidence Lower BOund (ELBO) optimization, we visualize the learning curves of the two terms in ELBO objective, and found that optimizing ELBO is not sufficient for training a successful UNMT model, indicating that specific regularization design i.e. the DAE loss, quite matters. B. By leveraging information theory, we present a formal definition on what does it mean to successfully train an UNMT model, and then readily derive a sufficient condition and a necessary condition for successfully training UNMT in principle. In addition, we validate both sufficient and necessary conditions through empirical experiments, and find that both conditions indeed explain why standard training protocol works while others suffer from degeneration to learning sub-optimal tasks. C. Based on explanations for those failed protocols, we continue experiments to settle the role played by DAE and BT. Firstly, BT is the main task while DAE is a critical auxiliary. Then we clarify that DAE has more important role than just learning word order, accepted as common knowledge in almost all previous works, but also preserving the mutual information between encoder input and encoder output, which is necessary for successful training. Furthermore, DAE also functions as a behavior regularizer for decoding with online BT, and prevents BT from yielding degenerated data.

2. UNDERSTANDING UNMT FROM TWO PERSPECTIVES

In this section, we first introduce background about the standard training protocol proposed in Lample et al. (2018b) , which is adopted by almost all later works. Then we introduce the basic concept of two perspectives on which we rely for analyzing the learning of different training protocol variants. Due to the space limit, please refer to appendix A.1 for a timely literature review of recent advance.

2.1. STANDARD TRAINING PROTOCOL

The standard training protocol involves standard initialization strategy and standard iterative training procedure, and they both are built upon a specific design of encoder-decoder parameterization. Parameterization and initialization UNMT model adopts a shared embedding matrix for a shared vocabulary with joint BPE (Sennrich et al., 2016) , and the two languages share the same encoder and decoder with only a language embedding for distinguishing the input from different languages. As a result, unconstrained decoding might generate tokens from the same language as the input. Standard initialization means using fastTEXT (Bojanowski et al., 2017) to initialize the embedding matrix, denoted as JointEmb. XLM (Lample & Conneau, 2019) uses a trained encoder to initialize both encoder and decoder of the UNMT model. We also consider random initialization for completeness. Iterative training strategy The iterative training strategy involves optimization of two critical losses by turns, i.e. the DAE loss and the BT loss as defined in Eq. 1 and Eq. 2, where s and t denote the two languages. DAE loss is constructed through sampling a monolingual sentence x (or y), construct its noisy version C(x) (C(y)) and minimize the reconstruction error or RecErr: L dae = -log p s→s (x|C(x)) + [-log p t→t (y|C(y))], BT loss is constructed through sampling a monolingual sentence x (or y), construct its corresponding translation via the current model M(x) (M(y)) through back-translation and minimize the RecErr: L bt = E ŷ∼M(x) [-log p t→s (x|ŷ)] + E x∼M(y) [-log p s→t (y|x)], The online BT process involved in the iterative training strategy can be seen as Co-Training (Blum & Mitchell, 1998) , where two models (with shared weights) constructed on two views (source/target sentence) generate pseudo labels as the other view (pseudo translation) for training the corresponding dual model. We summarize the whole standard training protocol in Algorithm 1 in appendix A.2. Constrained decoding Besides the basics, we further introduce the concept of constrained decoding where the model should be constrained to decode tokens only in the target language regardless of the shared embedding parameterization. This could give us a simple definition of cross-lingual RecErr beyond naive RecErr in Eq. 2. Details of the algorithm and the definition are shown in appendix A.3.

2.2. A MARGINAL MAXIMIZATION VIEW

The standard training of UNMT model takes advantage of sole monolingual corpora D s , D t , which is similar to the generative modeling setting where only unlabeled data is available (Ghahramani, 2004) . Here we take an analogy of the standard UNMT training as implicitly maximizing marginal of the monolingual data. Due to the duality of translation (He et al., 2016) , the target sentence not only plays the role of label, but also the input in reverse translation direction. So in essence the standard UNMT training can be seen as maximizing the marginal log likelihood of D s and D t simultaneously. However, since marginals involve infinite summation over a certain view (target/source), a lower bound is often optimized via Monte Carlo approximation (Kingma & Welling, 2014). In the following derivation of ELBO (Kingma & Welling, 2019), q φ (y|x) is the posterior distribution of y when taking y as the latent variable. Here we only derive the bound for x ∈ D s . A detailed analogy of the standard UNMT objective and the ELBO objective is presented in Table 1 . As you can see, both objectives have the same reconstruction error terms but different regularization terms: for ELBO, the model is optimized to stay close with the language model via the KL loss.

