DEMYSTIFYING LEARNING OF UNSUPERVISED NEU-RAL MACHINE TRANSLATION

Abstract

Unsupervised Neural Machine Translation or UNMT has received great attention in recent years. Though tremendous empirical improvements have been achieved, there still lacks theory-oriented investigation and thus some fundamental questions like why certain training protocol can work or not under what circumstances have not yet been well understood. This paper attempts to provide theoretical insights for the above questions. Specifically, following the methodology of comparative study, we leverage two perspectives, i) marginal likelihood maximization and ii) mutual information from information theory, to understand the different learning effects from the standard training protocol and its variants. Our detailed analyses reveal several critical conditions for the successful training of UNMT.

1. INTRODUCTION

Unsupervised Neural Machine Translation or UNMT have grown from its infancy (Artetxe et al., 2018; Lample et al., 2018a) to close-to-supervised performance recently on some translation scenarios (Lample & Conneau, 2019; Song et al., 2019) . Early UNMT works (Artetxe et al., 2017; Lample et al., 2018a; Yang et al., 2018) adopt complex training strategies including model initialization, synthetic parallel data for warming up the model, adversarial loss for making encoder universal, different weight sharing mechanisms etc. Then Lample et al. (2018b) simplifies all these and establishes a two-components framework, involving an initialization strategy followed by iterative training on two tasks, i.e. denoising auto-encoding with the DAE loss and online back-translation with the BT loss. Works afterwards mainly focus on developing better initialization strategies (Lample & Conneau, 2019; Ren et al., 2019; Song et al., 2019; Liu et al., 2020) . Although obtaining impressive performance, it is unclear why this standard training protocol is possible to be successful. A. By making an analogy of standard training protocol with marginal likelihood or Evidence Lower BOund (ELBO) optimization, we visualize the learning curves of the two terms in ELBO objective, and found that optimizing ELBO is not sufficient for training a successful UNMT model, indicating that specific regularization design i.e. the DAE loss, quite matters. B. By leveraging information theory, we present a formal definition on what does it mean to successfully train an UNMT model, and then readily derive a sufficient condition and a necessary condition for successfully training UNMT in principle. In addition, we validate both sufficient and necessary conditions through empirical experiments, and find that both conditions indeed explain why standard training protocol works while others suffer from degeneration to learning sub-optimal tasks. C. Based on explanations for those failed protocols, we continue experiments to settle the role played by DAE and BT. Firstly, BT is the main task while DAE is a critical auxiliary. Then we clarify that DAE has more important role than just learning word order, accepted as common knowledge in almost all previous works, but also preserving the mutual information between encoder input and



Kim et al. (2020) and Marchisio et al. (2020) consider the standard training as a black-box and empirically analyze its success or failure regarding different data settings (i.e. text domains and language pairs). Unfortunately, due to the lack of theoretical guidelines, some fundamental questions are still remained unknown: what standard training tries to minimize under the general unsupervised training paradigm (Ghahramani, 2004) and when a certain training protocol can work for training UNMT? In this paper, we attempt to open the back-box training of UNMT and understand its theoretical essence from two angles: i) a marginal likelihood maximization view; and ii) an information-theoretic view by ablating standard training protocol with other variants. Our contributions are as follows.

