

Abstract

is an effective strategy to improve the performance of Neural Machine Translation (NMT) by generating pseudo-parallel data. However, several recent works have found that better translation quality of the pseudo-parallel data does not necessarily lead to a better final translation model, while lower-quality but diverse data often yields stronger results instead. In this paper we propose a new way to generate pseudo-parallel data for back-translation that directly optimizes the final model performance. Specifically, we propose a meta-learning framework where the back-translation model learns to match the forward-translation model's gradients on the development data with those on the pseudo-parallel data. In our evaluations in both the standard datasets WMT En-De'14 and WMT En-Fr'14, as well as a multilingual translation setting, our method leads to significant improvements over strong baselines.

1. INTRODUCTION

While Neural Machine Translation (NMT) delivers state-of-the-art performance across many translation tasks, this performance is usually contingent on the existence of large amounts of training data (Sutskever et al., 2014; Vaswani et al., 2017) . Since large parallel training datasets are often unavailable for many languages and domains, various methods have been developed to leverage abundant monolingual corpora (Gulcehre et al., 2015; Cheng et al., 2016; Sennrich et al., 2016; Xia et al., 2016; Hoang et al., 2018; Song et al., 2019; He et al., 2020 ). Among such methods, one particularly popular approach is back-translation (BT; Sennrich et al. ( 2016)). In BT, in order to train a source-to-target translation model, i.e., the forward model, one first trains a target-to-source translation model, i.e., the backward model. This backward model is then employed to translate monolingual data from the target language into the source language, resulting in a pseudoparallel corpus. This pseudo-parallel corpus is then combined with the real parallel corpus to train the final forward translation model. While the resulting forward model from BT typically enjoys a significant boost in translation quality, we identify that BT inherently carries two weaknesses. First, while the backward model provides a natural way to utilize monolingual data in the target language, the backward model itself is still trained on the parallel corpus. This means that the backward model's quality is as limited as that of a forward model trained in the vanilla setting. Hoang et al. (2018) proposed iterative BT to avoid this weakness, but this technique requires multiple rounds of retraining models in both directions which are slow and expensive. Second, we do not understand how the pseudo-parallel data translated by the backward model affects the forward model's performance. For example, Edunov et al. (2018) has observed that pseudoparallel data generated by sampling or by beam-searching with noise from the backward model train better forward models, even though these generating methods typically result in lower BLEU scores compared to standard beam search. While Edunov et al. (2018) associated their observation to the diversity of the generated pseudo-parallel data, diversity alone is obviously insufficient -some degree of quality is necessary as well. In summary, while BT is an important technique, training a good backward model for BT is either hard or slow and expensive, and even if we have a good backward model, there is no single recipe how to use it to train a good forward model. In this paper, we propose a novel technique to alleviate both aforementioned weaknesses of BT. Unlike vanilla BT, which keeps the trained backward model fixed and merely uses it to generate pseudo- parallel data to train the forward model, we continue to update the backward model throughout the forward model's training. Specifically, we update the backward model to improve the forward model's performance on a held-out set of ground truth parallel data. We provide an illustrative example of our method in Fig. 1 , where we highlight how the forward model's held-out set performance depends on the pseudo-parallel data sampled from the backward model. This dependency allows us to mathematically derive an end-to-end update rule to continue training the backward model throughout the forward model's training. As our derivation technique is similar to meta-learning (Schmidhuber, 1992; Finn et al., 2017) , we name our method Meta Back-Translation (MetaBT). In theory, MetaBT effectively resolves both aforementioned weaknesses of vanilla BT. First, the backward model continues its training based on its own generated pseudo-parallel data, and hence is no longer limited to the available parallel data. Furthermore, MetaBT only trains one backward model and then trains one pair of forward model and backward model, eschewing the expense of multiple iterations in Iterative BT (Hoang et al., 2018) . Second, since MetaBT updates its backward model in an end-to-end manner based on the forward model's performance on a held-out set, MetaBT no longer needs to explicitly understand the effect of its generated pseudo-parallel data on the forward model's quality. Our empirical experiments verify the theoretical advantages of MetaBT with definitive improvements over strong BT baselines on various settings. In particular, on the classical benchmark of WMT En-De 2014, MetaBT leads to +1.66 BLEU score over sampling-based BT. Additionally, we discover that MetaBT allows us to extend the initial parallel training set of the backward model by including parallel data from slightly different languages. Since MetaBT continues to refine the backward model, the negative effect of language discrepancy is eventually rebated throughout the forward model's training, boosting up to +1.20 BLEU score for low-resource translation tasks.

2. A PROBABILISTIC PERSPECTIVE OF BACK-TRANSLATION

To facilitate the discussion of MetaBT, we introduce a probabilistic framework to interpret BT. Our framework helps to analyze the advantages and disadvantages of a few methods to generate pseudo-parallel data such as sampling, beam-searching, and beam-searching with noise (Sennrich et al., 2016; Edunov et al., 2018) . Analyses of these generating methods within our framework also motivates MetaBT and further allows us to mathematically derive MetaBT's update rules in § 3. Our Probabilistic Framework. We treat a language S as a probability distribution over all possible sequences of tokens. Formally, we denote by P S (x) the distribution of a random variable x, whose each instance x is a sequence of tokens. To translate from a source language S into a target language T , we learn the conditional distribution P S,T (y|x) for sentences from the languages S and T with a parameterized probabilistic model P (y|x; θ). Ideally, we learn θ by minimizing the objective:  J(θ) = E x, Motivating BT. In BT, since it is not feasible to draw exact samples y ∼ P T (y) and x ∼ P S,T (x|y), we rely on two approximations. First, instead of sampling y ∼ P T (y), we collect a corpus D T of



Figure 1: An example training step of meta back-translation to train a forward model translating English (En) into German (De). The step consists of two phases, illustrated from left to right in the figure. Phase 1: a backward model translates a De sentence taken from a monolingual corpus into a pseudo En sentence, and the forward model updates its parameters by back-propagating from canonical training losses on the pair (pseudo En, mono De). Phase 2: the updated forward model computes a cross-entropy loss on a pair of ground truth sentences (real En, real De). As annotated with the red path in the figure, this cross-entropy loss depends on the backward model, and hence can be back-propagated to update the backward model. Best viewed in colors.

y∼P S,T (x,y)[ (x, y; θ)] where (x, y; θ) = -logP (y|x; θ) (1) Since P S,T (x, y) = P S,T (y)P S,T (x|y) = P T (y)P S,T (x|y), we can refactor J(θ) from Eq. 1 as:J(θ) = E y∼P T (y) E x∼P S,T (x|y)[ (x, y; θ)]



