

Abstract

is an effective strategy to improve the performance of Neural Machine Translation (NMT) by generating pseudo-parallel data. However, several recent works have found that better translation quality of the pseudo-parallel data does not necessarily lead to a better final translation model, while lower-quality but diverse data often yields stronger results instead. In this paper we propose a new way to generate pseudo-parallel data for back-translation that directly optimizes the final model performance. Specifically, we propose a meta-learning framework where the back-translation model learns to match the forward-translation model's gradients on the development data with those on the pseudo-parallel data. In our evaluations in both the standard datasets WMT En-De'14 and WMT En-Fr'14, as well as a multilingual translation setting, our method leads to significant improvements over strong baselines.

1. INTRODUCTION

While Neural Machine Translation (NMT) delivers state-of-the-art performance across many translation tasks, this performance is usually contingent on the existence of large amounts of training data (Sutskever et al., 2014; Vaswani et al., 2017) . Since large parallel training datasets are often unavailable for many languages and domains, various methods have been developed to leverage abundant monolingual corpora (Gulcehre et al., 2015; Cheng et al., 2016; Sennrich et al., 2016; Xia et al., 2016; Hoang et al., 2018; Song et al., 2019; He et al., 2020 ). Among such methods, one particularly popular approach is back-translation (BT; Sennrich et al. ( 2016)). In BT, in order to train a source-to-target translation model, i.e., the forward model, one first trains a target-to-source translation model, i.e., the backward model. This backward model is then employed to translate monolingual data from the target language into the source language, resulting in a pseudoparallel corpus. This pseudo-parallel corpus is then combined with the real parallel corpus to train the final forward translation model. While the resulting forward model from BT typically enjoys a significant boost in translation quality, we identify that BT inherently carries two weaknesses. First, while the backward model provides a natural way to utilize monolingual data in the target language, the backward model itself is still trained on the parallel corpus. This means that the backward model's quality is as limited as that of a forward model trained in the vanilla setting. Hoang et al. ( 2018) proposed iterative BT to avoid this weakness, but this technique requires multiple rounds of retraining models in both directions which are slow and expensive. Second, we do not understand how the pseudo-parallel data translated by the backward model affects the forward model's performance. For example, Edunov et al. (2018) has observed that pseudoparallel data generated by sampling or by beam-searching with noise from the backward model train better forward models, even though these generating methods typically result in lower BLEU scores compared to standard beam search. While Edunov et al. (2018) associated their observation to the diversity of the generated pseudo-parallel data, diversity alone is obviously insufficient -some degree of quality is necessary as well. In summary, while BT is an important technique, training a good backward model for BT is either hard or slow and expensive, and even if we have a good backward model, there is no single recipe how to use it to train a good forward model. In this paper, we propose a novel technique to alleviate both aforementioned weaknesses of BT. Unlike vanilla BT, which keeps the trained backward model fixed and merely uses it to generate pseudo-

