SELF-SUPERVISED AND SUPERVISED JOINT TRAINING FOR RESOURCE-RICH MACHINE TRANSLATION

Abstract

Self-supervised pre-training of text representations has been successfully applied to low-resource Neural Machine Translation (NMT). However, it usually fails to achieve notable gains on resource-rich NMT. In this paper, we propose a joint training approach, F 2 -XEnDec, to combine self-supervised and supervised learning to optimize NMT models. To exploit complementary self-supervised signals for supervised learning, NMT models are trained on examples that are interbred from monolingual and parallel sentences through a new process called crossover encoder-decoder. Experiments on two resource-rich translation benchmarks, WMT'14 English-German and WMT'14 English-French, demonstrate that our approach achieves substantial improvements over several strong baseline methods and obtains a new state of the art of 46 BLEU on English-French when incorporating back translation. Results also show that our approach is capable of improving model robustness to input perturbations, particularly to code-switching noise which usually appears on the social media.

1. INTRODUCTION

Self-supervised pre-training of text representations (Peters et al., 2018; Radford et al., 2018) has achieved tremendous success in natural language processing applications. Inspired by BERT (Devlin et al., 2019) , recent works attempt to leverage sequence-to-sequence model pre-training for Neural Machine Translation (NMT). Generally, these methods comprise two stages: pre-training and finetuning. During the pre-training stage, a proxy task, e.g. the Cloze task (Devlin et al., 2019) , is used to learn the model parameters on abundant unlabeled monolingual data. In the second stage, the full or partial model is finetuned on a downstream translation task of labeled parallel sentences. When the amount of labeled data is limited, studies have demonstrated the benefit of pre-training for low-resource translation tasks (Lewis et al., 2019; Song et al., 2019) . In many NMT applications, we are confronted with resource-rich languages which are characterized by millions of labeled parallel sentences. However, for these resource-rich tasks, pre-training representations rarely endows the NMT model with superior quality and, even worse, it sometimes can undermine the model's performance if improperly utilized (Zhu et al., 2020) . This is partly due to catastrophic forgetting (French, 1999) where prolonged finetuning on large corpora causes the learning to overwhelm the knowledge learned during pre-training. Several mitigation methods have been proposed for resource-rich machine translation (Edunov et al., 2019; Yang et al., 2019; Zhu et al., 2020) , such as freezing the pre-trained representations during the finetuning stage. In this paper, we study resource-rich machine translation through a different perspective of joint training where in contrast to the conventional two-stage approaches, we train NMT models in a single stage using the self-supervised objective (on unlabeled monolingual sentences) in addition to the supervised objective (on labeled parallel sentences). The biggest challenge for this single-stage training paradigm is that self-supervised learning is less useful in joint training because it provides a much weaker learning signal that is easily dominated by the signal obtained through supervised learning. As a result, plausible approaches such as simply combining self-supervised and supervised learning objectives perform not much better than the supervised learning objective by itself. To this end, we introduce an approach to exploit complementary self-supervised learning signals to facilitate supervised learning in a joint training framework. Inspired by chromosomal crossovers (Rieger et al., 2012) , we propose a new task called crossover encoder-decoder (or XEn-Dec) which takes two training examples as inputs (called parents), shuffles their source sentences, and produces a "virtual" sentence (called offspring) by a mixture decoder model. The key to our approach is to "interbreed" monolingual (unlabeled) and parallel (labeled) sentences through second filial generation with a crossover encoder-decoder, which we call F 2 -XEnDec, and train NMT models on the F 2 offspring. As the F 2 offspring exhibits combinations of traits that differ from those found in either parent, it turns out to be a meaningful objective to learn NMT models from both labeled and unlabeled sentences in a joint training framework. To the best of our knowledge, the proposed F 2 -XEnDec is among the first joint training approaches that substantially improve resource-rich machine translation. Closest to our work are two-stage approaches by Zhu et al. ( 2020) and Yang et al. ( 2019) who designed special finetuning objectives. Compared to their approaches, our focus lies on addressing a different challenge which is making self-supervised learning complementary to joint training of supervised NMT models on large labeled parallel corpora. Our experimental results substantiate the competitiveness of the proposed joint training approach. Furthermore, our results suggest that the approach improves the robustness of NMT models (Belinkov & Bisk, 2018; Cheng et al., 2019) . Contemporary NMT systems often lack robustness and therefore suffer from dramatic performance drops when they are exposed to input perturbations, even though these perturbations may not be strong enough to alter the meaning of the input sentence. Our improvement in robustness is interesting as none of the two-stage training approaches have ever reported this behavior. We empirically validate our approach on the WMT'14 English-German and WMT'14 English-French translation benchmarks which yields an improvement of 2.13 and 1.78 BLEU points over a vanilla Transformer model baseline. We also achieve a new state of the art of 46 BLEU on the WMT'14 English-French translation task when further incorporating the back translation technique into our approach. In summary, our contributions are as follows: 1. We propose a crossover encoder-decoder (XEnDec) that generates "virtual" examples over pairs of training examples. We discuss its relation to the standard self-supervised learning objective that can be recovered by XEnDec. 2. We combine self-supervised and supervised losses in a joint training framework using our proposed F 2 -XEnDec and show that self-supervised learning is complementary to supervised learning for resource-rich NMT. 3. Our approach achieves significant improvements on resource-rich translation tasks and exhibits higher robustness against input perturbations, particularly to code-switching noise.

2.1. NEURAL MACHINE TRANSLATION

Under the encoder-decoder paradigm (Bahdanau et al., 2015; Gehring et al., 2017; Vaswani et al., 2017) , the conditional probability P (y|x; θ) of a target-language sentence y = y 1 , • • • , y J given a source-language sentence x = x 1 , • • • , x I is modeled as follows: The encoder maps the source sentence x onto a sequence of I word embeddings e(x) = e(x 1 ), ..., e(x I ). Then the word embeddings are encoded into their corresponding continuous hidden representations. The decoder acts as a conditional language model that reads embeddings e(y) for a shifted copy of y along with the aggregated contextual representations c. For clarity, we denote the input and output in the decoder as z and y, i.e. z = s , y 1 , • • • , y J-1 , where s is a start symbol. Conditioned on the aggregated contextual representation c j and its partial target input z ≤j , the decoder generates y as P (y|x; θ) = J j=1 P (y j |z ≤j , c j ; θ). The aggregated contextual representation c is often calculated by summarizing the sentence x with an attention mechanism (Bahdanau et al., 2015) . A byproduct of the attention computation is a noisy alignment matrix A ∈ R J×I which roughly captures the translation correspondence between target and source words (Garg et al., 2019) . Generally, NMT optimizes the model parameters θ by minimizing the empirical risk over a parallel training set (x, y) ∈ S: L S (θ) = E (x,y)∈S [ (f (x, y; θ) , h(y))], (1)

