PSEUDO-LABEL TRAINING AND MODEL INERTIA IN NEURAL MACHINE TRANSLATION

Abstract

Like many other machine learning applications, neural machine translation (NMT) benefits from over-parameterized deep neural models. However, these models have been observed to be brittle: NMT model predictions are sensitive to small input changes and can show significant variation across re-training or incremental model updates. This work studies a frequently used method in NMT, pseudo-label training (PLT), which is common to the related techniques of forward-translation (or self-training) and sequence-level knowledge distillation. While the effect of PLT on quality is well-documented, we highlight a lesserknown effect: PLT can enhance a model's stability to model updates and input perturbations, a set of properties we call model inertia. We study inertia effects under different training settings and we identify distribution simplification as a mechanism behind the observed results.

1. INTRODUCTION

Self-training (Fralick, 1967; Amini et al., 2022 ) is a popular semi-supervised technique used to boost the performance of neural machine translation (NMT) models. In self-training for NMT, also known as forward-translation, an initial model is used to translate monolingual data; this data is then concatenated with the original training data in a subsequent training step (Zhang & Zong, 2016; Marie et al., 2020; Edunov et al., 2020; Wang et al., 2021) . Self-training is believed to be effective through inducing input smoothness and leading to better learning of decision boundaries from the addition of unlabeled data (Chapelle et al., 2006; He et al., 2020; Wei et al., 2021) . It has also been observed to effectively diversify the training distribution (Wang et al., 2021; Nguyen et al., 2020) . A closely related technique is that of knowledge distillation (Hinton et al., 2015; Gou et al., 2021) , particularly sequence-level knowledge distillation (SKD), which uses hard targets in training and reduces to pseudo-labeled data augmentation (Kim & Rush, 2016) . In NMT, knowledge distillation is effective through knowledge transfer from ensembles or larger-capacity models and as a data augmentation method (Freitag et al., 2017; Gordon & Duh, 2019; Tan et al., 2019; Currey et al., 2020) . In non-autoregressive translation, Zhou et al. (2020) explored the effect of SKD on training data complexity and showed that simpler training data from distillation is crucial for the performance of non-autoregressive MT models. This paper examines the component that is common to these techniques, the introduction of pseudolabeled training (PLT) data. We focus on the more common autoregressive NMT formulation and show that in addition to the known quality gains, PLT has a large impact on model brittleness in that it increases smoothness as well as stability across model re-training. Our main contributions are: • We focus on a set of stability properties in NMT models, which we unify under the umbrella term inertia, and show that PLT increases model inertia. We further show that both the quality gains and the improved inertia are not properties of any one specific technique such as self-training or knowledge distillation, but are common to the use of pseudo-labeled data in training. This paper examines pseudo-label training in NMT and its effect on stability to both input variations and incremental model updates, which we group under the term inertia. Earlier work on pseudolabel training in MT focused on measuring quality alone and did not shed light on stability-related properties (Wang et al., 2021; He et al., 2020; Wei et al., 2021; Yuan et al., 2020) . In terms of stability to input variations, or smoothness, our findings are related to the work of Papernot et al. (2015) , where authors introduce defensive distillation and show that (self-)distillation increased smoothness when tested on digit and object recognition tasks. They show that the effect is one of reducing the amplitude of the network gradients. Unlike our work, they do not test pseudo-label training, but soft-target distillation, where a student is trained using the prediction probabilities of a teacher. 



, we hypothesize that PLT techniques are able to increase model inertia based on their distribution simplification properties. Earlier works have explored the distribution simplification property of PLT methods in terms of model performance. In non-autoregressive NMT, Zhou et al. (2020) and Xu et al. (2021) explored the effect of SKD on training data complexity and its correlation with model performance. As in previous work, they hypothesized that SKD alleviates the multiple modes problem, i.e., the existence of multiple alternative translations (Gu et al., 2018). Similarly to Zhou et al. (2020), we measure training data complexity when adding pseudo-labeled data and use the entropy of a conditional word-level alignment as a complexity metric.3 TRAINING WITH PSEUDO-LABELS IN NMTNeural machine translation (NMT) We use the autoregressive formulation of NMT, where given parallel data containing source and target sequences, a model θ is learned using the following objective:1{y j = k} × log p(y j = k|y <j , x, θ),(1)where x = [x 1 , ..., x I ] and y = [y 1 , ..., y J ] are the source/target sequences respectively, I and J are the source/target length, and |V | is the size of the vocabulary. Unless otherwise stated, we use beam search with a fixed number of hypotheses in order to generate a translation from this model.

Based on our findings, we recommend incorporating PLT into NMT training whenever inertia (e.g., stability to input perturbations and across incremental model updates) is important, as it increases inertia without sacrificing quality.2 RELATED WORKNeural network models are known to be sensitive to input variations, i.e., lacking in smoothness. This can make them brittle or open to adversarial attacks, a property observed across many application domains(Goodfellow et al., 2014; Szegedy et al., 2014; Jia & Liang, 2017). Neural machine translation models are similarly prone to robustness issues and can be affected by both synthetic and natural noise, leading to lower translation quality(Belinkov & Bisk, 2018; Li et al., 2019; Niu et al.,  2020; Fadaee & Monz, 2020). In MT, earlier works have found noisy data augmentation(Belinkov  & Bisk, 2018)  and subword regularization(Kudo, 2018; Provilkov et al., 2020)  to be among the most simple yet effective methods for addressing instability to input perturbations.In addition to smoothness, neural models are known be sensitive to the various sources of randomness in training, such as initialization or dropout(Bengio, 2012; Reimers & Gurevych, 2017;  Madhyastha & Jain, 2019). This instability negatively impacts end-users in the form of spurious differences in outputs between model updates, or more acutely, as quality regressions on specific data points, also known as negative flips(Shen et al., 2020; Xie et al., 2021; Yan et al., 2021). InNLP, Cai et al. (2022)  focus on a set of structured prediction tasks and show that when random initialization changes, up to 30% of all errors can be regression errors, and that improved accuracy does not always mean reduced regressions. While negative flips are more difficult to measure in MT as multiple translations can be valid, the lack of consistency across re-training is a known problem: in our experiments ∼80% of the translations change due to different model random initialization alone. Despite this, to the best of our knowledge, minimizing regressions or improving stability across incremental model updates or re-trainings has not yet been addressed in MT.

