PSEUDO-LABEL TRAINING AND MODEL INERTIA IN NEURAL MACHINE TRANSLATION

Abstract

Like many other machine learning applications, neural machine translation (NMT) benefits from over-parameterized deep neural models. However, these models have been observed to be brittle: NMT model predictions are sensitive to small input changes and can show significant variation across re-training or incremental model updates. This work studies a frequently used method in NMT, pseudo-label training (PLT), which is common to the related techniques of forward-translation (or self-training) and sequence-level knowledge distillation. While the effect of PLT on quality is well-documented, we highlight a lesserknown effect: PLT can enhance a model's stability to model updates and input perturbations, a set of properties we call model inertia. We study inertia effects under different training settings and we identify distribution simplification as a mechanism behind the observed results.

1. INTRODUCTION

Self-training (Fralick, 1967; Amini et al., 2022 ) is a popular semi-supervised technique used to boost the performance of neural machine translation (NMT) models. In self-training for NMT, also known as forward-translation, an initial model is used to translate monolingual data; this data is then concatenated with the original training data in a subsequent training step (Zhang & Zong, 2016; Marie et al., 2020; Edunov et al., 2020; Wang et al., 2021) . Self-training is believed to be effective through inducing input smoothness and leading to better learning of decision boundaries from the addition of unlabeled data (Chapelle et al., 2006; He et al., 2020; Wei et al., 2021) . It has also been observed to effectively diversify the training distribution (Wang et al., 2021; Nguyen et al., 2020) . A closely related technique is that of knowledge distillation (Hinton et al., 2015; Gou et al., 2021) , particularly sequence-level knowledge distillation (SKD), which uses hard targets in training and reduces to pseudo-labeled data augmentation (Kim & Rush, 2016) . In NMT, knowledge distillation is effective through knowledge transfer from ensembles or larger-capacity models and as a data augmentation method (Freitag et al., 2017; Gordon & Duh, 2019; Tan et al., 2019; Currey et al., 2020) . In non-autoregressive translation, Zhou et al. (2020) explored the effect of SKD on training data complexity and showed that simpler training data from distillation is crucial for the performance of non-autoregressive MT models. This paper examines the component that is common to these techniques, the introduction of pseudolabeled training (PLT) data. We focus on the more common autoregressive NMT formulation and show that in addition to the known quality gains, PLT has a large impact on model brittleness in that it increases smoothness as well as stability across model re-training. Our main contributions are: • We focus on a set of stability properties in NMT models, which we unify under the umbrella term inertia, and show that PLT increases model inertia. We further show that both the quality gains and the improved inertia are not properties of any one specific technique such as self-training or knowledge distillation, but are common to the use of pseudo-labeled data in training. • We investigate the hypothesis that the observed properties correlate with a training data simplification mechanism, similarly to the observations made in Zhou et al. (2020) . We compare with other popular semi-supervised techniques to investigate if the model quality and inertia properties hold when distribution simplification effects are not present. • Based on our findings, we recommend incorporating PLT into NMT training whenever inertia (e.g., stability to input perturbations and across incremental model updates) is important, as it increases inertia without sacrificing quality.

2. RELATED WORK

Neural network models are known to be sensitive to input variations, i.e., lacking in smoothness. This can make them brittle or open to adversarial attacks, a property observed across many application domains (Goodfellow et al., 2014; Szegedy et al., 2014; Jia & Liang, 2017) . Neural machine translation models are similarly prone to robustness issues and can be affected by both synthetic and natural noise, leading to lower translation quality (Belinkov & Bisk, 2018; Li et al., 2019; Niu et al., 2020; Fadaee & Monz, 2020) . In MT, earlier works have found noisy data augmentation (Belinkov & Bisk, 2018) and subword regularization (Kudo, 2018; Provilkov et al., 2020) to be among the most simple yet effective methods for addressing instability to input perturbations. In addition to smoothness, neural models are known be sensitive to the various sources of randomness in training, such as initialization or dropout (Bengio, 2012; Reimers & Gurevych, 2017; Madhyastha & Jain, 2019) . This instability negatively impacts end-users in the form of spurious differences in outputs between model updates, or more acutely, as quality regressions on specific data points, also known as negative flips (Shen et al., 2020; Xie et al., 2021; Yan et al., 2021) . In NLP, Cai et al. (2022) focus on a set of structured prediction tasks and show that when random initialization changes, up to 30% of all errors can be regression errors, and that improved accuracy does not always mean reduced regressions. While negative flips are more difficult to measure in MT as multiple translations can be valid, the lack of consistency across re-training is a known problem: in our experiments ∼80% of the translations change due to different model random initialization alone. Despite this, to the best of our knowledge, minimizing regressions or improving stability across incremental model updates or re-trainings has not yet been addressed in MT. This paper examines pseudo-label training in NMT and its effect on stability to both input variations and incremental model updates, which we group under the term inertia. Earlier work on pseudolabel training in MT focused on measuring quality alone and did not shed light on stability-related properties (Wang et al., 2021; He et al., 2020; Wei et al., 2021; Yuan et al., 2020) . In terms of stability to input variations, or smoothness, our findings are related to the work of Papernot et al. (2015) , where authors introduce defensive distillation and show that (self-)distillation increased smoothness when tested on digit and object recognition tasks. They show that the effect is one of reducing the amplitude of the network gradients. Unlike our work, they do not test pseudo-label training, but soft-target distillation, where a student is trained using the prediction probabilities of a teacher. Finally, we hypothesize that PLT techniques are able to increase model inertia based on their distribution simplification properties. Earlier works have explored the distribution simplification property of PLT methods in terms of model performance. In non-autoregressive NMT, Zhou et al. (2020) and Xu et al. (2021) explored the effect of SKD on training data complexity and its correlation with model performance. As in previous work, they hypothesized that SKD alleviates the multiple modes problem, i.e., the existence of multiple alternative translations (Gu et al., 2018) . Similarly to Zhou et al. (2020) , we measure training data complexity when adding pseudo-labeled data and use the entropy of a conditional word-level alignment as a complexity metric.

3. TRAINING WITH PSEUDO-LABELS IN NMT

Neural machine translation (NMT) We use the autoregressive formulation of NMT, where given parallel data containing source and target sequences, a model θ is learned using the following objective: L = - J j=1 |V | k=1 1{y j = k} × log p(y j = k|y <j , x, θ), where x = [x 1 , ..., x I ] and y = [y 1 , ..., y J ] are the source/target sequences respectively, I and J are the source/target length, and |V | is the size of the vocabulary. Unless otherwise stated, we use beam search with a fixed number of hypotheses in order to generate a translation from this model.

Pseudo-label training (PLT)

In this paper, we introduce the term pseudo-label training (PLT) to refer to the general technique of adding pseudo-labeled data during training, where the labels are obtained using a previously trained NMT model. Specifically, we consider two-step PLT. In a first stage we estimate a teacher model θ * trained with a supervised loss on samples drawn from p, the empirical distribution of the original training data: L = -E x∼p(x) E y∼p(y|x) p(y|x) log p θ * (y|x) In a second step we estimate the final student model θ, combining the supervised loss with a PL (pseudo-label) loss L + L P L , where: L PL = -E x∼p P L (x),y log p θ (y |x) In this case the targets y are given by the teacher distribution p θ * and the samples are drawn from a second distribution, p P L , which varies in the experiments below. Related techniques As discussed earlier, PLT is a common feature of several widely used techniques in NMT such as self-training (a.k.a. forward-translation) and sequence-level knowledge distillation. This paper opts for the term pseudo-label training (PLT) in order to avoid confusion with additional assumptions made by these techniques. Specifically: • PLT does not necessarily imply semi-supervision, as self-training does. • PLT is more specific than KD in that it is restricted to hard labels (as opposed to training on soft targets as in Hinton et al. 2015) , but more generic as it does not assume model compression. Another technique for introducing synthetic data is the use of back-translation (BT), where target segments are translated into source segments (Sennrich et al., 2016a; Hoang et al., 2018; Edunov et al., 2020) . PLT does not include BT since the latter does not introduce synthetic targets or labels. Lastly, note that self-training is closely related to entropy minimization (Grandvalet & Bengio, 2004) , a semi-supervised technique that encourages high-confidence predictions on unlabeled data. When reducing this objective to its mode, it becomes identical to L PL above, also observed in He et al. (2020) .

4. MODEL INERTIA

This section introduces a set of desired stability-related MT properties that we group under the term inertia. All our metrics are closed-box (based on user-observed model behaviour alone) and we investigate two types of model inertia: (1) robustness to input perturbations (or smoothness) and ( 2) stability across incremental model updates.

4.1. INPUT SMOOTHNESS

Robustness to input variations is important in MT models, which have been shown to be negatively affected by misspellings and other small variations in input (Belinkov & Bisk, 2018) . Niu et al. (2020) introduced metrics that contrast translations of noisy input with those of their clean counterparts in order to disentangle robustness from generic quality changes. We evaluate model robustness and consistency to input changes following the definitions introduced in Niu et al. ( 2020): Robustness measures degradation in translation quality when small variations are present in the input, while Consistency is a reference-free metric for changes in translation output alone. Specifically: Consistency = H(BLEU(Y , Y ), BLEU(Y, Y )) Robustness = BLEURT(Y , Y ref ) -BLEURT(Y, Y ref ) where Y ref stands for reference translations, Y, Y are translations of a clean/noisy versions of the test set (e.g., one with introduced misspellings) and H(•, •) stands for the harmonic mean. In this paper, we expand these definitions to consider robustness not only to synthetic misspellings, but also to natural grammatical errors.

4.2. STABILITY TO MODEL UPDATES

Unlike smoothness metrics, stability metrics are functions of two models: an original one (e.g., one that is deployed and available to users) and an update of this model which implements an incremental change. We denote a model update as a pair (θ, D, A) i , (θ, D, A) i+1 , where θ are the model parameters obtained when training using data D and algorithm A. While many incremental updates are possible, in this work we keep the model size and architectures intact and vary the random parameter initialization in training, following Xie et al. (2021) and Cai et al. (2022) . We define stability as a measure of similarity between model outputs, irrespective of quality changes, while regressions (negative flips) measure output changes that result in lower quality on a given input segment (Cai et al., 2022; Xie et al., 2021; Shen et al., 2020; Yan et al., 2021) . STABILITY Stability is measured as string similarity between the different outputs Y i , Y i+1 . We use a symmetric BLEU-based metric, the harmonic mean between BLEU(Y i , Y i+1 ) and BLEU(Y i+1 , Y i ), where Y i and Y i+1 are translations obtained with models θ i and θ i+1 , respectively. NFR Similarly to earlier works (Cai et al., 2022; Xie et al., 2021; Yan et al., 2021) , we measure regressions as Negative Flip Rate, the number of sentences for which the translation degrades between model updates over the total number of segments. We consider degradations in terms of both overall quality and a targeted translation error category. Unlike other tasks, NMT lacks a reliable automatic segment-level quality metric (Kocmi et al., 2021; Mathur et al., 2020) ; we use human evaluations for this reason. Having an additional targeted error category allows us to measure segment-level regression automatically. In this work, we adopt gender translation accuracy as the targeted error category. NFI Following Cai et al. (2022) , we also measure regressions in terms of Negative Flip Impact. NFI is defined as the proportion of negative flips to the total number of errors made by the new model. Note that in NMT, error is less well-defined for quality since it is not a categorical concept. This is not the case with targeted translation error categories.

5. EXPERIMENTS

We perform experiments across 6 language pairs (LPs): English (en)↔German (de), Russian (ru), and Japanese (ja). We adapt the Transformer-base architecture (Vaswani et al., 2017) to 20 encoder layers and 2 decoder layers (denoted 20:2) as recommended by Domhan et al. (2020) and SSRU decoder layers for faster decoding (Kim et al., 2019) . The deep-encoder-shallow-decoder configuration is widely used (Miceli Barone et al., 2017; Kim et al., 2019; Kasai et al., 2021) , and the 20:2 model was found by Domhan et al. (2020) to yield comparable quality to the 6:6 and 10:10 models while significantly decreasing latency. Unless otherwise noted, we use beam decoding with a beam size of 5 (further details in Appendix A). Experiments are carried out with the WMT21 dataset (Akhbardeh et al., 2021) . For en↔de we use 286M parallel segments, for en↔ja we use 17.2M parallel segments, and for en↔ru we use 34M parallel segments. For development, we use WMT newstest datasets from earlier years (see Appendix B for more details on datasets used). We evaluate quality using BLEUfoot_0 (Papineni et al., 2002) and BLEURT (Sellam et al., 2020) on the WMT21 newstest sets (Akhbardeh et al., 2021) . We use only source-original test sets in order to avoid misestimating model performance due to translationese input (Marie et al., 2020) . We train PLT-augmented models using a mix of the original training data and pseudo-labeled data in a joint training setting following Zhang & Zong (2016) ; Gordon & Duh (2019) . Based on recommendations by He et al. (2020) , we use dropout for all the models, set to 0.1. We do not tune the trade-off between the two losses L and L PL (we use an equal amount of original and PLT data) or the number of incremental applications of the PLT augmentation. 

SRC

Can yo put cites on those? BASELINE Können Sie Zitate darauf setzen? PLT(TRAIN) Kannst du diese zitieren? Table 1 : Example translations from BASELINE and PLT(TRAIN) on the synthetic misspellings and GMEG test sets. In the first example (synthetic misspelling), the baseline invents the word Guven as a translation of the original miss-spelled word, guven(given). PLT translates the second example (English learner error) as Can you cite these? using the informal register, while the Baseline translates it literally as Can you put citations on these? (formal register).

5.1. QUALITY AND INERTIA USING PSEUDO-LABELED DATA

This section evaluates PLT for both generic model quality and for inertia. Unless otherwise noted, student models share the same architecture as the teacher and are trained using the same parallel data with the addition of pseudo-labeled data. PLT can be implemented by sampling and labeling data from different source distributions p P L : the original training data (as in KD) or unseen monolingual data (i.e. semi-supervised). This section tests both: to that end, teacher models are trained on half of the available parallel data, while the other half is reserved as a source of unlabeled monolingual data. Specifically, we compare: • BASELINE: Model trained on half the available data without any data augmentation. • PLT(TRAIN): Data used in PLT augmentation is sampled from the training data. • PLT(UL): Data used in PLT augmentation is sampled from unused parallel data. • ALLDATA: Finally, to account for the differences in training data size, we also compare against a model trained on all available parallel data without any PLT.

5.1.1. INPUT SMOOTHNESS

For each of these models we compute newstest quality (BLEU score) as well as model smoothness (robustness and consistency). We measure robustness and consistency as defined in Section 4 with the following sources of input variations: • Synthetic misspellings: We introduce misspellings as proposed by Niu et al. (2020) into the newstest set. Each word is misspelled with probability of 0.1, and the strategy is randomly chosen from single-character deletion, insertion, and substitution (Karpukhin et al., 2019 ). • GMEG: The GMEG corpus (Napoles et al., 2019) contains data with natural errors made by English language learners (grammatical misuse, misspellings, etc.). We compute consistency using the noisy input and a reference correction made by a professional annotator. We report the average consistency over the four provided reference corrections.foot_2  Example translations and results are show in Tables 1 and 2 , respectively. Across all LPs, translation quality improves when pseudo-labeled data is used in training, irrespective of the source of the data added. However, sampling from unseen data does not bring additional improvements over using seen data for PLT. Similarly, using all parallel data vs. only half is not beneficial across the board, suggesting limitations of the training data w.r.t. the test domain. PLT shows significantly higher model consistency on both synthetic misspellings and the GMEG test sets. 3 Unlike Niu et al. (2020) , however, we find that robustness scores (translation quality changes relative to input changes) are not as well correlated with consistency scores, suggesting that while translations are more stable under noisy conditions they may not necessarily be better. In the context ) is the percent of outputs that stay identical across the two models. For Distillation (Distil.), the second model is trained to mimic the first model. of semi-supervised learning, it has been hypothesized that self-training has the effect of making models smoother through the addition of new data (He et al., 2020; Wei et al., 2021) . Our results suggest that this is not necessarily the case, as smoothness results are similar irrespective of the use of new unlabeled (monolingual) data (i.e., PLT(TRAIN) and PLT(UL) have similar smoothness).

5.1.2. STABILITY TO MODEL UPDATES

Next we investigate stability properties with respect to model updates when PLT is used in training. We fix the source of the pseudo-labeled data to be the training data (i.e., we consider only PLT(TRAIN)) and compare translation changes when re-training a model. Recall, a model update consists of a pair (θ, D, A) 1 , (θ, D, A) 2 , where θ are the model parameters obtained when training using data D and algorithm A. In these experiments, we keep the network architecture identical and hold A 1 = A 2 , modulo the random seed used in initialization. We contrast several settings: • BASELINE: Models are trained and re-trained with half the original data (D 1 = D 2 ), and no pseudo-labeled data is used. As above, we also evaluate the case where all of the original data is used (ALLDATA). We vary the random seed, leading to θ 1 = θ 2 . • PLT-δ(STUDENT): This tests the hypothesis that using PLT leads to more stable models that behave similarly when varying minor training conditions. We consider an identical setup as the baseline (D 1 = D 2 ), except that the data is augmented to contain PLT data This setting is a standard distillation approach for minimizing regressions, where θ 2 is trained to explicitly mimic θ 1 's predictions (Yan et al., 2021; Cai et al., 2022) . Stability and regression metrics are averaged over (θ 1 , θ 2 ) and (θ 2 , θ 1 ) scores since random initialization changes are not directional model updates. Tables 3 and 4 show stability and regression metrics respectively (regression results are discussed in the next section). First, we observe that a striking number of translations change when changing random initialization: only 15% of outputs remain identical for en→de, and 8% and 2% remain identical for the lowerresource en→ru and en→ja pairs respectively. Doubling the amount of training data (ALLDATA) improves stability, but not by a large margin. Across all LPs tested, PLT improves stability relative to the baseline models and nearly doubles the percentage of segments translated identically. Interestingly, PLT also improves stability relative to the system trained on all available parallel data, once again indicating that inertia effects do not simply stem from more data. This result is particularly surprising for the PLT-δ(TEACHER) setting: unlike the baseline, the two models compared are trained on different data on the target side, yet their outputs are more similar to each other than the baseline outputs are to each other. This suggests that the high translation variability of the original data (a.k.a. multiple modes in Gu et al., 2018 ) is an issue with auto-regressive MT as well, and that pseudo-labeled data alleviates it even when created with different models. Finally, we also find that distillation, where a new model is explicitly trained to mimic the previous model, increases stability between teacher and student, confirming earlier observations on text classification Cai et al. (2022) . However, this improvement is modest in our experiments.

5.1.3. NEGATIVE FLIPS

Next, we assess PLT in terms of negative flips (NFs) as described in Section 4. We evaluate regressions in terms of overall quality (human evaluations on the WMT21 newstest set) and on a targeted error category (gender translation accuracy). For human evaluations, we used two professional annotators who assigned scores on a scale of 1 to 6 with 0.2 increments, where 6 indicates a perfect translation. A NF is defined as both annotators agreeing that there is a degradation. Since quality is evaluated on a scale, and not as a binary score, the concept of NFI is ambiguous. We therefore compute negative flip rate (NFR) alone. We evaluate on en→de,ja,ru due to availability of annotators. For gender translation accuracy, which aggregates categorical measurements, we evaluate both NFR and NFI. We use the WinoMT benchmark (Stanovsky et al., 2019) , a gender accuracy benchmark with a reliable segment-level metric suitable for automatic measurements of negative flips. The dataset consists of English source segments containing a profession whose gender is ambiguous at the lexical level but disambiguated in the sentential context, along with an automatic morphologybased classifier that evaluates the gender accuracy when translated. We evaluate on the two of our language pairs that are covered by WinoMT, en→de and en→ru. Results are shown in Table 4 . First, we observe that, like other NLP tasks, regressions are also an issue for NMT: on WMT21, 15%-20% of the test set samples are judged as having degraded in quality according to both human annotators. NFR is lower for the gender translation accuracy task; however, NFs still amount to 10%-15% of the total number of errors, as measured by NFI. Mirroring Published as a conference paper at ICLR 2023 Table 5 : PLT models using teachers of varying quality (averages over the three language pairs in each direction). We find that teacher quality correlates with quality of PLT; however, weaker teachers can still improve student quality. In terms of inertia properties, these are preserved regardless of teacher quality. Note X→en averages for robustness and consistency to misspelling include only de,ru→en. stability results, both PLT models have significantly fewer segment-level regressions than baseline models. For quality, this is most pronounced for en→de,ru (∼50%-100% relative NFR reduction). In contrast, the effect of distillation is not consistent across the language pairs or the two test sets.

5.2. TEACHER QUALITY

In previous sections, we found that quality and model inertia improved when using PLT regardless of the source of the data. In this section, we examine another dimension which distinguishes different flavors of PLT, namely teacher quality. Stronger teachers (teachers with larger capacity than the student) are more common in KD applications whereas identical teacher and student models are the norm in self-training/forward translation. Specifically, we vary the base 20:2 teacher architecture by decreasing the number of decoder layers to 1 (weaker teacher) and increasing it to 4 (stronger teacher). We keep the student architecture identical at 20:2 layers and fix the source of pseudolabeled data to the training set (referred to as PLT(TRAIN) in earlier sections). Interestingly, we find that teacher quality does not play a large role in model stability (Table 5 ). There are small improvements in stability and robustness when stronger teachers are used, but gains are in range for all teacher models considered, even for weak teachers. Stronger teachers, however, are responsible for better performing student models. Most surprisingly, we found quality improvements over the baseline even when the teacher is of worse quality than the baseline model. This corroborates other work suggesting that the mechanism behind PLT is not simply that of compressing better performing (sets of) models (Furlanello et al., 2018; Hahn & Choi, 2019; Yuan et al., 2020) .

6. DISTRIBUTION SIMPLIFICATION

The previous section showed that PLT increases both quality and model inertia under different monolingual data and teacher quality settings. We hypothesize that the increased inertia observed is correlated with a distribution simplification mechanism: PLT leads to simpler training data, resulting in models that are less brittle. We test this by comparing PLT with other techniques used to improve quality and smoothness, but that may not have a distribution simplification effect. Below, we fix the source of pseudo-labeled data to the training data and test: • BT: Back-translation, a commonly used semi-supervised method that adds parallel data obtained through translation of target data with a reverse-direction MT model. • BPE-DROPOUT: A regularization method that has been shown to improve robustness to noise (Provilkov et al., 2020) . We used a dropout rate of 0.1 as recommended by the authors. • PLT(SAMPLE): A variant of PLT where we vary the decoding strategy and perform sampling decoding which leads to more complex data and weaker student models (Zhou et al., 2020) . Specifically, we sampled the top-8 hypotheses. In For each setting, we ( 1) compute an alignment model on the training data using fast align (Dyer et al., 2013) , (2) use it to align a sample of the training corpus, and (3) compute the entropy of the aligned data, leading to: C(d) = -1 |Vx| x∈Vx E y|x align log(y|x) , where y is the sequence of training data tokens and x align the sequence of source-side tokens that y tokens are aligned to. Lower entropy indicates that the data is explained by a simpler word-to-word translation model that uses similar word translations irrespective of context. Results are shown in Table 6 . First, we observe that the complexity scores confirm the results reported by Zhou et al. (2020) , with smaller-scale differences due to the fact that we mix both original data and pseudo-labeled data. BPE-DROPOUT performs best on smoothness w.r.t. synthetic noise: it outperforms all methods by a large margin on robustness, and by a smaller margin on consistency. This is not the case on data with natural noise (GMEG), where the increased consistency effect is smaller w.r.t. the BASELINE model. On other metrics, BPE-DROPOUT has no effect on quality (BLEURT) and a minor negative effect on stability across re-training. BPE-DROPOUT is not only the only method that lowers stability, but also the only method that increases the complexity of the data compared to the baseline. BT shows a data simplification effect, mirrored by increased stability when re-training. However, BT has a detrimental effect on robustness and consistency metrics. These results indicate that while back-translation and forward translation are typically seen as very similar methods, they have different properties. PLT(SAMPLE) performs very similarly to PLT(TRAIN): when compared with PLT(TRAIN), it leads to slightly more complex data, and slightly worse quality and inertia scores. PLT(TRAIN) shows the lowest complexity scores and the highest stability. While stability and complexity correlate, not all methods that simplify the data improve smoothness; conversely, smoothness to synthetic noise can be improved significantly with complementary methods such as BPE-DROPOUT. We corroborate Niu et al. (2020) and find that synthetic and natural noise are different in nature and not all methods are equally effective on both types of noise.

7. CONCLUSION

This paper investigates pseudo-label training, a technique common to a number of methods for boosting NMT performance. We show that in addition to well-studied gains in generic translation quality, pseudo-label training induces several desirable stability-related properties, which we group under the term inertia. Empirically, these improvements are not tied to the use of unlabeled data (as in self-training) or the use of stronger teacher models (as in knowledge distillation) but are a consequence of the use of pseudo-labeled data itself. When compared with other methods designed to improve robustness in NMT, we observed that the effect on stability over re-training occurs only for those methods that simplify the training data. Based on these findings, we recommend using PLT with unlabeled data (à la self-training) when developing NMT models where inertia is important due to its benefits to model inertia and its use in addressing potential language coverage bias (Wang et al., 2021) . In future work, we plan to investigate the interplay between PLT and different formulations of NMT (auto-vs. non-autoregressive MT) as well as potential negative side effects such as bias amplification (Renduchintala et al., 2021) . Finally, developing automatic metrics to detect negative flips in NMT is an important task that has yet to be examined extensively and can help guide PLT techniques. Table 7 : We trained our models on a subset of datasets from the WMT21 news task. Specifically, we used Paracrawl v9 (Bañón et al., 2020) , WikiMatrix (Schwenk et al., 2021) , WikiTitles (Bojar et al., 2018) , news commentary, UN v1.0 dataset (Ziemski et al., 2016) , JParaCrawl (Morishita et al., 2020) and the Japanese-English subtitles datasets (Pryzant et al., 2018) .

LP

Years # parallel en↔de 2017-2020 9k en↔ja 2020 2k en↔ru 2017-2020 9k Table 8 : We used the WMT news test datasets from previous years as our development set.



Specifically, using sacreBLEU(Post, 2018) with signature: nrefs:1|case:mixed|eff:no|tok: 13a|smooth:exp|version:2.0.0 except for en→ja where we use the ja-mecab tokenizer. It is not possible to compute robustness scores for GMEG as this set does not contain reference translations. The noisy datasets (synthetic misspellings and GMEG) do not cover the ja→en translation direction, so this direction is not included in Table2. https://github.com/alvations/sacremoses



Thousands of people aree guven a drug and thousands of others are given a placebo.. BASELINE Tausende von Menschen erhalten Guven ein Medikament und Tausende von anderen erhalten ein Placebo. PLT(TRAIN) Tausende von Menschen erhalten eine Droge und Tausende von anderen erhalten ein Placebo.

Training data sizes and performance scores for PLT/Baseline models. Quality is measured with BLEU and BLEURT on the WMT21 newstest set. Smoothness is measured as robustness and consistency to synthetic (Misspellings) and natural (GMEG) noise. GMEG scores are computed as the average over four reference corrections. Robustness measures changes in translation quality w.r.t. input variations, while consistency measures translation changes alone.



Negative flip rate (NFR) and negative flip impact (NFI) on WMT21 (assessed by human annotators) and WinoMT (using the automatic gender translation accuracy metric). stay constant. Note however that this is not a direct comparison to the baseline and PLT methods: the models do not vary in random seed alone, but also the contents of the training data (D 1 = D 2 ).• DISTILLATION: In this setting D 2 is obtained from D 1 using pseudo-labeled data obtained with model θ 1 . The training data D 1 is re-translated and merged with the original D 1 to create D 2 .

Quality and model inertia with PLT versus other methods (averages over the three language pairs in each direction). Stability to model updates is computed w.r.t. to random seed variation in student models. X→en averages for robustness and consistency to misspellings involve de,ru→en.

ACKNOWLEDGEMENTS

We thank Yi Zhang and Elman Mansimov for discussions on negative flips and Miguel Ballesteros, Surafel Lakew, Cuong Hoang, and anonymous reviewers for their comments and suggestions.

C TRAINING CURVES

Here, we compare pseudo-label training with back-translation (BT). We find that pseudo-label training regularizes the models by controlling for over fitting. BT also regularizes the model, but it does not simplify the distribution to the extent PLT does, implying that controlling over-fitting is not a main factor for stability. Comparisons with other methods (i.e. BPE-dropout and PLT(sample)) show similar trends.Figure 1 : Comparisons of PLT(train) validation (solid lines) and training curves (dashed lines) against back-translation and baseline models. We find that in comparison, PLT is able to control over fitting on the training data. Back-translation also regularizes the model, but it does not simplify the distribution to the extent PLT does, implying that controlling over-fitting is not a main factor for stability.

D BERTSCORE

We provide quality scores using BERTScore (Zhang et al., 2020) . In terms of generic quality, PLT provides improvements in quality consistent with earlier results using BLEU and BLEURT metrics (see Tables 2, 5, and 5 ). We also computed robustness metrics BERTScore: 

