UNDERSTANDING AND IMPROVING LEXICAL CHOICE IN NON-AUTOREGRESSIVE TRANSLATION

Abstract

Knowledge distillation (KD) is essential for training non-autoregressive translation (NAT) models by reducing the complexity of the raw data with an autoregressive teacher model. In this study, we empirically show that as a side effect of this training, the lexical choice errors on low-frequency words are propagated to the NAT model from the teacher model. To alleviate this problem, we propose to expose the raw data to NAT models to restore the useful information of low-frequency words, which are missed in the distilled data. To this end, we introduce an extra Kullback-Leibler divergence term derived by comparing the lexical choice of NAT model and that embedded in the raw data. Experimental results across language pairs and model architectures demonstrate the effectiveness and universality of the proposed approach. Extensive analyses confirm our claim that our approach improves performance by reducing the lexical choice errors on low-frequency words. Encouragingly, our approach pushes the SOTA NAT performance on the WMT14 English-German and WMT16 Romanian-English datasets up to 27.8 and 33.8 BLEU points, respectively.

1. INTRODUCTION

When translating a word, translation models need to spend a substantial amount of its capacity in disambiguating its sense in the source language and choose a lexeme in the target language which adequately express its meaning (Choi et al., 2017; Tamchyna, 2017) . However, neural machine translation (NMT) has a severe problem on lexical choice, since it usually has mistranslation errors on low-frequency words (Koehn & Knowles, 2017; Nguyen & Chiang, 2018; Gu et al., 2020) . et al., 2019; Lee et al., 2018; Ghazvininejad et al., 2019; Gu et al., 2019; Hao et al., 2021) . Recent studies have revealed that knowledge distillation (KD) reduces the modes (i.e. multiple lexical choices for a source word) in the raw data by re-weighting the training examples (Furlanello et al., 2018; Tang et al., 2020) , which lowers the intrinsic uncertainty (Ott et al., 2018) and learning difficulty for NAT (Zhou et al., 2020; Ren et al., 2020) . However, the side effect of KD has not been fully studied. In this work, we investigate this problem from the perspective of lexical choice, which is at the core of machine translation. SRC 今天 纽马 基特 的 跑道 湿软 。 RAW-TGT We argue that the lexical choice errors of AT teacher can be propagated to the NAT model via the distilled training data. To verify this hypothesis, we qualitatively compare raw and distilled training corpora. Table 1 lists all samples whose source sentences contain the place name "纽马基特". In the raw corpus ("RAW-TGT"), this low-frequency word totally occurs three times and corresponds to correct translation "Newmarket". However, in the KD corpus ("KD-TGT"), the word is incorrectly translated into a person name "Newmargot" (Margot Robbie is an Australian actress) or organization name "Newmarquette" (Marquette is an university in Wisconsin) or even invalid one "Newmarquite". Motivated by this finding, we explore NAT from the lexical choice perspective. We first validate our hypothesis by analyzing the lexical choice behaviors of NAT models ( §3). Concretely, we propose a new metric AoLC (accuracy of lexical choice) to evaluate the lexical translation accuracy of a given NAT model. Experimental results across different language pairs show that NAT models trained on distilled data have higher accuracy of global lexical translation (AoLC↑), which results in better sequence generation. However, fine-grained analyses revealed that although KD improves the accuracy on high-frequency tokens, it meanwhile harms performance on low-frequency ones (Low freq. AoLC↓). And with the improvement of teacher models, this issue becomes more severe. We conclude that the lexical choice of the low-frequency tokens is a typical kind of lost information when using knowledge distillation from AT model. In order to rejuvenate this lost information in raw data, we propose to expose the raw data to the training of NAT models, which augments NAT models the ability to learn the lost knowledge by themselves. Specifically, we propose two bi-lingual lexical-level data-dependent priors (Word Alignment Distribution and Self-Distilled Distribution) extracted from raw data, which is integrated into NAT training via Kullback-Leibler divergence. Both approaches expose the lexical knowledge in the raw data to NAT, which makes it learn to restore the useful information of low-frequency words to accomplish the translation. We validated our approach on several datasets that widely used in previous studies (i.e. WMT14 En-De, WMT16 Ro-En, WMT17 Zh-En, and WAT17 Ja-En) and model architectures (i.e. MaskPredict (Ghazvininejad et al., 2019) and Levenshtein Transformer (Gu et al., 2019) ). Experimental results show that the proposed method consistently improve translation performance over the standard NAT models across languages and advanced NAT architectures. The improvements come from the better lexical translation accuracy (low-frequency tokens in particular) of NAT models (AoLC↑), which leads to less mis-translations and low-frequency words prediction errors. The main contributions of this work are: • Our study reveals the side effect of NAT models' knowledge distillation on low-frequency lexicons, which makes the standard NAT training on the distilled data sub-optimal. • We demonstrate the necessity of letting NAT models learn to distill lexical choices from the raw data by themselves. • We propose an simple yet effective approach to accomplish this goalfoot_0 , which are robustly applicable to several model architectures and language pairs.

2. PRELIMINARIES 2.1 NON-AUTOREGRESSIVE TRANSLATION

The idea of NAT has been pioneered by Gu et al. (2018) , which enables the inference process goes in parallel. Different from AT models that generate each target word conditioned on previously generated ones, NAT models break the autoregressive factorization and produce target words in parallel. Given a source sentence x, the probability of generating its target sentence y with length T is calculated as: 



Code is available at: https://github.com/alphadl/LCNAT



The going at Newmarket is soft ... KD-TGT Today, Newmargot's runway is soft ... SRC 纽马 基特 赛马 总是 吸引 ... RAW-TGT The Newmarket stakes is always ... KD-TGT The Newmarquette races always ... SRC 在 纽马 基特 3 时 45 分 那场 中 , 我 ... RAW-TGT I've ... in the 3.45 at Newmarket. KD-TGT I ... at 3:45 a.m. in Newmarquite.

p(y|x) = p L (T |x; θ) T t=1 p(y t |x; θ)(1)

All samples that contain the source word "纽 马 基特" in raw and distilled training corpora, which are different in target sides (RAW-TGT vs. KD-TGT).

