UNDERSTANDING AND IMPROVING LEXICAL CHOICE IN NON-AUTOREGRESSIVE TRANSLATION

Abstract

Knowledge distillation (KD) is essential for training non-autoregressive translation (NAT) models by reducing the complexity of the raw data with an autoregressive teacher model. In this study, we empirically show that as a side effect of this training, the lexical choice errors on low-frequency words are propagated to the NAT model from the teacher model. To alleviate this problem, we propose to expose the raw data to NAT models to restore the useful information of low-frequency words, which are missed in the distilled data. To this end, we introduce an extra Kullback-Leibler divergence term derived by comparing the lexical choice of NAT model and that embedded in the raw data. Experimental results across language pairs and model architectures demonstrate the effectiveness and universality of the proposed approach. Extensive analyses confirm our claim that our approach improves performance by reducing the lexical choice errors on low-frequency words. Encouragingly, our approach pushes the SOTA NAT performance on the WMT14 English-German and WMT16 Romanian-English datasets up to 27.8 and 33.8 BLEU points, respectively.

1. INTRODUCTION

When translating a word, translation models need to spend a substantial amount of its capacity in disambiguating its sense in the source language and choose a lexeme in the target language which adequately express its meaning (Choi et al., 2017; Tamchyna, 2017) . However, neural machine translation (NMT) has a severe problem on lexical choice, since it usually has mistranslation errors on low-frequency words (Koehn & Knowles, 2017; Nguyen & Chiang, 2018; Gu et al., 2020) . et al., 2019; Lee et al., 2018; Ghazvininejad et al., 2019; Gu et al., 2019; Hao et al., 2021) . Recent studies have revealed that knowledge distillation (KD) reduces the modes (i.e. multiple lexical choices for a source word) in the raw data by re-weighting the training examples (Furlanello et al., 2018; Tang et al., 2020) , which lowers the intrinsic uncertainty (Ott et al., 2018) and learning difficulty for NAT (Zhou et al., 2020; Ren et al., 2020) . However, the side effect of KD has not been fully studied. In this work, SRC 今天 纽马 基特 的 跑道 湿软 。 RAW-TGT



The going at Newmarket is soft ... KD-TGT Today, Newmargot's runway is soft ... SRC 纽马 基特 赛马 总是 吸引 ... RAW-TGT The Newmarket stakes is always ... KD-TGT The Newmarquette races always ... SRC 在 纽马 基特 3 时 45 分 那场 中 , 我 ... RAW-TGT I've ... in the 3.45 at Newmarket. KD-TGT I ... at 3:45 a.m. in Newmarquite.

All samples that contain the source word "纽 马 基特" in raw and distilled training corpora, which are different in target sides (RAW-TGT vs. KD-TGT).

