HYBRID-REGRESSIVE NEURAL MACHINE TRANSLATION

Abstract

Although the non-autoregressive translation model based on iterative refinement has achieved comparable performance to the autoregressive counterparts with faster decoding, we empirically found that such aggressive iterations make the acceleration rely heavily on small batch size (e.g., 1) and computing device (e.g., GPU). By designing synthetic experiments, we highlight that iteration times can be significantly reduced when providing a good (partial) target context. Inspired by this, we propose a two-stage translation prototype -Hybrid-Regressive Translation (HRT). HRT first jumpily generates a discontinuous sequence by autoregression (e.g., make a prediction every k tokens, k >1). Then, with the help of the partially deterministic target context, HRT fills all the previously skipped tokens with one iteration in a non-autoregressive way. The experimental results on WMT'16 En↔Ro and WMT'14 En↔De show that our model outperforms the state-of-the-art non-autoregressive models with multiple iterations, and the original autoregressive models. Moreover, compared with autoregressive models, HRT can be steadily accelerated 1.5 times regardless of batch size and device.

1. INTRODUCTION

Although autoregressive translation (AT) has become the de facto standard for Neural Machine Translation (Bahdanau et al., 2015) , its nature of generating target sentences sequentially (e.g., from left to right) makes it challenging to respond quickly in a production environment. One straightforward solution is the non-autoregressive translation (NAT) (Gu et al., 2017) , which predicts the entire target sequence in one shot. However, such one-pass NAT models lack dependencies between target words and still struggles to produce smooth translations, despite many efforts developed (Ma et al., 2019; Guo et al., 2019a; Wang et al., 2019b; Shao et al., 2019; Sun et al., 2019) . Recent studies show that extending one-pass NAT to multi-pass NAT, so-called iterative refinement (IR-NAT), is expected to break the performance bottleneck (Lee et al., 2018; Ghazvininejad et al., 2019; Gu et al., 2019; Guo et al., 2020; Kasai et al., 2020a) . Unlike onepass NAT, which outputs the prediction immediately, IR-NAT takes the translation hypothesis from the previous iteration as a reference and regularly polishes the new translation until achieving the predefined iteration count I or no changes appear in the translation. Compared with AT, IR-NAT with I=10 runs 2-5 times faster with a considerable translation accuracy, as reported by Guo et al. (2020) . However, we highlight that the fast decoding of IR-NAT heavily relies on small batch size and GPU, which is rarely mentioned in prior studiesfoot_0 . Without loss of generality, we take Mask-Predict (MP)



Unfortunately, such a decoding setting is not common in practice. NMT systems deployed on GPUs tend to use larger batches to increase translation throughput, while the batch size of 1 is used more frequently in offline systems running on CPUs. e.g., smartphones.



Figure 1: Relative speedup ratio (α) compared MP with AT on GPU (solid) and CPU (dashed). The value of α denotes running faster (positive) or slower (negative) |α| times than AT.

