HYBRID-REGRESSIVE NEURAL MACHINE TRANSLATION

Abstract

Although the non-autoregressive translation model based on iterative refinement has achieved comparable performance to the autoregressive counterparts with faster decoding, we empirically found that such aggressive iterations make the acceleration rely heavily on small batch size (e.g., 1) and computing device (e.g., GPU). By designing synthetic experiments, we highlight that iteration times can be significantly reduced when providing a good (partial) target context. Inspired by this, we propose a two-stage translation prototype -Hybrid-Regressive Translation (HRT). HRT first jumpily generates a discontinuous sequence by autoregression (e.g., make a prediction every k tokens, k >1). Then, with the help of the partially deterministic target context, HRT fills all the previously skipped tokens with one iteration in a non-autoregressive way. The experimental results on WMT'16 En↔Ro and WMT'14 En↔De show that our model outperforms the state-of-the-art non-autoregressive models with multiple iterations, and the original autoregressive models. Moreover, compared with autoregressive models, HRT can be steadily accelerated 1.5 times regardless of batch size and device.

1. INTRODUCTION

Although autoregressive translation (AT) has become the de facto standard for Neural Machine Translation (Bahdanau et al., 2015) , its nature of generating target sentences sequentially (e.g., from left to right) makes it challenging to respond quickly in a production environment. One straightforward solution is the non-autoregressive translation (NAT) (Gu et al., 2017) , which predicts the entire target sequence in one shot. However, such one-pass NAT models lack dependencies between target words and still struggles to produce smooth translations, despite many efforts developed (Ma et al., 2019; Guo et al., 2019a; Wang et al., 2019b; Shao et al., 2019; Sun et al., 2019) . Recent studies show that extending one-pass NAT to multi-pass NAT, so-called iterative refinement (IR-NAT), is expected to break the performance bottleneck (Lee et al., 2018; Ghazvininejad et al., 2019; Gu et al., 2019; Guo et al., 2020; Kasai et al., 2020a) . Unlike onepass NAT, which outputs the prediction immediately, IR-NAT takes the translation hypothesis from the previous iteration as a reference and regularly polishes the new translation until achieving the predefined iteration count I or no changes appear in the translation. Compared with AT, IR-NAT with I=10 runs 2-5 times faster with a considerable translation accuracy, as reported by Guo et al. (2020) . However, we highlight that the fast decoding of IR-NAT heavily relies on small batch size and GPU, which is rarely mentioned in prior studiesfoot_0 . Without loss of generality, we take Mask-Predict (MP) (Ghazvininejad et al., 2019) as an example, a typical IR-NAT paradigm based on the conditional masked language model. Figure 1 illustrates that when the batch exceeds 8, MP(I=10) is already running slower than AT, and the situation is even worse on CPU. Further analysis shows that the increase in batch size leads to the efficiency degradation of parallel computing in NAT models 2 . To tackle this problem, we first design a synthetic experiment to understand the relationship between target context and iteration times. We mask some proportion tokens on the translation generated by a pretrained AT and take it as the decoder input of the pretrained MP. Then we surprisingly found that even masking 70% AT hypothesis, and the remaining target context can help MP(I=1) to compete with the standard MP(I=10) (Figure 2 ). This result confirms that decoding with multiple iterations in NAT is unnecessary when providing a good (partial) reference hypothesis. Inspired by this, we propose a two-stage translation prototype 

2. BACKGROUND

Given a source sentence x = {x 1 , x 2 , . . . , x M } and a target sentence y = {y 1 , y 2 , . . . , y N }, there are several ways to model P (y|x): Autoregressive translation (AT) is the dominant approach in NMT, which decomposes P (y|x) by chain rules: P (y|x) = N t=1 P (y t |x, y <t ) where y <t denotes the generated prefix translation before time step t. However, the existence of y <t requires the model must wait for y t-1 to be produced before predicting y t , which hinders the possibility of parallel computation along with time step. Non-autoregressive translation (NAT) is first proposed by Gu et al. (2017) , allowing the model to generate all target tokens simultaneously. NAT replaces y <t with target-independent input z and rewrites Eq. 1 as: P (y|x) = P (N |x) N t=1 P (y t |x, z) In Gu et al. (2017) , they monotonically copy the source embedding as z according to a fertility model. Subsequently, the researchers developed more advanced methods to enhance z, such as adversarial source embedding (Guo et al., 2019a) , reordered source sentence (Ran et al., 2019 ), latent variables (Ma et al., 2019; Shu et al., 2019) etc, but there still is a huge performance gap between AT and NAT. Iterative refinement based non-autoregressive translation (IR-NAT) extends the traditional onepass NAT by introducing the multi-pass decoding mechanism (Lee et al., 2018; Ghazvininejad et al., 2019; Gu et al., 2019; Guo et al., 2020; Kasai et al., 2020a) . IR-NAT applies a conversion function 2 Early experiment shows that when the batch size increases from 1 to 32, the latency of AT is reduced by 22 times, while MP(I=10) only reduces by four times. Latency is measured by the average time of translating a sentence on a constant test set. See Appendix A for details. 3 Thanks to the proposed training algorithm, a single HRT model can support both hybrid-regressive decoding and autoregressive decoding at inference. Here, the AT model refers to the autoregressive teacher model that generates the distillation data.



Unfortunately, such a decoding setting is not common in practice. NMT systems deployed on GPUs tend to use larger batches to increase translation throughput, while the batch size of 1 is used more frequently in offline systems running on CPUs. e.g., smartphones.



Figure 1: Relative speedup ratio (α) compared MP with AT on GPU (solid) and CPU (dashed). The value of α denotes running faster (positive) or slower (negative) |α| times than AT.

--Hybrid-Regressive Translation (HRT). After encoding, HRT first uses an autoregressive decoder (called Skip-AT) to produce a discontinuous translation hypothesis. Concretely, at decoding step i, the SKip-AT decoder immediately predicts the (i + k)-th token y i+k without generating y i+1 , . . . , y i+k-1 , where k is a hyperparameter and k > 1. Then, a non-autoregressive decoder like MP (called Skip-MP) predicts previously skipped tokens with one iteration according to the deterministic context provided by Skip-AT. Since both Skip-AT and Skip-MP share the same model parameters, HRT does not increase parameters significantly. To train HRT effectively and efficiently, we further propose joint training guided by curriculum learning and mixed distillation. Experimental results on WMT En↔Ro and En↔De show that HRT is far superior to existing IR-NATs and achieves comparable or even better accuracy than the original AT 3 with a consistent 50% decoding speedup on varying batch sizes and devices (GPU, CPU).

