DEEP ENCODER, SHALLOW DECODER: REEVALUATING NON-AUTOREGRESSIVE MACHINE TRANSLATION

Abstract

Much recent effort has been invested in non-autoregressive neural machine translation, which appears to be an efficient alternative to state-of-the-art autoregressive machine translation on modern GPUs. In contrast to the latter, where generation is sequential, the former allows generation to be parallelized across target token positions. Some of the latest non-autoregressive models have achieved impressive translation quality-speed tradeoffs compared to autoregressive baselines. In this work, we reexamine this tradeoff and argue that autoregressive baselines can be substantially sped up without loss in accuracy. Specifically, we study autoregressive models with encoders and decoders of varied depths. Our extensive experiments show that given a sufficiently deep encoder, a single-layer autoregressive decoder can substantially outperform strong non-autoregressive models with comparable inference speed. We show that the speed disadvantage for autoregressive baselines compared to non-autoregressive methods has been overestimated in three aspects: suboptimal layer allocation, insufficient speed measurement, and lack of knowledge distillation. Our results establish a new protocol for future research toward fast, accurate machine translation. Our code is available at https://github.com/jungokasai/deep-shallow.

1. INTRODUCTION

Fast, accurate machine translation is a fundamental goal with a wide range of applications both in research and production. State-of-the-art neural machine translation systems generate translations autoregressively where words are predicted one-by-one conditioned on all previous words (Kalchbrenner & Blunsom, 2013; Sutskever et al., 2014; Bahdanau et al., 2015; Wu et al., 2016; Vaswani et al., 2017) . This sequential property limits parallelization, since multiple tokens in each sentence cannot be generated in parallel. A flurry of recent work developed ways to (partially) parallelize the decoder with non-autoregressive machine translation (NAR; Gu et al., 2018) , thereby speeding up decoding during inference. NAR tends to suffer in translation quality because parallel decoding assumes conditional independence between the output tokens and prevents the model from properly capturing the highly multimodal distribution of target translations (Gu et al., 2018) . Recent work proposed methods to mitigate this multimodality issue, including iterative refinement (e.g., Lee et al., 2018; Ghazvininejad et al., 2019) , and modeling with latent variables (e.g., Ma et al., 2019; Shu et al., 2020) . These approaches modify the decoder transformer to find a balance between decoding parallelism and translation quality. In this work, however, we adopt a different speed-quality tradeoff. Recent work by Kim et al. (2019) in autoregressive machine translation (AR) suggests that better speed-quality tradeoffs can be achieved by having different depths in the encoder and the decoder. Here, we make a formal argument in favor of deep encoder, shallow decoder configurations and empirically demonstrate better speed-quality tradeoffs for the AR baselines. We provide extensive speed-quality comparisons between iterative NAR models and AR models with varying numbers of encoder and decoder layers. In particular, we use two types of speed measures for translation and discuss their relation to computational complexity. The two measures reflect two different application scenarios: feeding one sentence at a time, and feeding as many words as possible into the GPU memory. The first scenario is designed to simulate, for example, instantaneous machine translation that translates text (or even speech) input from users. This is where current NAR models shine-we can make full use of parallelism across decoding positions in a GPU. For this reason, much prior work in NAR only measures speed using this metric (e.g., Gu et al., 2018; 2019b; Kasai et al., 2020; Li et al., 2020) . The second scenario aims at a situation where we want to translate a large amount of text as quickly as possible. In this case, we see that AR models run faster than NAR models by a large margin. Computation at each time step is large enough to exploit parallelism in a GPU, which cancels out the benefit from parallel NAR decoding. Further, AR models can cache all previous hidden states ( Ott et al., 2019) and compute each step in linear time complexity with respect to the sequence length. In contrast, NAR models necessitate a fresh run of quadratic self and cross attention in every decoding iteration. Interestingly, using a deep encoder and a shallow decoder in NAR models fails to retain the original translation accuracy by using 6 layers each ( §4.1). This suggests that departure from AR decoding necessitates more capacity in the decoder; the strategy is effective specifically for AR models. In particular, our analysis demonstrates that an NAR decoder requires more layers to learn target word ordering ( §5). In summary, our contributions are the following: • We challenge three conventional assumptions in NAR evaluation: suboptimal layer allocation, lack of distillation for AR baselines, and insufficiently general speed measures. • We provide a complexity analysis and identify an optimal layer allocation strategy that leads to better speed-quality tradeoffs, namely a deep-shallow configuration. • We perform extensive analyses and head-to-head comparisons of AR and strong NAR models on seven standard translation directions. We demonstrate that the accuracy gap between the two model families is much wider than previously thought and that NAR models are unable to capture target word order well without sufficiently deep decoders.

2. REEVALUATING NON-AUTOREGRESSIVE MACHINE TRANSLATION

We critically examine in this section the evaluation practices and assumptions that are widely held in the non-autoregressive neural machine translation (NAR) literature (e.g., Gu et al., 2018; Ghazvininejad et al., 2019; Kasai et al., 2020) . In particular, we focus on three aspects: speed measurement ( §2.1), layer allocation ( §2.2), and knowledge distillation ( §2.3).

2.1. SPEED MEASURES

One major benefit of NAR models over AR ones is their ability to generate text in parallel. Current research on measuring speed has focused solely on the setting of translating one sentence at a time where full parallelization is trivial with a single GPU. However, we argue that this speed measure is not realistic in some scenarios because the GPU memory is finite and the GPU unit in such a setting is underused. To address this issue, we use two translation speed metrics to measure inference speed: • S 1 measures speed when translating one sentence at a time. This metric is used in standard practice and aligns with applications like instantaneous machine translation that translates text input from users immediately. • S max measures speed when translating in mini-batches as large as the hardware allows. This corresponds to scenarios where one wants to translate a large amount of text given in advance. For instance, such large-batch machine translation is implemented in the Google cloud service.foot_0  For all models, both metrics measure wall-clock time from when the weights are loaded until the last sentence is translated. We report speedups relative to an AR baseline with a 6-layer encoder and a 6-layer decoder following prior work (Gu et al., 2018; Li et al., 2020; Kasai et al., 2020) .



https://cloud.google.com/translate/docs/advanced/batch-translation.

