SPECULATIVE DECODING: LOSSLESS SPEEDUP OF AUTOREGRESSIVE TRANSLATION

Abstract

Different from some previous work accelerating autoregressive translation (AT) at the sacrifice of quality, we propose Speculative Decoding (SpecDec) -a novel decoding paradigm inspired by speculative execution in computer architecture, which combines respective advantages of AT and non-autoregressive translation (NAT) for lossless speedup of translation. At each decoding step, SpecDec first speculatively drafts (i.e. decodes) next k tokens with an NAT model and then verifies them with an AT model, where only the drafted tokens passing the verification are accepted as decoded tokens for guaranteeing its translation result is exactly the same as AT. The collaboration of NAT drafting and AT verification leads to a much higher decoding speed without quality loss due to parallel computing enabled by speculative decoding. We conduct experiments in 4 standard WMT translation benchmarks and confirm the vanilla SpecDec yields exactly the same results as AT greedy decoding with an around 3× speedup, and that its variant (SpecDec++) with an advanced verification strategy not only outperforms AT greedy decoding, but also further improves the decoding speed, resulting in an around 5× speedup over AT. Moreover, SpecDec can be easily generalized for speeding up other seq2seq tasks like Abstractive Summarization, and benefit more from stronger computing devices, demonstrating its potential to become a de facto decoding standard in the future for efficient and lossless seq2seq generation. We will release all our codes and checkpoints to facilitate reproducing our results.

1. INTRODUCTION

Since the Transformer (Vaswani et al., 2017) prevailed in Natural Language Processing (NLP), autoregressive decoding has become the de facto standard for neural machine translation (NMT) as well as other generation tasks, because it is easy to train and reliable to generate high-quality results. Despite its advantages, autoregressive translation (AT) has been widely blamed for its poor inference efficiency, motivating non-autoregressive translation (NAT). Unlike AT which sequentially decodes only one token at each iteration so that the next token prediction can condition on the previous decoding results, NAT decodes tokens in parallel without depending on the surface form of previous tokens, largely improving the inference efficiency. Recent research in NAT mainly focuses on improving its translation quality to bridge the performance gap between NAT and AT (Gu et al., 2018; Qian et al., 2021; Geng et al., 2021; Savinov et al., 2021) . Until now, however, NAT's performance is still less reliable than AT, as NAT is more difficult than AT given its unawareness of the conditional dependence of translated tokens. Given AT's reliable generation results and NAT's high efficiency, we propose an approach named Speculative Decoding (SpecDec) to combine their advantages, inspired by speculative executionfoot_0 , to accelerate translation without quality loss compared with AT. SpecDec decomposes a decoding iteration into two substeps: draft and verify: At each iteration, SpecDec first speculatively drafts (i.e., decodes) a fixed number of tokensfoot_1 in parallel through NAT; Then, the drafted tokens are verified Draft (NAT) Verify (AT) Input Output [MASK] [MASK] [MASK] [MASK] [MASK] [BOS] Was sind Grund@@ die ischen phys@@ [BOS] Was sind Grund@@ die ischen grundlegenden gesetze [BOS] Was sind die grundlegenden [BOS] Was

Source Sentence

What are the basic physical laws of the Universe ? ✓ ✓ ✗ ✗ ✗ [BOS] Was sind die grundlegenden [MASK] [MASK] [MASK] [MASK] [MASK] Next Input bifurcation Figure 1 : Speculative Decoding where a decoding iteration involves two substeps: draft and verify. In the Draft substep, an NAT model speculatively drafts (i.e., decodes) a block (block size k = 5 for this example) of tokens in parallel conditioning on the source sentence and previously decoded tokens (i.e., the tokens in the rectangle boxes). In the Verify substep, drafted tokens are verified in parallel: bifurcation is detected as the first position where we find the drafted token does not match the top-1 result verified by an AT model. The drafted tokens after the bifurcation position are all discarded, for guaranteeing SpecDec's translation is exactly the same with greedy decoding of AT. by an AT model in an autoregressive manner to determine how many of them match AT's (top-1) results and thus can be accepted as translation results, as Figure 1 shows. In contrast to conventional AT which decodes at a low speed, AT verification is highly efficient because it performs in parallel; more importantly, it helps guarantee SpecDec's translation is identical to AT, resulting in a desirable balance between translation speed and quality, as shown in Figure 2 . In addition to the vanilla SpecDec whose translation is required (strictly by the top-1 matching criterion in AT verification) to be identical to greedy decoding of AT, we propose SpecDec++ -an advanced variant of SpecDec by slightly relaxing the rigid requirement during AT verification. SpecDec++ not only yields translations beyond greedy decoding, but also prevents good drafted tokens from being discarded just because they are different from greedy decoding results, leading to a higher inference speedup. The experiments in four standard WMT benchmarks show that SpecDec can yield exactly the same translations as greedy decoding of AT with a 3× speedup and that its variant SpecDec++ can outperform greedy decoding with an even higher (∼ 5×) speedup. Moreover, the SpecDec paradigm can be easily generalized to other seq2seq tasks like Abstractive Summarization and benefit more from stronger computing devices. Its lossless quality and promising speedup results demonstrate its great potential to evolve into a de facto decoding standard for efficient seq2eq generation in the future.

2. BACKGROUND

2.1 AUTOREGRESSIVE TRANSLATION Given a source sentence x = (x 1 , x 2 , . . . , x n ) and the target sentence y = (y 1 , y 2 , . . . , y m ), an autoregressive translation (AT) model is trained with the target distribution of conditional probabilities based on the chain rule: where y <i denotes previous target tokens before the i th position. As Eq (1) shows, an AT model is trained via the teacher-forcing strategy that uses target tokens as previously decoded tokens, which performs efficiently as the probability P (y i | y <i , x) at each iteration can be calculated in parallel.



Speculative execution is an optimization technique used in computer architecture where a system performs some task in advance to avoid delays that would have to be incurred by doing the task after it is known that it is required (https://wikipedia.org/wiki/Speculative_execution). We use "a block of drafted tokens" to denote them in the following parts of this paper.



Figure 2: Translation quality and efficiency of models on WMT14 EN-DE. The speedup baseline (1.0×) is the Transformerbase (Vaswani et al., 2017) with beam search. All models above except "AT" are trained with KD by a Transformer-big teacher.

LAT = log P (y | x; θAT) = m i=1 log P (yi | y<i, x; θAT)(1)

