SPECULATIVE DECODING: LOSSLESS SPEEDUP OF AUTOREGRESSIVE TRANSLATION

Abstract

Different from some previous work accelerating autoregressive translation (AT) at the sacrifice of quality, we propose Speculative Decoding (SpecDec) -a novel decoding paradigm inspired by speculative execution in computer architecture, which combines respective advantages of AT and non-autoregressive translation (NAT) for lossless speedup of translation. At each decoding step, SpecDec first speculatively drafts (i.e. decodes) next k tokens with an NAT model and then verifies them with an AT model, where only the drafted tokens passing the verification are accepted as decoded tokens for guaranteeing its translation result is exactly the same as AT. The collaboration of NAT drafting and AT verification leads to a much higher decoding speed without quality loss due to parallel computing enabled by speculative decoding. We conduct experiments in 4 standard WMT translation benchmarks and confirm the vanilla SpecDec yields exactly the same results as AT greedy decoding with an around 3× speedup, and that its variant (SpecDec++) with an advanced verification strategy not only outperforms AT greedy decoding, but also further improves the decoding speed, resulting in an around 5× speedup over AT. Moreover, SpecDec can be easily generalized for speeding up other seq2seq tasks like Abstractive Summarization, and benefit more from stronger computing devices, demonstrating its potential to become a de facto decoding standard in the future for efficient and lossless seq2seq generation. We will release all our codes and checkpoints to facilitate reproducing our results.

1. INTRODUCTION

Since the Transformer (Vaswani et al., 2017) prevailed in Natural Language Processing (NLP), autoregressive decoding has become the de facto standard for neural machine translation (NMT) as well as other generation tasks, because it is easy to train and reliable to generate high-quality results. Despite its advantages, autoregressive translation (AT) has been widely blamed for its poor inference efficiency, motivating non-autoregressive translation (NAT). Unlike AT which sequentially decodes only one token at each iteration so that the next token prediction can condition on the previous decoding results, NAT decodes tokens in parallel without depending on the surface form of previous tokens, largely improving the inference efficiency. Recent research in NAT mainly focuses on improving its translation quality to bridge the performance gap between NAT and AT (Gu et al., 2018; Qian et al., 2021; Geng et al., 2021; Savinov et al., 2021) . Until now, however, NAT's performance is still less reliable than AT, as NAT is more difficult than AT given its unawareness of the conditional dependence of translated tokens. Given AT's reliable generation results and NAT's high efficiency, we propose an approach named Speculative Decoding (SpecDec) to combine their advantages, inspired by speculative execution 1 , to accelerate translation without quality loss compared with AT. SpecDec decomposes a decoding iteration into two substeps: draft and verify: At each iteration, SpecDec first speculatively drafts (i.e., decodes) a fixed number of tokens 2 in parallel through NAT; Then, the drafted tokens are verified Draft (NAT) Verify (AT) Input Output [MASK] [

MASK] [MASK] [MASK] [MASK] [BOS]

Was sind Grund@@ die ischen phys@@ [BOS] Was sind Grund@@ die ischen grundlegenden gesetze [BOS] Was sind die grundlegenden [BOS] Was

Source Sentence

What are the basic physical laws of the Universe ? ✓ ✓ ✗ ✗ ✗ [BOS] Was sind die grundlegenden [MASK] [MASK] [MASK] [MASK] [MASK] Next Input bifurcation Figure 1 : Speculative Decoding where a decoding iteration involves two substeps: draft and verify. In the Draft substep, an NAT model speculatively drafts (i.e., decodes) a block (block size k = 5 for this example) of tokens in parallel conditioning on the source sentence and previously decoded tokens (i.e., the tokens in the rectangle boxes). In the Verify substep, drafted tokens are verified in parallel: bifurcation is detected as the first position where we find the drafted token does not match the top-1 result verified by an AT model. The drafted tokens after the bifurcation position are all discarded, for guaranteeing SpecDec's translation is exactly the same with greedy decoding of AT. by an AT model in an autoregressive manner to determine how many of them match AT's (top-1) results and thus can be accepted as translation results, as Figure 1 shows. In contrast to conventional AT which decodes at a low speed, AT verification is highly efficient because it performs in parallel; more importantly, it helps guarantee SpecDec's translation is identical to AT, resulting in a desirable balance between translation speed and quality, as shown in Figure 2 . (Vaswani et al., 2017) with beam search. All models above except "AT" are trained with KD by a Transformer-big teacher. In addition to the vanilla SpecDec whose translation is required (strictly by the top-1 matching criterion in AT verification) to be identical to greedy decoding of AT, we propose SpecDec++ -an advanced variant of SpecDec by slightly relaxing the rigid requirement during AT verification. SpecDec++ not only yields translations beyond greedy decoding, but also prevents good drafted tokens from being discarded just because they are different from greedy decoding results, leading to a higher inference speedup. The experiments in four standard WMT benchmarks show that SpecDec can yield exactly the same translations as greedy decoding of AT with a 3× speedup and that its variant SpecDec++ can outperform greedy decoding with an even higher (∼ 5×) speedup. Moreover, the SpecDec paradigm can be easily generalized to other seq2seq tasks like Abstractive Summarization and benefit more from stronger computing devices. Its lossless quality and promising speedup results demonstrate its great potential to evolve into a de facto decoding standard for efficient seq2eq generation in the future.

2. BACKGROUND

2.1 AUTOREGRESSIVE TRANSLATION Given a source sentence x = (x 1 , x 2 , . . . , x n ) and the target sentence y = (y 1 , y 2 , . . . , y m ), an autoregressive translation (AT) model is trained with the target distribution of conditional probabilities based on the chain rule: LAT = log P (y | x; θAT) = m i=1 log P (yi | y<i, x; θAT) where y <i denotes previous target tokens before the i th position. As Eq (1) shows, an AT model is trained via the teacher-forcing strategy that uses target tokens as previously decoded tokens, which performs efficiently as the probability P (y i | y <i , x) at each iteration can be calculated in parallel. During inference, an AT model sequentially predicts output tokens given preceding decoded tokens: ŷ = arg max y * m ′ j=1 log P y * j | y * <j , x; θ AT (2) where ŷ = (ŷ 1 , ŷ2 , . . . , ŷm ′ ) is the predicted output sentence. Although AT offers desirable translation quality, its sequential decoding scheme with limited parallelism largely reduces its decoding speed, being its main efficiency bottleneck.

2.2. NON-AUTOREGRESSIVE TRANSLATION

To accelerate inference, non-autoregressive translation (NAT) (Gu et al., 2018) removes sequential dependence between target tokens with a conditional independence assumption: L NAT = m i=1 log P (yi | x; θNAT) In contrast to AT that will not start predicting y j until y <j are completely decoded, NAT decodesfoot_2 the output sentence in parallel, which is much more efficient than AT: y = arg max y * m ′ j=1 log P y * j | x; θNAT On the other hand, however, the conditional independence assumption makes it hard to train an NAT model well, leading to degradation in translation quality despite an improvement in decoding speed.

3. SPECULATIVE DECODING

Given the fact that AT translates better whereas NAT performs faster, we propose Speculative Decoding (SpecDec) to combine their respective advantages, inspired by speculative execution, to achieve lossless acceleration for seq2seq generation. Specifically, SpecDec decomposes every decoding iteration into two substeps -draft and verify: Draft At each iteration, SpecDec first utilizes an NAT model to simultaneously decode a block of drafted tokens (denoted as [MASK] in its decoder input in Figure 1 ) speculatively, conditioning on preceding translated tokens. Formally, given the source sentence x = (x 1 , x 2 , . . . , x n ) and the previously translated tokens y ≤j = (y 1 , y 2 , . . . , y j ), SpecDec decodes the next k (drafted) tokens as a block in parallel: y j+1•••j+k = arg max y j+1•••j+k k i=1 log P ( yj+i | y ≤j , x; θNAT) Verify Then, the drafted tokens y j+1•••j+k are verified with an AT model in the autoregressive manner, which performs in parallel. We find the bifurcation position c by comparing the drafted tokens with the autoregressive decoding results conditioning on the draft as Figure 1 shows: c = arg max i 1 ( yj+i ̸ = ŷj+i) i , 1 ≤ i ≤ k ŷj+i = arg max ŷj+i log P (ŷj+i | y ≤j , yj+1•••j+i-1, x; θ AT ) where 1(•) is the indicator function and ŷj+i is the top-1 result verified by the AT model conditioning on the previously translated tokens y ≤j and the drafted tokens y j+1•••j+i-1 . We only accept the verified tokens before (including) the bifurcation position as translated tokens, which ensures that SpecDec yields the same results as greedy decoding of AT: yj+1•••j+c = ŷj+1•••j+c = ( yj+1•••j+c-1, ŷj+c) We iterate decoding with the above substeps until the termination condition is met, i.e. the [EOS] token is decoded or the sentence reaches the maximal length. As illustrated, SpecDec is highly efficient because both draft and verify perform in parallel. ischen Draft (NAT) Verify (AT) Input [MASK] [MASK] [MASK] [MASK] [MASK] [BOS] Was sind Grund@@ die ischen phys@@ [BOS] Was

[BOS] Was

Source Sentence What are the basic physical laws of the Universe ? Grund@@ die grundlegenden gesetze [blank] [blank] [blank] physi@@ sind [blank] [blank] [blank] [blank] [blank] phys@@ [BOS] Was sind die phys@@ ischen Grund@@ gesetze [MASK] [MASK] [MASK] [MASK] [MASK] Next Input

Output

[BOS] Was sind Grund@@ ischen die phys@@ gesetze 𝛽 = 3 7)). As a result, SpecDec++ allows more drafted tokens to be accepted even if they are slightly different from the top-1 result of the AT verifier, leading to a higher inference speedup.

3.1. NAT DRAFTER

As demonstrated above, an NAT model is the key to speculative decoding, which can efficiently generate a block of drafted tokens in parallel. Our NAT drafter differs from other NAT models in two aspects: First, we only require the NAT drafter to decode a block (i.e., fixed length) of tokens in each decoding iteration, instead of the whole sequence; Second, as illustrated in Figure 1 , since we decode from Left to Right, the NAT drafter is required to decode tokens conditioning on the previously decoded tokens. Formally, given the source sentence x = (x 1 , • • • , x n ) and the randomly sampled prefix y ≤p (0 ≤ p < m) of the target sentence y = (y 1 , • • • , y m ), the model is trained to predict the next k tokens, as shown in Figure 1 : LNAT = p+k i=p+1 log P (yi | y ≤p , x; θNAT) In addition, we leverage the glancing strategy following Qian et al. (2021) , which exploits curriculum learning during training to get better performance. As in previous NAT work, we apply sequence-level knowledge distillation (Seq-KD) (Kim & Rush, 2016 ) by an autoregressive Transformer teacher model to train our NAT drafter.

3.2. AT VERIFIER

We use the conventional Transformer (see Section 2.1) as our AT verifier, which is the key to guaranteeing translation quality. As we hope as many drafted tokens by the NAT model as possible can be accepted by the AT verifier for a higher speedup, we also apply Seq-KD to the AT verifier by a shared teacher (with the NAT drafter), which not only allows the NAT drafter and AT verifier to perform similarly, but also improves the AT verifier's translation quality (Furlanello et al., 2018) .

4. SPECDEC++

As shown in Figure 1 and discussed in Section 3, the vanilla SpecDec only accepts the drafted tokens that match the top-1 result of the AT verifier, which guarantees that SpecDec's translation is identical to greedy decoding of AT. However, the top-1 results are not necessarily better than the drafted tokens. As a result, the strict verification criterion (i.e., top-1 matching) will result in many good drafted tokens being discarded just because they are different from the top-1 result of the AT verifier, which limits the speedup of SpecDec. To overcome this limitation, we propose a variant of SpecDec named SpecDec++, which is illustrated in Figure 3 . Instead of the rigid top-1 matching requirement shown in Eq (5), SpecDec++ relaxes the criterion to trust NAT's draft more, by only requiring the drafted tokens to fall in top-β candidates with a tolerable (log-likelihood) score gap τ (away from the top-1 result): ŷj+i = yj+i, if SpecDec++ requirement is met same as Eq (5), otherwise As discussed above, SpecDec++ requirement is met if Eq (6) and ( 7) are both true: log P ( yj+i | y ≤j , x; θ NAT ) ≥ log P (ŷ (β) j+i | y ≤j , yj+1•••j+i-1, x; θ AT ) (6) log P (ŷ (1) j+i | y ≤j , yj+1•••j+i-1, x; θ AT ) -log P ( yj+i | y ≤j , x; θ NAT ) ≤ τ (7) where log P (ŷ (β) j+i |•) is the top-β ranked result's log-likelihood score by the AT verifier. The advanced verification criterion with the hyperparameter top-β and tolerance τ not only allows more drafted tokens to be accepted for a higher speedup but also enables SpecDec++ to yield translations beyond greedy decoding.

Datasets and Evaluation

We mainly evaluate our approach on the most recognized machine translation benchmark: WMT14 English-German translation which contains 4.5M translation pairs for training. Following prior work (Ott et al., 2018) , we adopt newstest-13 as our validation set for finding the best hyperparameters and model checkpoints, and test on newstest-14. We use 32K Byte Pair Encoding (BPE) (Sennrich et al., 2016 ) subwordsfoot_3 as the joint source-target dictionary. Following prior work in NAT, we report BLEU (Papineni et al., 2002) to facilitate translation quality comparison. For inference efficiency, we use both the average number of decoding iterations and speedup over beam search. Specifically, we test the inference speed by running the model with one sentence at a time (batch=1)foot_4 using fairseq implementationfoot_5 on 1 Nvidia P100 GPU. In addition to WMT14 EN-DE, we also test SpecDec on WMT14 DE-EN and WMT16 EN-RO/RO-EN benchmarks, as previous NAT work.

Model Configuration

We mainly conduct experiments on the most commonly used base-size Transformer (Vaswani et al., 2017) architecture. The Transformer-basefoot_6 has a 6-layer encoder and a 6-layer decoder. Its embedding/FFN dimension/#heads are 512/2,048/8. We use the model architecture for both the drafter (NAT) and the verifier (AT). We apply sequence-level knowledge distillation as discussed in Section 3.1 and 3.2 for both the drafter and verifier using a shared teacher. Following recent iterative NAT work (Saharia et al., 2020; Savinov et al., 2021) , we use the Transformer-big as the teacher for WMT14 EN-DE/DE-ENfoot_7 and use Transformer-base for WMT16 EN-RO/RO-EN, which all train with the raw training set and generate the distilled training set with beam search (b = 5). We include model training details in Appendix A.

5.2. MAIN RESULTS

The translation quality and speedup results on WMT14 EN-DE are presented in Table 1 . Unlike previous NAT approaches that are inferior to AT with Seq-KD (i.e., our AT verifier), SpecDec introduces an around 3× speedup with exactly the same translation quality as (autoregressive) greedy decoding by our AT verifier, truly achieving lossless acceleration. SpecDec++ further improves the results by relaxing the strict top-1 matching criterion: slightly relaxing (i.e., SpecDec++ highquality) allows us to achieve better translation than greedy decoding -even approaching the beam search result with a higher speedup (3.0× → 3.6×), and a little more aggressively relaxing (i.e., SpecDec++ high-efficiency) further accelerates inference (3.6× → 4.5×) owing to the acceptance of more tokens despite a marginal loss of translation quality.

Models

Iter By looking into the results, our NAT drafter's translation quality is better than the majority of early fully NAT work but inferior to most iterative NAT approaches. Compared with the NAT models including complicated mechanisms such as length prediction, length beam, reranking, and CTC that slow down the efficiency per iterationfoot_8 , our NAT drafter is simple and straightforward. As a result, its decoding efficiency per iteration is higher -even comparable to fully NAT despite taking 1.6 decoding iterations on average. The acceptable translation quality and high efficiency of our NAT drafter significantly help accelerate autoregressive decoding, playing an instrumental role in lossless acceleration of SpecDec. Results for other language pairs are similar to WMT14 EN-DE. We include details in Table 8 in Appendix B.1.

5.3.1. HYPERPARAMETER

Block Size k We conduct experiments with various block sizes on the development set and show the results in Table 2 . As the block size k increases, the number of mean accepted tokens, which highly correlates with speedup and the number of decoding iterations, first increases and reaches a peak when k = 25. Further increasing k has an adverse effect, because it will become very hard for the model to learn to translate too many tokens simultaneously given the limited model capacity, leading to a drop in both efficiency and quality. Top-β and Tolerance τ in SpecDec++ We study the effects of hyperparameters in SpecDec++: top-β and tolerance τ , and show the results on the development set in less strict but also improves the translation quality over greedy decoding. However, the translation quality will decrease if the constraints are over relaxed: the BLEU score will degrade from the peak of 27.02 to 26.64 when decoding with top-5 selection (i.e., β = 5) and τ = 5.0. Based on the results in the development set, we conservatively select β = 3, τ = 1.0 for the high-quality SpecDec++, and use β = 5, τ = 3.0 as the high-efficiency SpecDec++ to pursue the higher speedup without substantial loss of translation quality for WMT14 EN-DE as in Table 1 .

5.3.2. MODEL SIZE

In addition to models of the base-size configuration, we also study larger models to test the effectiveness of SpecDec. We here use Transformer-big (Vaswani et al., 2017) as our model architecture for both the NAT drafter and the AT verifier in SpecDec/SpecDec++foot_9 , and compare it with the conventional Transformer-big baseline as well as Blockwise Decoding (Stern et al., 2018 ) -a stateof-the-art efficient Transformer-big variant by introducing additional k -1 heads on top of the Transformer decoder to generate next k tokens as a block and verifies, which works in a similar way to ours. According to Table 4 , our SpecDec/SpecDec++ substantially speeds up the AT baselines and outperforms Blockwise Decoding with both better results and a higher speedup. Compared with Blockwise Decoding whose performance drops significantly when k increases to 10 due to its underinvestment in speculation by only using lightweight heads to generate the next few tokens in parallel, SpecDec/SpecDec++ using an independent NAT drafter is much more powerful to generate more drafted tokens that can be accepted, turning out to result in a significantly higher speedup, despite introducing more parameters (see Appendix C for detailed discussion of memory cost issues). Moreover, we observe that big-size models can use a larger block size (k = 30) than the base-size models (k = 25) since larger capacity equips the model with a more powerful ability to learn to decode more tokens well in parallel. To better demonstrate this point, we conduct a comparative study of using drafters of different sizes in SpecDec-base, given the same block size (k = 30). According to Table 5 , the big-size NAT drafter largely outperforms the base-size counterpart and the drafter with lightweight heads (as used in Stern et al. (2018) ) performs worst, demonstrating that a stronger drafter can generate drafted tokens more reliably (i.e., on average more drafted tokens accepted by the AT verifier), resulting in fewer decoding iterations, which indicates that SpecDec may be further improved if a more powerful (not necessarily larger) NAT drafter is equipped.

5.3.3. OTHER SEQ2SEQ TASKS

We test SpecDec's effectiveness in one of the most representative seq2seq tasks -Abstractive Summarization. We employ the distilled training data of CNN Daily Mail (Hermann et al., 2015) from BART (Lewis et al., 2020) to train the NAT drafter and the AT verifier whose model architectures are both BART-base, and test on the CNN Daily Mail test split following previous work. According to Table 6 , the vanilla SpecDec consistently achieves exactly the same result as the AT verifier (b = 1), which is, to the best of our knowledge, the first work that achieves such a 3× lossless speedup for Abstractive Summarization. SpecDec++ further accelerates inference but does not show any quality improvement as observed in NMT experiments because of the larger performance gap between the NAT drafter and the AT verifier in the abstractive summarization benchmark.

5.4. DISCUSSION

Extensive experiments in multiple tasks show that SpecDec can significantly speed up seq2seq generation without quality loss. The state-of-the-art lossless speedup is attributed to the substantially improved computational parallelism that allows better utilization of (hardware) computing resources. We believe SpecDec is promising and can even benefit more from evolving processor hardware that will become increasingly powerful and better at parallel computing (shown in Appendix E). As a preliminary study, SpecDec is far from perfect and has much headroom for improvement. First, according to the experimental results above, we know SpecDec's translation quality mainly depends on the AT verifier and its efficiency relies on the NAT drafter (whose capability matters how many drafted tokens can be accepted). We believe more powerful NAT/AT models (than the simple and naive ones used in this paper) will benefit SpecDec to achieve better results. Moreover, SpecDec's potential can be further exploited by optimizing its implementation in computing and memory access. For example, according to Table 18 showing time cost by modules in SpecDec++, our naive implementation costs a total of approximately 16% of overall time to (sequentially) encode the input for AT and NAT. Obviously, this part can be optimized by performing AT and NAT encoding in parallel because they are independent, or sharing (or partially sharing) AT's encoder with NAT, which we leave as future exploration. Also, the NAT decoder costs more than the AT decoder because it employs bi-directional attention and cannot save the computation for the already decoded tokens as AT, which we believe can be improved in the future with a better non-autoregressive decoding mechanism designed for SpecDec.

6. RELATED WORK

Non-autoregressive Decoding To accelerate autoregressive translation (AT), Gu et al. (2018) first proposed Non-Autoregressive Translation (NAT), which decodes the output sentence in one single iteration despite translation quality loss. Recent work mainly focused on improving the quality while maintaining competitive speedups, including applying various training objectives (Wang et al., 2019; Wei et al., 2019; Shu et al., 2020; Shao et al., 2020; Guo et al., 2020; Liu et al., 2021; Ding et al., 2021b; Zeng et al., 2022; Shao et al., 2022) , modeling dependencies between target tokens (Kaiser et al., 2018; Sun et al., 2019; Ghazvininejad et al., 2019; Liu et al., 2020; Bao et al., 2021; Zhan et al., 2022; Zhu et al., 2022) and refining the translation outputs with multi-pass iterations (Chan et al., 2019; Stern et al., 2019; Sun & Yang, 2020; Ghazvininejad et al., 2020b; Ding et al., 2021a; Norouzi et al., 2022) . However, due to the inherent conditional independence assumption, NAT models' translation quality is generally less reliable than AT. Semi-autoregressive Decoding There are also some attempts trying to combine autoregressive (AR) and non-autoregressive (NAR) decoding: Wang et al. (2018) proposed to utilize NAR decoding locally while keeping the AR property globally; on the contrary, Ran et al. ( 2020) and Kong et al. (2020) introduced a local-AR model which retained the NAR property globally. Similar ideas have been also proposed for Grammatical Error Correction (GEC): Chen et al. (2020) proposed to use a sequence tagger to identify the grammatical errors' spans and then use AR decoding to correct them; Aggressive Decoding (Sun et al., 2021) is the first work that introduces speculative decoding into GEC. It assumes that the input is the sentence to be generated in the future (i.e., there are no grammatical errors in the input), and then verifies the whole sentence in parallel through greedy decoding of AT. However, Aggressive Decoding works only for tasks where the input and output are highly similar, which limits its application. The most similar work to ours is Blockwise Decoding (Stern et al., 2018 ) that proposed to additionally insert k -1 NAR heads on top of the Transformer decoder to generate k positions in parallel and use the original AR head to verify these outputs. However, its underinvestment in the NAR modeling seriously limits its performance, resulting in a much lower efficiency than our approach.

7. CONCLUSION AND FUTURE WORK

We propose a novel Speculative Decoding (SpecDec) paradigm as well as its variant (SpecDec++) by combining respective advantages of AT and NAT. Contrary to the stereotype that more models (parameters) tend to slow down inference, SpecDec's introduction of an additional NAT model substantially speeds up AT without quality loss, achieving a state-of-the-art lossless acceleration result owing to higher computational parallelism introduced by the idea of speculative execution to better utilize computing resources. Besides machine translation, SpecDec can be easily generalized to other seq2seq tasks like Abstractive Summarization and benefit from stronger computing devices (to discuss in Appendix E). It demonstrates a novel yet promising perspective for efficient seq2seq generation, orthogonal to the efforts for advancing the state-of-the-art NAT and AT models that can further benefit SpecDec. Despite the state-of-the-art results, SpecDec still has great potential with much headroom for improvement, as we discuss in Section 5.4. We hope that our preliminary study could draw more attention to improving this promising decoding paradigm with potential to evolve into a de facto standard for efficient and lossless seq2seq generation in the near future.

B.2 SPEEDUP DISTRIBUTION

To further understand the acceleration effects of SpecDec, we present the speedup distribution of a single sentence on the WMT14 EN-DE test set (which has 3,003 sentences in total) in Figure 4 , showing that most sentences are translated with a 3× ∼ 6× speedup compared to the beam search baseline, while some rare cases can even achieve over 10× speedup.  � � � � � � � � � �� �� ������� � �� ��� ��� ��� ��� ��� � � � � � � � � � � � � �

B.3 WORD REPETITIONS

With the conditional independence assumption, NAT models show a serious weakness in modeling highly multimodal distributions. The token repetition ratio is often utilized as a proxy to measure this multi-modality problem, which represents the degree of the text inconsistency. However, the role of our AT verifier guarantees that this problem does not exist in SpecDec. As shown in Table 9 , the token repetition ratio of SpecDec/SpecDec++ is similar to that of our AT baseline, which is significantly lower than most relevant NAT models.

B.4 TEACHER MODEL(S)

We study the effects of the teacherfoot_10 on SpecDec, by comparing the results of a single teacher with the teacher ensemble of 3 Transformer-big models in Table 10 . Compared with a single teacher model, teacher ensemble improves all the NAT drafter, AT verifier, and end-to-end SpecDec/SpecDec++ results, indicating that our approach can benefit from a better teacher. (Post, 2018) and COMETfoot_12 (Rei et al., 2020) scores in order to provide a reference for future research. SpecDec/SpecDec++ can also achieve lossless speedup even evaluated in sacreBLEU and COMET. Schmidt et al. (2022) pointed out that inconsistencies in the use of tokenized BLEU lead to deviations of up to 1.8 BLEU points. Therefore, we recommend that future research use sacreBLEU when comparing with our work. 15 show the comparisons of peak GPU memory footprintfoot_13 (MB) between SpecDec and AT (during inference) on the above two scenarios (i.e., MT and summarization). The results are consistent with our analysis above:

BLEU SacreBLEU COMET

The majority of the additional memory cost (i.e., ∆Memory) is for storing the NAT drafter's weights and the additional memory cost is not very likely to significantly increase as the batch size or sequence length increases. Our experiments above pre-loaded both the NAT drafter and AT verifier. In fact, it is also possible to load the static weights of the AT verifier and NAT drafter in a lazy manner in the meantime of GPU computation to save memory as they run alternatively. However, it is usually unnecessary in practice, because for a seq2seq model deployed on modern GPUs for online service, it is latency rather than memory that is the performance bottleneck. See the next section for more discussion.

C.2 MEMORY IS RARELY THE BOTTLENECK

To understand the performance bottleneck of online deployed seq2seq models, we test the latency and memory cost of T5-largefoot_14 (around 770M parameters) with fp16 on 1 Nvidia A40 GPU running greedy decoding in the machine translation and abstractive summarization task, and show results in For MT, T5-large's latency is over 1 second which is actually too long to be accepted because most MT engines in practice require the latency to be less than 100ms. However, its memory cost is only less than 2GB -far below A40 GPU's memory capacity (i.e., 48GBfoot_15 ). For abstractive summarization, even if the batch size increases to 32, its memory cost is still less than 50% utilization of 1 A40 GPU but its latency is already close up to 5 seconds that is too long for an online service in practice. To sum up, we now understand latency is the bottleneck of seq2seq models for online deployment in most cases. Therefore, we do not think additional memory cost by SpecDec will undermine its practical value; instead, we think a significant lossless acceleration even at the cost of memory (i.e., time-memory trade-off) is much more meaningful than the acceleration at the cost of quality, which should be the right path that we need to pay more attention to given much memory headroom on modern GPUs.

D PROFILING

We show the inference time cost by modules in SpecDec++ in Table 18 . The current naive implementation costs a total of approximately 16% of overall time to (sequentially) encode the input for the AT and NAT model, which can be obviously optimized. Also, the NAT decoder costs more than the AT decoder because of the multi-round computation for previously decoded tokens. 

E SPECDEC ON VARIOUS COMPUTING DEVICES

We also testfoot_16 the inference efficiency of SpecDec with the batch implementationfoot_17 on various GPUs, as shown in Figure 5 . It is obvious that more powerful devices (e.g., V100/A100) can benefit SpecDec/SpecDec++ more (i.e., higher speedup). Therefore, we believe SpecDec's speedup ratio in the future can be much higher than the number we report in this paper, because it can benefit more from evolving processor hardware that will become increasingly powerful and better at parallel computing (e.g., Nvidia H100foot_18 with fp8 support).

F DISCUSSIONS OF BEAM SEARCH

For possible concerns that SpecDec may not apply beam search, we make three points here: 1. As Kim & Rush (2016) mentioned, knowledge distillation largely decreases the performance gap of beam search and greedy decoding. In practice, greedy decoding can actually be comparable to beam search results after KD. 2. In practical online deployment, KD is almost used by default for enhancing the results for student models and greedy decoding is much more common than beam search because it is more cost-effective -it not only runs faster than beam search but also achieves decent performance with a student model trained through KD (as Point 1 addressed)



Speculative execution is an optimization technique used in computer architecture where a system performs some task in advance to avoid delays that would have to be incurred by doing the task after it is known that it is required (https://wikipedia.org/wiki/Speculative_execution). We use "a block of drafted tokens" to denote them in the following parts of this paper. We use y to denote NAT's translations, while we use ŷ to denote AT decoded/verified translation results. We use the same BPE tokenization and vocabulary asGhazvininejad et al. (2019). We report performances with various batch sizes in Appendix E. https://github.com/pytorch/fairseq. Beam search (b = 5) is our speed baseline (1.0×). We also include results of a Transformer with a deep encoder and a shallow decoder in Appendix B.5. As a reference, the teacher models ofSaharia et al. (2020) achieve 29.5 and 32.2 BLEU in WMT14 EN-DE and DE-EN respectively, which are stronger than our teachers achieving 29.3 and 32.1 respectively. For example, RewriteNAT(Geng et al., 2021) uses length beam size 5, slowing down its efficiency. The hyperparameters (e.g., block size k, Top-β, tolerance τ ) in SpecDec (big) are re-tuned on the development set, which may be different from those in the base-size models. We will release all our teachers to facilitate reproducing our results. https://github.com/mjpost/sacrebleu Obtained with wmt20-comet-da from version 1.1.0. Tested with the command torch.cuda.max memory allocated() In practice, T5-large is rarely deployed for online service because it is too large and expensive to serve. It can also easily scale to 96GB with NVIDIA NVLink connection of multiple GPUs. We test with the batch size up to 32 because as the batch size increases, the inference latency per batch will become higher. Therefore, it is impractical to use a large batch size during inference, asRajbhandari et al. (2022) points out. For simplifying implementation, we do not use incremental states in SpecDec to save the computation for previously decoded tokens as conventional AT, which means that its result is probably much underestimated. But even so, SpecDec/SpecDec++ still shows very promising speedup. https://www.nvidia.com/en-us/data-center/h100/



Figure 2: Translation quality and efficiency of models on WMT14 EN-DE. The speedup baseline (1.0×) is the Transformerbase (Vaswani et al., 2017) with beam search. All models above except "AT" are trained with KD by a Transformer-big teacher.

Figure 3: Illustration of SpecDec++. Compared to the vanilla SpecDec strictly requiring the drafted tokens to match the top-1 result of the AT verifier, SpecDec++ slightly relaxes the criterion to trust NAT's draft more, by only requiring the drafted tokens to fall in the top-β candidates of the AT verifier with a tolerable log-likelihood gap (not shown in this Figure; see Eq (7)). As a result, SpecDec++ allows more drafted tokens to be accepted even if they are slightly different from the top-1 result of the AT verifier, leading to a higher inference speedup.

Figure 4: Single sentence speedup distribution by SpecDec++ (k = 25, high-efficiency).

Figure 5: Inference latency of SpecDec-base (k = 25, top-5, τ = 3.0) on P100 (fp32), V100 (fp16) and A100 (fp16). The results are obtained on WMT14 EN-DE. The speedup baseline (1×) is AT (b = 5) when batch = 32.



Moderately increasing τ and β not only leads to an increase of mean accepted tokens since AT verification becomes



Results on the development set (newstest-13) with different hyperparameters in SpecDec++ (k = 25). Each cell lists the mean accepted tokens and BLEU score. The BLEU score of greedy decoding of the AT verifier is 26.62.

Results of SpecDec of the big-size model configuration on WMT14 EN-DE and the comparison to the state-of-the-art Blockwise Decoding Stern et al. (2018).

Results of SpecDec-base (k = 30) with various sizes of drafters on WMT14 EN-DE.

Results of SpecDec-base model on CNN-DM for Abstractive Summarization.

Token repetition ratio on WMT14 EN-DE. SpecDec-base is test with hyper-parameters k = 25, top-3, τ = 1.0. CMLM is tested with our implementation with the length beam set to 3.

Performance comparison between a single teacher model and teacher ensemble (3 teacher models) on WMT14 EN-DE. We report the results of both base-size and big-size SpecDec. Transformer-base/big with beam search are the speed baselines of SpecDec-base/big respectively. B.5 RESULTS FOR TRANSFORMER-12-2Kasai et al. (2021) points out that the Transformer with a deep encoder and a shallow decoder can achieve comparable translation quality with remarkable speedups. In Table11, we present the results of SpecDec-base with the configuration of 12 encoder layers and 2 decoder layers. The results indicate that even in the deep-shallow configuration, SpecDec/SpecDec++ can further speed up translation without quality loss.

Results of SpecDec with the deep-shallow model configuration on WMT14 EN-DE. Both the AT verifier and the NAT drafter have 12 encoder layers and 2 decoder layers. E: encoder depth; D: decoder depth. * denotes the results of AT baselines (b = 5) implemented by Ghazvininejad et al. (2019). † indicates results obtained by Transformer-big.

SacreBLEU and COMET scores on WMT14 EN-DE.

BLEU and SacreBLEU (denoted by † ) scores on WMT14 EN-DE and WMT16 EN-RO benchmarks.Compared with AT, the additional memory cost of SpecDec comes from the last two parts. While the static NAT drafter's weights account for the majority of the additional memory cost, the additional cost for storing intermediate variables is negligible because the NAT drafter and AT verifier decode alternatively during inference. Compared with AT, SpecDec's additional intermediate variables/results include:• The NAT drafter's last encoder layer's representation that will not be freed until decoding finishes, which is equal to B • S • d where B is the batch size, S is the sequence length and d is the dimension of the model. This part is actually negligible: for example, when B = 32, S = 128, d = 512, this part's memory cost is only 8MB (fp32) / 4MB (fp16). For a long sequence (e.g., paragraph/document-level inputs/outputs in summarization tasks), the largest intermediate variable becomes the tensor for storing self-attention computation whose size increases quadratically with S (S is the sequence length). This variable accounts for the largest memory cost for storing intermediate results in both AT and SpecDec. Therefore, in this case, this part does not introduce additional memory cost compared with AT.

Peak GPU memory utilization on WMT14 EN-DE translation dataset. The results are obtained with fp32 on a single Nvidia P100 GPU. The hyperparameters of SpecDec++ are k = 25, top-5, τ = 3.0.

Peak GPU memory utilization on CNN-DM with batch size 1 with fp32 on a single Nvidia P100 GPU.

Table

Table 16 and 17. Latency and peak GPU memory utilization of T5-Large on WMT14 EN-DE.

Latency and peak GPU memory utilization of T5-Large on CNN-DM.

Profiling of SpecDec++(base-size, k = 25, top-5, τ = 3.0) on the WMT14 EN-DE test set.

APPENDIX A HYPERPARAMETERS

Hyper-parameters of training SpecDec are listed in Table 7 . Following Vaswani et al. (2017) and Ott et al. (2018) , we also average model parameters from the last 10 checkpoints. Example 2-vanilla SpecDec SOURCE Yesterday , Gut@@ acht 's Mayor gave a clear answer to this question . D Gestern hat der Bürger@@ meister von Gut@@ acht eine klare V Gestern hat Gut@@ Bürger@@ meister von Gut@@ acht eine klare Antwort D Gestern hat Gut@@ acht@@ ts Bürger@@ meister eine klare Antwort auf diese Frage V Gestern hat Gut@@ ach@@ s Bürger@@ meister eine klare Antwort auf diese Frage gegeben D Gestern hat Gut@@ ach@@ ts Bürger@@ meister eine klare Antwort auf diese Frage gegeben V Gestern hat Gut@@ ach@@ ts Bürger@@ meister eine klare Antwort auf diese Frage gegeben .

D

Gestern hat Gut@@ ach@@ ts Bürger@@ meister eine klare Antwort auf diese Frage gegeben .[EOS] V Gestern hat Gut@@ ach@@ ts Bürger@@ meister eine klare Antwort auf diese Frage gegeben .[EOS] RESULTS Gestern hat Gut@@ ach@@ ts Bürger@@ meister eine klare Antwort auf diese Frage gegeben . Example 2-SpecDec++ SOURCE Yesterday , Gut@@ acht 's Mayor gave a clear answer to this question . D Gestern hat der Bürger@@ meister von Gut@@ acht eine klare V Gestern hat der Bürger@@ meister von Gut@@ acht eine klare Antwort D Gestern hat der Bürger@@ meister von Gut@@ acht eine klare Antwort auf diese Frage gegeben .[EOS] V Gestern hat der Bürger@@ meister von Gut@@ acht eine klare Antwort auf diese Frage gegeben .[EOS] RESULTS Gestern hat der Bürger@@ meister von Gut@@ acht eine klare Antwort auf diese Frage gegeben .Table 19 : Examples from the WMT14 English-German translation task. At each iteration, D and V are the outputs of the drafter and the verifier, respectively. Tokens within red blocks are the bifurcation positions. The verification pieces after the bifurcation are annotated as strikethrough. The highlighted parts are translations of previous iterations. Tokens in blue blocks are top-β candidates which meet the SpecDec++ requirement. The hyperparameters are k = 10, top-3, τ = 1.0. '@@' is the BPE token, e.g., Gut@@ acht → Gutacht. The output pieces after the [EOS] token is omitted in the table.

