SPECULATIVE DECODING: LOSSLESS SPEEDUP OF AUTOREGRESSIVE TRANSLATION

Abstract

Different from some previous work accelerating autoregressive translation (AT) at the sacrifice of quality, we propose Speculative Decoding (SpecDec) -a novel decoding paradigm inspired by speculative execution in computer architecture, which combines respective advantages of AT and non-autoregressive translation (NAT) for lossless speedup of translation. At each decoding step, SpecDec first speculatively drafts (i.e. decodes) next k tokens with an NAT model and then verifies them with an AT model, where only the drafted tokens passing the verification are accepted as decoded tokens for guaranteeing its translation result is exactly the same as AT. The collaboration of NAT drafting and AT verification leads to a much higher decoding speed without quality loss due to parallel computing enabled by speculative decoding. We conduct experiments in 4 standard WMT translation benchmarks and confirm the vanilla SpecDec yields exactly the same results as AT greedy decoding with an around 3× speedup, and that its variant (SpecDec++) with an advanced verification strategy not only outperforms AT greedy decoding, but also further improves the decoding speed, resulting in an around 5× speedup over AT. Moreover, SpecDec can be easily generalized for speeding up other seq2seq tasks like Abstractive Summarization, and benefit more from stronger computing devices, demonstrating its potential to become a de facto decoding standard in the future for efficient and lossless seq2seq generation. We will release all our codes and checkpoints to facilitate reproducing our results.

1. INTRODUCTION

Since the Transformer (Vaswani et al., 2017) prevailed in Natural Language Processing (NLP), autoregressive decoding has become the de facto standard for neural machine translation (NMT) as well as other generation tasks, because it is easy to train and reliable to generate high-quality results. Despite its advantages, autoregressive translation (AT) has been widely blamed for its poor inference efficiency, motivating non-autoregressive translation (NAT). Unlike AT which sequentially decodes only one token at each iteration so that the next token prediction can condition on the previous decoding results, NAT decodes tokens in parallel without depending on the surface form of previous tokens, largely improving the inference efficiency. Recent research in NAT mainly focuses on improving its translation quality to bridge the performance gap between NAT and AT (Gu et al., 2018; Qian et al., 2021; Geng et al., 2021; Savinov et al., 2021) . Until now, however, NAT's performance is still less reliable than AT, as NAT is more difficult than AT given its unawareness of the conditional dependence of translated tokens. Given AT's reliable generation results and NAT's high efficiency, we propose an approach named Speculative Decoding (SpecDec) to combine their advantages, inspired by speculative executionfoot_0 , to accelerate translation without quality loss compared with AT. SpecDec decomposes a decoding iteration into two substeps: draft and verify: At each iteration, SpecDec first speculatively drafts (i.e., decodes) a fixed number of tokensfoot_1 in parallel through NAT; Then, the drafted tokens are verified



Speculative execution is an optimization technique used in computer architecture where a system performs some task in advance to avoid delays that would have to be incurred by doing the task after it is known that it is required (https://wikipedia.org/wiki/Speculative_execution). We use "a block of drafted tokens" to denote them in the following parts of this paper.1

