NON-ITERATIVE PARALLEL TEXT GENERATION VIA GLANCING TRANSFORMER

Abstract

Although non-autoregressive models with one-iteration generation achieve remarkable inference speed-up, they still fall behind their autoregressive counterparts in prediction accuracy. The non-autoregressive models with the best accuracy currently rely on multiple decoding iterations, which largely sacrifice the inference speed of non-autoregressive models. Inspired by the way of learning word dependencies in autoregressive and iterative-decoding models, we propose Glancing Transformer (GLAT) with a glancing language model (GLM), which learns to capture the word dependency gradually. Experiments on three benchmarks demonstrate that our approach can significantly improve the accuracy of nonautoregressive models without multiple decoding iterations. In particular, GLAT achieves state-of-the-art results among non-iterative models and even outperforms top iterative counterparts in some specific benchmarks.

1. INTRODUCTION

Non-autoregressive transformer (NAT) has attracted wide attention in neural machine translation (Gu et al., 2018) , which generates sentences simultaneously rather than sequentially. To enable parallel decoding, NAT imposes a conditional independence assumption among words in the output sentences, which leads to significantly faster inference speed (almost a dozen times speed-up) than the autoregressive Transformer (Vaswani et al., 2017) . However, NAT still falls behind autoregressive Transformer (AT) in the quality of output sentences, such as BLEU (Papineni et al., 2002) for machine translation. We blame it for the imposed conditional independence assumption, which prevents NAT models from explicitly learning the word dependencies in the output sentence. Note that such word dependency is crucial, and it is explicitly learned in the AT model through the autoregressive language models (left-to-right, see Figure 1a and Figure 1b ). Recently, Ghazvininejad et al. (2019); Gu et al. (2019) propose to employ the Masked Language Model (MLM, Devlin et al., 2019) in NAT, which includes word dependency modeling in an iterative fashion (see Figure 1c ), therefore yielding quite competitive results compared to AT. Specifically, such iterative models randomly mask words in the reference and predict these masked words conditioned on unmasked ones during training. In this manner, iterative models are trained to explicitly capture the dependencies between masked words and unmasked words. However, these iterative approaches still give poor results with one decoding iteration and have to perform multiple iterations during inference, namely iteratively refining the generated outputs of the previous iteration. Such iterative process is quite time-consuming, which partly sacrifices the speed merit of NAT. How to abandon the iterative process while enjoy the benefits of explicitly modeling word dependencies in NAT is still an open problem. In this paper, we argue that the major culprit of the problem that mask language models have to be used together with iterative inference, is the sampling strategy of masking words in MLM. In particular, MLM employs a fixed uniform strategy for masking words randomly during training, which prevents the model from effectively learning word dependencies for one-iteration generation. For example, at the beginning of training, the NAT model would be poorly tuned and we should mask fewer words. If we mask too many words, it would be difficult for the NAT model to correctly predict the masked words. On the contrary, if we mask too little words at the end phase of training, the resulting NAT model is rarely trained to predict the whole sentences, and can only predict some sentence fragments. In such a case, to accurately generate the whole sentence in inference, the NAT Under review as a conference paper at ICLR 2021 ŷ 1 ŷ 2 ŷ 4 ŷ 5 ŷ 3 h 1 h 2 h 3 h 4 h 5

Glancing Sampling

Hamming Distance model has to generate the sentence fragments iteratively. To this end, the sampling strategy is crucial for the training of NAT. N( Ŷ , Y ) = 3 To address the above issues, we propose a simple yet effective approach called Glancing Transformer (GLAT), which is equipped with the proposed Glancing Language Model (GLM) for noniterative parallel text generation, achieving significant improvements upon strong baselines. Intuitively, GLM adopts a adaptive glancing sampling strategy, which glances at some fragments of the reference if the reference is too difficult to fit in the training of NAT. Correspondingly, when the model is well tuned, it will adaptively reduce the percentage of glancing sampling, making sure that the resulting model could learn to generate the whole sentence in a one-iteration fashion. Specifically, our proposed GLM differs from MLM in two aspects. Firstly, GLM proposes an adaptive glancing sampling strategy, which enables GLAT to generate sentences in a one-iteration way, working by gradual training instead of iterative inference (see Figure 1d ). Generally, GLM is quite similar to curriculum learning (Bengio et al., 2009) in spirit, namely first learning to generate some fragments and gradually moving to learn the whole sentences (from easy to hard). To achieve the adaptive glancing sampling, GLM performs decoding twice in training. The first decoding is the same as the vanilla NAT, and the prediction accuracy indicates whether current reference is "difficult" for fitting. In the second decoding, GLM gets words of the reference via glancing sampling according to the first decoding, and learn to predict the remaining words that are not sampled. Note that only the second decoding will update the model parameters. Secondly, instead of using the [MASK] token, GLM directly use representations from the encoder at corresponding positions, which is more natural and could enhance the interactions between sampled words and signals from the encoder. Experimental results show that GLAT obtains significant improvements (about 5 BLEU) on standard benchmarks compared to the vanilla NAT, without losing inference speed-up. GLAT achieves competitive results against iterative approaches like Mask-Predict (Ghazvininejad et al., 2019) , even outperforming the Mask-Predict model on WMT14 DE-EN and WMT16 RO-EN. Compared to the strong AT baseline, GLAT can still close the performance gap within 1 BLEU point while keeping 7.9× speed-up. Empirically, we even find that GLAT outperforms AT when the length of the reference is less than 20 on WMT14 DE-EN. We speculate this is because GLM could capture bidirectional context for generation while its left-to-right counterpart is only unidirectional, which indicates the potential of parallel generation approaches like GLAT.

2. TEXT GENERATION VIA CONDITIONAL LANGUAGE MODELING

In this section, we compare different language models used in different text generation approaches. Formally, considering a sequence-to-sequence model (Cho et al., 2014; Bahdanau et al., 2014; Vaswani et al., 2017) for predicting Y = {y 1 , y 2 , ..., y T } given the input sentence X = {x 1 , x 2 , ..., x N }. In the AT model, the training objective is maximizing the log-likelihood with autoregressive decomposition: L AT = T t=1 log p(y t |y <t , X; θ), where the word y t is conditioned on the target prefix y <t = {[BOS], y 1 , ..., y t-1 } and the source input X. AT models sentences from left-to-right, therefore word dependencies are learned in a



Figure 1: Different language modeling approaches of different text generation models.

