A TEXT GAN FOR LANGUAGE GENERATION WITH NON-AUTOREGRESSIVE GENERATOR Anonymous authors Paper under double-blind review

Abstract

Despite the great success of Generative Adversarial Networks (GANs) in generating high-quality images, GANs for text generation still face two major challenges: first, most text GANs are unstable in training mainly due to ineffective optimization of the generator, and they heavily rely on maximum likelihood pretraining; second, most text GANs adopt autoregressive generators without latent variables, which largely limits the ability to learn latent representations for natural language text. In this paper, we propose a novel text GAN, named NAGAN, which incorporates a non-autoregressive generator with latent variables. The non-autoregressive generator can be effectively trained with gradient-based methods and free of pretraining. The latent variables facilitate representation learning for text generation applications. Experiments show that our model is competitive compared with existing text GANs in unconditional text generation, and it outperforms existing methods on sentence manipulation in latent space and unsupervised text decipherment.

1. INTRODUCTION

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have achieved great success in generating continuous data, such as images with high resolution and fidelity (Brock et al., 2019) . Unsurprisingly, GANs are also widely studied for text generation, but the adaptation is by no means trivial. The mainstream text GANs (Yu et al., 2017b; Guo et al., 2018) apply a different framework tailored for discrete sequence data, but there are remaining several unsolved research problems. One problem lies in ineffective optimization. Most text GANs resort to gradient-free RL (reinforcement learning) algorithms, mainly due to the nature of discrete text data. However, since RL methods abandon the gradient information, they suffer from unstable training processes (Ke et al., 2019) . Though some works (Chen et al., 2018) explored the feasibility of gradient-based methods, the optimization is still ineffective. As a result, most text GANs heavily rely on MLE pretraining, and some even report worse performance after GAN training (Caccia et al., 2020) . Another problem can be attributed to the generative model. Most text GANs adopt an autoregressive generator, which defines an explicit likelihood without any latent variable. Latent variables have empowered image GANs with various applications, such as unsupervised style transfer (Taigman et al., 2017) and image editing (Brock et al., 2017) . However, most text GANs merely generate sentences from the learned distribution with autoregressive decoding, thereby hardly applicable to text style transfer or controlled text generation which may require latent representations. We therefore challenge the conventional design of existing text GANs and argue that incorporating a non-autoregressive generator can benefit from both efficient gradient-based optimization methods and the use of latent variables. Our proposed model, named Non-Autoregressive GAN (NAGAN), consists of a non-autoregressive generator and a regularized discriminator. The non-autoregressive generator naturally translates latent variables to the tokens in parallel, and the gradient-based optimization on our feed-forward structure is significantly more effective than the same method on an autoregressive generator. The discriminator is regularized by Max Gradient Penalty (Zhou et al., 2019) , which is another key to effective optimization. Our contributions are summarized as follows: • We propose non-autoregressive GAN, which characterizes itself by employing a non-autoregressive generator and latent variables, and efficiently training from scratch with gradient-based methods. To our knowledge, NAGAN is the first text GAN which learns latent representations from scratch. Table 1 : Differences between various GANs. Autoregressive: using autoregressive generators. Explicit: using explicit generative models. Latent: equipped with latent variables. Pretrain: pretraining required or not. The mainstream TextGANs include Yu et al. (2017b) ; Guo et al. (2018) ; Shi et al. (2018) ; Che et al. (2017) ; Lin et al. (2017) ; Fedus et al. (2018) . Model Data Autoregressive Explicit Latent Optimization Pretrain Vanilla GAN (Goodfellow et al., 2014) Image Gradient-based SOTA ImageGANs (Brock et al., 2019) Image Gradient-based Mainstream TextGANs [*] Text RL ScratchGAN (de Masson d'Autume et al., 2019) Text RL RelGAN (Nie et al., 2019) Text Gradient-based FMGAN (Chen et al., 2018) Text 1

Gradient-based NAGAN(Ours)

Text Gradient-based • We conduct experiments on synthetic and real data, and show that NAGAN without MLE pretraining is competitive in unconditional text generation compared with the pretrained text GANs. • By taking advantage of the latent variables and the non-autoregressive generator, our model can be applied to sentence manipulation in latent space and unsupervised text decipherment, where NAGAN significantly outperforms previous methods.

2. MOTIVATION OF NON-AUTOREGRESIVE GENERATION IN TEXT GANS

Non-autoregressive (NAR) generators have been widely used in image GANs (Goodfellow et al., 2014) and achieve great success in generating high-quality images (Brock et al., 2019) . However, for generating texts, text GANs apply a different framework where autoregressive (AR) generators are used. In this section, we will discuss the differences between image and text GANs, and show why we need non-autoregressive generators in text generation.

2.1. GENERATIVE MODELS

Image and text GANs differ substantially in generative models. The image GAN is an implicit generative model, where a sample x is generated in two steps: z ∼ p(z); x = G(z) where z is a latent variable, and p(z) is the prior distribution. G is a deterministic function from the latent space to the data space, usually parameterized by a NAR generator, where each pixel of x is generated simultaneously. A text GAN is usually an explicit generative model. A discrete sequence sample x = [x 1 , x 2 , • • • , x L ] is sampled by a stochastic process from the distribution P G (x), where P G (x) = L i=1 P G (x i |x 1 , • • • , x i-1 ). Eq (2) shows G is an autoregressive model, where tokens are sampled sequentially conditioned on previously generated prefixes. Most text GANs do not adopt latent variables, partially because Eq (2) is convenient for the MLE pretraining. There are also some explorations (Chen et al., 2018) on equipping text GANs with latent variables, but elaborate VAE-like pretraining is required. However, directly injecting latent variables into the explicit generative model of text GANs can be problematic. For instance, when defining P G (x) = E z∼p(z) L i=1 P G (x i |x <i , z), we may face two problems: (1) Solely optimizing P G does not make the model learn the latent representations. Without further constraintsfoot_1 , the model can degenerate and simply ignores z, which becomes almost a vanilla language model, even if the generation distribution P G perfectly fits the real distribution. (2) The representation in AR text GANs can be hardly applied to downstream tasks. In image GANs, x is fully determined by the sampled z, so we can control the generated images by manipulating the latent variable. Moreover, the mapping from the latent space to the data space is continuous and differentiable, thereby facilitating applications such as image editing (Brock et al., 2017) . In text GANs, translating z to x is still a stochastic process when the latent variable is fixed, indicating that the generated sentences may not be entirely controllable. The non-differentiable generator can also be an obstacle for downstream applications. Based on the above reasons, we explore the feasibility of adopting an implicit generative model with a NAR generator for text GANs. The NAR generator can be applied to various applications conveniently and will not degenerate into a language model (more discussed in Appendix A.1.1).

2.2. OPTIMIZATIONS

In image GANs, the generator G θ and the discriminator D φ play a minimax game: min θ max φ V (D φ , G θ ) = E x∼p data (x) [log D φ (x)] + E z∼p(z) [log(1 -D φ (G θ (z)))]. (3) Since G θ and D φ are fully differentiable, they can be optimized alternately by gradient-based methods. In the text GANs, there is usually no latent variable, and the minimax game becomes min θ max φ V (D φ , G θ ) = E x∼P data (x) [log D φ (x)] + E x∼P G θ (x) [log(1 -D φ (x))]. The discriminator D φ can be trained by gradient-based methods, but the output of G θ is discrete, so the gradient cannot be passed from D φ back to G θ . Instead, the generator G θ can be optimized by the gradient-free REINFORCE algorithm (Williams, 1992) : ∇ θ E x∼P G θ (x) [R(x)] = E x∼P G θ (x) [R(x)∇ θ log P G θ (x)], where R(x) is a reward function measuring the fitness of a generated sequence. If R(x) = log(1 -D(x)), it recovers the second term of Eq (4). Existing text GANs also devise various rewards tailored for text data (Guo et al., 2018; Shi et al., 2018; Lin et al., 2017; Fedus et al., 2018; Xu et al., 2018) . Gradient-free RL algorithms naturally suffer from high variance. In text GANs, the huge action spaces and the changing reward functionfoot_2 exacerbate the problem, so most text GANs heavily rely on the MLE pretraining. Some text GANs improve the training algorithms from the perspective of RL (Guo et al., 2018; Shi et al., 2018) , but recent empirical studies show the RL training does not always improve the performance over the pretraining (Caccia et al., 2020; Semeniuta1 et al., 2018) . ScratchGAN (de Masson d'Autume et al., 2019) devises improved RL techniques, for the first time freeing text GANs from MLE pretraining by increasing the batch size (~10x) and computation cost. Some works (Kusner & Hernández-Lobato, 2016; Chen et al., 2018) explore the gradient-based methods and adopt continuous relaxations or gradient estimators for discrete text data. However, these models still rely on pretraining since the optimization is ineffective as well. Different from existing works that focus on improvements of RL or the approximation methods, we find that one reason for the ineffective optimization may come from the generator architecture. To obtain gradients, an autoregressive generator with the gradient-based optimization method has to apply the gradient estimator recurrently, because the generator reads discrete tokens as inputs. In other words, the gradient flow from the last token will be approximated multiple times when it reaches the start of the sequence. On the contrary, a non-autoregressive generator has a feed-forward structure, and only one approximation is required at the last layer of the generator.

2.3. NON-AUTOREGRESSIVE TEXT GENERATION

Non-autoregressive text generation is first proposed for machine translation to generate the whole sequence in parallel with low latency (Gu et al., 2018; Ma et al., 2019) . Most non-autoregressive generators are trained with cross-entropy loss, which assumes the tokens are conditional independent of each other when the input is given. As a result, the non-autoregressive generators can only be applied to the tasks where the outputs can be fully determined by the inputs (e.g., machine translation), and they are failed to capture complex distributions, known as the multimodality problem (Gu et al., 2018) . Most non-autoregressive models require knowledge distillation to reduce the dataset complexity, otherwise they will experience a serious performance drop (Zhou et al., 2020) . We assume that a key problem behind the phenomenon is the unreasonable independence assumption, so we replace  Z = [z 1 z 2 . . . z L ] to a sentence O = [o 1 o 2 . . . o L ], where each o i is a one-hot vector. The gradient of non-differential operation is estimated by the straight through estimator. (B) The discriminator produces a score D(O) for the sentence O. The gradient from the discriminator can be passed back to the generator. the cross-entropy loss with the GANs objective. The GAN objective considers the whole sequence and punishes implausible sentences, which does not rely on the independence assumption. In this paper, we explore the feasibility of generating diverse sentences with a non-autoregressive generator.

3. NON-AUTOREGRESSIVE GAN

As shown in Figure 1 , our proposed non-autoregressive GAN (NAGAN) consists of a non-autoregressive generator with latent variables and a regularized discriminator. The framework is very similar to image GANs but differs from the mainstream text GANs substantially in two aspects: (1) We use implicit non-autoregressive generative models equipped with latent variables. (2) We use the gradient-based optimization method with Max Gradient Penalty to stabilize the training process.

3.1. GENERATOR

Our implicit generative model is defined by a sampling process. First, we sample the latent variable from the prior distribution. Instead of a single latent vector, we sample a sequence of continuous latent variables Z = [z 1 z 2 • • • z L ], where L is a desired length of the target sequencefoot_3 and each z t is sampled from N (0, I) independently for 1 ≤ t ≤ L. This idea is inspired by Ma et al. (2019) , which provides a good way for the non-autoregressive generator to leverage the latent variables. Then, the non-autoregressive generator G converts Z to a sequence O = [o 1 o 2 • • • o L ] , which is the one-hot representation of the generated sentence. We use one-hot representations here for the convenience of obtaining gradients. The process can be formulated as follows: Z ∼ P (Z); O = G(Z). G is implemented by a Transformer network (Gu et al., 2018; Vaswani et al., 2017) : [h 1 h 2 • • • h L ] = Transformer([z 1 z 2 • • • z L ]), s t = MLP(h t ) ∈ R V , o t = onehot(arg max v (s t,v )), ( ) where V is the size of the vocabulary, s t,v is the v-th dimension of s t , and o t ∈ {0, 1} V . Straight-Through Estimator. The argmax operation is non-differentiable such that we cannot pass gradient from o t back to the parameters of G. Previous work (Kusner & Hernández-Lobato, 2016) has introduced the straight-through estimator (Bengio et al., 2013) to solve the problem. Given a scalar temperature τ , if τ → 0, we have ∂ot ∂st ≈ ∂softmax(st/τ ) ∂st . Note that the straight-through estimator is only used for obtaining gradients, and the forward pass remains unchanged as Eq (8).

3.2. DISCRIMINATOR

Similar to the vanilla GANs, the discriminator is trying to distinguish the real sentences and the generated sentences as a binary classification. Once receiving the generated sentence O = [o 1 o 2 • • • o L ], the discriminator produces D(O) as a score for the sentence. We simply choose Transformer for our discriminator. Some other architectures are discussed in Section 4.2. [r 1 r 2 • • • r L ] = Transformer([o 1 o 2 • • • o L ]), (9) D(O) = MLP(Max_Pooling([r 1 r 2 • • • r L ])). (10) Max Gradient Penalty. The gradient vanishing (Arjovsky et al., 2017) problem is a critical factor for causing the unstable training process in image GANs. When an unregularized discriminator is perfectly trained, the gradient becomes zero and cannot provide signals for the generator's optimization. We borrow a common training technique named Max Gradient Penalty from image GANs (Zhou et al., 2019) . The technique restricts the Lipschitz constant of the discriminator, which is proven effective in gradient-based GAN's optimization. Max Gradient Penalty can be formulated as follows: L GP = max O ∂D(O ) ∂O 2 , ( ) where O are sampled uniformly along the straight lines between pairs of points sampled from the real data and the generated data (Gulrajani et al., 2017) .

3.3. TRAINING

The discriminator and the generator will be optimized alternatively, where the discriminator tries to distinguish the real and generated sentences, and the generator is updated to achieve a higher score D(G(Z)). We utilize the GAN objective with non-saturating loss (Goodfellow et al., 2014) . L D = -E x∼Pdata log D(onehot(x)) -E Z∼P (Z) log(1 -D(G(Z))), ( ) L G = -E Z∼P (Z) log D(G(Z)). ( ) We further apply Max Gradient Penalty on the objective, and the loss of the discriminator becomes L D = L D + λL GP , where λ is a hyper-parameter balancing the regularization terms. This objective can be regarded as a special case of Lipschitz GANs (Zhou et al., 2019) . With the straight-through estimator, both the generator and the discriminator are differentiable, so L D and L G can be optimized alternately by gradient-based methods. The training steps are consistent with the vanilla GANs but essentially different from previous text GANs: we do not use MLE pretraining or sampling techniques devised for RL. Predicting the Length. Non-autoregressive generators require a sentence length L before generation in parallel. Existing non-autoregressive generators (Gu et al., 2018) usually leverage a classifier P (L|C) to predict L conditioned on the input C (e.g., the source sentence in machine translation). In unconditional text generation, we directly estimate P (L) by counting the lengths of the training samples since C is not provided. During training and inference, L is sampled from the same P (L).

3.4. APPLICATIONS

Manipulate Sentences in Latent Space. To show the utility of latent variables, we introduce a sentence editing task: given a sentence (whose latent representation is Z s ) which contains a source word, the task aims to obtain a new latent representation Z t , which can be translated to a sentence that contains a target word. Some cases are shown in Table 6 . A previous method (Zhao et al., 2018) modifies the latent variable with an offset vector, which is obtained by subtracting the mean latent variable of the source word from that of the target word. This method ignores the context around the edited word and thus leads to a low success rate. More details are described in Appendix A.5. NAGAN can use the gradient information to make specific modification for different contexts. If we want to modify the t-th word x t in the source sentence to a desired word x , we can apply gradient descent on the latent variable to maximize the x -th dimension of o t in Eq (8). One descent step can be formulated as Z s := Z s + β ∂o t,x ∂Zs , where β is a small weight. The process will be repeated until success or exceeding a pre-specified iteration number. Note that the two methods are only applicable to the models with latent variables, where most text GANs are not applicable in this task. Unsupervised Decipherment. Our model can apply to unsupervised decipherment (Yang et al., 2018) within the framework of CycleGAN (Zhu et al., 2017) . The task provides two unparallel corpora: the plaintext X and the ciphertext Y, where we aim to learn a mapping F Y→X to decrypt the ciphertext. In order to adapt NAGAN to this task, we introduce an encoder E to encode the data to a shared latent space and define F Y→X := G(E(y, c Y ), c X ), where y ∈ Y, c Y and c X are labels indicating the type for the source and target text. In a similar way, F X →Y := G(E(x, c X ), c Y ). Following Dai et al. (2019) , we utilize three losses: the cycle loss, the adversarial loss, and the reconstruction loss. We first define the three losses on X , and the losses on Y can be obtained similarly. The cycle loss is defined as L cyc,X = E x∼P X [d(x, F Y→X (F X →Y (x)))], where d is a distance function. The adversarial loss matches the real sample and the generated samples: L adv,X = E x∼P X [D(x, c X )] - E y∼P Y [D(F Y→X (y), c X )]. The reconstruction loss helps the unsupervised encoder-decoder training: L rec,X = E x∼P X [d(x, G(E(x, c X ), c X ))]. The final objective sums three losses on both X and Y, i.e. min G max D [L cyc,X + L cyc,Y + αL adv,X + αL adv,Y + βL rec,X + βL rec,Y ], where α, β are hyperparameters to balance the losses. In this task, NAGAN learns shared latent representations for two types of text, and then the alignment between two text spaces are strengthened by the cycle loss. We adapt NAGAN to this task with the non-autoregressive generator unchanged, where we further introduce an encoder and utilize a cross-entropy based discriminator (de Masson d'Autume et al., 2019) . More details are described in Appendix A.6.1.

4. EXPERIMENTS

Experiment Settings. We test NAGAN on both synthetic and real data. The synthetic data (vocab size = 500, length = 20) is generated by an oracle Hidden Markov Model with fixed parameters. We do not use LSTM as the oracle model, because it may make the evaluation biased to the LSTM generators. We extract the text from COCO image caption (vocab size = 4,839, max length = 32) (Chen et al., 2015) and SNLI (vocab size = 42,981, max length = 40) (Bowman et al., 2015) dataset as the real data, which are adopted by Guo et al. (2018); Semeniuta1 et al. (2018) . Dropout in NAGAN. We utilize dropout (Srivastava et al., 2014) in our generator, which introduces noises for generating diverse samples and stabilizes GAN training (Isola et al., 2017; Sønderby et al., 2017) . Note that dropout in GANs should also be applied in inference, which ensures that the generated samples are from the same distribution during GAN training. However, we find that varying the dropout rate in inference is in fact balancing the quality and the diversity of generated sentences, which will be shown in Section 4.1. Evaluation Metrics. For the synthetic data, we calculate the oracle negative log-likelihood (Oracle NLL). For the real data, we adopt language model score (Caccia et al., 2020) , n-gram based metrics (Shi et al., 2018) , and Fréchet embedding distance (FED) (de Masson d'Autume et al., 2019) . In these metrics, LM score and BLEU F measure quality, BLEU B measures diversity, BLEU HA and FED measure overall performance. The detailed definition are shown in Appendix A.4.1.

4.1. RESULTS ON UNCONDITIONAL TEXT GENERATION

We first test NAGAN on the synthetic data. The chosen baselines include a GRU model trained by MLE (Graves, 2013) , two mainstream text GANs (Yu et al., 2017b; Guo et al., 2018), and ScratchGAN (de Masson d'Autume et al., 2019) . As shown in Figure 2 , NAGAN obtains lower Oracle NLL than the baselines, particularly than the other models without pretraining (solid lines). LeakGAN without pretraining is not shown because of too large scores (Oracle NLL=7.28). On the real data, we test NAGAN against the MLE-trained GRU (Graves, 2013) , Transformer (Vaswani et al., 2017) , and the SOTA text GANs. The result is shown in Table 2 . Our best model outperforms the pretrained baselines in LM Score and BLEU F . With respect to diversity, MLEtrained models are superior to all text GANs except RelGAN in terms of BLEU HA on SNLI dataset. The result suggests that GAN training in these pretrained baselines, particularly with RL optimization (SeqGAN, LeakGAN, IRL), harms the diversity and even the overall performance, which was also reported previously (Caccia et al., 2020; Semeniuta1 et al., 2018) . We test our models with different dropout rates in inference (where the training dropout rates remain unchanged as 0.25), and the result shows a smaller dropout rate can lead to higher quality. However, it also leads to less diversity (BLEU B ) and makes a larger gap between the distributions of the real sentences and the generated sentences (FED). It suggests that varying the dropout rate in inference is a method to balance the quality and diversity, where the larger noises introduced by the dropout encourage more diverse generated samples. We provide explanations in Appendix A.1.3. 

4.2. WHY CAN NON-AUTOREGRESSIVE GAN WORK?

Comparing Autoregressive (AR) and Non-autoregressive (NAR) Generator. In Section 2, we mention that the AR generator can be an obstacle to effective optimization. To verify the conjecture, we replace NAGAN's NAR generator with an AR generator, while preserving the other parts unchanged. The non-differentiable issue should be tackled before optimizing AR generator with gradient-based methods, so we adopt two methods proposed in previous text GANs. The first one is called Soft-Embedding approximation (Chen et al., 2018) . The generator predicts a word distribution at each position and obtains a soft word embedding weighted by the distribution, where the soft embedding is treated as the next input. The second one is called Gumbel-Softmax approximation (Kusner & Hernández-Lobato, 2016; Nie et al., 2019) . In each step, a token is sampled from the word distribution with a reparameterized trick, where the straight-through estimator is also used for gradient estimation. We implement the AR generator using a Transformer with the same size of parameters with NAGAN. We also equip the generator with a latent variable, which is concatenated with the input at every step. Detailed implmentations are described in Appendix A.4.4. We first compare NAGAN with these two methods on synthetic data of different lengths. As shown in Figure 3 , NAGAN dominates the AR models on all lengths, and the AR models become worse rapidly as the length increases. Then we evaluate these models on SNLI dataset (max length = 40), where NAGAN remarkably outperforms the AR methods, as shown in Table 3 . The results support our claim that the feed-forward architecture benefits gradient-based optimization and allows GAN training from scratch possible. The AR generator suffers from ineffective optimization because of the non-differentiable issue at every step's input, especially with long sequences. Investigating Discriminator Architectures and Max Gradient Penalty (MaxGP). To investigate the effects of the discriminator architectures and MaxGP technique, we utilize Transformer, CNN, and RNN as the discriminator, and evaluate them with and without MaxGP on COCO dataset. Detailed architectures are described in Appendix A.4.4. As shown in Table 4 , MaxGP boosts the performance on all architectures, where the weakest RNN discriminator can realize effective optimization in the non-pretrained setting. Although not critical, the discriminator architectures indeed affect the results, suggesting that there is room for further improvement in the discriminator design.

4.3. APPLICATION I: MANIPULATING SENTENCES IN LATENT SPACE

In this task, we set dropout rate to zero in inference, where the generated text can be fully controlled by the latent variable. We evaluate NAGAN on COCO dataset with two metrics. Success rate denotes the proportion of edited sentences containing the desired word. For overlap, we first compute the ratio of the longest common subsequence's length to the maximal length of the two sentences, and then average the ratios over 100 successfully edited pairs. We compare our proposed method (called GD, gradient descent) against the offset vector method (called OV) (Zhao et al., 2018) . In GD, the moving step will be repeated for several iterations. For a fair comparison, we also allow OV to try the moving step Z s := Z s + β∆ for several iterations, where ∆ is the offset vector. We also compare NAGAN against FMGAN (Chen et al., 2018) , where the latter is equipped with latent variables but requires VAE pretraining. However, FMGAN cannot be applied to GD because it uses Soft-Embedding approximation to tackle the non-differentiable problem, which cannot provide gradient for latent variables during inference. As presented in Table 6 , NAGAN(OV) and FMGAN(OV) are comparable, but NAGAN(GD) achieves a higher success rate with better overlap scores, as it benefits from the direct gradient signals of the non-autoregressive generator. We also provide overall results over 100 random pairs of top-50 frequent words. The overall results are lower than the chosen cases because the random pairs may have different parts of speech, such as modifying red to people, making the task harder.  P G (x) = E z∼p(z) L i=1 P G (x i |x <i , z). ( ) It can degenerate into a language model, where the latent variable z is ignored when generating x, i.e., mutual information I G (x, z) = 0. This problem has been found in Variational Autoencoders (VAEs), known as the KL vanishing problem (Bowman et al., 2016) . However, the reconstruction loss in VAEs optimizes the lower bound of I G (x, z), proven by Li et al. (2017) in Corollary 3: I G (x, z) ≥ E x∼p data E z∼q(z|x) [log P G (x|z)] + H p data (x) = -L recon + constant where H means entropy, and q is the approximate posterior distribution. Therefore, optimizing L recon avoids the degeneration, making VAEs able to learn latent representations. However, text GANs only use an adversarial loss, which merely matches P G and P real . We can easily construct an ideal generator defined as P G (x i |x <i , z) = P real (x i |x <i ), where P G perfectly fits P real , but I G (x, z) equals 0. We can conclude that the text GAN with an autoregressive generator is not guaranteed to learn latent representations. Things are different if G is a deterministic function, where the generator G is forced to use z if P G can fit P real well. To explain the statement, we present another lower bound of mutual information. Let G θ be an generator with parameters θ, and p(z) be the prior distribution of z, we have: I G θ (x, z) = E z∼p(z) D KL P G θ (x|z) P G θ (x) = E z∼p(z) D KL P G θ (x|z) P real (x) + E x∼P G θ (x|z) log P real (x) log P G θ (x) = E z∼p(z) D KL P G θ (x|z) P real (x) -D KL P G θ (x)||P real (x) (17) ≥ min z * ,θ * D KL P G θ * (x|z * ) P real (x) -D KL P G θ (x) P real (x) In Eq (17), we relax the first term by fixing z and θ to minimize the KL divergence. In the first term of Eq (18), z * is a constant, and thus P G θ * (x|z * ) can be regarded as a generator with a fixed input, which ignores the latent variable. In the second term, P G θ (x) is the expectation of P G θ (x|z) over p(z), which represents the model distribution considering latent variables. Then the first term indicates the best KL divergence if G ignores the latent variable, and the second term indicates the KL divergence when G θ uses the latent variable. Therefore, I G θ (x, z) is bounded by the performance gap between the generators with and without the latent variable. If G is a deterministic function, P G θ * (x|z * ) must be a one-point distribution 5 , which is far from fitting the real data. As long as G θ can be effectively optimized, the second term of Eq (18) should be smaller than the first term. It indicates I G θ (x, z) > 0, and G θ cannot ignore the latent variable. Moreover, when G θ is optimized to approach the real distribution, the lower bound of the mutual information becomes larger, which usually indicates better representations (Chen et al., 2016; Hjelm et al., 2019) .

A.1.2 SENTENCE LENGTH IN NON-AUTOREGRESSIVE GENERATORS

Autoregressive text generators use a special token (e.g., <eos>) to determine the length, but most non-autoregressive text generators require the sentence length before the parallel generation. Strictly speaking, a non-autoregressive text generator is a conditional generator, which generates the text conditioned on the length L and the input C, i.e., x = G(C, L). Therefore, the generation process is split into two steps: determine the sentence length, and then generate sentences according to the length. We use the non-autoregressive generator only in the second step, where the first step can vary in different tasks. In machine translation, non-autoregressive text generator usually uses a length predictor to predict the target length from the input sentence (Gu et al., 2018) , which is trained separately from the non-autoregressive generator. There is also an alternative method called Length-parallel Decoding (Shu et al., 2020) , where the generator tries all possible lengths (by enumerating a pre-specified range) and generates a set of sentences, and then select the best one using an external ranking model. In unconditional text generation, we aim to match P G (x) with the real distribution. Regarding the length L as another latent variable, P G (x) can be formulated as a sampling process: L ∼ p(L), z ∼ p(z|L), x = G(Z, L). We intuitively estimate p(L) by the sentence length distribution of the training data, which helps P G (x) approach P real (x) well. In the unsupervised word decipherment application, we have already known the target length should be equal to the source length, so we can simply obtain the target length. However, it is theoretically possible to learn a length predictor P C (L|y) to predict the target length, where y is the ciphertext. Then the decipher model F Y→X can be formulated as a sampling process: Z = E(y, c Y ), L ∼ P C (L|y), x = G(Z, c X , L). The classifier P C can be trained to minimize the final objective described in Section 3.4, by reinforcement learning or other optimization methods.

A.1.3 HOW DROPOUT WORK IN GANS

This subsection is added during the discussion. Dropout is a common trick in training image GANs, but it is more than a simple regularizer when used in GANs' generators. We provide two explanations about why dropout can stabilize the training process and balance fluency and diversity. First, dropout can be regarded as noises to diversify the generated samples. GANs training usually suffers from the mode collapse problem, which can be alleviated by keeping the diversity of the generated samples. We refer the reader to the instance noise trick (Sønderby et al., 2017), which is also a method that keeps the diversity. The author claim that the diverse samples can stabilize the gradient flows by avoiding the situation that P G and P data have disjoint supports. When removing the dropout in NAGAN, we can observe spiky losses and unstable training process. However, it still outperforms some baselines without MLE pretraining. The results of no dropout training is demonstrated in Table 11 . Second, dropout can be regarded as a method of introducing extra latent variables. Unlike the common networks that learn to produce the same output when dropout masks change, it is shown that random noises introduced in the intermediate layer of GANs' generator will affect the generated samples (Karras et al., 2019) . We can explicitly take these random noises as latent variables, where different noises will be mapped to different samples in the GANs training. In our case, the latent variables are the dropout masks. A linear layer with dropout can be formulated as follows: y = 1 1 -p W (x ⊗ m), ( ) where x is the input vector, y is the output vector, W is the weight matrix, m is the dropout mask, p is the dropout rate, and ⊗ is element-wise multiplication. Taking m as a random variable, we can obtain the mean and the variance of y. E[y] = 1 1 -p m∈{0,1} N W (x ⊗ m)P (m) = 1 1 -p N i=1 W i m i ∈{0,1} (ximiP (mi)) = 1 1 -p N i=1 W i xi(1 -p) = W x (20) Var[y] = E[y 2 ] -E 2 [y] = 1 (1 -p) 2 N i=1 W i m i ∈{0,1} (x 2 i m 2 i P (mi)) -(W x) 2 = ( 1 1 -p -1)(W x) 2 (21) N is the dimension of x, W i is the i-th column of W i . When p = 0, the variance of y is 0. For a larger p, the mean keeps unchanged but the variance becomes larger. In the GANs training and inference, we usually keep the distribution of dropout masks, i.e., the dropout rate, unchanged (Isola et al., 2017) , so that the generator's distribution remains unchanged. However, if we reduce the dropout rate in inference, the distribution will be more concentrated, bringing samples with higher quality but lower diversity. This phenomenon is not significant in autoregressive text GANs. One reason is that the implicit generative model uses a deterministic generator, where the generated sample is fully determined by the latent variables (including Z and dropout masks). However, the autoregressive generator samples from word distribution at each step, which brings more diversity than a non-autoregressive generator, so the effect of dropout will be weakened.

A.2 EXPLORING THE LATENT SPACE AND GENERATION PROCESS

This section is added during the discussion. For a better understanding of the learned latent space and the non-autoregressive generation process, we further conduct four experiments on the COCO dataset.

A.2.1 SENTENCE INTERPOLATION

This subsection is added during the discussion. We set L = 12 and randomly sample two sequences of latent variables Z 1 , Z 2 from N (0, I) independently. Then we obtain the linearly interpolated latent variables Z λ = (1 -λ)Z 0 + λZ 1 and translate them to the text space. Cases are shown in Table 7 .

A.2.2 SMOOTHNESS OF SENTENCES WITH DIFFERENT LENGTHS

This subsection is added during the discussion. We randomly sample a sequence of latent variables Z and then concatenate a new sampled z L+1 to the end of Z. We translate two sequences of latent variables to text space and find that the two sentences are very similar, except the length differs. We repeat the step and generate sentences with length from 10 to 17. We also try to concatenate a new sampled variable to the front of the sequence. The generated sentences are shown in Table 8 .

A.2.3 CONNECTION BETWEEN Z i AND O i

This subsection is added during the discussion. A picture of an person on a motorcycle in the ocean.

0.3

A picture of an elephant riding a surfboard in the ocean. 0.4 A picture of an elephant riding a surfboard in the ocean. 0.5 A picture of the ocean while some waves in the ocean. 0.6 A picture of the ocean while some surfboards in the ocean. 0.7 A picture of the ocean that being sailing in the water. 0.8 A picture of the ocean that is walking in the water. 0.9 A man in the ocean that is surfing in the ocean. 1.0 A man in a suit that is surfing in the ocean. 0.0 A big teddy bear laying on a rock in a field. 0.1 A herd teddy bear laying on a rock in a field. 0.2 A herd of sheep laying on a rock in a field. A herd of cattle grazing on a rock in a field. 0.4 A group of cattle standing on a motorcycle in a field. 0.5 A group of cattle standing on a motorcycle in the field. 0.6 A group of people standing on a motorcycle in the park. 0.7 A group of people standing on a horse in the rain. 0.8 A group of people standing riding a horse in the rain. 0.9 A group of young man riding a horse in the rain. 1.0 A group of young man riding a horse in the rain. 0.0 A man in a suit riding a skateboard in the snow. 0.1 A man in a suit riding some surfboards in the snow.

0.2

A man in a suit riding some surfboards in the ocean.

0.3

A picture of a person riding some surfboards in the rain. 0.4 A picture of a giraffe trying to someone in the rain. 0.5 A group of cattle standing next to someone in the rain. 0.6 A group of people standing next to someone in the distance. 0.7 A group of giraffes standing next to someone in the pasture . 0.8 A group of giraffes standing next to someone in the pasture. 0.9 A group of giraffes standing next to someone in the pasture. 1.0 A rows of giraffes stand next to other in the pasture. A pile of broccoli and a sandwich in mashed fries . 12 A slice of bread with a white sandwich on their fries . 13 A white white plate with a white cream arranged on a plate . 14 A white plate with chicken sized bunches of broccoli , and french fries . 15 A white plate with various meat and mashed broccoli , broccoli and french fries . 16 A white plate with meat peppers and mashed potatoes , broccoli , and french fries . 17 A white plate with chicken peppers and mashed potatoes , broccoli , and french fries nearby .

Length

Added to the front 10 A pile of bread on a plate in pan .

11

A plate is pie on a plate in a pan . 12 A slice of bread with sliced toppings sauce on a plate .

13

A white plate with sliced toppings sitting on a dinner white plate . 14 A plate of fresh fruits sitting on a plate on a white pan . 15 A pile of fruits and vegetables on a plate with broccoli and a fork . 16 A pile of broccoli and vegetables on top with a sandwich and french fries nearby . 17 A pile of broccoli and vegetables on top with a sandwich , and a plate nearby . As the length of Z is always equal to the length of generated sentences, it seems natural if z i is highly related to o i . However, the following experiment negates the hypothesis. We first sample a sequence of latent variables Z (L = 10), and translate it to sentence O. Then we random choose a position i and replace z i by an new sampled latent variable from N (0, I). The new sequence is denoted by Z and translated to a new sentence O . We repeat the trials for 1,600 times and count the probability P (i, j) that o j changed when modifying z i . If the hypothesis is true, high probabilities should be observed in the diagonal of the matrix P . However, P (i, j) is highly correlated with the position i regardless of j, as shown in Figure 4 . We can conclude that there is no strong connection between x i and z i , and each latent variable plays a similar role in generating x i . This property is not desired and can make NAGAN less interpretable. However, we think that adding regularizers will be a good way to make the tokens more correlated to nearby latent variables, which is left for future work. The element in the i-th row and the j-th column is the probability that the token x j changes when modifying the latent variable z i . The last token is a full stop and never changed.

A.2.4 INVESTIGATING GENERATION PROCESS

This subsection is added during the discussion. Autoregressive generators always generate tokens from left to right, but non-autoregressive generators produce the whole sequence in parallel. We investigate the generation process of NAGAN's generator and find some interesting properties: The non-autoregressive generator will gradually determine the hidden states as the data flows through the Transformer layers; It tends to determine some of the tokens first and then the others. We collect the hidden states of each Transformer's layer and calculate the cosine similarities between the intermediate hidden states and the last outputs. To be specific, sim t,l = cos(h t,l , h t,n ), where h t,l is the output of the l-th layer at the t-th position, and L is the number of layers. In our experiment, L = 5 and 1 ≤ t ≤ 10. As shown in Figure 5 , the similarities at each position are usually slowly increased from the first layer to the last layer, which means the generator gradually determine their hidden states. Some tokens at the front (t = 1), in the middle (4 ≤ t ≤ 7), and at the end (t = 10) will be determined before the 3rd layer, and the other tokens will be determined in the last two layers. The tokens at the front are usually definite or indefinite articles, and the tokens at the end are full stops. The tokens in the middle are usually prepositions (e.g., in, top, and next), which split the whole sentences in two halves. This strategy may make the generation of the remained tokens easier, because it is similar to a masked language model when the near context is given. 

A.3 IMPLEMENTATION DETAILS

Here we introduce the implementation details of NAGAN. The codes are available in the supplementary material.

A.3.1 GENERATOR

The generator's Transformer is based on the architecture of the non-autoregressive text decoderfoot_6 (Gu et al., 2018) , which contains n_layers Transformer blocks. The attention layers attending to the source sentence are replaced by attention layers attending to the latent variable. In the training stage, we introduce Gumbel noises when obtaining s t . Introducing noises is a common technique used in image GANs (Zhao et al., 2017; Sønderby et al., 2017) , where noises can encourage the model to generate diverse samples and avoid mode-collapse problem in the early training stage. To be specific, s t = MLP(h t ) + g t , where g t is sampled from the standard Gumbel distribution. We use Gumbel noises rather than the other noises (e.g., Gaussian noises) because they are more suitable for categorical distribution (Jang et al., 2017) . Different from the Gumbel-Softmax tricks, we do not use the noise or sample from the categorical distribution in inference. As mentioned in Section 3.1, the straight-through estimator is only used when obtaining gradient. We provide an alternative description for the implementation as follows: o t = onehot(arg max v (s t,v )), o t = softmax(s t /τ ), o t = stop_gradient(o t -o t ) + o t . ( ) where s t,v is the v-th dimension of s t . It means that we use o t = o t in the forward pass, and o t = o t + constant when obtaining gradient. Language Model Scores. Language model score (Ke et al., 2019; Caccia et al., 2020) is the NLL of a language model trained on the test set and evaluated on 5,000 generated sentences. As suggested by de Masson d'Autume et al. (2019) , this score may favor the models which have the same architecture of the language model. Thus we use a 4-gram language model with Kneser-Ney smoothing (Ney et al., 1994) instead of using RNN language models. n-gram Based Metrics. To evaluate both the quality and the diversity (Shi et al., 2018) , we adopt three BLEU scores: Forward BLEU (BLEU F ), Backward BLEU (BLEU B ), and their harmonic mean (BLEU HA ). Forward BLEU measures quality, which uses the test set as references, and evaluates 5,000 generated sentences with the BLEU score. Backward BLEU measures diversity, which swaps the roles of the test set and the generated sentences. We use the BLEU-5 score in our experiments. Some previous work adopts self-BLEU (Zhu et al., 2018) for evaluating diversity, which calculates the BLEU score of each generated sentence using the other generated sentences as references. The metric suffers from meaningless diversity, where a model that randomly generates tokens can achieve perfect self-BLEU. We use Backward BLEU to measure diversity, which is symmetric with Forward BLEU and avoids the meaningless diversity problem. Fréchet Embedding Distance (FED). FED (de Masson d'Autume et al., 2019) is a text version of Fréchet Inception Distance (Heusel et al., 2017) , which has been widely used in image GANs. It measures the similarity between the representation distributions of the test data and the generated data, where we use 5,000 generated sentences as the generated data. Sentences are represented with vectors by Universal Sentence Encoderfoot_7 (Cer et al., 2018) .

A.4.2 SYNTHETIC DATA

The synthetic data are generated by an HMM modelfoot_8 with 100 hidden states. A randomly initialized HMM may have a flat distribution over all sequences, which is remarkably different from natural language text, so we fit the HMM on 5,000 real sentences from the COCO dataset before generating the synthetic data. However, the generated synthetic data are still different from the COCO dataset, where the synthetic sentences have a fixed length (length=20 in Figure 2 ) and a smaller vocabulary of 500 tokens. The parameters of the oracle HMM will be released. We generate 50,000 sentences as the training set. We do not use a validation set or test set, because the Oracle NLL does not need real samples for evaluation. The Oracle NLL is always averaged over 5,000 generated sentences. The baselines with the RNN architecture are using LSTM with 128 cells or GRU with 256 cells. The MLE baseline is a 256-dim GRU. The other hyper-parameters of baselines are setting according to the official implementationfoot_9 . The hyper-parameters of NAGAN are shown in Table 9 . In Table 10 , we provide the best Oracle NLLs as a supplement of For the COCO dataset (Chen et al., 2015) , we follow previous work (Guo et al., 2018; Shi et al., 2018; Ke et al., 2019) and use the image captions as the real sentences, where the images are discarded. The dataset is processed by Shi et al. (2018) foot_10 , which contains 80,000 samples in the training set, 5,000 samples in the test set, and 4,838 words in the vocabulary. However, the dataset does not contain a validation set, so we randomly choose 5,000 samples from the training set as our validation set. For the SNLI dataset (Bowman et al., 2015) , we follow previous work (Semeniuta1 et al., 2018; Zhao et al., 2018) to extract the sentences, where the connections between the paired sentences are ignored. The SNLI dataset contains 714,667 samples in the training set, 5,000 samples in both the validation and test set, and 42,981 words in the vocabulary. We trim long sentences to the first 40 tokens. 

Model

Oracle NLL Pretrained Models: MLE (Graves, 2013) 3.87 SeqGAN (Yu et al., 2017b) 3.84 LeakGAN (Guo et al., 2018) 3 The hyper-parameters of NAGAN are shown in Table 9 , which are tuned to minimize the FED on the validation set. We report the generator architecture, numbers of parameters for the generator and the discriminator in Table 12 . The Transformer baseline and NAGAN has the same number of Transformer blocks, attention heads, and size of hidden features, except two differences: the baseline is an autoregressive generator trained by MLE; NAGAN does not need word embeddings because it receives the latent variable Z as input. Unless otherwise specified, the other hyper-parameters of baselines are set according to the official implementationfoot_11 . The reported results in Table 2 are averaged over three runs with different random seeds. As a supplement, we report the result with mean and variance in Table 11 . We also show some generated sentences in Table 13 .

A.4.4 IMPLEMENTATION OF ABLATION MODELS

Autoregressive (AR) Generators. In Section 4.2, we compare NAGAN against two models with AR generators. These two models differ from NAGAN only in the generator, which is implemented by an autoregressive Transformer with the same architecture except for the attention mask. AR(Soft) is implemented following Chen et al. (2018) . At step t, AR(Soft) predicts the next word distribution P t = P (x t |x <t , z) ∈ R V . Then we obtain the soft word embedding e t = EP t , where E ∈ R E_dim×V is the word embedding matrix. After that, e t is concatenated with the latent variable  o t = onehot(arg max v (log P (x t = v|x <t , z) + g v )) ∈ {0, 1} V . ( ) The Gumbel-softmax trick is a reparameterized trick, where o t can be regarded as the one-hot representation of a random sample from the categorical distribution P (x t |x <t , z) (Jang et al., 2017) . The straight-through estimator is also adopted to obtain gradients for o t . After that, o t is converted to a word embedding e t , which is then similarly concatenated with z and fed into the Transformer as the next input. AR(Gumbel) uses random decoding in inference and thus keeps the same behavior between training and test. Other Discriminator Architectures. In Section 4.2, we equip NAGAN with two different discriminators. The CNN discriminator is composed of 12 dilated residual blocks (Yu et al., 2017a) , where each block sequentially contains a batch normalization layer (Ioffe & Szegedy, 2015) , a 1D dilated convolutional layer, and a residual connection (He et al., 2016) . The channel size is 300, and the kernel size is 3. The dilation of the first 4 convolutional layers is set to [1, 1, 2, 3], which is repeated two times for the left 8 layers. The convolutional layer does not change the sequence length, where the feature is then fed into a max-pooling layer and an MLP like Eq (10). The RNN discriminator is a one-layer bidirectional GRU with 256 hidden cells. The encoded feature is also processed by Eq (10) to obtain the final score D(O).

A.5 APPLICATION I: MANIPULATE SENTENCES IN LATENT SPACE

In the Offset Vector method (OV) (Zhao et al., 2018) , we first randomly generate 100k sentences and obtain the offset vector ∆ = Z t -Z s , where Z t is the average of latent variables whose corresponding sentences containing the target word, and Z s is that for the source word. Then, we modify the latent variable by Z s := Z s + β∆ until the sentence generated from Z s contains the target word or the process exceeds a pre-specified iteration number. Our proposed method, called gradient descent (GD), edits the sentence by applying gradient descent over the latent variable. As described in Section 3.4, we maximize o t,x , where t is the edited position and x is the target word. In addition to the maximization term, we also add another term to keep the other parts unchanged, which can be formulated as maximizing i =t o i,xi , where x i is the t-th token in the source sentence. The final update step is Z s := Z s + ∂(o t,x + β i =t o i,xi ) ∂Z s , where β = 0.1 in our experiments. The update step needs a pre-specified position t, where we enumerate all editing positions t ∈ [1, L] and choose a successful one with the smallest changes. We also find using different optimizers rather than the simple Gradient Descent optimizer can improve the performance, where we try AdamKingma & Ba (2015) and SGD with different learning rates. A.6 APPLICATION II: UNSUPERVISED DECIPHERMENT A.6.1 IMPLEMENTATION DETAILS Unsupervised decipherment is a task similar to text style transfer, where we learn a sequence-tosequence model without parallel data. Following several previous models on style transfer (Zhu et al., 2017; Yang et al., 2018; Dai et al., 2019) , we adapt NAGAN to the task by introducing an encoder. An overview of the model is shown in Figure 6 , which can be formulated as: F X →Y := G(E(x, c X ), c Y ), F Y→X := G(E(y, c Y ), c X ). c X and c Y are labels that specify the input type for the encoder and the desired type for the generator. Note G is a non-autoregressive generator, where the output length is equal to the source length. (The equal length is a requirement of the task.) For the reconstruction loss and the cycle loss, we use: L rec,X = E x∼P X [d( x, G(E(x, c X ), c X ))]; (32) L cyc,X = E x∼P X [d(x, F Y→X (F X →Y (x)))]. x in Eq (32) is a noisy text modified from x, where we randomly drop and swap the order of some tokens (He et al., 2020) . d(x, G(•)) is a distance function, where we adopt the cross-entropy: d(x, G(•)) = - L t=1 log(softmax(st)|x t ), where s t is defined in Eq (8), and softmax(•)| xt means the x t -th dimension of the vector after softmax. The cross-entropy loss is commonly used in previous non-autoregressive generators (Gu et al., 2018) . For the adversarial loss, we utilize a discriminator based on a language model following Yang et al. (2018) , which is proven effective on style transfer tasks. In Eq (35), X is the distribution from the generator, where x is obtained following two steps: y ∼ P Y , x = F X →Y (y). The adversarial loss for the generator is defined as: We utilize a Transformer as the encoder, whose input is the sum of the source sentence embedding and an embedding for the label c X (or c Y ). The output of the Transformer is regarded as the latent variable Z ∈ R Z_dim×L in Figure 6 . The architecture is the same as the Transformer in the discriminator described in Appendix A.3.2. L adv,X = -E x ∼X   1 L x L x We adopt a Transformer-based language model as the discriminator, where the input is also added with the label embedding. The architecture of the Transformer is the same as the encoder except for the attention mask. Note that the discriminator should pass the gradient back to the generator, where we also use one-hot representations of sentences and the straight-through estimator. A previous work adopt the same method, so we refer the interested reader to Tu et al. (2020) for more details. For the synthetic data, we train our model for 50 epochs. Each training run costs around 15 hours with the devices described above. We also report the latency of one generator training step and the inference in Table 15 . Compared with an autoregressive Transformer of the same architecture, NAGAN is slower in training (because our generator's updates need gradients from the discriminator) but faster in inference (6.66x up). The fast inference is brought by the parallel decoding. Compared with autoregressive text GANs of the same architecture, NAGAN also has a faster training step because autoregressive GANs need to read its generated tokens as input. For the unsupervised decipherment task, we train the model for 200k batches, evaluate it every 1.5k batches, and select the model with the best BLEU-4 on the validation set for the Word Subtitution dataset (the best accuracy on the training set for the BrownW-200 dataset). Each training run cost around 37 hours with the same devices described above. 



FMGAN uses explicit generative models in pretraining and implicit ones in GAN training. For example, the reconstruction loss in VAEs alleviates degeneration. More discussed in Appendix A.1.1. The reward is estimated by the discriminator, and it keeps changing during training even for the same x. L can be different for each sequence. See more details in Section 3.3 and Appendix A.1.2. CONCLUSIONSWe present a novel text GAN, named NAGAN, which incorporates a non-autoregressive generator to facilitate efficient gradient-based training from scratch and the use of latent variables. NAGAN adopts a novel formulation for adversarial text generation, which connects itself with image GANs and can potentially benefit from the success of image generation. As a preliminary work of adversarial non-autoregressive text generation, NAGAN shows promising results on generating diverse sentences without conditions, which may suggest a direction for solving the multi-modality problem(Gu et al., 2018) and can be applied to general sequence-to-sequence generation as future work. PG θ * (x|z * ) is equal to 1 when x = G θ * (z * ) and equal to 0 otherwise. Our implementation is based on https://github.com/salesforce/nonauto-nmt. The pretrained model can be downloaded at https://tfhub.dev/google/universal-sentence-encoder/3 We use the implementation of https://github.com/hmmlearn/hmmlearn. https://github.com/geek-ai/Texygen https://github.com/FudanNLP/Irl_gen. IRL: https://github.com/FudanNLP/Irl_gen. SeqGAN, LeakGAN: https://github.com/geek-ai/Texygen. RelGAN: https://github.com/weilinie/RelGAN. ScratchGAN: https://github.com/deepmind/deepmind-research/tree/master/scratchgan. FMGAN: https://github.com/vijini/FM-GAN.



Figure 1: Architecture of NAGAN. (A) The generator converts the latent variableZ = [z 1 z 2 . . . z L ] to a sentence O = [o 1 o 2 . . . o L ],where each o i is a one-hot vector. The gradient of non-differential operation is estimated by the straight through estimator. (B) The discriminator produces a score D(O) for the sentence O. The gradient from the discriminator can be passed back to the generator.

Figure 2: Training curves on synthetic data. The vertical dashed line denotes the end of pretraining for SeqGAN and LeakGAN.

Figure 3: Comparison to autoregressive (AR) generators on synthetic data of different lengths. Lower values are better.

Figure 4: Connection between tokens and latent variables at different positions.The element in the i-th row and the j-th column is the probability that the token x j changes when modifying the latent variable z i . The last token is a full stop and never changed.

Figure 5: Cases of cosine similarities between intermediate hidden states and the last outputs. The i-th row indicates the hidden states of the i-th Transformer layer (1 ≤ i ≤ 5).

Note that one training step in Figure 2 may be different across models. For MLE, ScratchGAN, NAGAN, and the pretraining part of SeqGAN and LeakGAN, one training step means one epoch training. For the GAN training part of SeqGAN and LeakGAN, one training step contains several updating steps in the generator and the discriminator, which is consistent with the original implementations. A.4.3 REAL DATA

The discriminator is trained by a cross-entropy based loss (de Masson d'Autume et al., 2019): PD(x t |x <t , cX ))   . (35)

t=1 log PD(x t |x <t , cX ) roles of X and Y, we can obtain the other losses: L cyc,Y , L adv,Y , L rec,Y , and L D,Y . The final losses can be formulated as follows:L G = L cyc,X + L cyc,Y + αL adv,X + αL adv,Y + βL rec,X + βL rec,Y , (37) L D = L D,X + L D,Y .(38)where α, β are hyper-parameters. G and D are optimized alternatively by gradient-based methods.

OTHER DETAILSFor the COCO and SNLI dataset, we train our model for 100 epochs (each epoch contains 1500 batches of samples), evaluate it every epoch, and select the model with the best FED on the validation set. Each training run used approximately 4 Intel Xeon E5-2690 v4 CPUs at 2.60GHz, and 1 Nvidia GeForce RTX 2080 Ti GPU. Although the results of 10-hour training are close to the reported performance (with a gap of 0.01 in terms of BLEU HA ), we finish the 100 epochs for around 28 hours.

Generation performance on COCO and SNLI datasets. Bold scores indicate the best performance in all models while underline scores are the best in non-pretrained models.

Performance with different discriminator architectures on COCO dataset. LM Score ↓ BLEUF ↑ BLEUB ↑ BLEUHA ↑ FED ↓

Unsupervised decipherment. # and * are reported inChen et al. (2018) andHe et al. (2020), respectively.

Comparison to autoregressive (AR) generators on SNLI dataset.

Results for sentence manipulation (modifying a sentence containing a source word to the one containing a target word). OV = Offset Vector; GD = Gradient Descent (Ours).We use Word Substitution dataset(Yang et al., 2018) andBrownW-200 (Gomez et al., 2018)  for the task, where the text is encrypted by the substitution and the Vigenère cipher, respectively. Following previous work, WordSub is reported with BLEU-4, and BrownW-200 is reported with accuracy.As shown in Table5, NAGAN significantly outperforms existing methods. Compared with AR models, the NAR generator facilitates effective optimization for the cycle loss and the adversarial loss, where AR generators are facing the same optimizing problem as discussed in Section 4.2. The gradient of both losses can be back-propagated through our feed-forward generator smoothly, leading to a better alignment between two text spaces and thus a better decipher result. Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018, pp. 1097-1100, 2018. doi: 10.1145/3209978.3210080. URL https: //doi.org/10.1145/3209978.3210080.

Cases of sentence interpolation. Gray words indicate unchanged parts compared with the preceding sentence.

Smoothness of sentences with different lengths. We convert one sentence to the next one by adding a new sampled latent variable to the end (or the front) of the latent variable sequence. Gray words indicate unchanged parts compared with the preceding sentence.

Hyper-parameters of NAGAN on the synthetic data and the real data.

Results on the synthetic data. NA-GAN are tagged with droprates in inference (Training droprate is always 0.25).

Generation performance on COCO and SNLI datasets with mean and variance of three runs with different random seeds.

Parameter size of all models when vocabulary size is 4838. Note that NAGAN do not need word embeddings because it receives latent variables as inputs.

Cases of unconditional generation on the COCO dataset. Z_dim , and fed into the Transformer as the next input. The latent variable is sampled from the normal distribution N (0, I). Note that the model generates a sequence of embeddings in training, which does not represent any discrete sentence. We followChen et al. (2018) and use greedy decoding to generate texts in inference. AR(Gumbel) is implemented followingKusner & Hernández-Lobato (2016). AR(Gumbel) predicts the next word distribution at step t. Then we obtain the next word in one-hot representation with the Gumbel-softmax tricks(Jang et al., 2017):

Training and inference latency of an autoregressive Transformer and NAGAN on the COCO dataset. All results are evaluated with batch size of 32.

A.3.2 DISCRIMINATOR

The discriminator's Transformer has the same architecture as the encoder 6 in Gu et al. (2018) , which contains n_layers Transformer blocks.Max Gradient Penalty. Following Zhou et al. (2019) , we calculate a max sampled gradient as follows:where n denotes the batch size. The intermediate sequences O i are sampled in the following way: first, sample a sentence from the real data, whose one-hot representation is O ri ; second, generate fake sample O f i with the same length of O ri . Finally,where λ i is sampled from [0, 1] uniformly. Note that O i may not be a one-hot sequence and does not represent a discrete sentence. However, it is not a problem because it is only used to regularize the discriminator.Score Regularizer. We also adopt a L2-regularizer on the predicted score D(O), which is proposed to stabilize GAN training (Xu et al., 2019) . It can be formulated as follows:The training procedure is presented in Algorithm 1. Training the generator. 7: Sample a new batch of latent variables Zi similar to (3). 8: Update the generator G via Eq (13). 9: end for Parameter Averaging. Some studies (Mescheder et al., 2017; Nagarajan & Kolter, 2017; Mescheder et al., 2018) analyze the training dynamics of GANs and show that the generator does not converge to a point but becomes a periodic function around the Nash equilibrium point. Therefore we adopt a common method used in image GAN, called parameter averaging (Yazici et al., 2019) .Suppose the generator's parameter at step t is θ t . Then, the exponential moving average is defined aswhere γ is often a number slightly smaller than 1. It is noticeable that θt has much smaller variance than θ t , and θt is then used for the evaluation.

A.4.1 EVALUATION METRICS

Oracle NLL. For synthetic data, we calculate the negative log-likelihood (NLL) per token of the oracle model evaluated on 5,000 generated sentences. We do not use perplexity because the likelihood of implicit generative models is intractable. The non-autoregressive generator is kept unchanged, which is the core of our model. Autoregressive generators suffer from the non-differentiable problem when optimizing the cycle loss and the adversarial loss (Red arrows in Figure 6(A) ), where our non-autoregressive can be optimized more effectively.

A.6.2 EXPERIMENT DETAILS

The Word Substitution dataset (Yang et al., 2018) uses the word substitution cipher to encrypt the plaintext. The dataset contains 9,445 words in vocabulary, 200,000 unpaired sentences for both X and Y in the training set, and 5,000 sentence pairs in both the validation and the test set. Following previous work (He et al., 2020) , we use BLEU-4 for evaluation. We show several cases on the Word Substitution dataset in Table 14 .The BrownW-200 dataset (Gomez et al., 2018) uses the Vigenère cipher for encryption. The dataset contains 200 words in the vocabulary and 51,606 (11,468) sentence pairs for the training (test) set.We break the connections of the paired sentences when training. Following previous work (Chen et al., 2018) , we use accuracy for evaluation, which is calculated by the average proportion of words correctly deciphered. The dataset does not contain a validation set, so we choose the best model on the training set as our final model.For hyper-parameters, we set α = 1, β = 1 at the beginning of the training for the Word Subtituion dataset. Then they decay to 0 linearly in the first 100k batches. (We train the models for 200k batches.) The annealing method is proposed by He et al. (2020) . Similarly, we set α = 0.3, β = 1 at the beginning for BrownW-200 dataset. On both datasets, we use the RAdam (Liu et al., 2020) optimizer with a learning rate of 1e-3 to optimize the generator and the discriminator alternatively.Table 14 : Cases of unsupervised decipherment on the WordSub dataset. We show the deciphered text and the golden answer, where the ciphertext is omitted. Red words indicates the decipherment errors.Generated cart cost is $ num dollars . Golden entry cost is $ num dollars . Generated the wait staff and bartenders are rude .Golden the wait staff and bartenders are rude . Generated everyone wants to help .Golden everyone wants to help . Generated all cleaned up we enjoyed the area and searched for some drinks and sunglasses .Golden all cleaned up we enjoyed the area and headed for some drinks and gaming . Generated our server john was incrediblecute , patient , attentive and funny .Golden our server james was incredible : cute , patient , attentive and funny .

