IMPROVING SEQUENCE GENERATIVE ADVERSARIAL NETWORKS WITH FEATURE STATISTICS ALIGNMENT

Abstract

Generative Adversarial Networks (GAN) are facing great challenges in synthesizing sequences of discrete elements, such as mode dropping and unstable training. The binary classifier in the discriminator may limit the capacity of learning signals and thus hinder the advance of adversarial training. To address such issues, apart from the binary classification feedback, we harness a Feature Statistics Alignment (FSA) paradigm to deliver fine-grained signals in the latent high-dimensional representation space. Specifically, FSA forces the mean statistics of the fake data distribution to approach that of real data as close as possible in a finite-dimensional feature space. Experiments on synthetic and real benchmark datasets show the superior performance in quantitative evaluation and demonstrate the effectiveness of our approach to discrete sequence generation. To the best of our knowledge, the proposed architecture is the first that employs feature alignment regularization in the Gumbel-Softmax based GAN framework for sequence generation.

1. INTRODUCTION

Unsupervised sequence generation is the cornerstone for a plethora of applications, such as machine translation (Wu et al., 2016) , image captioning (Anderson et al., 2018) , and dialogue generation (Li et al., 2017) . The most common approach to autoregressive sequence modeling is maximizing the likelihood of each token in the sequence given the previous partial observation. However, using maximum likelihood estimation (MLE) for sequence modeling is inherently prone to the exposure bias problem (Bengio et al., 2015) , which results from the discrepancy between the training and inference stage: the generator predicts the next token conditioned on its previously generated ones during inference but conditioned on its prefix ground-truth tokens during training, yielding accumulative mismatch along with the increment of generated sequence length. Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) can serve as an alternative to models trained by MLE, which have achieved promising results in generating sequences of discrete elements, in particular, language sequences (Kusner & Hernández-Lobato, 2016; Yu et al., 2017; Lin et al., 2017; Guo et al., 2018; Fedus et al., 2018; Nie et al., 2019; de Masson d'Autume et al., 2019; Zhou et al., 2020; Scialom et al., 2020) . GANs consist of two competing networks: a discriminator that is trained to distinguish the generated samples from real data, and a generator that aims to generate high-quality samples to fool the discriminator. Although having succeeded in avoiding exposure bias issues, GANs still suffer from some intrinsic problems, such as mode dropping, reward sparsity, and training instability. To enrich the informativeness of the discriminator's training signal, several approaches have been proposed by measuring the latent features, such as feature distribution matching (Zhang et al., 2017; Chen et al., 2018) and comparative discriminators (Lin et al., 2017; Zhou et al., 2020) In this work, we propose to improve the GANs for sequence generation by jointly considering both the feature statistics matching and relativistic discriminator to serve as fine-grained and coarse learning signals respectively. We leverage the Feature Statistics Alignment (FSA) paradigm to embed the latent feature representations in a finite feature space and force the distribution of generated samples to approach the real data distribution by minimizing the distance between their respective feature representation centroids. Intuitively, matching the mean feature representations of fake and real samples could make the two data distributions closer. Besides, the relativistic discriminator (Jolicoeur-Martineau, 2019) is employed to measure the comparative information between generated and real sequences and empirically to show the effectiveness during the model training. Our experimental results illustrate the effectiveness of FSA techniques and large batch size to alleviate the gradient vanishing problem and stabilize the training process in comparison with the vanilla Gumbel-Softmax GANs. Besides, our models could generate discrete text sequences with high quality in terms of the semantic coherence and grammatical correctness of language, as evaluated with crowdsourcing. Furthermore, we empirically demonstrate that the proposed architecture overshadows most existing models in terms of quantitative and qualitative evaluation. To the best of our knowledge, the proposed framework is the first to adopt the statistics feature alignment paradigm in the Gumbel-Softmax based GAN framework for discrete sequence generation.

2. ADVERSARIAL SEQUENCE GENERATION

Adversarial sequence generation has attracted broad attention for its properties to solve the exposure bias issue suffered with maximum likelihood estimation (MLE) for generating language sequences. Based on the game theory, its goal is to train a generator network G(z; θ (G) ) that produces samples from the data distribution p data (x) by decoding the randomly initialized noise z (i.e., standard normal distribution) into the sequence x = G(z; θ (G) ), where the training signal is provided by the discriminator network D(x; φ φ φ (D) ) that is trained to distinguish between the samples drawn from the real data distribution p data and those produced by the generator. The minimax objective of adversarial training is formulated as: min θ (G) max φ φ φ (D) E x∼pdata log D(x; φ φ φ (D) ) + E z∼pz log 1 -D φ φ φ (D) (G(z; θ (G) )) . Despite the impressive results of GANs in the sequence generation (Yu et al., 2017; Gulrajani et al., 2017; Scialom et al., 2020) 



Figure 1: (a) Standard GANs using a binary classifier as its discriminator; (b) GANs with Feature Statistics Alignment and relativistic discriminator that provide more instructive signals for updating the generator.

, there are still several fundamental issues in the GAN training: (a) Training instability, which arises from the intrinsic nature of minimax games in GANs; (b) Mode dropping, which is the fact that GANs only generate samples with limited patterns in the real data distribution instead of attending to diverse patterns(Chen et al., 2018); (c) Reward sparsity, which is because that it is easier to train the discriminator than the generator, making it difficult to acquire the instructive feedback(Zhou et al., 2020).

