REFINING DEEP GENERATIVE MODELS VIA DISCRIMINATOR GRADIENT FLOW

Abstract

Deep generative modeling has seen impressive advances in recent years, to the point where it is now commonplace to see simulated samples (e.g., images) that closely resemble real-world data. However, generation quality is generally inconsistent for any given model and can vary dramatically between samples. We introduce Discriminator Gradient f low (DGf low), a new technique that improves generated samples via the gradient flow of entropy-regularized f -divergences between the real and the generated data distributions. The gradient flow takes the form of a non-linear Fokker-Plank equation, which can be easily simulated by sampling from the equivalent McKean-Vlasov process. By refining inferior samples, our technique avoids wasteful sample rejection used by previous methods (DRS & MH-GAN). Compared to existing works that focus on specific GAN variants, we show our refinement approach can be applied to GANs with vector-valued critics and even other deep generative models such as VAEs and Normalizing Flows. Empirical results on multiple synthetic, image, and text datasets demonstrate that DGf low leads to significant improvement in the quality of generated samples for a variety of generative models, outperforming the state-of-the-art Discriminator Optimal Transport (DOT) and Discriminator Driven Latent Sampling (DDLS) methods.

1. INTRODUCTION

Deep generative models (DGMs) have excelled at numerous tasks, from generating realistic images (Brock et al., 2019) to learning policies in reinforcement learning (Ho & Ermon, 2016) . Among the variety of proposed DGMs, Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have received widespread popularity for their ability to generate high quality samples that resemble real data. Unlike Variational Autoencoders (VAEs) (Kingma & Welling, 2014) and Normalizing Flows (Rezende & Mohamed, 2015; Kingma & Dhariwal, 2018) , GANs are likelihood-free methods; training is formulated as a minimax optimization problem involving a generator and a discriminator. The generator seeks to generate samples that are similar to the real data by minimizing a measure of discrepancy (between the generated samples and real samples) furnished by the discriminator. The discriminator is trained to distinguish the generated samples from the real samples. Once trained, the generator is used to simulate samples and the discriminator has traditionally been discarded. However, recent work has shown that discarding the discriminator is wasteful -it actually contains useful information about the underlying data distribution. This insight has led to sample improvement techniques that use this information to improve the quality of generated samples (Azadi et al., 2019; Turner et al., 2019; Tanaka, 2019; Che et al., 2020) . Unfortunately, current methods either rely on wasteful rejection operations in the data space (Azadi et al., 2019; Turner et al., 2019) , or require a sensitive diffusion term to ensure sample diversity (Che et al., 2020) . Prior work has also focused on improving GANs with scalar-valued discriminators, which excludes a large family of GANs with vector-valued critics, e.g., MMDGAN (Li et al., 2017; Bińkowski et al., 2018) and OCFGAN (Ansari et al., 2020) , and likelihood-based generative models. In this work, we propose Discriminator Gradient f low (DGf low) which formulates sample improvement as refining inferior samples using the gradient flow of fdivergences between the generator and the real data distributions (Fig. 1 ). DGf low avoids wasteful rejection operations and can be used in a deterministic setting without a diffusion term. Existing state-of-the-art methods -specifically, Discriminator Optimal Transport (DOT) (Tanaka, 2019) and Discriminator Driven Latent Sampling (DDLS) (Che et al., 2020) -can be viewed as special cases of DGf low. Similar to DDLS, DGf low recovers the real data distribution when the gradient flow is simulated exactly. We further present a generalized framework that employs existing pre-trained discriminators to refine samples from a variety of deep generative models: we demonstrate our method can be applied to GANs with vector-valued critics, and even likelihood-based models such as VAEs and Normalizing Flows. Empirical results on synthetic datasets, and benchmark image (CIFAR10, STL10) and text (Billion Words) datasets demonstrate that our gradient flow-based approach outperforms DOT and DDLS on multiple quantitative evaluation metrics. In summary, this paper's key contributions are: • DGf low, a method to refine deep generative models using the gradient flow of f -divergences; • a framework that extends DGf low to GANs with vector-valued critics, VAEs, and Normalizing Flows; • experiments on a variety of generative models trained on synthetic, image (CIFAR10 & STL10), and text (Billion Words) datasets demonstrating that DGf low is effective in improving samples from generative models.

2. BACKGROUND: GRADIENT FLOWS

The following gives a brief introduction to gradient flows; we refer readers to the excellent overview by Santambrogio (2017) for a more thorough introduction. Let (X , • 2 ) be a Euclidean space and F : X → R be a smooth energy function. The gradient flow of F is the smooth curve {x t } t∈R+ that follows the direction of steepest descent, i.e., x (t) = -∇F (x(t)). (1) The value of the energy F is minimized along this curve. This idea of steepest descent curves can be characterized in arbitrary metric spaces via the minimizing movement scheme (Jordan et al., 1998) . Of particular interest is the metric space of probability measures that is endowed with the Wasserstein distance (W p ); the Wasserstein distance is a metric and the W p topology satisfies weak convergence of probability measures (Villani, 2008, Theorem 6.9) . Gradient flows in the 2-Wasserstein space (P 2 (Ω), W 2 ) -i.e., the space of probability measures with finite second moments and the 2-Wasserstein metric -have been studied extensively. Let {ρ t } t∈R+ be the gradient flow of a functional F in the 2-Wasserstein space, where ρ t is absolutely continuous with respect to the Lebesgue measure. The curve {ρ t } t∈R+ satisfies the continuity equation (Ambrosio et al., 2008, Theorem 8.3.1) , ∂ t ρ t + ∇ • (ρ t v t ) = 0. The velocity field v t in Eq. ( 2) is given by v t (x) = -∇ x δF δρ (ρ), where δF δρ denotes the first variation of the functional F. Since the seminal work of Jordan et al. (1998) that showed that the Fokker-Plank equation is the gradient flow of a particular functional in the Wasserstein space, gradient flows in the Wasserstein metric have been a popular tool in the analysis of partial differential equations (PDEs). For example, they have been applied to the study of the porous-medium equation (Otto, 2001) , crowd modeling (Maury et al., 2010; 2011) , and mean-field games (Almulla et al., 2017) . More recently, gradient flows of various distances used in deep generative modeling literature have been proposed, notably that of the sliced Wasserstein distance (Liutkus et al., 2019) , the maximum mean discrepancy (Arbel et al., 2019) , the Stein discrepancy (Liu, 2017) , and the Sobolev discrepancy (Mroueh et al., 2019) . Gradient flows have also been used for learning non-parametric and parametric implicit generative models (Liutkus et al., 2019; Gao et al., 2019; 2020) . As an example of the latter, Variational Gradient Flow (Gao et al., 2019) learns a mapping between latent vectors and samples evolved using the gradient flow of f -divergences. In this work, we present a method using gradient flows of entropy-regularized f -divergences for refining samples from deep generative models employing existing discriminators as density-ratio estimators.

3. GENERATOR REFINEMENT VIA DISCRIMINATOR GRADIENT FLOW

This section describes our main contribution: Discriminator Gradient f low (DGf low). As an overview, we begin with the construction of the gradient flow of entropy-regularized f -divergences and describe its application to sample refinement. We then discuss how to simulate the gradient flow in the latent space of the generator -a procedure more suitable for high-dimensional datasets. Finally, we present a simple technique that extends our method to generative models that have not yet been studied in the context of refinement. Due to space constraints, we focus on conveying the key concepts and relegate details (e.g., proofs) to the appendix. The entropy-regularized f -divergence functional is defined as F f µ (ρ) D f (µ ρ) -γH(ρ), where the f -divergence term D f (µ ρ) ensures that the "distance" between the probability density ρ and the target density µ decreases along the gradient flow. The differential entropy term H(ρ) improves diversity and expressiveness when the gradient flow is simulated for finite time-steps. We now construct the gradient flow of F f µ . Lemma 3.1. Define the functional F f µ : P 2 (Ω) → R as F f µ (ρ) f (ρ(x)/µ(x)) µ(x)dx f-divergence + γ ρ(x) log ρ(x)dx negative entropy , ( ) where f is a twice-differentiable convex function with f (1) = 0. The gradient flow of the functional F f µ (ρ) in the Wasserstein space (P 2 (Ω), W 2 ) is given by the following PDE, ∂ t ρ t (x) -∇ x • (ρ t (x)∇ x f (ρ t (x)/µ(x))) -γ∆ xx ρ t (x) = 0, where ∇ x • and ∆ xx denote the divergence and the Laplace operators respectively. The proof is given in Appendix A.1. The PDE in Eq. ( 6) is a type of Fokker-Plank equation (FPE). FPEs have been studied extensively in the literature of stochastic processes and have a Stochastic Differential Equation (SDE) counterpart (Risken, 1996) . In the case of Eq. ( 6), the equivalent SDE is given by dx t = -∇ x f (ρ t /µ) (x t )dt drift + 2γdw t diffusion , where dw t denotes the standard Wiener process. Eq. ( 7) defines the evolution of a particle x t under the influence of drift and diffusion. Specifically, it is a McKean-Vlasov process (Braun & Hepp, 1977) which is a type of non-linear stochastic process as the drift term at any time t depends on the distribution ρ t of the particle x t . Eqs. ( 6) and ( 7) are equivalent in the sense that the distribution of the particle x t in Eq. ( 7) solves the PDE in Eq. ( 6). Consequently, samples from the density ρ t along the gradient flow can be obtained by first drawing samples x 0 ∼ ρ 0 and then simulating the SDE in Eq. ( 7). The SDE can be approximately simulated via the stochastic Euler scheme (also known as the Euler-Maruyama method) (Beyn & Kruse, 2011) given by x τn+1 = x τn -η∇ x f (ρ τn /µ) (x τn ) + 2γηξ τn , where ξ τn ∼ N (0, I), the time interval [0, T ] is partitioned into equal intervals of size η and τ 0 < τ 1 < • • • < τ N denote the discretized time-steps. Eq. ( 8) provides a non-parametric procedure to refine samples from a generator g θ where we let µ be the density of real samples and ρ τ0 the density of samples generated from g θ obtained by first sampling from the prior latent distribution z ∼ p Z (z) and then feeding z into g θ . We first generate particles x 0 ∼ ρ τ0 and then update the particles using Eq. ( 8) for N time steps. Given a binary classifier (discriminator) D that has been trained to distinguish between samples from µ and ρ τ0 , the density-ratio ρ τ0 (x)/µ(x) can be estimated via the well-known density-ratio trick (Sugiyama et al., 2012) , ρ τ0 (x)/µ(x) = 1 -D(y = 1|x) D(y = 1|x) = exp(-d(x)), where D(y = 1|x) denotes the conditional probability of the sample x being from µ and d(x) denotes the logit output of the classifier D. We term this procedure where samples are refined via gradient flow of f -divergences as Discriminator Gradient f low (DGf low).

3.1. REFINEMENT IN THE LATENT SPACE

Eq. ( 8) requires a running estimate of the density-ratio ρ τn (x)/µ(x), which can be approximated using the stale estimate ρ τn (x)/µ(x) ≈ ρ τ0 (x)/µ(x) for η → 0 and small N , where the density ρ τn will be close to ρ τ0 . However, our initial image experiments showed that refining directly in high-dimensional data-spaces with the stale estimate is problematic; error is accumulated at each time-step leading to a visible degradation in the quality of data samples (e.g., appearance of artifacts in images). To tackle this problem, we propose refining the latent vectors before mapping them to samples in data-space using g θ . We describe a procedure analogous to Eq. ( 8) but in the latent space for generators g θ that take a latent vector z ∈ Z as input and generate a sample x ∈ X . We first show in Lemma 3.2 that the density-ratio in the latent space between two distributions can be estimated via the density-ratio of corresponding distributions in the data space. Lemma 3.2. Let g : Z → X be a sufficiently well-behaved injective function where Z ⊆ R n and X ⊂ R m with m > n. Let p Z (z), p Ẑ (ẑ) be probability densities on Z and q X (x), q X (x) be the densities of the pushforward measures g Z, g Ẑ respectively. Assume that p Z (z) and p Ẑ (ẑ) have same support, and the Jacobian matrix J g has full column rank. Then, the density-ratio p Ẑ (u)/p Z (u) at the point u ∈ Z is given by p Ẑ (u) p Z (u) = q X (g(u)) q X (g(u)) . ( ) Algorithm 1 Refinement in the Latent Space using DGf low. Require: First derivative of f (f ), generator (g θ ), discriminator (d φ ), number of update steps (N ), stepsize (η), noise factor (γ). 1: z 0 ∼ p Z (z) Sample from the prior. 2: for i ← 0, N do 3: ξ i ∼ N (0, I) 4: z i+1 = z i -η∇ zi f (e -d φ (g θ (zi)) ) + √ 2ηγξ i 5: end for 6: return g θ (z n ) The refined sample. The proof is in Appendix A.2. We let p Ẑ (ẑ) be the density of the "correct" latent space distribution induced by a generator g θ , i.e., p Ẑ (ẑ) is the density of a probability measure whose pushforward under g θ approximately equals the target data density µ. The density-ratio of the prior latent distribution p Z (z) and p Ẑ (ẑ) can now be computed by combining Lemma 3.2 with Eq. ( 9), p Z (u) p Ẑ (u) = ρ τ0 (g θ (u)) µ(g θ (u)) = exp(-d(g θ (u))). Although a generator g θ parameterized by a neural network may not satisfy the conditions of injectivity and full column rank Jacobian matrix J g θ , Eq. ( 11) provides an approximation that works well in practice as shown by our experiments. Combining Eq. ( 11) with Eq. ( 8) provides us with an update rule for refining samples in the latent space, u τn+1 = u τn -η∇ u f p uτ n /p Ẑ (u τn ) + 2γηξ τn , where u τ0 ∼ p Z (z) and the density-ratio p uτ n /p Ẑ is approximated using the stale estimate p uτ 0 /p Ẑ = exp(-d(g θ (u))). We summarize the complete algorithm in Algorithm 1.

3.2. REFINEMENT FOR ALL

Thus far, prior work (Azadi et al., 2019; Turner et al., 2019; Tanaka, 2019; Che et al., 2020) has focused on improving samples for GANs with scalar-valued discriminators, which comprises the canonical GAN as well as recent variants, e.g., WGAN (Gulrajani et al., 2017) , and SNGAN (Miyato et al., 2018) . Here, we propose a technique that extends our approach to refine samples from a larger class of DGMs including GANs with vector-valued critics, VAEs, and Normalizing Flows. Let p θ be the density of the samples generated by a generator g θ and µ be the density of the real data distribution. We are interested in refining samples from g θ ; however, a corresponding density-ratio estimator for p θ /µ is unavailable, as is the case with the aforementioned generative models. Let D φ be a discriminator that has been trained on the same dataset but for a different generative model g φ (e.g., let g φ and D φ be the generator and discriminator of SNGAN respectively). D φ can be used to compute the density ratio p φ /µ. A straightforward technique would be to use the crude approximation p θ /µ ≈ p φ /µ, which could work provided p θ and p φ are not too far from each other. Our experiments show that this simple approximation works to a limited extent (see appendix E). To improve upon the crude approximation above, we propose to correct the density-ratio estimate. Specifically, a discriminator D λ is initialized with the weights from D φ and is fine-tuned on samples from g φ and g θ . D φ and D λ are then used to approximate the density-ratio p θ /µ, p θ (x) µ(x) = p φ (x) µ(x) p θ (x) p φ (x) = exp(-d φ (x)) • exp(-d λ (x)), where d φ and d λ are logits output from D φ and D λ , respectively. We term the network D λ the density ratio corrector, which experiments show produces higher quality samples than using p θ /µ ≈ p φ /µ. The estimate in Eq. ( 13) is similar to telescoping density-ratio estimation (TRE), a technique proposed in very recent independent work (Rhodes et al., 2020) . In brief, Rhodes et al. (2020) show that classifier-based density ratio estimators perform poorly when distributions are "too far apart"; the classifier can easily distinguish between the distributions, even with a poor estimate of the density ratio. TRE expands the standard density ratio into a telescoping product of more difficultto-distinguish intermediate density ratios. Likewise, in Eq. ( 13), we treat p φ as an intermediate distribution and estimate the final density-ratio as a product of two density-ratios. 4 RELATED WORK Azadi et al. (2019) first proposed the idea of improving samples from a GAN's generator by discriminator rejection sampling (DRS), making use of the density-ratio provided by the discriminator to estimate the acceptance probability. Metropolis-Hastings GAN (MH-GAN) (Turner et al., 2019) improved upon the costly rejection sampling procedure via the Metropolis-Hastings algorithm. Unlike DGf low, both of these methods reject inferior samples instead of refining them. Our method is closely related to recent state-of-the-art sample refinement techniques, specifically Discriminator-Driven Latent Sampling (DDLS) (Che et al., 2020) and Discriminator Optimal Transport (DOT) (Tanaka, 2019) . In fact, both these methods can be seen as special cases of DGf low. DDLS treats a GAN as an energy-based model and uses Langevin dynamics to sample from the energy-based latent distribution p t (z) ∝ p Z (z) exp(d(g θ (z))) induced by performing rejection sampling in the latent space. This distribution is the same as p Ẑ (ẑ), which can be seen by rearranging terms in Eq. ( 11). If we use the KL-divergence by setting f = r log r , DGf low is equivalent to DDLS. However, there are practical differences that make DGf low more appealing. DDLS requires estimation of the score function ∇ z {log p Z (z) + d(g θ (z))} to perform the update which becomes undefined if z escapes the support of p Z (z), e.g., in the case of the uniform prior distribution commonly used in GANs; handling such cases would require techniques such as projected gradient descent. This problem does not arise in the case of DGf low since it only uses the density-ratio that is implicitly defined by the discriminator. Moreover, DDLS uses Langevin dynamics which requires the sensitive diffusion term to ensure diversity and to prevent points from collapsing to the maximum-likelihood point. In DGf low, the sample diversity is ensured by the density-ratio term and the diffusion term serves as an enhancement. Note that DGf low performs well even without the diffusion term (i.e., with γ = 0, see Tables 13 & 14 in the appendix). This deterministic variant of DGf low is a practical alternative with one less hyperparameter to tune. DOT refines samples by constructing an Optimal Transport (OT) map induced by the WGAN discriminator. The OT map is realized by means of a deterministic optimization problem in the vicinity of the generated samples. If we further analyze the case of DGf low with γ = 0 and solve the resulting ordinary differential equation (ODE) using the backward Euler method, u τn+1 = arg min u∈R n f p uτ n /p Ẑ (u) + 1 2λ u -u τn 2 , DOT emerges as a special case when we consider a single update step of Eq. ( 14) using gradient descent and set f (t) = log(t)foot_0 with λ = 1 2 . This connection of DGf low to DOT, an optimal transport technique, is perhaps unsurprising given the relationship between gradient flows and the dynamical Benamou-Brenier formulation of optimal transport (Santambrogio, 2017) . Recent work has also sought to improve generative models via sample evolution in the training/generation process. In energy-based generative models (Arbel et al., 2021; Deng et al., 2020) , the energy functions can be viewed as a component that improves some base generator. For example, the Generalized Energy-Based Model (GEBM) (Arbel et al., 2021) jointly trains a base generator by minimizing a lower bound of the KL divergence along with an energy function in an alternating fashion. Once trained, the energy function is used to refine samples from the base generator using Langevin dynamics and serves a similar purpose to the discriminator in DDLS and DGf low. The Noise Conditional Score Network (NCSN) (Song & Ermon, 2019 ; 2020) -a score-based generative model -can be seen as a gradient flow that refines a sample right from noise to data. Latent Optimization GAN (LOGAN) (Wu et al., 2020) optimizes a latent vector via natural gradient descent as part of the GAN training process. In contrast to these works, we primarily focus on refining samples from pretrained generative models using the gradient flow of f -divergences.foot_1 

5. EXPERIMENTS

In this section, we present empirical results on various deep generative models trained on multiple synthetic and real world datasets. Our primary goals were to determine if (a) DGf low is effective in improving the quality of samples from generative models, (b) the proposed extension to other generative models improves their sample quality, and (c) DGf low is generalizable to different types of data and metrics. Note that we did not seek to achieve state-of-the-art results for the datasets studied but to demonstrate that DGf low is able to significantly improve samples from the bare generators for different models. We experimented with three f -divergences, namely the Kullback-Leibler (KL) divergence, the Jensen-Shannon (JS) divergence, and the log D divergence (Gao et al., 2019) . The specific forms of the functions f and corresponding derivatives are tabulated in Table 7 (appendix). We compare DGf low with two state-of-the-art competing methods: DOT and DDLS. In this section we discuss the main results and relegate details to the appendix. Our code is available online at https://github.com/clear-nus/DGflow.

5.1. 2D DATASETS

We first tested DGf low on two synthetic datasets, 25Gaussians and 2DSwissroll, to visually inspect the improvement in the quality of generated samples. We generated 5000 samples from a trained WGAN-GP generator and refined them using DOT, DDLS, and DGf low. We performed refinement in the latent space for DDLS and directly in the data-space for DOT and DGf low. the samples generated from the WGAN-GP generator (blue) and the refined samples using different techniques (red) against the real samples from the training dataset (brown). Although the WGAN-GP generator learned the overall structure of the dataset, it also learned a number of spurious modes. DOT is able to refine the spurious samples but to a limited degree. In contrast, DDLS and DGf low are able to correct almost all spurious samples and are able to recover the correct structure of the data. Visualizations for DGf low with different f -divergences can be found in the appendix (Fig. 4 ). We also compared the different methods quantitatively on two metrics: % high quality samples and kernel density estimate (KDE) score. A sample is classified as a high quality sample if it lies within 4 standard deviations of its nearest Gaussian. The KDE score is computed by first estimating the KDE using generated samples and then computing the log-likelihood of the training samples under the KDE estimate. We computed both the metrics 10 times using 5000 samples and report the mean in Table 1 . The quantitative metrics reinforce the qualitative analysis and show that DDLS and DGf low significantly improve the samples from the generator, with DGf low performing slightly better than DDLS in terms of the KDE score. We conducted experiments on the CIFAR10 and STL10 datasets to demonstrate the efficacy of DGf low in the real-world setting. We followed the setup of Tanaka (2019) for our image experiments. We used the Fréchet Inception Distance (FID) (Heusel et al., 2017) and Inception Score (IS) (Salimans et al., 2016) metrics to evaluate the quality of generated samples before and after refinement. A high value of IS and a low value of FID corresponds to high quality samples, respectively.

5.2. IMAGE EXPERIMENTS

We first applied DGf low to GANs with scalar-valued discriminators (e.g., WGAN-GP, SNGAN) trained on the CIFAR10 and the STL10 datasets. Table 2 shows that DGf low significantly improves the quality of (Karras et al., 2017) 8.80 (.05) SN-ResNet-GAN (Miyato et al., 2018) 8.22 (.05) NCSN (Song & Ermon, 2019) 8.87 (.12) DCGAN 2.88 DCGAN + DRS (cal) (Azadi et al., 2019) 3.07 DCGAN + MH (cal) (Turner et al., 2019) 3 11 ), which shows DGf low outperforms DOT on all models. In Table 3 , we reproduce previously reported IS results for generative models and other sample improvement methods (DRS, MH-GAN, and DDLS) for completeness. DGf low performs the best in terms of relative improvement from the base score and even outperforms the state-of-the-art BigGAN (Brock et al., 2019) , a conditional generative model, without the need for additional labels. Qualitatively, DGf low improves the vibrance of the samples and corrects deformations in the foreground object. Fig. 3 shows the change in the quality of samples when using DGf low where the leftmost columns show the image generated form the base models and the successive columns show the refined sample using DGf low over increments of 5 update steps. We then evaluated the ability of DGf low to refine samples from generative models without corresponding discriminators, namely MMDGAN, OCFGAN-GP, VAEs, and Normalizing Flows (Glow). We used the SN-DCGAN (ns) as the surrogate discriminator D φ for these models and fine-tuned density ratio correctors D λ for each model as described in section 3.2. Table 4 shows the FID scores achieved by these models without and with refinement using DGf low. We obtain a clear improvement in quality of samples when these generative models are combined with DGf low.

5.3. CHARACTER-LEVEL LANGUAGE MODELING

Finally, we conducted an experiment on the character-level language modeling task proposed by Gulrajani et al. (2017) to show that DGf low works on different types of data. We trained a character-level GAN language model on the Billion Words Dataset (Chelba et al., 2013) , which was pre-processed into 32-character long strings. We evaluated the generated samples using the JS-4 and JS-6 scores which compute the Jensen-Shannon divergence between the 4-gram and 6-gram probabilities of the data generated by the model and the real data. an improvement in the JS-4 and JS-6 scores. 

6. CONCLUSION

In this paper, we proposed a technique to improve samples from deep generative models by refining them using gradient flow of f -divergences between the real and the generator data distributions. We also presented a simple framework that extends the proposed technique to commonly used deep generative models: GANs, VAEs, and Normalizing Flows. Experimental results indicate that gradient flows provide an excellent alternative methodology to refine generative models. Moving forward, we are considering several technical enhancements to improve DGf low's performance. At present, DGf low uses a stale estimate of the density-ratio, which could adversely affect sample evolution when the gradient flow is simulated for larger number of steps; how we can efficiently update this estimate is an open question. Another related question is when the evolution of the samples should be stopped; running chains for too long may modify characteristics of the original sample (e.g., orientation and color) which may be undesirable. This issue does not just affect DGf low; a method for automatically stopping sample evolution could improve results across refinement techniques. of the two. Similarly, in Deng et al. (2020) , a discriminator that estimates the energy function is combined with a language model to train an energy-based text-generation model. Score-based generative modeling (SBGM) (Song & Ermon, 2019; 2020) is another active area of research closely-related to energy-based models. Noise Conditional Score Network (NCSN) (Song & Ermon, 2019; 2020) , a SBGM, trains a neural network to estimate the score function of a probability density at various noise levels. Once trained, this score network is used to evolve samples from noise to the data distribution using Langevin dynamics. NCSN can be viewed as a gradient flow that refines a sample right from noise to data; however, unlike DGf low, NCSN is a complete generative models in itself and not a sample refinement technique that can be applied to other generative models. Other Related Work Monte Carlo techniques have been used for improving various components in generative models, e.g., Grover et al. (2018) proposed Variational Rejection Sampling which performs rejection sampling in the latent space of VAEs to improve the variational posterior and Grover et al. ( 2019) used likelihood-free importance sampling for bias correction in generative models . Wu et al. (2020) proposed Latent Optimization GAN (LOGAN) which optimizes the latent vector as part of the training process unlike DGf low that refines the latent vector post training.

D IMPLEMENTATION DETAILS D.1 2D DATASETS

Datasets The 25 Gaussians dataset was constructed by generating 100000 samples from a mixture of 25 equally likely 2D isotropic Gaussians with means {-4, -2, 0, 2, 4} × {-4, -2, 0, 2, 4} ⊂ R 2 and standard deviation 0.05. Once generated, the data-points were normalized by 2 √ 2 following Tanaka (2019). The 2DSwissroll dataset was constructed by first generating 100000 samples of the 3D swissroll dataset using make swiss roll from scikit-learn with noise=0.25 and then only keeping dimensions {0, 2}. The generated samples were normalized by 7.5.

Base Models

We trained a WGAN-GP model for both the datasets. The generator was a fullyconnected network with ReLU non-linearities that mapped z ∼ N (0, I 2×2 ) to x ∈ R 2 . Similarly, the discriminator was a fully-connected network with ReLU non-linearities that mapped x ∈ R 2 to R. We refer the reader to Gulrajani et al. (2017) for the exact network structures. The gradient penalty factor was set to 10. The models were trained for 10K generator iterations with a batch size of 256 using the Adam optimizer with a learning rate of 10 -4 , β 1 = 0.5, and β 1 = 0.9. We updated the discriminator 5 times for each generator iteration. Hyperparameters We ran DOT for 100 steps and performed gradient descent using the Adam optimizer with a learning rate of 0.01 and β = (0., 0.9) as suggested by Tanaka (2019) . DDLS was run for 50 iterations with a step-size of 0.01 and the Gaussian noise was scaled by a factor of 0.1 as suggested by Che et al. (2020) . For DGf low, we set the step-size η = 0.01, the number of steps n = 100, and the noise regularizer γ = 0.01. We used the output from the WGAN-GP discriminator directly as a logit for estimating the density ratio for DDLS and DGf low. Metrics We compared the different methods quantitatively on two metrics: % high quality samples and kernel density estimate (KDE) score. A sample is classified as a high quality sample if it lies within 4 standard deviations of its nearest Gaussian. The KDE score is computed by first estimating the KDE using generated samples and then computing the log-likelihood of the training samples under the KDE estimate. KDE was performed using sklearn.neighbors.KernelDensity with a Gaussian kernel and a kernel bandwidth of 0.1. The quantitative metrics were averaged over 10 runs with 5000 samples from each method.

D.2 IMAGE EXPERIMENTS

Datasets CIFAR10 (Krizhevsky et al., 2009) is a dataset of 60K natural RGB images of size 32 × 32 from 10 classes. STL10 is a dataset of 100K natural RGB images of size 96 × 96 from 10 Output Shape: (b, m, 1, 1) classes. We resized the STL10 (Coates et al., 2011) dataset to 48 × 48 for SNGAN and WGAN-GP, and to 32 × 32 for MMDGAN, OCFGAN-GP, and VAE since the respective base models were trained on these sizes. Base Models for CIFAR10 We used the publicly available pre-trained models for WGAN-GP, SN-DCGAN (hi), and SN-DCGAN (ns). We refer the reader to Tanaka (2019) for exact details about these models. For SN-ResNet-GAN and OCFGAN-GP we used the pre-trained models from Miyato et al. (2018) and Ansari et al. (2020) respectively. We used the respective discriminators of SN-DCGAN (ns), SN-DCGAN (hi), and WGAN-GP for density-ratio estimation when refining their generators. For the SN-ResNet-GAN (hi) generator, we used SN-DCGAN (ns) discriminator as the non-saturating loss provides a better density-ratio estimation than a discriminator trained using the hinge loss. We trained our own models for MMDGAN, VAE, and Glow. We used the generator and discriminator architectures shown in Table 6 for MMDGAN with d = 32. VAE used the same architecture with d = 64. Our Glow model was trained using the code available at https: //github.com/y0ast/Glow-PyTorch with a batch size of 56 for 150 epochs. The density ratio correctors, D λ (see section 3.2), were initialized with the weights from the SN-DCGAN (ns) released by Tanaka (2019) . D λ was then fine-tuned on images from SN-DCGAN (ns)'s generator and the generator being improved (e.g., MMDGAN and OCFGAN-GP) using SGD with a learning rate of 10 -4 and momentum of 0.9. We fine-tuned D λ for 10000 iterations with a batch size of 64. Base Models for STL10 We used the publicly available pre-trained models (Tanaka, 2019; Ansari et al., 2020) for WGAN-GP, SN-DCGAN (hi), SN-DCGAN (ns), and OCFGAN-GP. We trained our own models for MMDGAN and VAE with the same architecture and training details as CIFAR10. We fine-tuned the density ratio correctors for STL10 for 5000 iterations with other details being the same as CIFAR10. Hyperparameters We performed 25 updates of DGf low for CIFAR10 and STL10 with a step size of 0.1 for models that do not require density ratio corrections. For STL10 models that require a density ratio correction, we performed 15 updates with a step size of 0.05. The noise regularizer (γ), whenever used, was set to 0.01. Metrics We used the Fréchet Inception Distance (FID) (Heusel et al., 2017) and Inception Score (IS) (Salimans et al., 2016) metrics to evaluate the quality of generated samples before and after refinement. The IS denotes the confidence in classification of the generated samples using a pretrained InceptionV3 network whereas the FID is the Fréchet distance between multivariate Gaussians fitted to the 2048 dimensional feature vectors extracted from the InceptionV3 network for real and generated data. Both the metrics were computed using 50K samples for all the models, except Glow where we used 10K samples. Following Tanaka (2019) , we used the entire training and test set (60K images) for CIFAR10 and the entire unlabeled set (100K images) for STL10 as the set of real images used to compute FID.

D.3 CHARACTER LEVEL LANGUAGE MODELING

Dataset We used the Billion Words dataset (Chelba et al., 2013) which was pre-processed into 32-character long strings. Base Model Our generator was a 1D CNN which followed the architecture used by Gulrajani et al. (2017) . Hyperparameters We performed 50 updates of DGf low with a step size of 0.1 and noise factor γ = 0. Metrics The JS-4 and JS-6 scores were computed using the code provided by Gulrajani et al. (2017) at https://github.com/igul222/improved wgan training. We used 10000 samples from the models to compute the JS-4 score. Fig. 4 shows the samples generated by WGAN-GP (leftmost, blue) and refined samples generated using DGf low with different f -divergences (red). Fig. 5 shows the deterministic component, -∇ x f (ρ 0 /µ)(x 0 ), of the velocity for different f -divergences on the 2D datasets. Fig. 6 (right) shows the latent space distribution recovered by DGf low when applied in the latent space for the 2D datasets. This latent space is same as the one derived by Che et al. (2020), i.e., p t (z) ∝ p Z (z) exp(d(g θ (z))) which is shown in Fig. 6 (left) for both datasets.

GAN

Table 11 shows the comparison of DGf low with DOT in terms of the inception score for the CI-FAR10 and STL10 datasets. DGf low outperforms DOT significantly for all the base GAN models on both the datasets. significant improvement in the quality of samples for all the models. Tables 13 and 14 compare the deterministic variant of DGf low (γ = 0) against DOT and DDLS. These results show that the diffusion term only serves as an enhancement for DGf low, not a necessity, and it outperforms competing methods even without added noise. Table 15 shows the results of DGf low on MMDGAN, OCFGAN-GP, and VAE models when the SN-DCGAN (ns) discriminator is directly used as a density-ratio estimator without an additional density-ratio corrector. Runtime DGf low performs a backward pass through d φ • g θ to compute the gradient of the density-ratio with respect to the latent vector. This results in the same runtime complexity as that of DOT and DDLS. Table 8 shows a comparison of the runtimes of DOT, DDLS, and, DGf low on the 25Gaussians dataset under same conditions. As expected, these refinement methods have similar runtimes in practice. The wall-clock time required for DGf low (KL) to refine 100 samples from different base models on the CIFAR10 and STL10 datasets is reported in tables 9 and 10. 



This implies that f (t) = t log t -t + 1, which is a twice-differentiable convex function with f (0) = 1. For further discussion about these techniques, please refer to appendix C.



Figure 1: An illustration of refinement using DGf low, with the gradient flow in the 2-Wasserstein space P2 (top) and the corresponding discretized SDE in the latent space Z (bottom). The image samples from the densities along the gradient flow are shown in the middle.

Figure 2: Qualitative comparison of DGf low(KL) with DOT and DDLS on synthetic 2D datasets.

Figure 3: Improvement in the quality of samples generated from the base model (leftmost columns) over the steps of DGf low for SN-ResNet-GAN and SN-DCGAN on the CIFAR10 and STL10 datasets respectively.

terms of the FID score and outperforms DOT on multiple models. The corresponding values of the Inception score can be found in the Appendix (Table

Figure 4: Qualitative comparison of DGf low with different f -divergences on the 25Gaussians and 2DSwissroll datasets.

Figure 5: A vector plot showing the deterministic component of the velocity, i.e., the drift -∇xf (ρ0/µ)(x0), for different f -divergences on the 25Gaussians and 2DSwissroll dataset.

Figures 7, 8, 9, and 10   show the samples generated by the base model (left) and the refined samples (right) using DGf low for the CIFAR10 and STL10 datasets.

Figure 6: Latent space recovered by DGf low (right) for the 2D datasets is same as the one derived by Che et al. (2020) (left).

(a) WGAN-GP (b) WGAN-GP + DGf low (c) SN-DCGAN (hi) (d) SN-DCGAN (hi) + DGf low (e) SN-DCGAN (ns) (f) SN-DCGAN (ns) + DGf low (g) SN-ResNet-GAN (h) SN-ResNet-GAN + DGf low Figure 7: Samples from different models for the CIFAR10 dataset before (left) and after (right) refinement using DGf low.

Glow (h) Glow + DGf lowFigure8: Samples from different models for the CIFAR10 dataset before (left) and after (right) refinement using DGf low.

(a) WGAN-GP (b) WGAN-GP + DGf low (c) SN-DCGAN (hi) (d) SN-DCGAN (hi) + DGf low (e) SN-DCGAN (ns) (f) SN-DCGAN (ns) + DGf low Figure 9: Samples from different models for the STL10 dataset before (left) and after (right) refinement using DGf low.

VAE + DGf low Figure 10: Samples from different models for the STL10 dataset before (left) and after (right) refinement using DGf low.

Quantitative comparison on the 25Gaussians dataset. Higher scores are better.

Comparison of different variants of DGf low with DOT on the CIFAR10 and STL10 datasets. For SN-DCGAN, (hi) denotes the hinge loss and (ns) denotes the non-saturating loss. Lower scores are better. DGf low's results have been averaged over 5 random runs with the standard deviation in parentheses.

Inception scores of different generative models, DRS, MH-GAN, DDLS, and DGf low on the CI-FAR10 dataset. Higher scores are better.



Comparison of different variants of DGf low applied to MMDGAN, OCFGAN-GP, VAE, and Glow models. Lower scores are better. Results have been averaged over 5 random runs with the standard deviation in parentheses.

Results of DGf low on a character-level GAN language model.



Network architectures used for MMDGAN and VAE models.

f -divergences and their derivatives.

Table 12 compares different variants of DGf low applied to MMDGAN, OCFGAN-GP, VAE, and Glow generators in terms of the inception score. DGf low leads to a

Runtime comparison of DOT, DDLS, and DGf low(KL) on the 25Gaussians dataset. The runtime is averaged over 100 runs with standard deviation reported in parentheses.

Runtime of DGf low(KL) for models that do not require density-ratio correction on a single GeForce RTX 2080 Ti GPU. The runtime is averaged over 100 runs with standard deviation reported in parentheses.

Runtime of DGf low(KL) for models that require density-ratio correction on a single GeForce RTX 2080 Ti GPU. The runtime is averaged over 100 runs with standard deviation reported in parentheses.

Comparison of different variants of DGf low with DOT on the CIFAR10 and STL10 datasets. Higher scores are better.

Comparison of different variants of DGf low applied to MMDGAN, OCFGAN-GP, VAE, and Glow models. Higher scores are better.

Comparison of different variants of DGf low without diffusion (i.e., γ = 0) on the CIFAR10 and STL10 datasets. Lower scores are better.

Comparison of DDLS with DGf low (with and without diffusion) on the CIFAR10 dataset. Higher scores are better.

Comparison of different variants of DGf low applied to MMDGAN, OCFGAN-GP, and VAE models without density-ratio correction. Lower scores are better. VAE 150.49 (.07) 151.76 (.01) 152.03 (.05) 151.88 (.11)

ACKNOWLEDGEMENTS

This research is supported by the National Research Foundation Singapore under its AI Singapore Programme (Award Number: AISG-RP-2019-011) to H. Soh. Thank you to J. Scarlett for his comments regarding the proofs.

A PROOFS

A.1 LEMMA 3.1Proof. Gradient flows in the Wasserstein space are of the form of the continuity equation (see Ambrosio et al. (2008) , page 281), i.e,The velocity field v in Eq. ( 15) is given bywhere δF δρ (ρ) denotes the first variation of the functional F. The first variation is defined aswhere χ = νρ for some ν ∈ P 2 (Ω).Let's derive an expression for the the first variation of F. In the following, we drop the notation for dependence on x for clarity,Substituting δF δρ (ρ) in Eq. ( 16) we get,Substituting v in Eq. ( 15) we get the gradient flow,where ∆ xx and ∇ x • denote the Laplace and the divergence operators respectively.A.2 LEMMA 3.2Proof. Let f be an integrable function on X . If J g has full column rank and g is an injective function, then we have the following change-of-variables equation (Ben-Israel, 1999; Gemici et al., 2016) ,This implies that the infinitesimal volumes dx and dz are related as dx = det J g J g (z)dz and the densities p Z (z) and q X (x) are related as p Z (z) = q X (g(z)) det J g J g (z). Similarly, p Ẑ (ẑ) = q X (g(ẑ)) det J g J g (ẑ). Finally, the density-ratio p Ẑ (u)/p Z (u) at the point u ∈ Z is given byB A DISCUSSION ON DGf LOW FOR WGANWe apply DGf low to WGAN models by treating the output from their critics as the logit for the estimation of density-ratio. However, it is well-known that WGAN critics are not density-ratio estimators as they are trained to maximize the 1-Wasserstein distance with an unconstrained output. In this section, we provide theoretical justification for the good performance of DGf low on WGAN models. We show that DGf low is related to the gradient flow of the entropy-regularized 1-Wasserstein functional F W µ : P 2 (Ω) → R,where µ denotes the target density, d Lip denotes the Lipschitz constant of the function d.Let d * be the function that achieves the supremum in Eq. ( 27). This results in the functional,Following a similar derivation as in Appendix A.1, the gradient flow of F W µ (ρ) is given by the following PDE,If d * is approximated using the critic (d φ ) of WGAN, we get the following gradient flow,which is same as the gradient flow of entropy-regularized f -divergence with f = r log r (i.e., the KL divergence) when d φ is treated as a density-ratio estimator. The gradient flow of entropy-regularized f -divergence with f = r log r is simplified below,The equality of Eq. (30) and Eq. ( 33) implies that DGf low approximates the gradient flow of the 1-Wasserstein distance when the critic of WGAN is used for density-ratio estimation.

C FURTHER DISCUSSION ON RELATED WORK

Energy-based & Score-based Generative Models DGf low is related to recently proposed energy-based generative models (Arbel et al., 2021; Deng et al., 2020) -one can view the energy functions used in these methods as a component that improves some base model. For example, the Generalized Energy-Based Model (GEBM) (Arbel et al., 2021) jointly trains an implicit generative model with an energy function and uses Langevin dynamics to sample from the combination

