MONOFLOW: A UNIFIED GENERATIVE MODELING FRAMEWORK FOR DIVERGENCE GANS

Abstract

Generative adversarial networks (GANs) play a minmax two-player game via adversarial training. The conventional understanding of adversarial training is that the discriminator is trained to estimate a divergence and the generator learns to minimize this divergence. We argue that despite the fact that many variants of GANs are developed following this paradigm, the existing theoretical understanding of GANs and the practical algorithms are inconsistent. In order to gain deeper theoretical insights and algorithmic inspiration for these GAN variants, we leverage Wasserstein gradient flows which characterize the evolution of particles in the sample space. Based on this, we introduce a unified generative modeling framework -MonoFlow: the particle evolution is rescaled via an arbitrary monotonically increasing mapping. Under our framework, adversarial training can be viewed as a procedure first obtaining MonoFlow's vector field via the discriminator and then the generator learns to parameterize the flow defined by the corresponding vector field. We also reveal the fundamental difference between variational divergence minimization and adversarial training. This analysis helps us to identify what types of generator loss functions can lead to the successful training of GANs and suggest that GANs may have more loss designs beyond those developed in the literature, e.g., non-saturated loss, as long as they realize MonoFlow. Consistent empirical studies are also included to validate the effectiveness of our framework.

1. INTRODUCTION

Generative adversarial nets (GANs) (Goodfellow et al., 2014; Jabbar et al., 2021) are a powerful generative modeling framework that has gained tremendous attention in recent years. GANs have achieved significant successes in applications, especially in high-dimensional image processing such as high-fidelity image generation (Brock et al., 2018; Karras et al., 2019) , super-resolution (Ledig et al., 2017) and domain adaption (Zhang et al., 2017) . In the GAN framework, a discriminator d and a generator g play a minmax game. The discriminator is trained to distinguish real and fake samples and the generator is trained to generate fake samples to fool the discriminator. The equilibrium of the vanilla GAN is defined byfoot_0  min g max d V (g, d) = E x∼pdata log σ[d(x)] + E z∼pz log 1 -σ[d(g(z))] The elementary optimization approach to solve the minmax game is adversarial training. Previous perspectives explained it as first estimating Jensen-Shannon divergence then the generator learns to minimize this divergence. Several variants of GANs have been developed based on this point of view for other probability divergences, e.g., χ 2 divergence (Mao et al., 2017) , Kullback-Leibler (KL) divergence (Arbel et al., 2021) and general f -divergences (Nowozin et al., 2016; Uehara et al., 2016) , while others are developed with Integral Probability Metrics (Arjovsky et al., 2017; Dziugaite et al., 2015; Mroueh et al., 2018b) . However, we emphasize that the traditional understanding over GANs is incomplete and here we present three non-negligible facts which are commonly associated with adversarial training, making it different from the standard variational divergence minimization (VDM) problem (Blei et al., 2017) : 1. The estimated divergence is computed from the discriminator d(x). It is a function only depending on samples x and cannot capture the variability of the generator's distribution p g . However, the optimal discriminator in Proposition 1 by Goodfellow et al. (2014) requires p g to be a functional variable as well. This issue was also raised in (Metz et al., 2017; Franceschi et al., 2022) . 2. The generator typically minimizes a divergence with a missing term, e.g., the vanilla GAN only minimizes the second term of Jensen-Shannon divergence -E z∼pz -log 1 -σ[d(g(z))] where -log 1 -σ[d(g(z))] is a monotonically increasing function of d(g(z)). 3. Practical algorithms are inconsistent with the theory, a heuristic trick "non-saturated loss" is commonly adopted to mitigate the gradient vanishing problem, but it still lacks a rigorous mathematical understanding. For example, the generator can minimize -E z∼pz {log σ[d(g(z))]} where log σ[d(g(z))] is also a monotonically increasing function of d(g(z)). It is known the logit output d(x) of a binary classification problem in Eq. ( 1) is the logarithm density ratio estimator between two distributions (Sugiyama et al., 2012) . To gain a deeper understanding of adversarial training of GANs, we study the Wasserstein gradient flow of the KL divergence which characterizes a deterministic evolution of particles described by an ordinary differential equation (ODE). This ODE is a Euclidean gradient flow of a time-dependent log density ratio. Based on this ODE, we propose the MonoFlow framework -transforming the log density ratio by a monotonically increasing mapping such that the vector field of the gradient flow is rescaled along the same direction. Consequently, approximating and learning to parameterize MonoFlow is synonymous with adversarial training. Under our framework, we gain a comprehensive understanding of training dynamics over GANs: the discriminator obtains a bijection of the log density ratio that suggests the vector field and the generator learns to parameterize the particles of MonoFlow. All variants of divergence GANs are a subclass of our framework. Finally, we reveal that the discriminator and generator loss do not need to follow the same objective. The discriminator maximizes an objective to obtain a bijection of the log density ratio. Then the generator loss can be any monotonically increasing mapping of this log density ratio. Our contributions are summarized as follows: • A novel generative modeling framework unifies divergence GANs, providing a new theoretically and practically consistent understanding of the underlying mechanism of the training dynamics over GANs. • We reveal the fundamental difference between VDM and adversarial training, which indicates that the previous analysis of GANs based on the perspective of VDM might not provide benefits and instead we should treat GANs as a particle flow method. • An analysis of what types of generator loss functions can lead to the success of training GAN. Our framework explains why and how non-saturated loss works. • An algorithmic inspiration where GANs may have more variants of loss designs than we have already known.

2. WASSERSTEIN GRADIENT FLOWS

In this section, we review the definition of gradient flows in Wasserstein space (P(R n ), W 2 ), the space of Borel probability measures P(R n ) defined on R n with finite second moments and Eq.uipped with the Wasserstein-2 metric. An absolutely continuous curve of probability measures {q t } t≥0 ∈ P(R n ) is a Wasserstein gradient flow if it satisfies the following continuity equation (Ambrosio et al., 2008) , ∂q t ∂t = div q t ∇ W2 F(q t ) , where ∇ W2 F(q t ) is called the Wasserstein gradient of the functional δqt . Specifically, for the functional F(q t ) = log qt p dq t as the KL divergence where p is a fixed target probability measure, we have δF (qt) δqt = log qt p + 1. Hence, the Wasserstein gradient flow of Wasserstein space: Euclidean space: F : P(R n ) → R. -∇ W2 F(q t ) p q 0 q t W 2 (q 0 , p) x t ∼ q t v t = -∇ x δF δqt (x) x=xt p Figure 1: The illustration of a Wasserstein gradient flow and its particle evolution. In Wasserstein space, the blue curve is a gradient flow and the red dotted line is a geodesic. q t evolves along a curve whose tangent vector is given by -∇ W2 F(q t ) such that the functional is always decreasing with time. Correspondingly, particles evolve in Euclidean space towards the target measure p with the vector field -∇ x δF δqt (x). Note that directly minimizing the Wasserstein-2 metric W 2 (q t , p) instead yields a path {q t } t≥0 along the geodesic connecting q 0 and p over Wasserstein space. the KL divergence reads the Fokker-Planck equation, ∂q t ∂t = div q t (∇ x log q t -∇ x log p) . (3) As t → ∞, the stationary probability measure of q t is the target p. Denote the Euclidean path of random variables as {x t } t≥0 ∈ R n with the initial condition x 0 ∼ q 0 , we can define an ordinary differential equation (ODE) which describes the evolution of particles in R n , dx t = ∇ x log p(x t ) -∇ x log q t (x t ) dt := v t (x t )dt, x 0 ∼ q 0 , where the vector field v t of these particles is the negative Euclidean gradient of the functional's first variation. As shown in Figure 1 , Wasserstein gradient flows establish a connection between the probability evolution in Wasserstein space and its associated particle evolution in Euclidean space. Applying Itô integral to Langevin dynamics dx t = ∇ x log p(x t )dt + √ 2dw where dw is a Wiener process, we obtain the same Fokker-Planck equation in Eq. ( 3). This indicates that the deterministic particle evolution by the ODE can be approximated via a stochastic differential equation (SDE). Langevin dynamics admits the same marginal probability measure q t as Eq. ( 4), this relation of SDE and its corresponding ODE were also studied in score-based diffusion models (Song et al., 2021) . Langevin dynamics was first interpreted as the Wasserstein gradient flow of the KL divergence by Jordan et al. (1998); Otto (2001) . It plays an important role in generative modeling as a sampling scheme. In order to transform noises into the target data distribution by Langevin dynamics, an essential step is to fit the data distribution using energy-based models (Song & Kingma, 2021) or to directly estimate its scores with score-matching techniques (Hyvärinen & Dayan, 2005; Vincent, 2011; Song & Ermon, 2019) .

3. MONOFLOW: A UNIFIED GENERATIVE MODELING FRAMEWORK

This section presents our main contribution that connects gradient flows and divergence GANs into a unified framework. We first introduce MonoFlow where the ODE evolution is rescaled via a monotonically increasing function. Consequentially, learning to parameterize the rescaled flow by a neural network recovers the bi-level optimization dynamics of training GANs. This gives us a novel understanding of the hidden mechanism of adversarial training.

3.1. MONOFLOW

We consider the ODE in Eq. ( 4) with a fixed target measure p, e.g., a data distribution in a generative modeling scenario. Assume that we have a time-dependent log density ratio function as log r t (x) = log p(x) qt(x) , the ODE can be rewritten as dx t = ∇ x log r t (x t )dt, x 0 ∼ q 0 . This is a gradient flow in Euclidean space where its vector field is the gradient of the log density ratio. With a monotonically increasing (strict) mapping h : R → R where h is first-order differentiable, we can define another ODE: dx t = ∇ x h log r t (x t ) dt = h ′ log r t (x t ) ∇ x log r t (x t )dt, x 0 ∼ q 0 (6) By transforming the time-dependent log density ratio under the mapping h, its first-order derivative rescales the vector field of the original gradient flows defined in Eq. ( 5). We call Eq. ( 6) as MonoFlow. MonoFlow defines a different family of vector fields {v t } t≥0 for the particle evolution where v t (x t ) = h ′ log r t (x t ) ∇ x log r t (x t ). Conversely, the vector fields {v t } t≥0 also determine an absolutely continuous curve {q t } t≥0 in Wasserstein space by the continuity equation, see Theorem 4.6 in (Ambrosio et al., 2008) , ∂q t ∂t = -div(q t v t ), under mild regularity conditions. Hence the probability evolution of MonoFlow is described by ∂q t ∂t = div D t ∇ x q t -div ζ -1 t q t ∇ x log p , where 8) is a special case of convection-diffusion equations where D t is called the diffusion coefficient and ζ -1 t is called mobility. MonoFlow defines a positive diffusion coefficient. This has a physical interpretation that particles diffuse to spread probability mass over the target measure other than concentrate. Next, we study the properties of MonoFlow. D t = ζ -1 t = h ′ (log r t ). Eq. ( Theorem 3.1. If h is strictly increasing, i.e., h ′ (•) > 0, the dissipation rate ∂F (qt) ∂t for the KL divergence F(q t ) = log qt p dq t satisfies ∂F(q t ) ∂t ≤ 0, ( ) the equality is achieved if and only if q t = p and the marginal probability q t of MonoFlow evolves to p as t → ∞. Theorem 3.1 shows that MonoFlow does not disturb the equilibrium of Eq. ( 3), and the convergence to the target probability measure p is guaranteed. The negative dissipation rate ensures that the gradient flow curve {q t } t≥ of MonoFlow always decreases the KL divergence with time. MonoFlow is obtained by transforming the log density ratio which arises from the Wasserstein gradient flow of KL divergence. We can also obtain other forms of deterministic particle evolution by considering Wasserstein gradient flows of general f -divergences, D f (p||q) = f p q dq, where f : R + → R is a convex function and f (1) = 0. Theorem 3.2. The Wasserstein gradient flow of an f -divergence characterizes the evolution of particles in R n by dx t = r(x t ) 2 f ′′ r(x t ) ∇ x log r t (x t )dt, x 0 ∼ q 0 . A similar result can also be derived with the reversed f -divergences D f (q||p) used by (Gao et al., 2019; Ansari et al., 2021) . Theorem 3.2 shows that the particle evolution of the Wasserstein gradient flow of f -divergences is a special instance of MonoFlow where h ′ (log r) = rfoot_1 f ′′ (r) > 0 because f is a convex function. It indicates once a curve {q t } t≥0 evolves with the time t in Wasserstein space to decrease an f -divergence, it simultaneously decreases the KL divergence as well since the dissipation rate of MonoFlow is negative.

3.2. PRACTICAL APPROXIMATIONS OF DENSITY RATIOS

We first discretize the ODE in Eq. ( 6) by the forward Euler method such that we obtain standard gradient ascent iterations with step size α and the index of the discretized time step k 2 : x k+1 = x k + α∇ x h log r k (x k ) , t k+1 = t k + α. Therefore, we can sample initial particles x 0 ∼ q 0 and perform gradient ascent iterations by estimating the density ratio r k (x) = p(x) q k (x) using samples from q k and p. In order to enable a practical  (d) = sup r∈domf {rd -f (r)}. ϕ(d) ψ(d) d * (x) h T (d) Vanilla GAN log σ(d) log(1 -σ(d)) log r(x) -log(1 -σ(d)) Non-saturated GAN log σ(d) log(1 -σ(d)) log r(x) log σ(d) f -GAN d -f (d) f ′ (r(x)) d b-GAN f ′ (d) f (d) -df ′ (d) r(x) df ′ (d) -f (d) Least-square GAN -(d -1) 2 -d 2 r(x) 1+r(x) -(d -1) 2 Generalized EBM (KL) -(d + λ) -exp(-d -λ) -log r(x) -λ exp(-d -λ) algorithm to obtain the time-dependent density ratio, we introduce a general framework that solves the following optimization problem, similar to Moustakides & Basioti (2019), max d∈H E x∼p ϕ d(x) + E x∼q k ψ d(x) , where d : R n → R is a discriminator and H is a class of measurable functions. ϕ and ψ are scalar functions upon design later. Lemma 3.3. Solving Eq. ( 13), the optimal d * satisfies d * (x) = T -1 (r(x)), r(x) = p(x) q k (x) , ( ) with T (d(x)) := -ψ ′ (d(x)) ϕ ′ (d(x)) . Note that the resulting mapping T must be a bijection such that its inverse exists. Additional conditions on the hypothesis of ϕ and ψ can be found in Appendix A.3. Remark: Note that two-sample density ratio estimations discard the density information from q k . The functions d(x), r(x) only depend on x and they cannot capture the variability of q k . To this end, we can train d to solve the optimization problem in Eq. ( 13) and the density ratio is approximated by T (d(x)). For example, in a standard binary classification problem where we can design ϕ(d) = log σ(d) and ψ(d) = log(1 -σ(d)), we have d * (x) = log r(x) (Sugiyama et al., 2012) . Other types of density ratio estimation can be found in Table 1 as they have been already used in GAN variants where q k refers to the generator's distribution p g . In practice, since the change of x k is sufficiently small at every step k, we can use a single discriminator d(x) and perform a few gradient updates to solve Eq. ( 13) per iteration k to approximate the time-dependent density ratio r k (x), which is identical to GAN training.

3.3. PARAMETERIZATION OF THE DISCRETIZED MONOFLOW

The previous method directly pushes particles in the Euclidean space towards the target distribution. We can use a neural network generator to mimic the distribution of these particles, i.e., the generator learns to draw samples. If we parameterize particles with a neural network generator g θ that takes as input random noises z ∼ p z and output particles x (θ,z) = g θ (z), the infinitesimal change of parameters of the generator is dθ t dt = ∂g θt (z) ∂θ t ∇ x h log r t (x t ) p z (z)dz, where x t = g θt (z), where ∂g θ t (z) ∂θt is the Jacobian of the neural network generator. Apply the forward Euler method with the step size β, we have θ k+1 = θ k + β∇ θ E z∼pz h log r k (g θ k (z)) . Eq. ( 15) can be regarded as amortizing particles of the gradient flow into a neural network generator by approximately solving θ k+1 = arg min θ E z∼pz ||g θ (z) -x k+1 || 2 where x k+1 = g θ k (z) + α∇ x h log r k (g θ k (z)) with a one-step gradient descent (Wang & Liu, 2017) . Consequently, by the chain rule we have dx t = ∂g θ t (z) ∂θt dθ t , replace dθ t with Eq. ( 15), dx t = E z ′ ∼pz K t g (z, z ′ )∇ x h log r t (x t )) dt where K t g (z, z ′ ) = ⟨ ∂g θ t (z) ∂θt , ∂g θ t (z ′ ) ∂θt ⟩ is the neural tangent kernel (NTK) (Jacot et al., 2018) defined by the generator. Note that Eq. ( 17) realizes Stein Variational Gradient Descent (Liu & Wang, 2016; Franceschi et al., 2022) if h is an identity mapping.

3.4. A UNIFIED FORMULATION OF DIVERGENCE GENERATIVE ADVERSARIAL NETS

Based on the above derivation, we propose a general formulation for divergence GANs. We clarify that GANs can be treated with different objective functions for training discriminators and generators. All of these variants are algorithmic instantiations of the parameterized MonoFlow. The unified framework is summarized as: given a discriminator d and a generator g, the discriminator d learns to maximize E x∼pdata ϕ d(x) + E z∼pz ψ d(g(z)) , where p data refers to the data distribution. Next, we train the generator g to minimize -E z∼pz h T d(g(z)) . ( ) where h T (d) = h(log T (d)) and h can be any strictly increasing function. We summarize some typical GAN variants in Table 1 . We view adversarial training as maximizing Eq. ( 18) to obtain the density ratio which suggests the vector field for MonoFlow and minimizing Eq. ( 19) as learning to parameterize MonoFlow corresponding to Eq. ( 16). Especially, we find that b-GAN (Uehara et al., 2016) realizes parameterized Wasserstein gradient flows of f -divergences since its generator loss is aligned with Theorem 3.2 for the design of h.

4. UNDERSTANDING ADVERSARIAL TRAINING VIA MONOFLOW

The dominating understanding of adversarial training over GANs is that the generator learns to minimize the divergence estimated from the discriminator. However, as pointed out in Section 1, the theoretical explanation of GAN and its practical algorithm are inconsistent. In this section, through the lens of MonoFlow, we will explain why this inconsistency of adversarial training still can lead to convergence to the target distribution and how it differs from a variational divergence minimization (VDM) problem for generative modeling.

4.1. WHY THE ADVERSARIAL GAME WORKS?

In an adversarial game, the discriminator is trained to maximize the lower bound of f -divergences. This lower bound can be derived via the dual representation of f -divergences (Nguyen et al., 2010) between p data and p g , D f (p data ||p g ) = max d∈H E x∼pdata d(x) -E x∼pg f d(x) lower bound , r(x) = p data (x) p g (x) , where f (d) = sup r∈domf {rd -f (r)} is the convex conjugate of f (r) and H is a class of any measurable functions. Note that for binary classification problems where we design specific ϕ and ψ, the corresponding optimization problem in Eq. ( 18) can be translated into an equivalent formulation as the above dual representation (Nowozin et al., 2016) . Since the first term of the lower bound in Eq. ( 20) is irrelevant to p g , the generator actually only learns to minimize the second term, min g -E x∼pg f d(x) Meanwhile, the generator can also alternatively minimize the heuristic non-saturated loss -E x∼pg d(x) , which has been proven to work well in practice. By the Fenchel duality, the optimal d * is given by d * = f ′ (r) with the equality f (d * ) = rf ′ (r) -f (r). Fortunately, it can be simply verified that f ′ (r) and rf ′ (r) -f (r) are both monotonically increasing functions of the density ratio (as well as the log density ratio). Hence, adversarial training with the vanilla loss and the non-saturated loss both fall into the framework of MonoFlow which has theoretical guarantees. Table 2 : Comparisons the convergence when using different f and h on three density ratio models: "✓" means the generator converges to the data distribution and "✗" means it does not converge. For r(x, θ) and r(x, θ de ), the convergences means the generator parameter θ can converge to the true value. For r GAN (x), convergence means the parameter can closely approximate the true value.

Visualization results are included in Appendix

B.1. if f convex if h mono increases r(x, θ) r(x, θ de ) r GAN (x) KL Yes Yes ✓ ✓ ✓ Forward KL Yes No ✓ ✗ ✗ Chi-Square Yes No ✓ ✗ ✗ Hellinger Yes No ✓ ✗ ✗ Jensen-Shannon Yes No ✓ ✗ ✗ Exp No Yes ✗ ✓ ✓

4.2. DIFFERENCE BETWEEN ADVERSARIAL TRAINING AND VARIATIONAL DIVERGENCE MINIMIZATION

VDM differs from adversarial training because it requires density information from the generator, as elaborated in the following. In variational divergence minimization, the generator g θ usually defines a distribution via an explicit density function p g (x; θ). For example, x can be reparameterized as a Gaussian random variable where θ are its mean and scale. We are interested in minimizing an f -divergence D f (p data ||p g ) = E x∼pg f r(x, θ) = Cost(θ), where the density ratio r(x, θ) = pdata(x) pg(x;θ) is a function of x as well as θ to capture the variability of p g . Integrate out x, the f -divergence can be written as a cost function of θ. Since f is convex, such that by Jensen inequality, this cost is minimized at zero when r(x, θ) is a constant for each x, meaning p g = p data . Similarly under Fenchel-duality, we can rewrite f -divergences as Cost(θ) = E x∼pdata d * (x, θ) -E x∼pg f d * (x, θ) , where d * (x, θ) = f ′ (r(x, θ)). Eq. ( 23) is different from Eq. ( 20), where Cost(θ) depends on the first term of the dual representation. However, in an adversarial training scenario, the density ratio estimator r(x) or its bijection d(x) are only functions of the sample x. It was also discussed by Metz et al. (2017) ; Franceschi et al. (2022) , the smoothness between the discriminator and the variability of p g is lost during the practical algorithm. Plugging r(x) or d(x) into the f -divergence to replace r(x, θ) or d * (x, θ), we can approximate f -divergences but the approximated divergences can never be viewed as a cost function of θ anymore. This is the major disconnection between the theory and the practical algorithm over GANs, as Eq. ( 4) in (Goodfellow et al., 2014) is similar to Eq. ( 23).

4.3. EMPIRICAL STUDY OF GAUSSIANS

This part highlights the differences between density ratio models in terms of the convergence of generators. We start from the simplest form of a generator p g θ : x = µ + sz, where z ∼ N (0, I) and θ = (µ, s), where µ is the mean and s is the scale matrix. Let the data distribution be p data = N (µ 0 , s T 0 s 0 ). By assuming the generator and data distributions are Gaussians, we can define three density ratio models. The first model is r(x, θ) = pdata(x) pg(x;θ) , where the density ratio function depends on x and θ simultaneously. The second model is r(x, θ de ) = pdata(x) pg(x;θde) , where θ de means we detach the gradient of θ such that the second model cannot reflect the variability of p g . The third model is r GAN (x) where the density ratio is approximated by performing a single gradient update for the classifier using samples in standard adversarial training. We train the generator to minimize the following loss function with the above three density ratio models respectively (for r GAN (x), we use the standard bi-level training), min θ E z∼pz f (r) or equivalently min θ -E z∼pz h(log r) Given f (r) we can rewrite it as a function of log density ratio h(log r) = -f (r). Similarly, given h(log r), we can write it as a function of density ratio f (r) = -h(log r). In this experiment, we consider five types of f -divergences (expressions summarized in Appendix B.1). In addition, we study a strictly increasing function h given by h(log r) = exp 1.5 log r = r 1.5 where its f (r) = -r 1.5 is concave. The results are summarized in Table 2 , which verifies our analysis that the objective of VDM should be a convex function of the density ratio, whereas MonoFlow only requires it to be an increasing function of the log density ratio. r(x, θ de ) can recover the true f -divergence, but minimizing this f -divergence has no effects except for KL divergence. Remark: Minimizing the KL divergence using the detached model r(x, θ de ) also leads to convergence. This mechanism is called "Variational Inference via Wasserstein Gradient Flows" (Lambert et al., 2022) and it also have a lower variance (Roeder et al., 2017) .

5.1. ANALYZING PRACTICAL EFFECTIVENESS OF GENERATOR LOSS

In this part, we explain what types of generator loss functions can lead to the successful training of GANs via the lens of MonoFlow. We provide a study for the vanilla GAN since the logit output of the binary classifier is the log density ratio where we have d(x) = log r(x) (see Table 1 ). We consider four generator losses which are monotonically increasing functions of the log density ratio: 1). Vanilla loss: h(d) = -log(1 -σ(d)); 2). Non-saturated (NS) loss: h(d) = log(σ(d)). 3). Maximum likelihood loss: h(d) = exp(d). 4). Logit loss: h(d) = d. The plot of these functions is shown in Figure 2 . The vanilla loss and the MLE loss do not work well in practice (Goodfellow, 2016) , since at initial training steps, the generator is weak and d(x) = log pdata(x) pg(x) ≪ 0, for x ∼ p g (x). We may observe in Figure 2 , the curves of the vanilla loss and the MLE loss are fairly flat on the left side, which means the derivative h ′ (•) is nearly zero. According to Eq. ( 6), such a rescaling scheme yields extremely small vector fields, resulting in the generator being trapped at the initial steps as the infinitesimal change of particles dx t ≈ 0. The NS and logit loss both have non-zero derivatives when d(x) < 0 despite that the NS loss is flat on the right side. This is not a problem since d(x) gradually increases from a negative value during the training and when d(x) = 0, it means p g = p data such that the generator converges.

5.2. AN EMBARRASSINGLY SIMPLE TRICK TO FIX VANILLA GAN

We have justified that GANs can work with any generator loss as long as it is a monotonically increasing mapping of the log density ratio and this mapping's derivative deviates from zero when the log r(x) ≪ 0. We show the effects of shifting the generator loss of the vanilla GAN left by adding a constant C to the sigmoid function, h(d) = -log(1 -σ(d + C)) By adding a constant, we can obtain a better monotonically increasing function whose derivative deviates from zero significantly, see Figure 3 . The neural network architecture used here is DCGAN (Radford et al., 2015) and we follow the vanilla GAN framework where the log density ratio is obtained by logit output from the binary classifier and the model is trained with 15 epochs. The generated samples are shown in Figure 4 . We observe that when C = 3 and C = 5, the generator losses in Eq. ( 25) begin to work, i.e., generators output plausible fake images. 

6. RELATED WORKS

Gradient Flow: Wasserstein gradient flows of f -divergences have been previously studied in deep generative modeling as a refinement approach to improve sample quality (Ansari et al., 2021) . A close work to ours is (Gao et al., 2019) where the authors proposed to use gradient flows of fdivergences to refine fake samples output by the generator and the generator learns to minimize the squared distance between the refined samples and the original fake samples. However, neither of the above reveals the equivalence between gradient flows and divergence GANs. Furthermore, MonoFlow is a more generalized framework to cover existing gradient flows of f -divergences and our method also applies to traditional loss designs as well as many other types of monotonically increasing functions. IPM GANs: Our framework unifies divergence GANs since estimating a probability divergence is naturally related to density ratio estimation (Sugiyama et al., 2012) . However, some variants of GANs are developed with Integral Probability Metric (IPM) (Sriperumbudur et al., 2009) . For example, WGANs (Arjovsky et al., 2017; Gulrajani et al., 2017) estimate the Wasserstein-1 metric and then minimize this metric. While MonoFlow is associated with Wasserstein-2 metric, minimizing a functional in P(R d ) naturally decreases Wasserstein-2 metric as well. Other types of IPM GANs are MMD GAN (Dziugaite et al., 2015) and Sobolev GAN (Mroueh et al., 2018b) . Both of them have been interpreted as gradient flow approaches (Mroueh & Nguyen, 2021; Mroueh et al., 2018a) but associated with different vector fields. Franceschi et al. (2022) studied the NTK view on GANs given a vector field specified by a loss function of IPM but lacks connections to divergence GANs. Diffusion Models: diffusion models (Ho et al., 2020; Song et al., 2021; Luo, 2022) are another line of generative modeling framework. This framework first perturbs data by adding noises with different scales to create a path {q t } t≥0 interpolating the data distribution and the noise distribution. Subsequently, the generative modeling is to reverse {q t } t≥0 as denoising. The similarity of MonoFlow and diffusion models is that they both involves particle evolution associated with different paths of marginal probabilities. The difference is as follows: the vector field of MonoFlow is obtained with the log density ratio and the log density ratio must be corrected per iteration by gradient update; diffusion models directly estimate vector fields by time-dependent neural networks and they are straightforward particle methods.

7. CONCLUSIONS

We introduce a unified framework for GAN variants. Our framework provides a comprehensive understanding to help us get insights into why and how adversarial training works. The mechanism of adversarial training is not as adversarial as we used to think. It instead simulates an ODE system, the bi-level step of adversarial can be regarded as it first estimates the vector field of MonoFLow and next the generator is updated to learn to draw particles guided by the vector field, aka we call it parameterizing MonoFlow. Therefore, all GAN variants discussed in this paper are equal at the methodology level. They all are different methods of estimating the bijection of the log density ratio and then mapping the log density ratio by different monotonically increasing functions. Compared to previous studies of GANs, our framework is highly theoretically and practically consistent. The limitation of this paper is that our framework does not cover the variants of IPM GANs since these variants give a vector field that is different from the gradient of log density ratios. We will leave it as a future work.

A PROOFS

A.1 PROOF OF THEOREM 3.1 Equilibrium of MonoFlow: Without loss of generality, we assume the target measure p is properly normalized, i.e., dp = 1. To satisfy this assumption, p usually follows a Boltzmann distribution, i.e., p ∝ exp(-U ) where the potential energy U = -log p. MonoFlow defines a vector field v t = h ′ log r t ∇ x log p -∇ x log q t such that by Theorem 4.6 in (Ambrosio et al., 2008) , we can write the continuity equation as ∂q t ∂t = div q t h ′ (log r t ) (∇ x log q t -∇ x log p) . At equilibrium (stationary distribution), the probability current is zero, q t h ′ (log r t ) (∇ x log q t -∇ x log p) = 0, This condition is required to reflect the boundary conditions of the continuity equation. Note that the equilibrium state of a general continuity equation does not necessarily indicate that the current (flux) has to be zero. MonoFlow is a special case where the drift force of the system is generated by the potential energy (conservative force), such that achieving equilibrium is equivalent to the current being zero. Since h ′ (log r t ) > 0, we directly have q t (∇ x log q t -∇ x log p) = 0. Hence q t = p is the solution to the above differential equation, which is the same as finding the equilibrium of the Fokker Planck equation in Eq. ( 3). The dissipation rate: For any curve {q t } t≥0 evolving according to the vector field {v t } t≥0 in Wasserstein space, the dissipation rate of the functional ∂F (qt) ∂t = E qt ⟨∇ W2 F(q t ), v t ⟩, where the Wasserstein gradient of KL divergence is ∇ W2 F(q t ) = ∇ x log qt p . Therefore, the dissipation rate of the KL divergence under MonoFlow is ∂F(q t ) ∂t = E qt -h ′ log r t ∇ x log q t p 2 ≤ 0, and equality is achieved if and only if q t = p. Hence, MonoFlow always decreases the KL divergence with time. A.2 PROOF OF THEOREM 3.2 Define the functional F(q) of f -divergences as F(q) = D f (p||q) = f p q (x)q(x)dx. ( ) where f : R + → R is a convex function and we may further assume that f is twice differentiable. Let ϕ ∈ P(R n ) be an arbitrary test function, the first variation (functional derivative) δF δq is defined as δF δq (x)ϕ(x)dx = lim ϵ→0 F(q + ϵϕ) -F(q) ϵ = d dϵ F(q + ϵϕ) ϵ=0 = d dϵ f p q + ϵϕ (x) q(x) + ϵϕ(x) dx ϵ=0 = f p q + ϵϕ (x)ϕ(x) -f ′ p q + ϵϕ (x) p(x)ϕ(x) q(x) + ϵϕ(x) dx ϵ=0 = f p q -f ′ p q p q (x)ϕ(x)dx. (30) Thus, δF δq = f (r) -rf ′ (r), where r = p q . (31) Recall that the Wasserstein gradient of F(q) is the Euclidean gradient of the first variation, we have ∇ W2 F(q) = ∇ x δF δq = -rf ′′ (r)∇ x r. ( ) The corresponding vector field is given by the negative Euclidean gradient, see Section 2, therefore the particle flow ODE of f -divergences can be written as dx = -∇ x δF δq (x)dt = r(x)f ′′ (r(x))∇ x r(x)dt = r(x) 2 f ′′ (r(x))∇ x log r(x)dt. (33) We have h ′ (log r) = r 2 f ′′ (r) > 0 since f is a convex function, concluding the proof. A  we apply the interchange of maximum and integral because the integral operator is independent of d. Since the maximum is holding for every fixed x, thus we let the derivative ϕ ′ (d) is a bijection. 2. ϕ ′ > 0 and the resulting mapping T is a strictly increasing mapping (also a bijection). For condition 1, it is obvious d * = T -1 (r) is the root of Eq. ( 36) and since ϕ and ψ are concave, hence l(d) is concave which satisfies Eq. (37). Therefore, d * is the unique maximizer. 



We use a slightly different notation: d(x) is the logit output of the classifier and σ(•) is the Sigmoid activation. For the sake of simplicity, we briefly replace t k by its index k, though it is not rigorous.



Figure 2: The plot of different generator losses as a function of d.

Figure 3: The plots of the vanilla losses by adding different Cs.

Figure 4: Generated samples with different Cs. Experiments uses the data set MNIST.

ϕ d(x) + ψ d(x) = 0, we have the optimal d * , the abbreviation of d * (x), to satisfy rϕ ′ (d * ) + ψ ′ (d * ) = 0, r(x) = pneed to discuss under what sufficient conditions, d * is the unique maximizer for the above problem. Denote l(d) = rϕ(d) + ψ(d), in order to ensure that d * is the unique maximizer, l ′ (d) should satisfy l ′ (d) > 0, ∀d < d * and l ′ (d) < 0, ∀d > d * . (37) We summarize two sufficient conditions as: 1. ϕ, ψ are concave functions and the resulting mapping T (d) := -ψ ′ (d)

, we can write l ′ (d) = [T (d * )-T (d)]ϕ ′ (d). Since T is a strictly increasing mapping and d * is the maximizer, we have T (d * ) -T (d) > 0 for d < d * and T (d * ) -T (d) < 0 for d > d * . Hence l ′ (d) satisfies the condition stated in Eq. (37). In Table 2, b-gan satisfies condition 2 and the rest of divergence GANs satisfy condition 1. Some examples: • For binary classification, ϕ(d) = log σ(d) and ψ(d) = log(1 -σ(d)), r(x) = exp(d * (x)). • Fenchel-duality, ϕ(d) = d, ψ(d) = -f (d), r(x) = f ′ d * (x) where the convex conjugate is f (d) = sup r∈domf {rd -f (r)} • For least-square GAN, ϕ(d) = -(d -1) 2 , ψ(d) = -d 2 , r(x) = d * (x) 1-d * (x)

Different types of divergence GANs. f is a convex function and f is the convex conjugate by f

.3 PROOF OF LEMMA 3.3 This proof is adapted from Lemma 1 and 2 by Moustakides & Basioti (2019). Given the optimization problem max d∈HE x∼p ϕ d(x) + E x∼q ψ d(x) ,(34)where H is a class of measurable functions. We rewrite it as

B EXPERIMENTS

B.1 DETAILS FOR 4.3 Table 3 : Explicit forms of f and h f (r) h(u), u = log rThe generator is initialized at:and the target distribution is N 0.0 0.0 , 1.00 0.80 0.80 0.89The visualisation results follow the order: KL, Forward KL, Chi-Square, Hellinger, Jensen-Shannon (GAN), Exp. r GAN (x) uses a simple 2-layer discriminator with Leaky ReLU activation that has logit output as the log density ratio. . This has no meaning for reconstructing any probability divergences, but GAN still works. We did an experiment on Cifar10 and Celeb-A using DCGAN architecture (Radford et al., 2015) . The generated samples are shown in Figure 6 . 

