MONOFLOW: A UNIFIED GENERATIVE MODELING FRAMEWORK FOR DIVERGENCE GANS

Abstract

Generative adversarial networks (GANs) play a minmax two-player game via adversarial training. The conventional understanding of adversarial training is that the discriminator is trained to estimate a divergence and the generator learns to minimize this divergence. We argue that despite the fact that many variants of GANs are developed following this paradigm, the existing theoretical understanding of GANs and the practical algorithms are inconsistent. In order to gain deeper theoretical insights and algorithmic inspiration for these GAN variants, we leverage Wasserstein gradient flows which characterize the evolution of particles in the sample space. Based on this, we introduce a unified generative modeling framework -MonoFlow: the particle evolution is rescaled via an arbitrary monotonically increasing mapping. Under our framework, adversarial training can be viewed as a procedure first obtaining MonoFlow's vector field via the discriminator and then the generator learns to parameterize the flow defined by the corresponding vector field. We also reveal the fundamental difference between variational divergence minimization and adversarial training. This analysis helps us to identify what types of generator loss functions can lead to the successful training of GANs and suggest that GANs may have more loss designs beyond those developed in the literature, e.g., non-saturated loss, as long as they realize MonoFlow. Consistent empirical studies are also included to validate the effectiveness of our framework.

1. INTRODUCTION

Generative adversarial nets (GANs) (Goodfellow et al., 2014; Jabbar et al., 2021) are a powerful generative modeling framework that has gained tremendous attention in recent years. GANs have achieved significant successes in applications, especially in high-dimensional image processing such as high-fidelity image generation (Brock et al., 2018; Karras et al., 2019 ), super-resolution (Ledig et al., 2017) and domain adaption (Zhang et al., 2017) . In the GAN framework, a discriminator d and a generator g play a minmax game. The discriminator is trained to distinguish real and fake samples and the generator is trained to generate fake samples to fool the discriminator. The equilibrium of the vanilla GAN is defined byfoot_0  min g max d V (g, d) = E x∼pdata log σ[d(x)] + E z∼pz log 1 -σ[d(g(z))] The elementary optimization approach to solve the minmax game is adversarial training. et al., 2015; Mroueh et al., 2018b) . However, we emphasize that the traditional understanding over GANs is incomplete and here we present three non-negligible facts which are commonly associated with adversarial training, making it different from the standard variational divergence minimization (VDM) problem (Blei et al., 2017) : 1. The estimated divergence is computed from the discriminator d(x). It is a function only depending on samples x and cannot capture the variability of the generator's distribution p g . However, the optimal discriminator in Proposition 1 by Goodfellow et al. (2014) requires p g to be a functional variable as well. This issue was also raised in (Metz et al., 2017; Franceschi et al., 2022) . 2. The generator typically minimizes a divergence with a missing term, e.g., the vanilla GAN only minimizes the second term of Jensen-Shannon divergence -E z∼pz -log 1 -σ[d(g(z))] where -log 1 -σ[d(g(z)) ] is a monotonically increasing function of d(g(z)). 3. Practical algorithms are inconsistent with the theory, a heuristic trick "non-saturated loss" is commonly adopted to mitigate the gradient vanishing problem, but it still lacks a rigorous mathematical understanding. For example, the generator can minimize -E z∼pz {log σ[d(g(z))]} where log σ[d(g(z)) ] is also a monotonically increasing function of d(g(z)). It is known the logit output d(x) of a binary classification problem in Eq. ( 1) is the logarithm density ratio estimator between two distributions (Sugiyama et al., 2012) . To gain a deeper understanding of adversarial training of GANs, we study the Wasserstein gradient flow of the KL divergence which characterizes a deterministic evolution of particles described by an ordinary differential equation (ODE). This ODE is a Euclidean gradient flow of a time-dependent log density ratio. Based on this ODE, we propose the MonoFlow framework -transforming the log density ratio by a monotonically increasing mapping such that the vector field of the gradient flow is rescaled along the same direction. Consequently, approximating and learning to parameterize MonoFlow is synonymous with adversarial training. Under our framework, we gain a comprehensive understanding of training dynamics over GANs: the discriminator obtains a bijection of the log density ratio that suggests the vector field and the generator learns to parameterize the particles of MonoFlow. All variants of divergence GANs are a subclass of our framework. Finally, we reveal that the discriminator and generator loss do not need to follow the same objective. The discriminator maximizes an objective to obtain a bijection of the log density ratio. Then the generator loss can be any monotonically increasing mapping of this log density ratio. Our contributions are summarized as follows: • A novel generative modeling framework unifies divergence GANs, providing a new theoretically and practically consistent understanding of the underlying mechanism of the training dynamics over GANs. • We reveal the fundamental difference between VDM and adversarial training, which indicates that the previous analysis of GANs based on the perspective of VDM might not provide benefits and instead we should treat GANs as a particle flow method. • An analysis of what types of generator loss functions can lead to the success of training GAN. Our framework explains why and how non-saturated loss works. • An algorithmic inspiration where GANs may have more variants of loss designs than we have already known.

2. WASSERSTEIN GRADIENT FLOWS

In this section, we review the definition of gradient flows in Wasserstein space (P(R n ), W 2 ), the space of Borel probability measures P(R n ) defined on R n with finite second moments and Eq.uipped with the Wasserstein-2 metric. An absolutely continuous curve of probability measures {q t } t≥0 ∈ P(R n ) is a Wasserstein gradient flow if it satisfies the following continuity equation (Ambrosio et al., 2008) , ∂q t ∂t = div q t ∇ W2 F(q t ) , where ∇ W2 F(q t ) is called the Wasserstein gradient of the functional F : P(R n ) → R. The Wasserstein gradient is defined as ∇ x δF δqt , i.e. the Euclidean gradient of the functional's first variation δF (qt) δqt . Specifically, for the functional F(q t ) = log qt p dq t as the KL divergence where p is a fixed target probability measure, we have δF (qt) δqt = log qt p + 1. Hence, the Wasserstein gradient flow of



We use a slightly different notation: d(x) is the logit output of the classifier and σ(•) is the Sigmoid activation.

