MUTUAL CALIBRATION BETWEEN EXPLICIT AND IM-PLICIT DEEP GENERATIVE MODELS

Abstract

Deep generative models are generally categorized into explicit models and implicit models. The former defines an explicit density form that allows likelihood inference; while the latter targets a flexible transformation from random noise to generated samples. To take full advantages of both models, we propose Stein Bridging, a novel joint training framework that connects an explicit (unnormalized) density estimator and an implicit sample generator via Stein discrepancy. We show that the Stein bridge 1) induces novel mutual regularization via kernel Sobolev norm penalization and Moreau-Yosida regularization, and 2) stabilizes the training dynamics. Empirically, we demonstrate that Stein Bridging can facilitate the density estimator to accurately identify data modes and guide the sample generator to output more high-quality samples especially when the training samples are contaminated or limited.

1. INTRODUCTION

Deep generative model, as a powerful unsupervised framework for learning the distribution of highdimensional multi-modal data, has been extensively studied in recent literature. Typically, there are two types of generative models: explicit and implicit (Goodfellow et al., 2014) . Explicit models define a density function of the distribution, while implicit models learn a mapping that generates samples by transforming an easy-to-sample random variable. Both models have their own power and limitations. The density form in explicit models endows them with convenience to characterize data distribution and infer the sample likelihood. However, the unknown normalizing constant often causes computational intractability. On the other hand, implicit models including generative adversarial networks (GANs) can directly generate vivid samples in various application domains including images, natural languages, graphs, etc. (Goodfellow et al., 2014; Radford et al., 2016; Arjovsky et al., 2017; Brock et al., 2019) . Nevertheless, one important challenge is to design a training algorithm that do not suffer from instability and mode collapse. In view of this, it is natural to build a unified framework that takes full advantages of the two models and encourages them to compensate for each other. Intuitively, an explicit density estimator and a flexible implicit sampler could help each other's training given effective information sharing. On the one hand, the density estimation given by explicit models can be a good metric that measures quality of samples (Dai et al., 2017) , and thus can be used for scoring generated samples given by implicit model or detecting outliers as well as noises in input true samples (Zhai et al., 2016) . On the other hand, the generated samples from implicit models could augment the dataset and help to alleviate mode collapse especially when true samples are insufficient that would possibly make explicit model fail to capture an accurate distribution. We refer to Appendix A for a more comprehensive literature review. Motivated by the discussions above, in this paper, we propose a joint learning framework that enables mutual calibration between explicit and implicit generative models. In our framework, an explicit model is used to estimate the unnormalized density; in the meantime, an implicit generator model is exploited to minimize certain statistical distance (such as the Wasserstein metric or Jensen-Shannon divergence) between the distributions of the true and the generated samples. On top of these two models, a Stein discrepancy, acting as a bridge between generated samples and estimated densities, is introduced to push the two models to achieve a consensus. Unlike flow-based models (Nguyen et al., 2017; Kingma & Dhariwal, 2018; Papamakarios et al., 2017) , our formulation does not impose invertibility constraints on the generative models and thus is flexible in utilizing general neural network architectures. Our main contribution are as follows. • Theoretically, we prove that our method allows the two generative models to impose novel mutual regularization on each other. Specifically, our formulation penalizes large kernel Sobolev norm of the critic in the implicit (WGAN) model, which ensures the critic not to change suddenly on the high-density regions and thus preventing the critic of the implicit model being to strong during training. In the mean time, our formulation also smooths the function given by the Stein discrepancy through Moreau-Yosida regularization, which encourages the explicit model to seek more modes in the data distribution and thus alleviates mode collapse. • In addition, we also show that the joint training helps to stabilize the training dynamics. Compared with other common regularization approaches for GAN models that may shift original optimum, our method can facilitate convergence to unbiased model distribution. • Extensive experiments on synthetic and image datasets justify our theoretical findings and demonstrate that joint training can help two models achieve better performance. On the one hand, the energy model can detect complicated modes in data more accurately and distinguish out-ofdistribution samples. On the other hand, the implicit model can generate higher-quality samples, especially when the training samples are contaminated or limited.

2. BACKGROUND

We briefly provide some technical background related to our model. Energy Model. The energy model assigns each data x ∈ R d with a scalar energy value E φ (x), where E φ (•) is called energy function and is parameterized by φ. The model is expected to assign low energy to true samples according to a Gibbs distribution p φ (x) = exp{-E φ (x)}/Z φ , where Z φ is a normalizing constant dependent of φ. The normalizing term Z φ is often hard to compute, making the training intractable, and various methods are proposed to detour such term (see Appendix A). Stein Discrepancy. Stein discrepancy (Gorham & Mackey, 2015; Liu et al., 2016; Chwialkowski et al., 2016; Oates et al., 2017; Grathwohl et al., 2020) is a measure of closeness between two probability distributions and does not require knowledge for the normalizing constant of one of the compared distributions. Let P and Q be two probability distributions on X ⊂ R d , and assume Q has a (unnormalized) density q. The Stein discrepancy S(P, Q) is defined as S(P, Q) := sup f ∈F E x∼P [A Q f (x)] := sup f ∈F {Γ(E x∼P [∇ x log q(x)f (x) + ∇ x f (x)])}, where F is often chosen to be a et al., 2017) . The Wasserstein-1 metric between distributions P and Q is defined as W(P, Q) := min γ E (x,y)∼γ [ x -y ], where the minimization with respect to γ is over all joint distributions with marginals P and Q. By Kantorovich-Rubinstein duality, W(P, Q) has a dual representation W(P, Q) := max D {E x∼P [D(x)] -E y∼Q [D(y)]} , where the maximization is over all 1-Lipschitz continuous functions. Sobolev space and Sobolev dual norm. Let Lfoot_0 (P) be the Hilbert space on R d equipped with an inner product u, v L 2 (P) := R d uvdP(x). The (weighted) Sobolev space H 1 is defined as the closure of C ∞ 0 , a set of smooth functions on R d with compact support, with respect to norm u H 1 := R d (u 2 + ∇u



2 )dP(x)1/2 , where P has a density. For v ∈ L 2 , its Sobolev dual norm



Stein class (see, e.g., Definition 2.1 in (Liu et al., 2016)), f : R d → R d is a vector-valued function called Stein critic and Γ is an operator that transforms a d × d matrix into a scalar value. One common choice of Γ is trace operator when d = d. One can also use other forms for Γ, like matrix norm when d = d (Liu et al., 2016). If F is a unit ball in some reproducing kernel Hilbert space (RKHS) with a positive definite kernel k, it induces Kernel Stein Discrepancy (KSD). More details are provided in Appendix B. Wasserstein Metric. Wasserstein metric is suitable for measuring distances between two distributions with non-overlapping supports (Arjovsky

