MUTUAL CALIBRATION BETWEEN EXPLICIT AND IM-PLICIT DEEP GENERATIVE MODELS

Abstract

Deep generative models are generally categorized into explicit models and implicit models. The former defines an explicit density form that allows likelihood inference; while the latter targets a flexible transformation from random noise to generated samples. To take full advantages of both models, we propose Stein Bridging, a novel joint training framework that connects an explicit (unnormalized) density estimator and an implicit sample generator via Stein discrepancy. We show that the Stein bridge 1) induces novel mutual regularization via kernel Sobolev norm penalization and Moreau-Yosida regularization, and 2) stabilizes the training dynamics. Empirically, we demonstrate that Stein Bridging can facilitate the density estimator to accurately identify data modes and guide the sample generator to output more high-quality samples especially when the training samples are contaminated or limited.

1. INTRODUCTION

Deep generative model, as a powerful unsupervised framework for learning the distribution of highdimensional multi-modal data, has been extensively studied in recent literature. Typically, there are two types of generative models: explicit and implicit (Goodfellow et al., 2014) . Explicit models define a density function of the distribution, while implicit models learn a mapping that generates samples by transforming an easy-to-sample random variable. Both models have their own power and limitations. The density form in explicit models endows them with convenience to characterize data distribution and infer the sample likelihood. However, the unknown normalizing constant often causes computational intractability. On the other hand, implicit models including generative adversarial networks (GANs) can directly generate vivid samples in various application domains including images, natural languages, graphs, etc. (Goodfellow et al., 2014; Radford et al., 2016; Arjovsky et al., 2017; Brock et al., 2019) . Nevertheless, one important challenge is to design a training algorithm that do not suffer from instability and mode collapse. In view of this, it is natural to build a unified framework that takes full advantages of the two models and encourages them to compensate for each other. Intuitively, an explicit density estimator and a flexible implicit sampler could help each other's training given effective information sharing. On the one hand, the density estimation given by explicit models can be a good metric that measures quality of samples (Dai et al., 2017) , and thus can be used for scoring generated samples given by implicit model or detecting outliers as well as noises in input true samples (Zhai et al., 2016) . On the other hand, the generated samples from implicit models could augment the dataset and help to alleviate mode collapse especially when true samples are insufficient that would possibly make explicit model fail to capture an accurate distribution. We refer to Appendix A for a more comprehensive literature review. Motivated by the discussions above, in this paper, we propose a joint learning framework that enables mutual calibration between explicit and implicit generative models. In our framework, an explicit model is used to estimate the unnormalized density; in the meantime, an implicit generator model is exploited to minimize certain statistical distance (such as the Wasserstein metric or Jensen-Shannon divergence) between the distributions of the true and the generated samples. On top of these two models, a Stein discrepancy, acting as a bridge between generated samples and estimated densities, is introduced to push the two models to achieve a consensus. Unlike flow-based models (Nguyen et al., 2017; Kingma & Dhariwal, 2018; Papamakarios et al., 2017) , our formulation does not impose invertibility constraints on the generative models and thus is flexible in utilizing general neural network architectures. Our main contribution are as follows. • Theoretically, we prove that our method allows the two generative models to impose novel mutual regularization on each other. Specifically, our formulation penalizes large kernel Sobolev norm of the critic in the implicit (WGAN) model, which ensures the critic not to change suddenly on the high-density regions and thus preventing the critic of the implicit model being to strong during training. In the mean time, our formulation also smooths the function given by the Stein discrepancy through Moreau-Yosida regularization, which encourages the explicit model to seek more modes in the data distribution and thus alleviates mode collapse. • In addition, we also show that the joint training helps to stabilize the training dynamics. Compared with other common regularization approaches for GAN models that may shift original optimum, our method can facilitate convergence to unbiased model distribution. • Extensive experiments on synthetic and image datasets justify our theoretical findings and demonstrate that joint training can help two models achieve better performance. On the one hand, the energy model can detect complicated modes in data more accurately and distinguish out-ofdistribution samples. On the other hand, the implicit model can generate higher-quality samples, especially when the training samples are contaminated or limited.

2. BACKGROUND

We briefly provide some technical background related to our model. Energy Model. The energy model assigns each data x ∈ R d with a scalar energy value E φ (x), where E φ (•) is called energy function and is parameterized by φ. The model is expected to assign low energy to true samples according to a Gibbs distribution p φ (x) = exp{-E φ (x)}/Z φ , where Z φ is a normalizing constant dependent of φ. The normalizing term Z φ is often hard to compute, making the training intractable, and various methods are proposed to detour such term (see Appendix A). Stein Discrepancy. Stein discrepancy (Gorham & Mackey, 2015; Liu et al., 2016; Chwialkowski et al., 2016; Oates et al., 2017; Grathwohl et al., 2020) is a measure of closeness between two probability distributions and does not require knowledge for the normalizing constant of one of the compared distributions. Let P and Q be two probability distributions on X ⊂ R d , and assume Q has a (unnormalized) density q. The Stein discrepancy S(P, Q) is defined as S(P, Q) := sup f ∈F E x∼P [A Q f (x)] := sup f ∈F {Γ(E x∼P [∇ x log q(x)f (x) + ∇ x f (x)])}, where F is often chosen to be a Stein class (see, e.g., Definition 2.1 in (Liu et al., 2016) ), f : R d → R d is a vector-valued function called Stein critic and Γ is an operator that transforms a d × d matrix into a scalar value. One common choice of Γ is trace operator when d = d. One can also use other forms for Γ, like matrix norm when d = d (Liu et al., 2016) . If F is a unit ball in some reproducing kernel Hilbert space (RKHS) with a positive definite kernel k, it induces Kernel Stein Discrepancy (KSD). More details are provided in Appendix B. Wasserstein Metric. Wasserstein metric is suitable for measuring distances between two distributions with non-overlapping supports (Arjovsky et al., 2017) . The Wasserstein-1 metric between distributions P and Q is defined as W(P, Q) := min γ E (x,y)∼γ [ x -y ], where the minimization with respect to γ is over all joint distributions with marginals P and Q. By Kantorovich-Rubinstein duality, W(P, Q) has a dual representation W(P, Q) := max D {E x∼P [D(x)] -E y∼Q [D(y)]} , where the maximization is over all 1-Lipschitz continuous functions. Sobolev space and Sobolev dual norm. Let Lfoot_0 (P) be the Hilbert space on R d equipped with an inner product u, v L 2 (P) := R d uvdP(x). The (weighted) Sobolev space H 1 is defined as the closure of C ∞ 0 , a set of smooth functions on R d with compact support, with respect to norm u H 1 := R d (u 2 + ∇u v H -1 is defined by (Evans, 2010) v H -1 := sup u∈H 1 v, u L 2 : R d ∇u 2 2 dP(x) ≤ 1, R d u(x)dP(x) = 0 . The constraint R d u(x)dx = 0 is necessary to guarantee the finiteness of the supremum, and the supermum can be equivalently taken over C ∞ 0 . 3 PROPOSED MODEL: STEIN BRIDGING In this section, we formulate our model Stein Bridging. A scheme of our framework is illustrated in Figure 1 . Denote by P real the underlying real distribution from which the data {x} are sampled. The formulation simultaneously learns two generative models -one explicit and one implicit -that represent estimates of P real . The explicit generative model has a distribution P E on X with explicit probability density proportional to exp(-E(x)), x ∈ X , where E is referred to as an energy function. We focus on energy-based explicit model in model formulation as it does not enforce any constraints or assume specific density forms. For specifications, one can also consider other explicit models, like autoregressive models or directly using some density forms such as Gaussian distribution with given domain knowledge. The implicit model transforms an easy-tosample random noise z with distribution P 0 via a generator G to a sample x = G(z) with distribution P G . Note that for distribution P E , we have its explicit density without normalizing term, while for P G and P real , we have samples from two distributions. Hence, we can use the Stein discrepancy (that does not require the normalizing constant) as a measure of closeness between the explicit distribution P E and the real distribution P real , and use the Wasserstein metric (that only requires only samples from two distributions) as a measure of closeness between the implicit distribution P G and the real data distribution P real . To jointly learn the two generative models P G and P E , arguably the most straightforward way is to minimize the sum of the Stein discrepancy and the Wasserstein metric: min E,G W(P real , P G ) + λS(P real , P E ), where λ ≥ 0. However, this approach appears no different than learning the two generative models separately. To achieve information sharing between two models, we incorporate another term S(P G , P E ) -called the Stein bridge -that measures the closeness between the explicit distribution P E and the implicit distribution P G : min E,G W(P real , P G ) + λ 1 S(P real , P E ) + λ 2 S(P G , P E ), where λ 1 , λ 2 ≥ 0. The Stein bridge term in (3) pushes the two models to achieve a consensus. Remark 1. Our formulation is flexible in choosing both the implicit and explicit models. In (3), we can choose statistical distances other than the Wasserstein metric W(P real , P G ) to measure closeness between P real and P G , such as Jensen-Shannon divergence, as long as its computation requires only samples from the involved two distributions. Hence, one can use GAN architectures other than WGAN to parametrize the implicit model. In addition, one can replace the first Stein discrepancy term S(P real , P E ) in (3) by other statistical distances as long as its computation is efficient and hence other explicit models can be used. For instance, if the normalizing constant of P E is known or easy to calculate, one can use Kullback-Leibler (KL) divergence. Remark 2. The choice of the Stein discrepancy for the bridging term S(P G , P E ) is crucial and cannot be replaced by other statistical distances such as KL divergence, since the data-generating distribution does not have an explicit density form (not even up to a normalizing constant). This is exactly one important reason why Stein bridging was proposed, which requires only samples from the data distribution and only the log-density of the explicit model without the knowledge of normalizing constant as estimated in MCMC or other methods. In our implementation, we parametrize the generator in implicit model and the density estimator in explicit model as G θ (z) and p φ (x), respectively. The Wasserstein term in (3) is implemented using its equivalent dual representation in (2) with a parametrized critic D ψ (x). The two Stein terms in (3) can be implemented using (1) with either a Stein critic (parametrized as a neural network, i.e., f w (x)), or the non-parametric Kernel Stein Discrepancy. Our implementation iteratively updates the explicit and implicit models. Details for model specifications and optimization are in Appendix E.2. We also compare with some related works that attempt to combine both of the worlds (such as energy-based GAN, contrastive learning and cooperative learning) in Appendix A.3.

4. THEORETICAL ANALYSIS

In this section, we theoretically show that the Stein bridge allows the two models to facilitate each other's training by imposing certain regularizations on both the implicit and the explicit models, as well as stabilizing the training dynamics.

4.1. REGULARIZATION VIA STEIN BRIDGE

We first show the regularization effect of the Stein bridge on the Wasserstein critic. Define the kernel Sobolev dual norm as D H -1 (P;k) := sup u∈C ∞ 0 { D, u L 2 (P) : E x,x ∼P [∇u(x) k(x, x )∇u(x )] ≤ 1, E P [u] = 0}. which can be viewed as a kernel generalization of the Sobolev dual norm defined in Section 2, which reduces to the Sobolev dual norm when k(x, x ) = I(x = x ) and P is the Lebesgue measure. Theorem 1. Assume that {P G } G exhausts all continuous probability distributions and S is chosen as kernel Stein discrepancy. Then problem (3) is equivalent to min E max D E y∼P E [D(y)] -E x∼P real [D(x)] -1 4λ2 D 2 H -1 (P E ;k) + λ 1 S(P real , P E ) . The kernel Sobolev norm regularization penalizes large variation of the Wasserstein critic D. Particularly, observe that (Villani, 2008) if k(x, x ) = I(x = x ) and E P E [D] = 0, and then D H -1 (P E ;k) = lim →0 W 2 ((1 + D)P E , P E ) , where W 2 denotes the 2-Wasserstein metric. Hence, the Sobolev dual norm regularization ensures D not to change suddenly on high-density region of P E , and thus reinforces the learning of the Wasserstein critic. Stein bridge penalizes large variation of the Wasserstein critic, in the same spirit but of different form comparing to gradient-based penalty (e.g., (Gulrajani et al., 2017; Roth et al., 2017) ). It prevents Wasserstein critic from being too strong during training and thus encourages mode exploration of sample generator. To illustrate this, we conduct a case study where we train a generator over the data sampled from a mixture of Gaussian (µ 1 = [-1, -1], µ 2 = [1, 1] and Σ = 0.2I). In Fig. 2 In this subsection, we further show that Stein Bridging could help stabilize adversarial training between generator and Wasserstein critic with a local convergence guarantee. As is known, the training for minimax game in GAN is difficult. When using traditional gradient methods, the training would suffer from some oscillatory behaviors (Goodfellow, 2017; Liang & Stokes, 2019; Zhang & Yu, 2020) . In order to better understand the optimization behaviors, we first compare the behaviors of WGAN, likelihood-and entropy-regularized WGAN, and our Stein Bridging under SGD via an easy to comprehend toy example in one-dimensional case. Fig. 3 shows numerical results that compare the optimization behaviors of above methods. As we can see, Stein Bridging achieves good convergence to the optimum point, while WGAN suffers from an oscillation instead of converging. Entropy regularization (ER) can encourage the generator to seek more modes but would make the model diverge in this case. By contrast, likelihood regularization (LR) can help for training stability but it changes the converging point to a biased distribution. A recently proposed variational annealing strategy (VA) (Tao et al., 2019) for regularized GAN introduces a trade-off between convergence and unbiased result. The detailed discussions and proofs are presented in Appendix D.1. We also generalize the convergence results to multi-dimensional bilinear system F (ψ, θ) = θ Aψb θc ψ in Appendix D.2. Our theoretical results indicate that Stein Bridging could stabilize the minimax training of GAN without changing its optimum. In the experiments, we will empirically validate our analysis.

5. EXPERIMENTS

In this section, we conduct experimentsfoot_1 to verify the effectiveness of proposed method from multifaceted views. We consider two synthetic datasets with mixtures of Gaussian distributions: Two-Circle and Two-Spiral. The first one is composed of 24 Gaussian mixtures that lie in two circles. Such dataset is extended from the 8-Gaussian-mixture scenario widely used in previous papers, so that we can use it to test the quality of generated samples and mode coverage of learned energy. The second dataset consists of 100 Gaussian mixtures whose centers are densely arranged on two centrally symmetrical spiral-shaped curves. This dataset can be used to examine the power of generative model on complicated data distributions. The ground-truth distributions and samples are shown in Fig. 4 (a) and Fig. 5(a) . Furthermore, we also apply the method to MNIST and CIFAR datasets which require the model to deal with high-dimensional image data. In each dataset, we use observed samples as input of the model and leverage them to train our model. The details for each dataset are reported in Appendix E.1. We term the model Joint-W if using Wasserstein metric in (3) and Joint-JS if using JS divergence in this section. We consider several competitors. For implicit generative models, we basically consider the counterparts without joint training with energy model, which are equivalently valina GAN and WGAN with gradient penalty (Gulrajani et al., 2017) , for ablation study. Also, as comparison to the new regularization effects by Stein Bridging, we consider a recently proposed variational annealing regularization (Tao et al., 2019) for GANs (short as GAN+VA/WGAN+VA). We employ denoising auto-encoder to estimate the gradient for regularization penalty, which is proposed by (Alain & Bengio, 2014) . For explicit models, we also consider the counterparts without joint training with generator model, i.e., directly training Deep Energy Model (DEM) using Stein discrepancy (Grathwohl et al., 2020) . Besides we compare with energy calibrated GAN (EGAN) (Dai et al., 2017) and Deep Directed Generative (DGM) Model (Kim & Bengio, 2017) which adopt contrastive divergence to train a sample generator with an energy estimator. See Appendix A for brief introduction of these methods and Appendix E.3 for implementation details.

5.1. DENSITY ESTIMATION OF EXPLICIT MODEL

Mode Coverage for Complicated Distributions. One advantage of joint learning is that the generator could help the density estimator to capture more accurate distribution. As shown in Two-Circle case in Fig 5, both Joint-JS and Joint-W manage to capture all Gaussian components while other methods miss some of modes. In Two-Spiral case in Fig 4, Joint-JS and Joint-W exactly fit the ground-truth distribution. Nevertheless, DEM misses one spiral while EGAN degrades to a uniformlike distribution. DGM manages to fit two spirals but allocate high densities to regions that have low densities in the groung-truth distribution. As quantitative comparison, we study three evaluation metrics: KL & JS divergence and Area Under the Curve (AUC). The detailed information and results are given in Appendix E.4 and Table 5 respectively. The values show that Joint-W and Joint-JS provide better density estimation than all competitors over a large margin . Density Rankings for High-Dimensional Digits. We also rank generated digits (and true digits) on MNIST w.r.t densities given by the energy model in Fig. 11, Fig. 12 and Fig. 13 . As depicted in the figures, the digits with high densities (or low densities) given by Joint-JS possess enough diversity (the thickness, the inclination angles as well as the shapes of digits diverses). By constrast, all the digits with high densities given by DGM tend to be thin and digits with low densities are very thick. Also, as for EGAN, digits with high (or low) densities appear to have the same inclination angle (for high densities, '1' keeps straight and '9' 'leans' to the left while for low densities, just the opposite), which indicates that DGM and EGAN tend to allocate high (or low) densities to data with certain modes and miss some modes that possess high densities in ground-truth distributions. By contrast, our method manages to capture these complicated features in data distributions. Detection for Out-of-distribution Samples. We further study model performance on detection for out-of-distribution samples. We consider CIFAR-10 images as positive samples and construct negative samples by (I) flip images, (II) add random noise, (III) overlay two images and (IV) use images from LSUN dataset, respectively. A good density models trained on CIFAR-10 are expected to give high densities to positive samples and low densities to negative samples, with exception for case (I) (flipping images are not exactly negative samples and the model should give high densities). We use the density values rank samples and calculate AUC of false positive rate v.s. true positive rate, reported in Table 2 . Our model Joint-W manages to distinguish samples for (II), (III), (IV) and is not fooled by flipping images, while DEM and EGAN fail to detect out-of-distribution samples and DGM recognizes flipping images as negative samples. Calibrating explicit (unnormalized) density model with implicit generator is expected to improve the quality of generated samples. In Fig. 5 we show the results of different generators in Two-Circle and Two-Spiral datasets. In Two-Circle, there are a large number of generated samples given by GAN, WGAN-GP and DGM locating between two Gaussian components, and the boundary for each component is not distinguishable. Since the ground-truth densities of regions between two components are very low, such generated samples possess lowquality, which depicts that these models capture the combinations of two dominated features (i.e., modes) in data but such combination makes no sense in practice. By contrast, Joint-JS and Joint-W could alleviate such issue, reduce the low-quality samples and produce more distinguishable boundaries. In Two-Spiral, similarly, the generated samples given by GAN and WGAN-GP form a circle instead of two spirals while the samples of DGM 'link' two spirals. Joint-JS manages to focus more on true high densities compared to GAN and Joint-W provides the best results. To quantitatively measure the sample quality, we adopt Maximum Mean Discrepancy (MMD) and High-quality Sample Rate (HSR). The details are in Appendix E.4 and we report results in Table 5 where our models significantly outperform the competitors over a large margin.

5.2. SAMPLE QUALITY OF IMPLICIT MODEL

Sample Quality for Generated Images. We calculate the Inception Score (IS) and Fréchet Inception Distance (FID) to measure the sample quality on CIFAR-10. As shown in Table 1 , Joint-W outperforms other competitors by 0.2 and achieves 5.6% improvement over WGAN-GP w.r.t IS. As for FID, Joint-W slightly outperforms WGAN-GP and beats energy-based GAN and variational annealing regularized WGAN over a large margin. One possible reason is that these methods both consider entropy regularization which encourages diversity of generated samples but will have a negative effect on sample quality. Stein Bridging can overcome this issue via joint training with explicit model. The performance of DGM tends to be much worse than others. In practice, DGM is hard for convergence and suffers from severe instability in training. Model Performance in Contaminated or Limited Data. As further discussions, we highlight that Stein Bridging has promising power in some extreme cases where the training sample are contaminated or limited. We consider noised data scenario and randomly add n noise points sampled from Gaussian distribution N (0, σ 0 I) where σ 0 = 2 to the original true samples in Two-Circle dataset. The results on noised dataset are presented in Fig. 8 (a) where we set noise ratio n = [40, 100, 160, 300, 400, 600, 800, 1000] and report the HSRs of Joint-W and WGAN-GP. The noise ratio in data impacts the performance of WGAN-GP and Joint-W, but comparatively, the performance decline of Joint-W is less insignificant than WGAN-GP, which indicates better robustness of joint training w.r.t. noised data. [100, 200, 300, 500, 700, 1000, 2000] in Two-Spiral dataset and report the AUC of Joint-W and DEM. When sample size decreases from 2000 to 100, the AUC value of DEM declines dramatically, showing its dependency on sufficient training samples. By contrast, the AUC of Joint-W exhibits a small decline when the sample size is more than 500 and suffers from an obvious decline when it is less than 300. Such phenomenon demonstrates its lower sensitivity to data size.

5.3. ENHANCING THE STABILITY OF GAN

Joint training also helps to stabilize training dynamics. In Fig. 6 we present the learning curves of Joint-W (resp. Joint-JS) compared with WGAN (resp. GAN) and likelihood-and entropyregularized WGAN (resp. GAN). The curves depict that joint training could reduce the variance of metric values especially during the second half of training. Furthermore, we visualize generated digits given by the same noise z in adjacent epochs in Fig. 7 . The results show that Joint-W gives more stable generation in adjacent epochs while generated samples given by WGAN-GP and WGAN+VA exhibit an obvious variation. Especially, some digits generated by WGAN-GP and WGAN+VA change from one class to another, which is quite similar to the oscillation without convergence discussed in Section 3.2. To quantify the evaluation of bias in model distributions, we calculate distances between the means of 50000 generated digits (resp. images) and 50000 true digits (resp. images) in MNIST (reps. CIFAR-10). The results are reported in Table 4 . We can see that the model distributions of other competitors are more biased from true data distribution, compared with Joint-W.

6. CONCLUSIONS

In this paper, we aim at uniting the training for implicit generative model (represented by GAN or WGAN) and explicit generative model (represented by a deep energy-based model) via an bridging term of Stein discrepancy between the generator and the energy-based density estimator. Theoretically, we show that joint training could i) enforce dual regularization effects on both models and thus encourage mode exploration, and ii) help to facilitate the convergence of minimax training dynamics. We also conduct extensive experiments on different tasks and applications to verify our theoretical findings as well as demonstrate the superiority of our method compared with training generator models or energy-based models alone. Our formulation is flexible in handling various implicit or explicit models. As such, for future works, one can try other generative models such as VAE or flowed-based model as replacement for our GAN and energy-based models. It would also be interesting to exploit our formulation in the context of few-shot learning in generative models.

A LITERATURE REVIEWS

We discuss some of related literature and shed lights on the relationship between our work with others. A.1 EXPLICIT GENERATIVE MODELS Explicit generative models are interested in fitting each instance with a scalar (unnormalized) density expected to explicitly capture the distribution behind data. Such densities are often up to a constant and called as energy functions which are common in undirected graphical models (LeCun et al., 2006) . Hence, explicit generative models are also termed as energy-based models. An early version of energy-based models is the FRAME (Filters, Random field, And Maximum Entropy) model (Zhu et al., 1997; Wu et al., 2000) . Later on, some works leverage deep neural networks to model the energy function (Ngiam et al., 2011; Xie et al., 2016b) and pave the way for researches on deep energy model (DEM) (e.g., (Liu & Wang, 2017; Kim & Bengio, 2017; Zhai et al., 2016; Haarnoja et al., 2017; Du & Mordatch, 2019; Nijkamp et al., 2019) ). Apart from DEM, there are also some other forms of deep explicit models based on restricted Boltzmann machines like deep belief networks (Hinton et al., 2006) and deep Boltzmann machines (Salakhutdinov & Hinton, 2009) . The normalized constant under the energy function requires an intractable integral over all possible instances, which makes the model hard to learn via Maximum Likelihood Estimation (MLE). To solve this issue, some works propose to approximate the constant by MCMC methods (Geman & Geman, 1984; Neal, 2011) . However, MCMC requires an inner-loop samples in each training, which induces high computational costs. Another solution is to optimize an alternate surrogate loss function. For example, contrastive divergence (CD) (Liu & Wang, 2017 ) is proposed to measure how much KL divergence can be improved by running a small numbers of Markov chain steps towards the intractable likelihood, while score matching (SM) (Hyvärinen, 2005) detours the constant by minimizing the distance for gradients of log-likelihoods. A recent study (Grathwohl et al., 2020) uses Stein discrepancy to train unnormalized model. The Stein discrepancy does not require the normalizing constant and makes the training tractable. Moreover, the intractable normalized constant makes it hard to sample from. To obtain an accurate samples from unnormalized densities, many studies propose to approximate the generation by diffusion-based processes, like generative flow (Nguyen et al., 2017) and variational gradient descent ( (Liu & Wang, 2016) ). Also, a recent work (Hu et al., 2018) leverages Stein discrepancy to design a neural sampler from unnormalized densities. The fundamental disadvantage of explicit model is that the energy-based learning is difficult to accurately capture the distribution of true samples due to the low manifold of real-world instances (Liu & Wang, 2017) .

A.2 IMPLICIT GENERATIVE MODELS

Implicit generative models focus on a generation mapping from random noises to generated samples. Such mapping function is often called as generator and possesses better flexibility compared with explicit models. Two typical implicit models are Variational Auto-Encoder (VAE) (Kingma & Welling, 2014) and Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) . VAE introduces a latent variable and attempts to maximize the variational lower bound for likelihood of joint distribution of latent variable and observable variable, while GAN targets an adversarial game between the generator and a discriminator (or critic in WGAN) that aims at discriminating the generated and true samples. In this paper, we focus on GAN and its variants (e.g., WGAN (Arjovsky et al., 2017), WGAN-GP (Gulrajani et al., 2017) , DCGAN (Radford et al., 2016) , etc.) as the implicit generative model and we leave the discussions on VAE as future work. Two important issues concerning GAN and its variants are instability of training and local optima. The typical local optima for GAN can be divided into two categories: mode-collapse (the model fails to capture all the modes in data) and mode-redundance (the model generates modes that do not exist in data). Recently there are many attempts to solve these issues from various perspectives. One perspective is from regularization. Two typical regularization methods are likelihood-based and entropy-based regularization with the prominent examples (Warde-Farley & Bengio, 2017) and (Li & Turner, 2018) that respectively leverage denoising feature matching and implicit gradient approximation to enforce the regularization constraints. The likelihood and entropy regularizations could respectively help the generator to focus on data distribution and encourage more diverse samples, and a recent work (Tao et al., 2019) uses Langevin dynamics to indicate that i) the entropy and likelihood regularizations are equivalent and share an opposite relationship in mathematics, and ii) both regularizations would make the model converge to a surrogate point with a bias from original data distribution. Then (Tao et al., 2019) proposes a variational annealing strategy to empirically unite two regularizations and tackle the biased distributions. To deal with the instability issue, there are also some recent literatures from optimization perspectives and proposes different algorithms to address the non-convergence of minimax game optimization (for instance, (Gemp & Mahadevan, 2018; Liang & Stokes, 2019; Gidel et al., 2019) ). Moreover, the disadvantage of implicit models is the lack of explicit densities over instances, which disables the black-box generator to characterize the distributions behind data.

A.3 ATTEMPTS TO COMBINE BOTH OF THE WORLDS

Recently, there are several studies that attempt to combine explicit and implicit generative models from different ways. For instance, (Zhao et al., 2017) proposes energy-based GAN that leverages energy model as discriminator to distinguish the generated and true samples. The similar idea is also used by (Kim & Bengio, 2017) and (Dai et al., 2017) which let the discriminator estimate a scaler energy value for each sample. Such discriminator is optimized to give high energy to generated samples and low energy to true samples while the generator aims at generating samples with low energy. The fundamental difference is that (Zhao et al., 2017) and (Dai et al., 2017 ) both aim at minimizing the discrepancy between distributions of generated and true samples while the motivation of (Kim & Bengio, 2017) is to minimize the KL divergence between estimated densities and true samples. (Kim & Bengio, 2017) adopts contrastive divergence (CD) to link MLE for energy model over true data with the adversarial training of energy-based GAN. However, both CD-based method and energy-based GAN have limited power for both generator and discriminator. Firstly, if the generated samples resemble true samples, then the gradients for discriminator given by true and generated samples are just the opposite and will counteract each other, and the training will stop before the discriminitor captures accurate data distribution. Second, since the objective boils down to minimizing the KL divergence (for (Kim & Bengio, 2017 )) or Wasserstein distance (for (Dai et al., 2017) ) between model and true distributions, the issues concerning GAN (or WGAN) like training instability and mode-collapse would also bother these methods. Another way for combination is by cooperative training. (Xie et al., 2016a) (and its improved version (Xie et al., 2018) ) leverages the samples of generator as the MCMC initialization for energy-based model. The synthesized samples produced from finite-step MCMC are closer to the energy model and the generator is optimized to make the finite-step MCMC revise its initial samples. Also, a recent work (Du et al., 2018) proposes to regard the explicit model as a teacher net who guides the training of implicit generator as a student net to produce samples that could overcome the mode-collapse issue. The main drawback of cooperative training is that they indirectly optimize the discrepancy between the generator and data distribution via the energy model as a 'mediator', which leads to a fact that once the energy model gets stuck in a local optimum (e.g., mode-collapse or moderedundance) the training for the generator would be affected. In other words, the training for two models would constrain rather than exactly compensate each other. Different from existing methods, our model considers three discrepancies simultaneously as a triangle to jointly train the generator and the estimator, enabling them to compensate and reinforce each other.

B BACKGROUND FOR STEIN DISCREPANCY

Assume q(x) to be a continuously differentiable density supported on X ⊂ R d and f : R d → R d a smooth vector function. Define A q [f (x)] = ∇ x log q(x)f (x) + ∇ x f (x) as a Stein operator. If f is a Stein class (satisfying some mild boundary conditions) then we have the following Stein identity property: E x∼q [A q [f (x)]] = E x∼q [∇ x log q(x)f (x) + ∇ x f (x)] = 0. Such property induces Stein discrepancy between distributions P : p(x) and Q : q(x), x ∈ X : S(Q, P) = sup f ∈F {E x∼q [A p [f (x)]] = sup f ∈F {Γ(E x∼q [∇ x log p(x)f (x) + ∇ x f (x)])}, ( ) where f is what we call Stein critic that exploits over function space F and if F is large enough then S(Q, P) = 0 if and only if Q = P. Note that in (1), we do not need the normalized constant for p(x) which enables Stein discrepancy to deal with unnormalized density. If F is a unit ball in a Reproducing Kernel Hilbert Space (RKHS) with a positive definite kernel function k(•, •), then the supremum in (1) would have a close form (see (Liu et al., 2016; Chwialkowski et al., 2016; Oates et al., 2017) for more details): S K (Q, P) = E x,x ∼q [u p (x, x )], where Proof. Applying Kantorovich's duality on W(P G , P r ) and using the exhaustiveness assumption on the generator, we rewrite the problem as min u p (x, x ) = ∇ x log p(x) k(x, x )∇ x log p(x ) + ∇ x log p(x) ∇ x k(x, x ) + ∇ x k(x, x ) ∇ x log p(x ) + tr(∇ x,x k(x, x )). E,P max D {E P [D] -E P real [D] + λ 1 S(P real , P E ) + λ 2 S(P, P E )}, where the minimization with respect to E is over all energy functions, the minimization with respect to P is over all probability distributions with continuous density, and the maximization with respect to D is over all 1-Lipschitz continuous functions. Recall the definition of kernel Stein discrepancy S(P, P E ) = E x,x ∼P [(∇ x log dP/dP E (x)) k(x, x )∇ x log dP/dP E (x )], where dP/dP E is the Radon-Nikodym derivative. Observe that S(P, P E ) is infinite if P is not absolutely continuous with respect to P E . Hence, to minimize the objective of ( 6), it suffices to consider those P's that are absolutely continuous with respect to P E . Introducing a variable replacement h(x) := dP/dP E (x) -1, then problem (6) becomes min E,h max D E P E [(1 + h)D] -E P real [D] + λ 1 S(P real , P E ) + λ 2 • E x,x ∼P [∇ x log(1 + h(x)) k(x, x )∇ x log(1 + h(x ))] , where the minimization with respect to h is over all L 1 (P E ) functions with P E -expectation zero. Fixing E, we claim that we can swap min h and max D . Indeed, without loss of generality, we can restrict D to be such that D(x 0 ) = 0 for some element x 0 , as a constant shift does not change the value of E P E [(1 + h)D] -E P real [D]. The set of Lipschitz functions that vanish at x 0 is a Banach space, and the set of 1-Lipschitz functions is compact (Weaver, 1999) . Moreover, L 1 (P E ) is also a Banach space and the objective function is linear in both h and D. The above verifies the condition of Sion's minimax theorem, and thus the claim is proved. Swapping min h and max D in (7) and fixing E and D, we consider min h:E P E [h]=0 {E P E [hD] + λ 2 • E x,x ∼P [∇ x log(1 + h(x)) k(x, x )∇ x log(1 + h(x ))]} = min h:E P E [h]=0 E P E [hD] + λ 2 • E x,x ∼P ∇ x h(x) 1 + h(x) k(x, x ) ∇ x h(x ) 1 + h(x ) = min h:E P E [h]=0 E P E [hD] + λ 2 • E x,x ∼P E ∇ x h(x) k(x, x )∇ x h(x ) , where the first equality follows from the chain rule of the derivative, and the second equality follows from a change of measure dP = (1 + h)dP E . Introducing an auxiliary variable r so that r 2 is an upper bound of E x,x ∼P E ∇ x h(x) k(x, x )∇ x h(x ) , we have that min h:E P E [h]=0 E P E [hD] + λ 2 • E x,x ∼P E ∇ x h(x) k(x, x )∇ x h(x ) = min r≥0 min h:E P E [h]=0 E P E [hD] + λ 2 r 2 : E x,x ∼P E ∇ x h(x) k(x, x )∇ x h(x ) ≤ r 2 = min r≥0 min h:E P E [h]=0 rE P E [hD] + λ 2 r 2 : E x,x ∼P E ∇ x h(x) k(x, x )∇ x h(x ) ≤ 1 = min r≥0 λ 2 r 2 -r D H -1 (P E ;k) = - 1 4λ 2 D 2 H -1 (P E ;k) , where the first equality holds because E x,x ∼P E ∇ x (rh)(x) k(x, x )∇ x (rh)(x ) = r 2 E x,x ∼P E ∇ x h(x) k(x, x )∇ x h(x ) for all r ≥ 0 and by introducing an auxiliary variable r 2 = E x,x ∼P E ∇ x h(x) k(x, x )∇ x h(x ) ; the second equality follows from a change of variable from h to rh; and the third equality follows from the definition of the kernel Sobolev dual norm. Plugging back in (7) yields the ideal result.

C.2 PROOF FOR THEOREM 2

Proof. Applying the definition of Stein discrepancy on S(P E , P G ) and under the exhaustiveness assumption of G, we rewrite the problem as min E,P max f {λ 1 S(P real , P E ) + λ 2 E y∼P [A P E f (y)] + W(P real , P)}, where the minimization with respect to E is over the set of all engergy functions; the minimization with respect to P is over all continuous distributions; and the maximization with respect to f is over the Stein class for P E . Let us fix E. Using a similar argument as in the proof of Theorem 1, it suffices to restrict P on the set of distributions that are absolutely continuous with respect to P E , which can be identified as the set of L 1 (P E ) functions with P E -mean zero and is thus Banach. Together with the compactness assumption of the Stein class, using Sion's minimax theorem, we can swap the minimization over P and the maximization over f . Now, fixing further f , consider min P {λ 2 E y∼P [A P E f (y)] + W(P real , P)}. ( ) Recall the definition of Wasserstein metric W(P real , P) = min γ E (x,y)∼γ [ x -y ], where the minimization is over all joint distributions of (x, y) with x-marginal P real and y-marginal P. We rewrite problem (8) as min P,γ {E (x,y)∼γ [λ 2 A P E f (y) + ||x -y||]}, where γ has marginals P real and P. Since P is unconstrained, the above problem is further equivalent to min γ {E (x,y)∼γ [λ 2 A P E f (y)] + ||x -y||]}, where the minimization is over all joint distributions of (x, y) with x-marginal being P real . Using the law of total expectation, the problem above is equivalent to min {γx} x∈suppP real E x∼P real E y∼γx [λ 2 A P E f (y) + ||x -y|| | x] = E x∼P real min γx E y∼γx [λ 2 A P E f (y) + ||x -y|| | x] = E x∼P real min y∈X {λ 2 A P E f (y) + ||x -y||} where the minimization in the first line of the equation is over γ x , the set of all conditional distributions of y given x where x is over the support supp P real of P real ; the exchanging of min and E in the first equality follows from the interchangebability principle (Shapiro et al., 2009) ; the second equality holds because the infimum can be restricted to the set of point masses. Finally, the original problem is equivalent to min E max f λ 1 S(P real , P E ) + E x∼P real min y∈X {λ 2 A P E f (y) + ||x -y||} . Therefore, the proof is completed using the definition of Moreau-Yosida regularization. D DETAILS AND PROOFS IN SECTION 4.2

D.1 DISCUSSIONS ON ONE-DIMENSIONAL CASE

The training for minimax game in GAN is difficult. When using traditional gradient methods, the training would suffer from some oscillatory behaviors (Goodfellow, 2017; Liang & Stokes, 2019) . In order to better understand the optimization behaviors, we first study a one-dimension linear system that provides some insights on this problem. Such toy example (or a similar one) is also utilized by (Gidel et al., 2019; Nagarajan & Kolter, 2017) to shed lights on the instability of WGAN trainingfoot_2 . Consider a linear critic D ψ (x) = ψx and generator G θ (z) = θz. Then the Wasserstein GAN objective can be written as a constrained bilinear problem: min θ max |ψ|≤1 ψE[x] -ψθE[z], which could be further simplified as an unconstrained version (the behaviors can be generalized to multidimensional cases (Gidel et al., 2019)): min θ max ψ ψ -ψ • θ. Unfortunately, such simple objective cannot guarantee convergence by traditional gradient methods like SGD with alternate updatingfoot_3 : θ k+1 = θ k +ηψ k ,, ψ k+1 = ψ k +η(1-θ k+1 ). Such optimization would suffer from an oscillatory behavior, i.e., the updated parameters go around the optimum point ([ψ * , θ * ] = [0, 1]) forming a circle without converging to the centrality, which is shown in Fig. 3 

(a).

A recent study in (Liang & Stokes, 2019) theoretically show that such oscillation is due to the interaction term in (9). One solution to the instability of GAN training is to add (likelihood) regularization, which has been widely studied by recent literatures (Warde-Farley & Bengio, 2017; Li & Turner, 2018) . With regularization term, the objective changes into min θ max |ψ|≤1 ψE[x] -ψθE[z] -λE[log µ(θz)], where µ(•) denotes the likelihood function and λ is a hyperparameter. A recent study (Tao et al., 2019) proves that when λ < 0 (likelihood-regularization), the extra term is equivalent to maximizing sample evidence, helping to stabilize GAN training; when λ > 0 (entropy-regularization), the extra term maximizes sample entropy, which encourages diversity of generator. Here we consider a Gaussian likelihood function for generated sample x , µ(x ) = exp(-1 2 (x -b) 2 ) which is up to a constant. Its parameter can be estimated by b = E[x]. Then for generated sample x = θz, we have E(log µ(θz)) = -1 2 E[z 2 ]θ 2 + E[z]E[x]θ -1 2 E[x] 2 . Like the case in WGAN, we consider E[x] = E[z] = 1. Assume Var[z] = 1 and we have E[z 2 ] = 1 + E[z]. Hence, for the analysis on likelihood-(and entropy-) regularized WGAN, we can study the following system: min θ max ψ ψ -ψ • θ -λ(θ 2 -θ). When λ = 1, the above objective degrades to (9); when λ < 0 (likelihood-regularization), the the gradient of regularization term pushes θ to shrink, which helps for convergence; when λ > 0 (entropy-regularization), the added term forms an amplifiying strength on θ and leads to divergence. Another issue of likelihood-regularization is that the extra term changes the optimum point and makes the model converge to a biased distribution, as proved by (Tao et al., 2019) . In this case, one can verify that the optimum point becomes [ψ * , θ * ] = [-λ, 1], resulting in a bias. To avoid this issue, (Tao et al., 2019) proposes to temporally decrease |λ| through training. However, such method would also be stuck in oscillation when |λ| gets close to zero as is shown in Fig. 3(a) . Finally, consider our proposed model. We also simplify the density estimator as a basic energy model p φ (x) = exp(-1 2 x 2 -φx) whose score function ∇ x log p φ (x) = -x -φ. Then if we specify the two Stein discrepancies in (3) as KSD with kernel k(x 1 , x 2 ) = I(x 1 = x 2 ), then S(P real , P E ) = E x1,x2 [(∇ x1 log p φ (x 1 ) -∇ x1 log µ(x 1 ))k(x 1 , x 2 )(∇ x2 log p φ (x 2 ) - ∇ x2 log µ(x 2 ))] = E x [(∇ x log p φ (x) -∇ x log µ(x)) 2 ] = (φ + E[x]) 2 . Similarly, one can obtain S(P G , P E ) = (φ + θE[z]) 2 . Therefore we arrive at the objective in ( 11) min θ max ψ min φ ψ -ψ • θ + λ 1 2 (1 + φ) 2 + λ 2 2 (θ + φ) 2 . Interestingly, for ∀λ 1 , λ 2 , the optimum remains the same [ψ * , θ * , φ * ] = [0, 1, -1]. Then we show that the optimization guarantees convergence to [ψ * , θ * , φ * ]. Proposition 1. Using alternate SGD for (11) geometrically decreases the square norm N t = |ψ t | 2 + |θ -1| 2 + |φ + 1| 2 , for any 0 < η < 1 with λ 1 = λ 2 = 1, N t+1 = (1 -η 2 (1 -η) 2 )N t . Proof. Instead of directly studying the optimization for (11), we first prove the following problem will converge to the unique optimum, min θ max ψ min φ θψ + θφ + 1 2 θ 2 + φ 2 . ( ) Applying alternate SGD we have the following iterations: ψ t+1 = ψ t + η * θ t , φ t+1 = φ t -η * (θ t + 2φ t ) = (1 -2η)φ t -ηθ t , θ t+1 = θ t -η(ψ t+1 + φ t+1 + θ t ) = -η(1 -2η)φ t + (1 -η)θ t -ηψ t . Then we obtain the relationship between adjacent iterations: ψ t+1 φ t+1 θ t+1 = 1 0 η 0 1 -2η -η -η -η(1 -2η) 1 -η • ψ t φ t θ t = M • ψ t φ t θ t We further calculate the eigenvalues for matrix M and have the following equations (assume the eigenvalue as λ): (λ -1) 3 + 3η(λ -1) 2 + 2η 2 (1 + η)(λ -1) + 2η 3 = 0. One can verify that the solutions to the above equation satisfy |λ| < (1 -η + η 2 )(1 + η -η 2 ). Then we have the following relationship ψ t+1 φ t+1 θ t+1 2 2 = [ψ t φ t θ t ] • M M • ψ t φ t θ t 2 2 ≤ λ 2 m • ψ t φ t θ t 2 2 where λ m denotes the eigenvalue with the maximum absolute value of matrix M . Hence, we have ψ 2 t+1 + φ 2 t+1 + θ 2 t+1 ≤ (1 -η + η 2 )(1 + η -η 2 )[ψ 2 t + φ 2 t + θ 2 t ]. We proceed to replace ψ, φ and θ in (13) by ψ , φ and θ respectively and conduct a change of variable: let θ = 1 -θ and φ = -1 -φ. Then we get the conclusion in the proposition. As shown in Fig. 3 (a), Stein Bridging achieves a good convergence to the right optimum. Compared with (9), the objective (11) adds a new bilinear term φ • θ, which acts like a connection between the generator and estimator, and two other quadratic terms, which help to penalize the increasing of values through training. The added terms and original terms in (11) cooperate to guarantee convergence to a unique optimum. In fact, the added terms λ1 2 (1+φ) 2 + λ2 2 (θ +φ) 2 in (11) and the original terms ψ -ψ • θ in WGAN play both necessary roles to guarantee the convergence to the unique optimum points [ψ * , θ * , φ * ] = [0, 1, -1]. If we remove the critic and optimize θ and φ with the remaining loss terms, we would find that the training would converge but not necessarily to [ψ * , θ * ] = [0, 1] (since the optimum points are not unique in this case). On the other hand, if we remove the estimator, the system degrades to ( 9 

D.2 GENERALIZATION TO BILINEAR SYSTEMS

Our analysis in the one-dimension case inspires us that we can add affiliated variable to modify the objective and stabilize the training for general bilinear system. The bilinear system is of wide interest for researchers focusing on stability of GAN training ((Goodfellow, 2017; Liang & Stokes, 2019; Gidel et al., 2019; Gemp & Mahadevan, 2018; Zhang & Yu, 2020) ). The general bilinear function can be written as F (ψ, θ) = θ Aψ -b θ -c ψ, where ψ, θ are both r-dimensional vectors and the objective is min θ max ψ F (ψ, θ) which can be seen as a basic form of various GAN objectives. Unfortunately, if we directly use simultaneous (resp. alternate) SGD to optimize such objectives, one can obtain divergence (resp. fluctuation). To solve the issue, some recent papers propose several optimization algorithms, like extrapolation from the past ((Gidel et al., 2019) ), crossing the curl ( (Gemp & Mahadevan, 2018) ) and consensus optimization ( (Liang & Stokes, 2019) ). Also, (Liang & Stokes, 2019) shows that it is the interaction term which generates non-zero values for ∇ θψ F and ∇ ψθ F that leads to such instability of training. Different from previous works that focused on algorithmic perspective, we propose to add new affiliated variables which modify the objective function and allow the SGD algorithm to achieve convergence without changing the optimum points. Based on the minimax objective of ( 14) we add affiliated r-dimensional variable φ (corresponding to the estimator in our model) the original system and tackle the following problem: min θ max ψ min φ F (ψ, θ) + αH(φ, θ), where H(φ, θ) = 1 2 (θ + φ) B(θ + φ), B = (AA ) 1 2 and α is a non-negative constant. Theoretically, the new problem keeps the optimum points of ( 14 Proof. The condition tells us that ∇ θ F (ψ * , θ) = 0 and ∇ ψ F (ψ, θ * ) = 0. Then we derive the gradients for L(ψ, φ, θ), ∇ ψ L(ψ * , φ, θ) = ∇ θ F (ψ * , θ) = 0, ∇ θ L(ψ, φ, θ * ) = ∇ θ F (ψ, θ * ) + ∇ θ H(φ, θ * ) = 1 2 (B + B )(θ * + φ), ∇ φ L(ψ, φ, θ) = ∇ φ H(φ, θ) = 1 2 (B + B )(φ + θ), Combining ( 17) and ( 18) we get φ * = -θ * . Hence, the optimum point of ( 15) is [ψ * , θ * , φ * ] where φ * = -θ * . The advantage of the new problem is that it can be solved by SGD algorithm and guarantees convergence theoretically. We formulate the results in the following theorem. Theorem 3. For problem min θ max ψ min φ L(ψ, φ, θ) using alternate SGD algorithm, i.e., ψ t+1 = ψ t + η∇ ψ L(θ t , ψ t , φ t ), φ t+1 = φ t -η∇ φ L(θ t , ψ t+1 , φ t ), θ t+1 = θ t -η∇ θ L(θ t , ψ t+1 , φ t+1 ), we can achieve convergence to [ψ * , θ * , φ * ] where φ * = -θ * with at least linear rate of (1 -η 1 + η 2 2 )(1 + η 2 -η 2 1 ) where η 1 = ησ min , η 2 = ησ max and σ min (resp. σ max ) denotes the maximum (resp. minimum) singular value of matrix A. To prove Theorem 3, we can prove a more general argument. Lemma 1. If we consider any first-order optimization method on (15), i.e., ψ t+1 ∈ ψ 0 + span(L(ψ 0 , φ, θ), • • • , F (ψ t , φ, θ)), ∀t ∈ N, φ t+1 ∈ ψ 0 + span(L(ψ, φ 0 , θ), • • • , L(ψ, φ t , θ)), ∀t ∈ N, θ t+1 ∈ ψ 0 + span(L(ψ, φ, θ 0 ), • • • , L(ψ, φ, θ t )), ∀t ∈ N, Then we have ψ t = V (ψ t -ψ * ), φ t = U (φ t -φ * ), θ t = U (θ t -θ * ), where U and V are the singular vectors decomposed by matrix A using SVD decomposition, i.e., A = UDV and the triple ([ ψ t ] i , [ φ t ] i , [ θ t ] i ) 1≤i≤r follows the update rule with step size σ i η as the same optimization method on a unidimensional problem min θ max ψ min φ θψ + θφ + 1 2 θ 2 + 1 2 φ 2 , ( ) with step size η, where σ i denotes the i-th singular value on the diagonal of D. Proof. The proof is extended from the proof of Lemma 3 in (Gidel et al., 2019) . The general class of first-order optimization methods derive the following updations: ψ t+1 = ψ 0 + t+1 s=0 ρ st (A θ s -c) = ψ 0 + t+1 s=0 ρ st A (θ s -θ * ), φ t+1 = φ 0 + 1 2 t+1 s=0 δ st (B + B )(θ s + φ s ), θ t+1 = θ 0 + t+1 s=0 µ st [A(ψ s -ψ * ) + 1 2 (B + B )(θ s + φ s )], where ρ st , δ st , µ st ∈ R depend on specific optimization method (for example, in SGD, ρ tt = δ tt = µ tt remain as a non-zero constant for ∀t and other coefficients are zero). Using SVD A = UDV and the fact θ * = -φ * , B = (UDD U ) = D, we have V (ψ t+1 -ψ * ) = V (ψ 0 -ψ * ) + t+1 s=0 ρ st D U (θ s -θ * ) U (φ t+1 -φ * ) = U (φ 0 -φ * ) + t+1 s=0 δ st U D(θ s -θ * ) + U D(φ s -φ * ), U (θ t+1 -θ * ) = U (θ 0 -θ * ) + t+1 s=0 ρ st [DV (ψ s -ψ * ) + U D(θ s -θ * ) + U D(φ s -φ * )], and equivalently, ψ t+1 = ψ 0 + t+1 s=0 ρ st D θ t , φ t = φ 0 + t+1 s=0 δ st D( θ t + φ t ), θ t+1 = θ 0 + t+1 s=0 ρ st D( ψ t + θ t + φ t ). Note that D is a rectangular matrix with non-zero elements on a diagonal block of size r. Hence, the above r-dimensional problem can be reduced to r unidimensional problems: [ ψ t+1 ] i = [ ψ 0 ] i + t+1 s=0 ρ st σ i [ θ t ] i , [ φ t ] i = [ φ 0 ] i + t+1 s=0 δ st σ i ([ θ t ] i + [ φ t ] i ), [ θ t+1 ] i = [ θ 0 ] i + t+1 s=0 ρ st σ i ([ ψ t ] i + [ θ t ] i + [ φ t ] i ). The above iterations can be conducted independently in each dimension where the optimization in i-th dimension follows the same updating rule with step size σ i η as problem in (20). Furthermore, since problem (20) can achieve convergence with a linear rate of (1-η+η 2 )(1+η-η 2 ) using alternate SGD (the proof is similar to that of ( 13)), the multi-dimensional problem in (15) can achieve convergence by SGD with at least a rate of (1 -η 1 + η 2 2 )(1 + η 2 -η 2 1 ) where η 1 = ησ max , η 2 = ησ min and σ max (resp. σ min ) denotes the maximum (resp. minimum) singular value of matrix A. We conclude the proof for Theorem 4. Theorem 3 suggests that the added term H(φ, θ) with affiliated variables φ could help the SGD algorithm achieve convergence to the the same optimum points as directly optimizing F (ψ, θ). Our method is related to consensus optimization algorithm ( (Liang & Stokes, 2019) ) which adds a regularization term ∇ θ F (ψ, θ) + ∇ ψ F (ψ, θ) to ( 14) resulting extra quadratic terms for θ and ψ. The disadvantage of such method is the requirement of Hessian matrix of F (ψ, θ) which is computational expensive for high-dimensional data. By contrast, our solution only requires the first-order derivatives.

E DETAILS FOR IMPLEMENTATIONS E.1 SYNTHETIC DATASETS

We provide the details for two synthetic datasets. The Two-Circle dataset consists of 24 Gaussian mixtures where 8 of them are located in an inner circle with radius r 1 = 4 and 16 of them lie in an outer circle with radius r 2 = 8. For each Gaussian component, the covariance matrix is 0.2 0 0 0.2 = σ 1 I and the mean value is [r 1 cos t, r 1 sin t], where t 3 + linspace(0, 0.5, 50) • 2π for another spiral. We sample N 2 = 5000 points as true observed samples. = 2π•k 8 , k = 1, • • • , 8,

E.2 MODEL SPECIFICATIONS AND TRAINING ALGORITHM

In different tasks, we consider different model specifications in order to meet the demand of capacify as well as test the effectiveness under various settings. Our proposed framework (3) adopts Wasserstein distance for the first term and two Stein discrepancies for the second and the third terms. We can write (3) as a more general form min θ,φ D 1 (P real , P G ) + λ 1 D 2 (P real , P E ) + λ 2 D 3 (P G , P E ), where D 1 , D 2 , D 3 denote three general discrepancy measures for distributions. As stated in our remark, D 1 can be specified as arbitrary discrepancy measures for implicit generative models. Here we also use JS divergence, the objective for valina GAN. To well distinguish them, we call the model using Wasserstein distance (resp. JS divergence) as Joint-W (resp. Joint-JS) in our experiments. On the other hand, the two Stein discrepancies in (3) can be specified by KSD (as defined by S k in ( 5)) or general Stein discrepancy with an extra critic (as defined by S in (1)). Hence, the two specifications for D 1 and the two for D 2 (D 3 ) compose four different combinations in total, and we organize the objectives in each case in Table 3 . In our experiments, we use KSD with RBF kernels for D 2 and D 3 in Joint-W and Joint-JS on two synthetic datasets. For MNIST with conditional training (given the digit class as model input), we also use KSD with RBF kernels. For MNIST and CIFAR with unconditional training (the class is not given as known information), we find that KSD cannot provide desirable results so we adopt general Stein discrepancy for higher model capacity. The objectives in Table 3 appear to be comutationally expensive. In the worst case (using general Stein discrepancy), there are two minimax operations where one is from GAN or WGAN and one is from Stein discrepancy estimation. To guarantee training efficiency, we alternatively update the generator, estimator, Wasserstein critic and Stein critic over the parameters θ, φ, ψ and π respectively. Specifically, in one iteration, we optimize the generator over θ and the estimator over φ with one step respectively, and then optimize the Wasserstein critic over ψ with n d steps and the Stein critic over π with n c steps. Such training approach guarantees the same time complexity order of proposed method as that of GAN or WGAN, and the training time for our model can be bounded within constant times the time for training GAN model. In our experiment, we set n d = n c = 5 and empirically find that our model Stein Bridging would be two times slower than WGAN on average. We present the training algorithm for Stein Bridging in Algorithm 1.

E.3 IMPLEMENTATION DETAILS

We give the information of network architectures and hyper-parameter settings for our model as well as each competitor in our experiments. The energy function is often parametrized as a sum of multiple experts ( (Hinton, 1999) ) and each expert can have various function forms depending on the distributions. If using sigmoid distribution, the energy function becomes (see section 2.1 in (Kim & Bengio, 2017) for details) E φ (x) = i log(1 + e -(Win(x)+bi) ), where n(x) maps input x to a feature vector and could be specified as a deep neural network, which corresponds to deep energy model ( (Ngiam et al., 2011) ) When not using KSD, the implementation for Stein critic f and operation function φ in (1) has still remained an open problem. Some existing studies like (Hu et al., 2018) Besides, to further reduce computational cost, we let the two Stein critics share the parameters, which empirically provide better performance than two different Stein critics. Another tricky point is how to design a proper Γ given d = d where the trace operation is not applicable. One simple way is to set Γ as some matrix norms. However, the issue is that using matrix norm would make it hard for SGD learning. The reason is that the Γ and the expectation in (1) cannot exchange the order, in which case there is no unbiased estimation by mini-batch samples for the gradient. Here, we specify Γ as max-pooling over different dimensions of A p φ [f π (x)], i.e. the gradient would back-propagate through the dimension with largest absolute value at one time. Theoretically, such setting can guarantee the value in each dimension reduces to zero through training and we find it works well in practice. 



2 )dP(x)1/2 , where P has a density. For v ∈ L 2 , its Sobolev dual norm The experiment codes will be released. Our theoretical discussions focus on WGAN, and we also compare with original GAN in the experiments. Here, we adopt the most widely used alternate updating strategy. The simultaneous updating, i.e., θ k+1 = θ k + ηψ k and ψ k+1 = ψ k + η(1 -θ k ), would diverge in this case.



Figure 1: Model framework for Stein Bridging.

Figure 2: (a) The gradient norm of Wasserstein critic with (blue) and without (red) the Stein bridge when data are sampled from a mixture of Gaussian. (b) Contour of an energy model with one mode and empirical data from a distribution with a different mode (blue dots); (c) & (d) Contours of the Stein critics between the two distributions in (b) learned with and without the Stein bridge, respectively.

Figure 4: Comparison for density estimation. (a) True densities of real distribution and (b)∼(f) estimated densities given by the estimators of different methods on Two-Circle (upper line) and Two-Spiral (bottom line) datasets.

Figure 5: Comparison for generated sample quality. (a) True samples from real distribution and (b)∼(f) generated samples produced by the generators of different methods on Two-Circle (upper line) and Two-Spiral (bottom line) datasets.

Figure 6: Learning curves of Joint-W (resp. Joint-JS) compared with WGAN (resp. GAN) and its regularization-based variants on Two-Circle and Two-Spiral datasets.

The (5) gives the Kernel Stein Discrepancy (KSD). C PROOFS OF RESULTS IN SECTION 4.1 C.1 PROOF OF THEOREM 1

) and would not converge to the unique optimum point [ψ * , θ * ] = [0, 1]. If we consider both of the world and optimize three terms together, the training would converge to a unique global optimum [ψ * , θ * , φ * ] = [0, 1, -1].

) unchanged. Let L(ψ, φ, θ) = F (ψ, θ) + αG(φ, θ).Proposition 2. Assume the optimum point of min θ max ψ (ψ, θ) are [ψ * , θ * ], then the optimum points of (15) would be [ψ * , θ * , φ * ] where φ * = -θ * .

for the inner circle, and [r 2 cos t, r 2 sin t], where t = 2π•k 16 , k = 1, • • • , 16 for the outer circle. We sample N 1 = 2000 points as true observed samples for model training. The Two-Spiral dataset contains 100 Gaussian mixtures whose centers locate on two spiral-shaped curves. For each Gaussian component, the covariance matrix is 0.5 0 0 0.5 = σ 2 I and the mean value is [-c 1 cos c 1 , c 1 sin c 1 ], where c 1 = 2π 3 + linspace(0, 0.5, 50) • 2π, for one spiral, and [c 2 cos c 2 , -c 2 sin c 2 ], where c 2 = 2π

set d = 1 in which situation f reduces to a scalar-function from d-dimension input to one-dimension scalar value. Such setting can reduce computational cost since large d could lead to heavy computation for training. Empirically, in our experiments on image dataset, we find that setting d = 1 can provide similar performance to d = 10 or d = 100. Hence, we set d = 1 in our experiment in order for efficiency.

Figure 9: Generated digits given by Joint-W on MNIST.

Figure11: The generated digits (and real digits) with the highest densities and the lowest densities given by Joint-W.

Figure13: The generated digits (and real digits) with the highest densities and the lowest densities given by EGAN.

P E f ) λ2 (•) denotes the (generalized) Moreau-Yosida regularization of the function A P E f with parameter λ 2 , i.e., (A P E f ) λ2 (x) = min y∈X {A P E f (y) + 1 λ2 ||x -y||}.

Inception Scores (IS) and Fréchet Inception Distance (FID) on CIFAR-10.

Objectives for different specifications of D 1 (P real , P G ), D 2 (P real , P E ) and D 3 (P G , P E ). We specify D 1 as Wasserstein distance or JS divergence in our paper and for D 2 and D 3 we consider the general Stein discrepancy or kernel Stein discrepancy. Here we use W, J S to denote Wasserstein distance and JS divergence respectively, and S, S k to represent general Stein discrepancy and kernel Stein discrepancy respectively. We omit the gradient penalty term for Wasserstein distance here but use it in experiments.min φ max ψ max π E x∼P data [d ψ (x)] -E z∼p0 [d ψ (G θ (z))] +λ 1 E x∼P data [A p φ [f π (x)]] + λ 2 E z∼ 0 [A p φ [f π (G θ (z))]] W S k S k min θ min φ max ψ E x∼P data [d ψ (x)] -E z∼p0 [d ψ (G θ (z))] +λ 1 E x,x ∼P data [u p φ (x, x )] + λ 2 E z,z ∼p0 [u p φ (G θ (z), G θ (z ))] min φ max ψ max π E x∼Pr [log(d ψ (x))] + E z∼p0 [log(1 -d ψ (G θ (z)))] +λ 1 E x∼P data [A p φ [f π (x)]] + λ 2 E z∼ 0 [A p φ [f π (G θ (z))]] J S S k S k min θ min φ max ψ E x∼Pr [log(d ψ (x))] + E z∼p0 [log(1 -d ψ (G θ (z)))] +λ 1 E x,x ∼P data [u p φ (x, x )] + λ 2 E z,z ∼p0 [u p φ (G θ (z), G θ (z ))]

annex

Algorithm 1: Training Algorithm for Stein Bridging 1 REQUIRE: observed training samples {x} ∼ P real . 2 REQUIRE: θ 0 , φ 0 , ψ 0 , π 0 , initial parameters for generator, estimator, Wasserstein critic and Stein critic models respectively. α E = 0.0002, β E 1 = 0.9, β E 2 = 0.999, Adam hyper-parameters for explicit models. α I = 0.0002, β I 1 = 0.5, β I 2 = 0.999, Adam hyper-parameters for implicit models. λ 1 = 1, λ 2 , weights for D 2 and D 3 (we suggest increasing λ 2 from 0 to 1 through training). n d = 5, n c = 5 number of iterations for Wasserstein critic and Stein critic, respectively, before one iteration for generator and estimator. B = 100, batch size.6 Sample B random noise {z i } B i=1 ∼ P 0 and obtain generated samples 2 )// update the Stein critic;14 Sample B random noise {z i } B i=1 ∼ P 0 and obtain generated samples2 )// update the density estimator;172 )// update the sample generator; 19 OUTPUT: trained sample generator G θ (z) and density estimator p φ (x).For synthetic datasets, we set the noise dimension as 4. All the generators are specified as a threelayer fully-connected (FC) neural network with neuron size 4-128-128-2, and all the Wasserstein critics (or the discriminators in JS-divergence-based GAN) are also a three-layer FC network with neuron size 2 -128 -128 -1. For the estimators, we set the expert number as 4 and the feature function n(x) is a FC network with neuron size 2 -128 -128 -4. Then in the last layer we sum the outputs from each expert as the energy value E(x). The activation units are searched within [LeakyReLU, tanh, sigmoid, sof tplus] . The learning rate [1e -6, 1e -5, 1e -4, 1e -3, 1e -2] and the batch size [50, 100, 150, 200] . The gradient penalty weight for WGAN is searched in [0, 0.1, 1, 10, 100].For MNIST dataset, we set the noise dimension as 100. All the critics/discriminators are implemented as a four-layer network where the first two layers adopt convolution operations with filter size 5 and stride [2, 2] and the last two layers are FC layers. The size for each layer is 1 -64 -128 -256 -1. All the generators are implemented as a four-layer networks where the first two layers are FC and the last two adopt deconvolution operations with filter size 5 and stride [2, 2] . The size for each layer is 100 -256 -128 -64 -1. For the estimators, we consider the expert number as 128 and the feature function is the same as the Wasserstein critic except that the size of last layer is 128. Then we sum the outputs from each expert as the energy value. The activation units are searched within [ReLU, LeakyReLU, tanh] . The learning rate [2e -5, 2e -4, 2e -3, 2e -2] and the batch size [32, 64, 100, 128] . The gradient penalty weight for WGAN is searched in [1, 10, 100, 1000] .For CIFAR dataset, we adopt the same architecture as DCGAN for critics and generators. As for the estimator, the architecture of feature function is the same as the critics except the last year where we set the expert number as 128 and sum each output as the output energy value. The architectures for Stein critic are the same as Wasserstein critic for both MNIST and CIFAR datasets. In other words, Table 5 : Quantitative results including MMD (lower is better), HSR (higher is better) as the metrics for quality of generated samples and KLD (lower is better), JSD (lower is better), AUC (higher is better) as the metrics for accuracy of estimated densities on Two-Circle and Two-Spiral datasets. we consider d = 1 in (1) and further simply φ as an average of each dimension ofEmpirically we found this setting can provide efficient computation and decent performance.

E.4 EVALUATION METRICS

We adopt some quantitative metrics to evaluate the performance of each method on different tasks. In section 4.1, we use two metrics to test the sample quality: Maximum Mean Discrepancy (MMD) and High-quality Sample Rate (HSR). MMD measures the discrepancy between two distributions X andj=1 Φ(y i ) where x i and y j denote samples from X and Y respectively and Φ maps each sample to a RKHS. Here we use RBF kernel and calculate MMD between generated samples and true samples. HSR statistics the rate of high-quality samples over all generated samples. For Two-Cirlce dataset, we define the generated points whose distance from the nearest Gaussian component is less than σ 1 as high-quality samples. We generate 2000 points in total and statistic HSR. For Two-Spiral dataset, we set the distance threshold as 5σ 2 and generate 5000 points to calculate HSR. For CIFAR, we use the Inception V3 Network in Tensorflow as pre-trained classifier to calculate inception score.In section 4.2, we use three metrics to characterize the performance for density estimation: KL divergence, JS divergence and AUC. We divide the map into a 300 meshgrid, calculate the unnormalized density values of each point given by the estimators and compute the KL and JS divergences between estimated density and ground-truth density. Besides, we select the centers of each Gaussian components as positive examples (expected to have high densities) and randomly sample 10 points within a circle around each center as negative examples (expected to have relatively low densities) and rank them according to the densities given by the model. Then we obtain the area under the curve (AUC) for false-positive rate v.s. true-positive rate.

