GENERATIVE LEARNING WITH EULER PARTICLE TRANSPORT

Abstract

We propose an Euler particle transport (EPT) approach for generative learning. The proposed approach is motivated by the problem of finding the optimal transport map from a reference distribution to a target distribution characterized by the Monge-Ampere equation. Interpreting the infinitesimal linearization of the Monge-Ampere equation from the perspective of gradient flows in measure spaces leads to a stochastic McKean-Vlasov equation. We use the forward Euler method to solve this equation. The resulting forward Euler map pushes forward a reference distribution to the target. This map is the composition of a sequence of simple residual maps, which are computationally stable and easy to train. The key task in training is the estimation of the density ratios or differences that determine the residual maps. We estimate the density ratios (differences) based on the Bregman divergence with a gradient penalty using deep density-ratio (difference) fitting. We show that the proposed density-ratio (difference) estimators do not suffer from the "curse of dimensionality" if data is supported on a lower-dimensional manifold. Numerical experiments with multi-mode synthetic datasets and comparisons with the existing methods on real benchmark datasets support our theoretical results and demonstrate the effectiveness of the proposed method.

1. INTRODUCTION

The ability to efficiently sample from complex distributions plays a key role in a variety of prediction and inference tasks in machine learning and statistics (Salakhutdinov, 2015) . The long-standing methodology for learning an underlying distribution relies on an explicit statistical data model, which can be difficult to specify in many applications such as image analysis, computer vision and natural language processing. In contrast, implicit generative models do not assume a specific form of the data distribution, but rather learn a nonlinear map to transform a reference distribution to the target distribution. This modeling approach has been shown to achieve impressive performance in many machine learning tasks (Reed et al., 2016; Zhu et al., 2017) . Generative adversarial networks (GAN) (Goodfellow et al., 2014) , variational auto-encoders (VAE) (Kingma & Welling, 2014) and flow-based methods (Rezende & Mohamed, 2015) are important representatives of implicit generative models. In this paper, we propose an Euler particle transport (EPT) approach for learning a generative model by integrating ideas from optimal transport, numerical ODE, density-ratio estimation and deep neural networks. We formulate the problem of generative learning as that of finding a nonlinear transform that pushes forward a reference to the target based on the quadratic Wasserstein distance. Since it is challenging to solve the resulting Monge-Ampère equation, we consider the continuity equation derived from the linearization of the Monge-Ampère equation, which is a gradient flows converging to the target distribution. We solve the Mckean-Vlasov equation associated with the gradient flow using the forward Euler method. The resulting EPT that pushes forward the reference distribution to the target distribution is a composition of a sequence of simple residual maps, which are computationally stable and easy to train. The residual maps are completely determined by the density ratios between the distributions at the current iterations and the target distribution. We estimate density ratios based on the Bregman divergence with a gradient regularizer using deep density-ratio fitting. We establish bounds on the approximation errors due to linearization of the Monge-Ampère equation, Euler discretization of the Mckean-Vlasov equation, and deep density-ratio estimation. Our result on the error rate for the proposed density-ratio estimators improves the minimax rate of nonparametric estimation via exploring the low-dimensional structure of the data and circumvents the "curse of dimensionality". Experimental results on multi-mode synthetic data and comparisons with stateof-the-art GANs on benchmark data support our theoretical findings and demonstrate that EPT is computationally more stable and easier to train than GANs. Using simple ReLU ResNets without batch normalization and spectral normalization, we obtained results that are better than or comparable with those using GANs trained with such tricks.

2. EULER PARTICLE TRANSPORT

Let X ∈ R m be a random vector with distribution ν, and let Z be a random vector with distribution µ. We assume that µ has a known and simple form. Our goal is to construct a transformation T such that T # µ = ν, where T # µ denotes the push-forward distribution of µ by T , that is, the distribution of T (Z). Then we can sample from ν by first generating a Z ∼ µ and calculate T (Z). In practice, ν is unknown and only a random sample {X i } n i=1 i.i.d. ν is available. We must construct T based on the sample. There may exist multiple transports T with T # µ = ν. The optimal transport is the one that minimizes the quadratic Wasserstein distance between µ and ν defined by W 2 (µ, ν) = { inf γ∈Γ(µ,ν) E (Z,X)∼γ [ Z -X 2 2 ]} 1 2 , where Γ(µ, ν) denotes the set of couplings of (µ, ν) (Villani, 2008; Ambrosio et al., 2008) . Suppose that µ and ν have densities q and p with respect to the Lesbeque measure, respectively. Then the optimal transport map T such that T # µ = ν is characterized by the Monge-Ampère equation (Brenier, 1991; McCann, 1995; Santambrogio, 2015) . Specifically, the minimization problem in (1) admits a unique solution γ = (1, T ) # µ with T = ∇Ψ, µ-a.e., where 1 is the identity map and ∇Ψ is the gradient of the potential function Ψ : R m → R. This function is convex and satisfies the Monge-Ampère equation det(∇ 2 Ψ(z)) = q(z) p(∇Ψ(z)) , z ∈ R m . Therefore, to find the optimal transport T , it suffices to solve (2) for Ψ. However, it is challenging to solve this degenerate elliptic equation due to its highly nonlinear nature. Below we describe the proposed EPT method for obtaining an approximate solution of the Monge-Ampère equation (2). It consists of the following steps: (a) linearizing (2) via residual maps, (b) determining the velocity fields governing the stochastic McKean-Vlasov equation resulting from the linearization, (c) calculating the forward Euler particle transport map and, (d) training the EPT map by estimating the velocity fields from data. Since velocity fields are completely determined by density ratios, this step amounts to nonparametric density ratio estimation. We also provide bounds on the errors due to linearization, discretization and estimation. Mathematical details and proofs are given in the appendix. Linearization via residual map A basic approach to addressing the difficulty due to nonlinearity is linearization. We use a linearization method based on the residual map T t,Φt = ∇Ψ = 1 + t∇Φ t , t ≥ 0, where Φ t : R m → R 1 is a function to be chosen such that the law of T t,Φt (Z) approaches ν as t increases (Villani, 2008) . We give the specific form of Φ t below, see Theorem B.1 in the appendix for details. This linearization scheme leads to the stochastic process X t : R m → R m satisfying the McKean- Vlasov equation d dt X t (x) = v t (X t (x)), t ≥ 0, with X 0 ∼ µ, µ-a.e. x ∈ R m , where v t is the velocity vector field of X t . In addition, we have v t = ∇Φ t . Thus v t also determines the residual map (3). The details of the derivation are given in Theorems B.2 and B.1. in the appendix. Therefore, estimating the residual map (3) is equivalent to estimating v t . The movement of X t along t is completely governed by v t , given the initial value. We choose a v t to decrease the discrepancy between the distribution of X t , say µ t , at time t and the target ν with respect to a properly chosen measure. An equivalent formulation of ( 4) is through the gradient flow {µ t } t≥0 with {v t } t≥0 as its velocity fields, see Proposition B.1 in the appendix. Computationally it is more convenient to work with (4).

Determining velocity field

The basic intuition is that we should move in the direction that decreases the differences between µ t and the target ν. We use an energy functional L[µ t ] to measure such differences. An important energy functional L[µ t ] is the f -divergence (Ali & Silvey, 1966) , L[µ t ] = D f (µ t ν) = R m p(x)f q t (x) p(x) dx, where q t is the density of µ t , p is the density of ν and f : R + → R is assumed to be a twicedifferentiable convex function with f (1) = 0. We choose Φ t such that L[µ t ] is minimized. We show in Theorem B.1 in the appendix that Φ t (x) = -f (r t (x)) and v t (x) = ∇Φ t (x). Therefore, v t (x) = -f (r t (x))∇r t (x), where r t (x) = q t (x) p(x) , x ∈ R m . For example, if we use the χ 2 -divergence with The forward Euler method Numerically, we need to discretize the McKean-Vlasov equation (4). Let s > 0 be a small step size. We use the forward Euler method defined iteratively by: f (c) = (c-1) 2 /2, then v t (x) = ∇r t (x) T k = 1 + sv k , X k+1 = T k (X k ), µ k+1 = (T k ) # µ k , where X 0 ∼ µ, µ 0 = µ, v k is the velocity field at the kth step, k = 0, 1, ..., K for some large K. The particle process {X k } k≥0 is a discretized version of the continuous process {X t } t≥0 in (4). The final transport map is the composition of a sequence of simple residual maps T 0 , T 1 . . . , T K , i.e., T = T K • T K-1 • • • • T 0 . This updating scheme is based on the forward Euler method for solving equation (4). This is the reason we refer to the proposed method as Euler particle transport (EPT). Training EPT When the target ν is unknown and only a random sample is available, it is natural to learn ν by first estimating the discrete velocity fields v k at the sample level and then plugging the estimator of v k in (6). For example, if we use the f -divergence as the energy functional, estimating v k (x) = -f (r k (x))∇r k (x) boils down to estimating the density ratios r k (x) = q k (x)/p(x) dynamically at each iteration k. Nonparametric density-ratio estimation using Bregman divergences and gradient regularizer are discussed in Section 4 below. Let vk be the estimated velocity fields at the kth iteration. The kth estimated residual map is T k = 1 + sv k . Finally, the trained map is T = T K • T K-1 • • • • • T 0 . ( ) Theoretical guarantees We establish the following bound on the approximation error due to the linearization of the Monge-Ampère equation under appropriate conditions: W 2 (µ t , ν) = O(e -λt ), for some λ > 0, see Proposition B.1 in the appendix. Therefore, µ t converges to ν exponentially fast as t → ∞. For an integer K ≥ 1 and a small s > 0, let {µ s t : t ∈ [ks, (k + 1)s), k = 0, . . . , K} be a piecewise constant interpolation between µ ks and µ (k+1)s , k = 0, 1, . . . , K. Under the assumption that the velocity fields v t are Lipschitz continuous with respect to (x, µ t ), the discretization error of µ s t can be bounded in a finite time interval [0, T ) as follows: sup t∈[0,T ) W 2 (µ t , µ s t ) = O(s). The proof of ( 11) is given in Proposition B.2 in the appendix. The error bounds (10) and ( 11) imply that the distributions of the particles X k generated by the EPT map defined in (7) with a small s and a sufficiently large k converges to the target ν at the rate of discretization size s. When training the EPT map, we use the deep neural networks to estimate the density ratios (density differences) with samples. In Theorem 4.1, we provide an estimation error bound that improves the minimax rate of deep nonparametric estimation via exploring the low-dimensional structure of data and circumvents the "curse of dimensionality." Thus this result is of independent interest in nonparametric estimation using deep neural networks.

3. IMPLEMENTATION

We now described how to implement EPT and train the optimal transport T with an i.i.d. sample {X i } n i=1 ⊂ R m from an unknown target distribution ν. The EPT map is trained via the forward Euler iteration ( 6)-( 8) with a small step size s > 0. The resulting map is a composition of a sequence of residual maps, i.e., T K • T K-1 • ... • T 0 for a large K. As implied by Theorem 4.1 in Section 4, each T k , k = 0, ..., K can be estimated with high accuracy by T k = 1 + sv k , where vk (x) = -f ( R φ (x))∇ R φ (x). Here R φ is the density-ratio estimator defined in (14) below based on {Y i } n i=1 ∼ q k and the data {X i } n i=1 ∼ p. Therefore, according to the EPT map ( 9), the particles T ( Ỹi ) ≡ T K • T K-1 • ... • T 0 ( Ỹi ), i = 1, . . . , n serve as samples drawn from the target distribution ν, where particles { Ỹi } n i=1 ⊂ R m are sampled from a simple reference distribution µ. In many applications, high-dimensional complex data such as images, texts and natural languages, tend to have low-dimensional latent features. To learn generative models with latent low-dimensional structures, it is beneficial to have the option of first sampling particles {Z i } n i=1 from a low-dimensional reference distribution μ ∈ P 2 (R ) with d. Then we apply T to particles Ỹi = G θ (Z i ), i = 1, ..., n, where we introduce another deep neural network G θ : R → R m with parameter θ. We can estimate G θ via fitting the pairs {(Z i , Ỹi )} n i=1 . We describe the EPT algorithm below. • Outer loop for modeling low dimensional latent structure (optional) -Sample {Z i } n i=1 ⊂ R from a low-dimensional reference distribution μ and let Ỹi = G θ (Z i ), i = 1, 2, . . . , n. -Inner loop for finding the push-forward map * If there are no outer loops, sample Ỹi ∼ µ, i = 1, . . . , n. * Get v(x) = -f ( R φ (x))∇ R φ (x) via solving (14) below with Y i = Ỹi . Set T = 1 + sv with a small step size s. * Update the particles Ỹi = T ( Ỹi ), i = 1, . . . , n. -End inner loop -If there are outer loops, update the parameter θ of G θ (•) via solving min θ n i=1 G θ (Z i ) -Ỹi 2 2 /n. • End outer loop

4. DEEP DENSITY-RATIO AND DENSITY-DIFFERENCE FITTING

The evaluation of velocity fields depends on the dynamic estimation of a discrepancy between the push-forward distribution q t and the target distribution p. Density-ratio and density-difference fitting with the Bregman score provides a unified framework for such discrepancy estimation without estimating each density separately (Gneiting & Raftery, 2007; Dawid, 2007; Sugiyama et al., 2012a; b; Kanamori & Sugiyama, 2014) . Let r(x) = q(x)/p(x) be the density ratio between a given density q(x) and the target p(x). Let g : R → R be a differentiable and strictly convex function. The separable Bregman score with the base probability density p for measuring the discrepancy between r and a measurable function R : R m → R 1 is B(r, R) = E X∼p [g (R(X))R(X) -g(R(X))] -E X∼q [g (R(X))]. Here we focus on the widely used least-squares density-ratio (LSDR) fitting with g(c) = (c -1) 2 as a working example, i.e., B LSDR (r, R) = E X∼p [R(X) 2 ] -2E X∼q [R(X)] + 1. (12) For other choice of g, such as g(c) = c log c -(c + 1) log(c + 1) corresponding to estimating r via the logistic regression (LR), and the scenario of density difference fitting will be presented in detail in Section B.3.1.

Gradient regularizer

The distributions of real data may have a low-dimensional structure with their support concentrated on a low-dimensional manifold, which may cause the f -divergence to be ill-posed due to non-overlapping supports. To exploit such underlying low-dimensional structures and avoid ill-posedness, we derive a simple weighted gradient regularizer 1 2 E p [g (R) ∇R 2 2 ] , motivated by recent works on smoothing via noise injection (Sønderby et al., 2017; Arjovsky & Bottou, 2017) . This serves as a regularizer for deep density-ratio fitting. For example, with g(c) = (c -1) 2 , the resulting gradient regularizer is E p [ ∇R 2 2 ], which recovers the well-known squared Sobolev semi-norm in nonparametric statistics. Gradient regularization stabilizes and improves the long time performance of EPT. The detailed derivation is presented in Section B.3.2. LSDR estimation with gradient regularizer Let {X i } n i=1 and {Y i } n i=1 be two collections of i.i.d data from densities p(x) and q(x), respectively. Let H ≡ H D,W,S,B be the set of ReLU neural networks R φ with parameter φ, depth D, width W, size S, and R φ ∞ ≤ B. We combine the least squares loss (12) with the gradient regularizer (13) as our objective function. The resulting gradient regularized LSDR estimator of r = p/q is given by R φ ∈ arg min R φ ∈H 1 n n i=1 [R φ (X i ) 2 -2R φ (Y i )] + α 1 n n i=1 ∇R φ (X i ) 2 2 , where α ≥ 0 is a regularization parameter.

Estimation error bound

We first show that the density ratio r is identifiable through the objective function by proving that, at the population level, we can recover the density ratio r via minimizing B α LSDR (R) = B LSDR (r, R) + αE p [ ∇R 2 2 ] + C, where B LSDR is defined in ( 12) and C = E X∼q [r 2 (X)] -1. Lemma 4.1. For any α ≥ 0, we have r ∈ arg min R B α LSDR (R). In addition, B α LSDR (R) ≥ 0 for any R with E X∼p R 2 (X) < ∞, and B α LSDR (R) = 0 iff R(x) = r(x) = 1 (q, p)-a.e. x ∈ R m . This identifiabiity result shows that the target density ratio is the unique minimizer of the population version of the empirical criterion in ( 14). This provides a the basis for establishing the convergence result of deep nonparametric density-ratio estimation. Next we bound the nonparametric estimation error R φ -r L 2 (ν) under the assumptions that the support of ν is concentrated on a compact low-dimensional manifold and r is Lipsichiz continuous. Let M ⊆ [-c, c] m be a Riemannian manifold (Lee, 2010) with dimension m, condition number 1/τ , volume V, geodesic covering regularity R, and m  M = O (m ln(mVR/τ )) m. Denote M = {x ∈ [-c, c] m : inf{ x -y 2 : y ∈ M} ≤ } , ∈ (0, 1). = O(log n), W = O(n M 2(2+M) / log n), S = O(n M-2 M+2 / log 4 n), and B = 2B. Then, E {Xi,Yi} n i=1 [ R φ -r 2 L 2 (ν) ] ≤ C(B 2 + cLmM)n -2/(2+M) , where C is a universal constant. The error bound established in Theorem 4.1 for the nonparametric deep density-ratio fitting is new. This result is of independent interest for nonparametric estimation with deep neural networks. The above derived rate O(n -2 2+M ln m ) is faster than the optimal rate of convergence for nonparametric estimation of a Lipschitz target in R m , where the optimal rate is O(n -2 2+m ) (Stone, 1982; Schmidt-Hieber, 2020) as long as the intrinsic dimension M of the data is much smaller than the ambient dimension m. Therefore, the proposed density-ratio estimators circumvent the "curse of dimensionality" if data is supported on a lower-dimensional manifold.

5. RELATED WORK

We discuss connections between EPT and the existing related works. The existing generative models, such as VAEs, GANs and flow-based methods, parameterize a transform map with a neural network, say G, that solves min G D(G # µ, ν), where D(•, •) is an integral probability discrepancy. The original GAN (Goodfellow et al., 2014) , f -GAN (Nowozin et al., 2016) and WGAN (Arjovsky et al., 2017) solve the dual form of ( 15) by parameterizing the dual variable using another neural network with D as the JS-divergence, the f -divergence and the 1-Wasserstein distance, respectively. Based on the fact that the 1-Wasserstein distance can be evaluated from samples via linear programming (Sriperumbudur et al., 2012) , Liu et al. (2018) and Genevay et al. (2018) proposed training the primal form of WGAN via a two-stage method that solves the linear programm. SWGAN (Deshpande et al., 2018) and MMDGAN (Li et al., 2017; Binkowski et al., 2018) use the sliced quadratic Wasserstein distance and the maximum mean discrepancy (MMD) as D, respectively. Vanilla VAE (Kingma & Welling, 2014) approximately solves the primal form of ( 15) with the KL-divergence loss under the framework of variational inference. Several authors have proposed methods that use optimal transport losses, such as various forms of Wasserstein distances between the distribution of learned latent codes and the prior distribution as the regularizer in VAE to improve performance. These methods include WAE (Tolstikhin et al., 2018) , Sliced WAE (Kolouri et al., 2019) and Sinkhorn AE (Patrini et al., 2019) . Discrete time flow-based methods minimize (15) with the KL divergence loss (Rezende & Mohamed, 2015; Dinh et al., 2015; 2017; Kingma et al., 2016; Papamakarios et al., 2017; Kingma & Dhariwal, 2018) . Grathwohl et al. (2019) proposed an ODE flow approach for fast training in such methods using the adjoint equation (Chen et al., 2018b) . By introducing the optimal transport tools into maximum SVGD in (Liu, 2017) and the proposed EPT are both particle methods based on gradient flow in measure spaces. However, the SVGD samples from an un-normalized density, while EPT focuses on generative leaning, i.e., learning the distribution from samples. At the population level, projecting the velocity fields of EPT with KL divergence onto reproducing kernel Hilbert Spaces will recover the velocity fields of SVGD. The proof is given in Appendix B.5. Score-based methods in (Song & Ermon, 2019; 2020; Ho et al., 2020) are also particle methods based on unadjusted Langevin flow and deep score estimators. At the population level, the velocity fields of these score-based methods are random since they have a Brownian motion term, while the velocity fields of EPT are deterministic. At the sample level, these score-based methods need to learn a vector-valued deep score function. while in EPT we need to estimate the density ratios which are scalar functions.

6. EXPERIMENTS

The implementation details on numerical settings, network structures, SGD optimizers and hyperparameters are given in the appendix. All experiments are performed using NVIDIA Tesla K80 GPUs. The PyTorch code of EPT is available at https://github.com/anonymous/EPT. 2D Examples. We use EPT to learn 2D distributions adapted from Grathwohl et al. (2019) with multiple modes and density ridges. The first row in Figure 1 shows kernel density estimation (KDE) plots of 50k samples from target distributions including (from left to right) 8Gaussians, pinwheel, moons, checkerboard, 2spirals, and circles. The second and third rows show the KDE plots of learned samples via EPT with f -divergence/ Lebesgue norm (left six of the second/ third row), and surface plots of estimated density ratio/ difference after 20k iterations of EPT with f -divergence/ Lebesgue norm (right six of the second/ third row), respectively. Clearly, the generated samples via EPT are nearly indistinguishable from those of the target samples and the estimated density-ratio/ difference functions are approximately equal to 1/0, indicating the learnt distribution matches the target well. We further visualize the transport maps learned with 5squares and large4gaussians from 4squares and small4gaussians, respectively. We use 200 particles connected with grey lines to manifest the learned transport maps. As shown in the left two figures in Figures 2, the central squares of 5squares were learned better with the gradient penalty, which is consistent with the result of the estimated density-ratio right two figures in Figure 2 . For large4gaussians, the learned transport map exhibited some optimality under quadratic Wasserstein distance due to the obvious correspondence between the samples in left two figures in Figure 2 . We further compare EPT using the outer loop with the generative models including WGAN, SNGAN and MMDGAN. We considered different f -divergences, including Pearson's χ 2 , KL, JS and logD (Gao et al., 2019) ) and different deep density-ratio fitting methods (LSDR and LR). Table 1 shows FID (Heusel et al., 2017) evaluated with five bootstrap sampling of EPT with four divergences on CIFAR10. We can see that EPT using ReLU ResNets without batch normalization and spectral normalization attains (usually better) comparable FID scores with the state-of-the-art generative models. Comparisons of the real samples and learned samples on MNIST, CIFAR10 and CelebA are shown in Figure 4 , where high-fidelity learned samples are comparable to real samples visually. 

7. CONCLUSION

EPT is a new approach for generative learning via training a transport map that pushes forward a reference to the target. This approach uses the forward Euler method for solving the McKean-Vlasov equation, which results from linearizing the Monge-Ampère equation that characterizes the optimal transport map. The EPT map is a composition of a sequence of simple residual maps. The key task in training is the estimation of density ratios that completely determine the residual maps. We estimate density ratios based on the Bregman divergence with gradient penalty using deep density-ratio fitting. We establish bounds on the approximation errors due to linearization, discretization, and density-ratio estimation. Our results provide strong theoretical guarantees for the proposed method and ensure that the EPT map converges fast to the target. We also show that the proposed density-ratio (difference) estimators do not suffer from the "curse of dimensionality" if data is supported on a lower-dimensional manifold. This is an interesting result in itself since density-ratio estimation is an important basic problem in machine learning and statistics. Because EPT is easy to train, computationally stable, and enjoys strong theoretical guarantees, we expect it to be a useful addition to the methods for generating learning. The proposed EPT method is motivated from the Monge-Ampère equation that characterizes the optimal transport map. However, while the EPT map pushes forward a reference distribution to the target, it is not an estimate of the optimal transport map itself. How to consistently estimate the Monge-Ampére optimal map is a challenging and open problem. Algorithm 1: EPTv1: Euler particle transport Input: K ∈ N * , s > 0, α > 0 // maximum loop count, step size, regularization coeficient X i ∼ ν, Ỹ 0 i ∼ µ, i = 1, 2, • • • , n // real samples, initial particles k ← 0 while k < K do R k φ ∈ arg min R φ 1 n n i=1 [R φ (X i ) 2 + α ∇R φ (X i ) 2 2 -2R φ ( Ỹ k i )] via SGD // determine the density ratio vk (x) = -f ( R k φ (x))∇ R k φ (x) // approximate the velocity field T k = 1 + sv k // define the forward Euler map Ỹ k+1 i = T k ( Ỹ k i ), i = 1, 2, • • • , n // update particles k ← k + 1 end Output: Ỹ k i ∼ μk , i = 1, 2, • • • , n // transported particles Evaluation metrics. Fréchet Inception Distance (FID) (Heusel et al., 2017) computes the Wasserstein distance W 2 with summary statistics (mean µ and variance Σ) of real samples xs and generated samples gs in the feature space of the Inception-v3 model (Szegedy et al., 2016) , i.e., FID = µ x -µ g 2 2 + Tr(Σ x + Σ g -2(Σ x Σ g ) 1 2 ). Here, FID is reported with the TensorFlow implementation and lower FID is better. Network architectures and hyper-parameter settings. We employed the ResNet architectures used by Gao et al. (2019) in our EPT algorithm. Especially, the batch normalization (Ioffe & Szegedy, 2015) and the spectral normalization (Miyato et al., 2018) of networks were omitted for EPT-LSDRχ 2 . To train neural networks, we set SGD optimizers as RMSProp with the learning rate 0.0001 and the batch size 100. Inputs {Z i } n i=1 in EPTv2 (Algorithm 2) were vectors generated from a 128-dimensional standard normal distribution on all three datasets. Hyper-parameters are listed in Table A3 where IL expresses the number of inner loops in each outer loop. Even without outer loops, EPTv1 (Algorithm 1) can generate images on MNIST and CIFAR10 as well by making use of a large set of particles. Table A4 shows the hyper-parameters. We illustrate the convergence property of the learning dynamics of EPTv1 on synthetic datasets pinwheel, checkerboard and 2spirals. As shown in Figure 5 , on the three test datasets, the dynamics of both the estimated LSDR fitting losses in ( 14) with α = 0 and the estimated value of the gradient norms E X∼q k [ ∇R φ (X) 2 ] demonstrate the estimated LSDR loss converges to the theoretical value -1. Algorithm 2: EPTv2: Euler particle transport with latent structure Input: IL, OL ∈ N * , s > 0, α > 0 // maximum inner loop count, maximum outer loop count, step size, regularization coeficient X i ∼ ν, i = 1, 2, • • • , n // real samples G 0 θ ← G init θ // initialize the transport map j ← 0 / * outer loop * / while j < OL do Z j i ∼ μ, i = 1, 2, • • • , n // latent particles Ỹ 0 i = G j θ (Z j i ), i = 1, 2, • • • , n // intermediate particles k ← 0 / * inner loop * / while k < IL do R k φ ∈ arg min R φ 1 n n i=1 [R φ (X i ) 2 + α ∇R φ (X i ) 2 2 -2R φ ( Ỹ k i )] via SGD // determine the density ratio vk (x) = -f ( R k φ (x))∇ R k φ (x) // approximate the velocity field T k = 1 + sv k // define the forward Euler map  Ỹ k+1 i = T k ( Ỹ k i ), i = 1, 2, • • • , n // update particles k ← k + 1 end G j+1 θ ∈ arg min G θ 1 n n i=1 G θ (Z j i ) -Ỹ IL i 2 2 via SGD // fit the transport map j ← j + 1 end Output: G OL θ : R → R d

A.3 LEARNING AND INFERENCE

The learning process of EPT performs particle evolution via solving the McKean-Vlasov equation using forward Euler iterations. The iterations rely on the estimation of the density ratios (difference) between the pushforward distributions and the target distribution. To make the inference of EPTv1 more amendable, we propose EPTv2 based on EPTv1. EPTv2 takes advantage of a neural network to fit the pushforward map. The inference of EPTv2 is fast since the pushforward map is parameterized as a neural network and only forward propagation is involved. These aspects distinguish EPTv2 from score-based generative models Song & Ermon (2019; 2020) which simulate Langevin dynamics to generate samples. Proposition B.1. (i) The following continuity equation holds in the sense of distributions. ∂ ∂t µ t = -∇ • (µ t v t ) in R + × R m with µ 0 = µ, (B-8) (ii) Energy decay along the gradient flow: d dt L[µ t ] = -v t 2 L 2 (µt,R m ) a.e. t ∈ R + . In addition, W 2 (µ t , ν) = O(exp -λt ), if L[µ] is λ-geodetically convex with λ > 0 1 . (iii) Conversely, if {µ t } t is the solution of continuity equation (B-8) in (i) with v t (x) specified by (B-9) in (ii), then {µ t } t is a gradient flow of L[•]. Remark B.1. In part (ii) of Proposition B.1, for general f -divergences, we assume the functional L to be λ-geodesically convex for the convergence of µ t to the target ν in the quadratic Wasserstein distance. However, for the KL divergence, the convergence can be guaranteed if ν satisfies the log-Sobolev inequality (Otto & Villani, 2000) . In addition, the distributions that are strongly log-concave outside a bounded region, but not necessarily log-concave inside the region satisfy the log-Sobolev inequality, see, for example, Holley & Stroock (1987) . Here the functional L can even be nonconvex, an example includes the densities with double-well potential. Remark B.2. Equation (8.48) in Proposition 8.4.6 of and Ambrosio et al. (2008) shows the connection (locally) of the velocity v t of the gradient flow µ t and the optimal transport along µ t , i.e., let T µ t+h µt be the optimal transport from µ t to µ t+h , then T µ t+h µt = I + hv t + o(h) in L p . So locally, I + hv t approximates the optimal transport map from µ t to µ t+h on [t, t + h] for a small h. Proof. (i) The continuity equation (B-8) follows from the definition of the gradient flow directly, see, page 281 in (Ambrosio et al., 2008) . (ii) The first equality follows from the chain rule and integration by part, see, Theorem 24.2 of Villani (2008) . The second one on linear convergence follows from Theorem 24.7 of Villani (2008) , where the assumption on λ in equation (24.6) is equivalent to the λ-geodetically convex assumption here. (iii) Similar to (i) see, page 281 in Ambrosio et al. (2008) . Theorem B.1. (i) Representation of the velocity fields: if the density q t of µ t is differentiable, then v t (x) = -∇F (q t (x)) µ t -a.e. x ∈ R m . (B-9) (ii) If we let Φ be time-dependent in (B-4)-(B-5), i.e., Φ t , then the linearized Monge-Ampère equations (B-4)-(B-5) are the same as the continuity equation (B-8) by taking Φ t (x) = -F (q t (x)). Proof. (i) Recall L[µ] is a functional on P a 2 (R m ). By the classical results in calculus of variation (Gelfand & Fomin, 2000) , ∂L[q] ∂q (x) = d dt L[q + tg] | t=0 = F (q(x)), where ∂L[q] ∂q denotes the first order of variation of L[•] at q, and q, g are the densities of µ and an arbitrary ξ ∈ P a 2 (R m ), respectively. Let L F (z) = zF (z) -F (z) : R 1 → R 1 . Some algebra shows, ∇L F (q(x)) = q(x)∇F (q(x)). Then, it follows from Theorem 10.4.6 in (Ambrosio et al., 2008) that ∇F (q(x)) = ∂ o L(µ), 1 We say that L is λ-geodetically convex if there exists a constant λ > 0 such that for every µ1, µ2 ∈ P a 2 (R m ), there exists a constant speed geodestic γ : [0, 1] → P a 2 (R m ) such that γ0 = µ1, γ1 = µ2 and L(γs) ≤ (1 -s)L(µ1) + sL(µ2) - λ 2 s(1 -s)d(µ1, µ2), ∀s ∈ [0, 1], where d is a metric defined on P a 2 (R m ) such as the quadratic Wasserstein distance. Proposition B.2. Suppose that the velocity fields v t are Lipschitz continuous with respect to (x, µ t ), that is, there exists a finite constant L v > 0 such that v t (x) -v t( x) ≤ L v [ x -x + W 2 (µ t , µ t)], t, t > 0 and x, x ∈ R m . (B-10) Then for any finite T > 0, the bound (11) on the discretization error holds: sup t∈[0,T ] W 2 (µ t , µ s t ) = O(s). Remark B.3. If we take f (x) = (x -1) 2 /2 in Lemma B.1, then the velocity fields v t (x) = ∇r t (x), where r t (x) = q t (x)/p(x). In the proof of Theorem B.1, part (ii), it is shown that q t satisfies dq t /dt = -∇ • (q t ∇Φ t ). Thus for this simple f -divergence function, the verification of the Lipschitz condition (B-10) amounts to verifying that ∇r t (x) is Lipschitz in the sense of (B-10). Proof. Without loss of generality let K = T s > 1 be an integer. Recall {µ s t t ∈ [ks, (k + 1)s) is the piecewise constant interpolation between µ k and µ k+1 defined as µ s t = (T k,s t ) # µ k , where, T k,s t = 1 + (t -ks)v k , µ k is defined in ( 16)-( 18) with v k = v ks , i.e., the continuous velocity in (B-9) at time ks, k = 0, .., K -1, µ 0 = µ. Under assumption (B-10) we can first show in a way similar to the proof of Lemma 10 in Arbel et al. (2019) that W 2 (µ ks , µ k ) = O(s). (B-11) Let Γ be the optimal coupling between µ k and µ ks , and (X, Y ) ∼ Γ. Let X t = T k,s t (X) and Y t be the solution of (4) with X 0 = Y and t ∈ [ks, (k + 1)s). Then X t ∼ µ s t , Y t ∼ µ t and Y t = Y + t ks v t(Y t)d t. It follows that W 2 2 (µ t , µ ks ) ≤ E[ Y t -Y 2 2 ] (B-12) = E[ t ks v t(Y t)d t 2 2 ] ≤ E[( t ks v t(Y t) 2 d t) 2 ] ≤ O(s 2 ). where, the first inequality follows from the definition of W 2 , and the last equality follows from the the uniform bounded assumption of v t . Similarly, W 2 2 (µ k , µ s t ) ≤ E[ X -X t 2 2 ] = E[ (t -ks)v k (X) 2 2 ] ≤ O(s 2 ). (B-13) Then, W 2 (µ t , µ s t ) ≤ W 2 (µ t , µ ks ) + W 2 (µ ks , µ k ) + W 2 (µ k , µ s t ) ≤ O(s), where the first inequality follows from the triangle inequality, see for example Lemma 5.3 in Santambrogio (2015) , and the second one follows from (B-11)-(B-13). ) , where C is a universal constant. = O(log n), W = O(n M 2(2+M) / log n), S = O(n M-2 M+2 / log 4 n), and B = 2B. Then, E {Xi,Yi} n 1 [ R φ -r 2 L 2 (ν) ] ≤ C(B 2 + cLmM)n -2/(2+M Proof. We use B(R) to denote B 0 LSDR -C for simplicity, i.e., B(R) = E X∼p [R(X) 2 ] -2E X∼q [R(X)]. (B-14) Rewrite ( 20) with α = 0 as R φ ∈ arg min R φ ∈H D,W,S,B B(R φ ) = n i=1 1 n (R φ (X i ) 2 -2R φ (Y i )). (B-15) By Lemma B.2 and Fermat's rule (Clarke, 1990), we know 0 ∈ ∂B(r). Then, ∀R direct calculation yields, R -r 2 L 2 (ν) = B(R) -B(r) -∂B(r), R -r = B(R) -B(r). (B-16) ∀ Rφ ∈ H D,W,S,B we have, R φ -r 2 L 2 (ν) = B( R φ ) -B(r) (B-17) = B( R φ ) -B( R φ ) + B( R φ ) -B( Rφ ) + B( Rφ ) -B( Rφ ) + B( Rφ ) -B(r) ≤ 2 sup R∈H D,W,S,B |B(R) -B(R)| + Rφ -r 2 L 2 (ν) , where the inequality uses the definition of R φ , Rφ and (B-16). We prove the theorem by upper bounding the expected value of the right hand side term in (B-17). To this end, we need the following auxiliary results (B-18)-(B-20). E {Zi} n i [sup R |B(R) -B(R)|] ≤ 4C 1 (2B + 1)G(H), (B-18) where G(H) = E {Zi, i} n i sup R∈H D,W,S,B | 1 n n i=1 i R(Z i )| is the Gaussian complexity of H D,W,S,B (Bartlett & Mendelson, 2002) . Proof of (B-18 ). Let g(c) = c 2 -c, z = (x, y) ∈ R m × R m , R(z) = (g • R)(z) = R 2 (x) -R(y). Denote Z = (X, Y ), Z i = (X i , Y i ), i = 1, ..., n with X, X i i.i.d. ∼ p, Y, Y i i.i.d. ∼ q. Let Z i be an i.i.d. copy of Z i , and σ i ( i ) be i.i.d. Rademacher random (standard normal) variables that are independent of Z i and Z i . Then, B(R) = E Z [ R(Z)] = 1 n E Zi [ R( Z i )], and B(R) = 1 n n i=1 R(Z i ). Denote R(H) = 1 n E {Zi,σi} n i [ sup R∈H D,W,S,B | n i=1 σ i R(Z i )|] as the Rademacher complexity of H D,W,S,B (Bartlett & Mendelson, 2002) . Then, E {Zi} n i [sup R |B(R) -B(R)|] = 1 n E {Zi} n i [sup R | n i=1 (E Zi [ R( Z i )] -R(Z i ))|] ≤ 1 n E {Zi, Zi} n i [sup R | R( Z i ) -R(Z i )|] = 1 n E {Zi, Zi,σi} n i [sup R | n i=1 σ i ( R( Z i ) -R(Z i ))|] ≤ 1 n E {Zi,σi} n i [sup R | n i=1 σ i R(Z i )|] + 1 n E { Zi,σi} n i [sup R | n i=1 σ i R( Z i )|] = 2R(g • H) ≤ 4(2B + 1)R(H) ≤ 4C 1 (2B + 1)G(H), where, the first inequality follows from the Jensen's inequality, and the second equality holds since the distribution of σ i ( R( Z i ) -R(Z i )) and R( Z i ) -R(Z i ) are the same, and the last equality holds since the distribution of the two terms are the same, and last two inequality follows from the Lipschitz contraction property where the Lipschitz constant of g on H D,W,S,B is bounded by 2B + 1 and the relationship between the Gaussian complexity and the Rademacher complexity, see for Theorem 12 and Lemma 4 in Bartlett & Mendelson (2002) , respectively. 

G(H)

n i=1 i R(Z i )]|{Z i } n i=1 ]. Conditioning on {Z i } n i=1 , ∀R, R ∈ H D,W,S,B it easy to check V i [ 1 n n i=1 i (R(Z i ) -R(Z i ))] = d H 2 (R, R) √ n , where, = ∇ x K(x, z)dν(z) -∇ x K(x, z)dµ t (z) = ∇ x K(x, z)p(z)dz -∇ x K(x, z)q t (z)dz By Lemma B.1, the vector fields corresponding the Lebesgue norm 1 2 µ-ν 2 L 2 (R m ) = 1 2 R m |q(x)p(x)| 2 dx are defined as v t = ∇p(x) -∇q t (x). d H 2 (R, R) = 1 √ n n i=1 (R(Z i ) -R(Z i )) 2 . Next, we will show the vector fields v mmd t is exactly by projecting the vector fields v t on to the reproducing kernel Hilbert space H m = H ⊗m . By the definition of reproducing kernel we have, p(x) = p(•), K(x, •) H = K(x, z)p(z)dz, and q t (x) = q t (•), K(x, •) H = K(x, z)q t (z)dz. Hence, v t (x) = ∇p(x) -∇q t (x) = ∇ x K(x, z)(p(z) -q t (z))dz =v mmd t (x). This completes the proof.

B.5 PROOF OF THE RELATION BETWEEN EPT AND SVGD

Proof. Let f (u) = u log u in (5). With this f the velocity fields v t = -f (r t )∇r t = -∇rt(x) rt(x) Let g in a Stein class associated with q t . v t , g H(qt) = -g(x) T ∇r t (x) r t (x) q t (x)dx = -g(x) T ∇ log r t (x)q t (x)dx = -E X∼qt(x) [g(x) T ∇ log q t (X) + g(x) T ∇ log p(X)] = -E X∼qt(x) [g(x) T ∇ log q t (X) + ∇ • g(x)] + E X∼qt(x) [g(x) T ∇ log p(X) + ∇ • g(x)] = -E X∼qt(x) [T qt g] + E X∼qt(x) [T p g] =E X∼qt(x) [T p g], where the last equality is obtained by restricting g in a Stein class associated with q t , i.e., E X∼qt(x) T qt g = 0. This is exactly the velocity fields of SVGD (Liu, 2017) .



Theorem 4.1. Assume supp(r) = M and r(x) satisfies |r(x)| ≤ B for a finite constant B > 0 and is Lipschitz continuous with Lipschitz constant L. Suppose the topological parameter of H D,W,S,B in (14) with α = 0 satisfies D

likelihood training,Chen et al. (2018a)  andZhang et al. (2018) considered continuous time flow.Chen  et al. (2018a)  proposed a gradient flow in measure spaces in the framework of variational inference and then discretized it with the implicit movement minimizing scheme(De Giorgi, 1993;Jordan et al., 1998).Zhang et al. (2018) considered gradient flows in measure spaces with time invariant velocity fields. CFGGAN(Johnson & Zhang, 2018) derived from the perspective of optimization in the functional space is a special form of EPT with L[•] taken as the KL divergence. SW flow(Liutkus et al., 2019) and MMD flow(Arbel et al., 2019) are gradient flows in measure spaces. MMD flow can be recovered from EPT by first choosing L[•] as the Lebesgue norm and then projecting the corresponding velocity vector fields onto reproducing kernel Hilbert spaces, please see Appendix B.4 for a proof. However, neither SW flow nor MMD flow can model hidden low-dimensional structure with the particle sampling procedure.

Figure 1: KDE plots of the target samples (the first row), KDE plots of the learned samples via EPT with f -divergence/ Lebesgue norm (left six of the second/third row)) and surface plots of estimated density ratio/ difference after 20k iterations of EPT with f -divergence/ Lebesgue norm (right six of the second/ third row).

Figure 2: Learned transport maps (left two) and estimated density ratio (right two) in learning 5squares from 4squares and learning large4gaussians from small4gaussians.Results on Benchmark Image Data. We show the performance of applying EPT to benchmark dataMNIST (LeCun et al., 1998), CIFAR10(Krizhevsky & Hinton, 2009) and CelebA(Liu et al., 2015) using ReLU ResNets without batch normalization and spectral normalization. The particle evolutions on MNIST and CIFAR10 without using outer loop are shown in Figure3. Clearly, EPT can transport samples from a multivariate normal distribution into a target distribution.

Figure 3: Particle evolution of EPT on MNIST and CIFAR10.

Figure 4: Visual comparisons between real images (left 3 panels) and generated images (right 3 panels) by EPT-LSDR-χ 2 on MNIST, CIFAR10 and CelebA.

Figure 5: The numerical convergence phenomenon of EPTv1 on simulated datasets. First row: LSDR fitting loss (14) with α = 0 v.s. iterations on pinwheel, checkerboard and 2spirals. Second row: Estimation of the gradient norm E X∼q k [ ∇R φ (X) 2 ] v.s. iterations on pinwheel, checkerboard and 2spirals.

Assume supp(r) = M and r(x) is Lipschitz continuous with the bound B and the Lipschitz constant L. Suppose the topological parameter of H D,W,S,B in (20) with α = 0 satisfies D

is simply the gradient of the density ratio. Other types of velocity fields can be obtained by using different energy functionals such as the Lebesgue norm of the density difference, i.e., L[µ t ] = R m |q t (x)-p(x)| 2 dx, see Section B.2 for details.

Mean (standard deviation) of FID scores on CIFAR10. The FID score of NSCN is reported inSong & Ermon (2019)  and results in the right table are adapted fromArbel et al. (2018).

Hyper-parameters in EPT with outer loops on real image datasets.

Hyper-parameters in EPT without outer loops on real image datasets.

Observing the diameter of H D,W,S,B under d H where, the first inequality follows from the chaining Theorem 8.1.3 in Vershynin (2018), and the second inequality holds due to d H 2 ≤ d H ∞ , and in the third inequality we used the relationship between and ∂L[µ] ∂µ (x) = K(x, z)dµ(z) -K(x, z)dν(z),

APPENDIX

In the appendix, we provide the implementation details on numerical settings, network structures, SGD optimizers, and hyper-parameters in the paper. We show the numerical convergence of EPT with simulated datasets and compare the learning and inference of EPT with other generative models. We give detailed theoretical background and proofs of the results mentioned in the paper. We also provide proofs MMD flow and SVGD can be derived from EPT by choosing appropriate f -divergences.

A APPENDIX: NUMERICAL EXPERIMENTS A.1 IMPLEMENTATION DETAILS, NETWORK STRUCTURES, HYPER-PARAMETERS

We provide the details of two versions of the EPT algorithm, EPTv1 in Algorithm 1 and EPTv2 in Algorithm 2 below. In Algorithm 1, we describe the algorithm without outer loops. In Algorithm 2, we describe the algorithm with a latent structure and outer loops.

A.1.1 2D EXAMPLES

Experiments on 2D examples in our work were performed with deep LSDR fitting and the Pearson χ 2 divergence. We use the EPTv1 (Algorithm 1) without outer loops. In inner loops, only a multilayer perceptron (MLP) was utilized for dynamic estimation of the density ratio between the model distribution q k and the target distribution p. The network structure and hyper-parameters in EPT and deep LSDR fitting were shared in all 2D experiments. We adopt EPT to push particles from a predrawn pool consisting of 50k i.i.d. Gaussian particles to evolve in 20k steps. We used RMSProp with the learning rate 0.0005 and the batch size 1k as the SGD optimizer. The details are given in Table A1 and Table A2 . We note that s is the step size, n is the number of particles, α is the penalty coefficient, and T is the mini-batch gradient descent times of deep LSDR fitting or deep logistic regression in each inner loop hereinafter. For convenience, we first give the following notation to be used in this section. Let P 2 (R m ) denote the space of Borel probability measures on R m with finite second moments, and let P a 2 (R m ) denote the subset of P 2 (R m ) in which measures are absolutely continuous with respect to the Lebesgue measure (all distributions are assumed to satisfy this assumption hereinafter). Tan µ P 2 (R m ) denotes the tangent space to P 2 (R m ) at µ. Let AC loc (R + , P 2 (R m )) := {µ t :With 1, det and tr, we refer to the identity map, the determinant and the trace. We use ∇, ∇ 2 and ∆ to denote the gradient or Jacobian operator, the Hessian operator and the Laplace operator, respectively.We are now ready to describe the proposed method in a mathematically rigorous fashion and provide theoretical guarantees. Let X ∼ q, X = T t,Φ (X), and denote the distribution of X as q. With a small t, the map T t,φ is invertible according to the implicit function theorem, and we have the change of variables formulaUsing the fact d dt t=0 det(A + tB) = det(A)tr A -1 B ∀A, B ∈ R m×m with A invertible, and applying the first order Taylor expansion to (B-1), we have(B-3)Let t → 0 in (B-2) and (B-3), we obtain a random process {x t } and its law q t satisfying dx t dt = ∇Φ(x t ), with x 0 ∼ q, (B-4) d ln q t (x t ) dt = -∆Φ(x t ), with q 0 = q. (B-5)Equations (B-4) and (B-5) resulting from linearization of the Monge-Ampère equation ( 2) can be interpreted as gradient flows in measure spaces (Ambrosio et al., 2008) . And thanks to this connection, we can resort to solving a continuity equation characterized by a type of McKean-Vlasov equation, an ODE system that is easier to handle.

B.2 GRADIENT FLOWS

be an energy functional satisfying ν ∈ arg min L[•], where F (•) : R + → R 1 is a twice-differentiable convex function. Among the widely used metrics on P a 2 (R m ) in implicit generative learning, the following two are important examples of L[•] : (1) f -divergence given in (5) (Ali & Silvey, 1966) ; (2) Lebesgue norm of density difference:e., t ∈ R + and the velocity vector fieldThe gradient flow {µ t } t∈R + of L[•] enjoys the following nice properties.where, ∂ o L(µ) denotes the one in ∂L(µ) with minimum length. The above display and the definition of gradient flow implies the representation of the velocity fields v t .(ii) The time dependent form of (B-4)-(B-5) reads dx t dt = ∇Φ t (x t ), with x 0 ∼ q, d ln q t (x t ) dt = -∆Φ t (x t ), with q 0 = q.By chain rule and substituting the first equation into the second one, we havewhich implies,By (B-9), the above display coincides with the continuity equation (B-8) with v t = ∇Φ t = -∇F (q t (x)).Theorem B.1 and Proposition B.1 imply that {µ t } t , the solution of the continuity equation (B-8) with v t (x) = -∇F (q t (x)), converges rapidly to the target distribution ν. Furthermore, the continuity equation has the following representation under mild regularity conditions on the velocity fields.Then the solution of the continuity equation (B-8) can be represented as µ t = (X t ) # µ, where X t (x) : R + × R m → R m satisfies the McKean-Vlasov equation (4).Proof. The Lipschitz assumption of v t implies the existence and uniqueness of the McKean-Vlasov equation (4) according to the classical results in ODE (Arnold, 2012) . By the uniqueness of the continuity equation, see Proposition 8.1.7 in Ambrosio et al. (2008) , it is sufficient to show that µ t = (X t ) # µ satisfies the continuity equation (B-8) in a weak sense. This can be done by the standard test function and smoothing approximation arguments, see, Theorem 4.4 in Santambrogio (2015) for details.As shown in Lemma B.1 below, the velocity fields associated with the f -divergence (5) and the Lebesgue norm (B-7) are determined by density ratio and density difference respectively. Lemma B.1. The velocity fields v t satisfyProof. By definition,Then, the desired result follows from the above display and (B-9).Several methods have been developed to estimate density ratio and density difference in the literature. Examples include probabilistic classification approaches, moment matching and direct density-ratio (difference) fitting, see Sugiyama et al. (2012a; b) ; Kanamori & Sugiyama (2014) ; Mohamed & Lakshminarayanan (2016) and the references therein.

B.3.1 BREGMAN SCORE FOR DENSITY RATIO/DIFFERENCE

The separable Bregman score with the base probability measure p to measure the discrepancy between a measurable function R : R m → R 1 and the density ratio r isIt can be verified that B ratio (r, R) ≥ B ratio (r, r), where the equality holds iff R = r.For deep density-difference fitting, a neural network D : R m → R 1 is utilized to estimate the density-difference d(x) = q(x) -p(x) between a given density q and the target p. The separable Bregman score with the base probability measure w to measure the discrepancy between D and d can be derived similarly,Here, we focus on the widely used least-squares density-ratio (LSDR) fitting with g(c) = (c -1) 2 as a working example for estimating the density ratio r. The LSDR loss function is

B.3.2 GRADIENT PENALTY

We consider a noise convolution form of B ratio (r, R) with Gaussian noise ∼ N (0, αI),Using equations ( 13)-( 17) in Roth et al. (2017) , we geti.e., 1 2 E p [g (R) ∇R 2 2 ] serves as a regularizer for deep density-ratio fitting when g is twice differentiable.

B.3.3 PROOFS IN SECTION 4

Below we prove Theorem 4.1 in Section 4. Lemma B.2. For given densities p(x) and q(x), let r(x) = q(x)/p(x) with C = E X∼q [r(X)]-1 < ∞. For any α ≥ 0, define a nonnegative functionalProof. By definition, it is easy to check B 0 LSDR (R) = B ratio (r, R) -B ratio (r, r), where B ratio (r, R) is the Bregman score with the base probability measure p between R and r. Then r ∈ arg min measureable R B 0 LSDR (R) follow from the fact B ratio (r, R) ≥ B ratio (r, r) and the equality= 0, which is further equivalent to R = r = constant (q, p)-a.e. , and the constant = 1 since r is a density ratio. the matric entropy and the VC-dimension of the ReLU networks H D,W,S,B (Anthony & Bartlett, 2009) Let r be an extension of the restriction of r on M , which is defined similarly as g on page 30 in Shen et al. (2019) . Since we assume the target r is Lipschitz continuous with the bound B and the Lipschitz constant L, let small enough, then by Theorem 4. 

B.4 THE RELATIONSHIP BETWEEN EPT AND MMD FLOW

Here we show that MMD flow can be considered a special case of EPT.Proof. Let H be a reproducing kernel Hilbert space with characteristic kernel K(x, z). Recall in MMD flow,

