LIPSCHITZ REGULARIZED GRADIENT FLOWS AND LA-TENT GENERATIVE PARTICLES

Abstract

Lipschitz regularized f -divergences are constructed by imposing a bound on the Lipschitz constant of the discriminator in the variational representation. These divergences interpolate between the Wasserstein metric and f -divergences and provide a flexible family of loss functions for non-absolutely continuous (e.g. empirical) distributions, possibly with heavy tails. We first construct Lipschitz regularized gradient flows on the space of probability measures based on these divergences. Examples of such gradient flows are Lipschitz regularized Fokker-Planck and porous medium partial differential equations (PDEs) for the Kullback-Leibler and α-divergences, respectively. The regularization corresponds to imposing a Courant-Friedrichs-Lewy numerical stability condition on the PDEs. For empirical measures, the Lipschitz regularization on gradient flows induces a numerically stable transporter/discriminator particle algorithm, where the generative particles are transported along the gradient of the discriminator. The gradient structure leads to a regularized Fisher information which is the total kinetic energy of the particles and can be used to track the convergence of the algorithm. The Lipschitz regularized discriminator can be implemented via neural network spectral normalization and the particle algorithm generates approximate samples from possibly high-dimensional distributions known only from data. Notably, our particle algorithm can generate synthetic data even in small sample size regimes. A new data processing inequality for the regularized divergence allows us to combine our particle algorithm with representation learning, e.g. autoencoder architectures. The resulting particle algorithm in latent space yields markedly improved generative properties in terms of efficiency and quality of the synthetic samples. From a statistical mechanics perspective the encoding can be interpreted dynamically as learning a better mobility for the generative particles.

1. INTRODUCTION

We construct new algorithms that are capable of efficiently transporting arbitrary empirical distributions to a target data set. The transportation of the empirical distribution is constructed as a (discretized) gradient flow in probability space for Lipschitz-regularized f -divergences. Samples are viewed as particles and are transported along the gradient of the discriminator of the divergence towards the target data set. We take advantage of representation learning concepts, e.g. autoencoders, and make these algorithms efficient even in high-dimensional sample spaces by defining particle algorithms in latent space. Their accuracy is guaranteed by a new data processing inequality. One of our main tools is Lipschitz regularized f -divergences which interpolate between the Wasserstein metric and f -divergences. Such divergences Dupuis & Mao (2022) ; Birrell et al. (2022a; c) , discussed in Section 2 provide a flexible family of loss functions for non-absolutely continuous distributions. In Machine Learning one needs to build algorithms to handle target distributions Q which are singular, either by their intrinsic nature such as probability densities concentrated on low dimensional structures and/or because Q is usually only known through N samples (the corresponding empirical distribution Q N is always singular). Another key ingredient in our construction is that we build gradient flows where mass is transported along the gradient of the optimal discriminator in the variational formulation of the divergences. The time discretization of such gradient flows for empirical distributions gives rise to a so-called transporter/discriminator particle algorithm which transports an initial empirical distribution P N toward the target Q N . The Lipschitz regularization provides numerically stable, mesh free, particle algorithms that can act as generative models for high-dimensional target distributions. Moreover the gradient structure yields a dissipation functional which corresponds to the kinetic energy of the particles (a Lipschitz regularized version of Fisher information) and which can be used to control the convergence of the algorithm. The third new element in our methods is the use of representation learning to reduce the sample space dimension. We construct latent particle algorithms by building a Lipschitz regularized gradient flow in latent space. The fidelity of the latent space particle algorithm is guaranteed by a new data processing inequality for Lipschitz regularized divergence which ensures that convergence in latent space implies convergence in real sample space. The proposed generative approach is validated on a wide variety of datasets and applications ranging from image generation to gene expression data integration. Related work. Our approach is inspired by the MMD and KALE gradient flows from Arbel et al. (2019) ; Glaser et al. (2021) based on an entropic regularization of the MMD metrics, and related work using the Kernelized Sobolev Discrepancy Mroueh et al. (2019) . Furthermore, the recent work of Dupuis & Mao (2022) ; Birrell et al. (2022a) built the mathematical foundations for a large class of new divergences which contains the Lipschitz regularized f -divergences and used them to construct GANs, and in particular symmetry preserving GANs Birrell et al. (2022c) ). Lipschitz regularizations (or related spectral normalization) have been shown to improve the stability of GANs Miyato et al. (2018) ; Arjovsky et al. (2017) ; Gulrajani et al. (2017) . Our particle algorithms share similarities with GANs Goodfellow et al. (2014) ; Arjovsky et al. (2017) , sharing the same discriminator but having a different generator step. They are also broadly related to the Wasserstein gradient flows Fan et al. (2022) which build a suitable neural method for the JKO-type schemes, Jordan et al. (1998) . Furthermore, our methods are closely related to continuous time normalizing flows (NF) Chen et al. (2018a) ; Köhler et al. (2020) ; Chen et al. (2018b) , diffusion models Sohl-Dickstein et al. (2015) ; Ho et al. (2020) and score-based generative flows Song & Ermon (2020) ; Song et al. (2021) . However, the aforementioned continuous time models, along with variational autoencoders Kingma & Welling (2013) and energy based methods LeCun et al. (2006) , are all likelihood-based. On the other hand, particle gradient flows such as the ones proposed here, can be classified in the same category of generative models that include GANs. Here there is more flexibility in selecting the loss function in terms of a suitable divergence or probability metric, enabling the comparison of even mutually singular distributions, e.g. Arjovsky et al. (2017) . In Section A and Section F.1 we compare further our particle methods to other generative particles algorithms such as RKHS-based gradient flows and score-matching methods. Gradient flows in probability spaces related to the Kullback-Leibler (KL) divergence, such as the Fokker-Planck equations and Langevin dynamics Roberts & Tweedie (1996) ; Durmus & Moulines (2017) or Stein variational gradient descent Liu & Wang (2016) ; Liu (2017) ; Lu et al. (2019) , form the basis of a variety of sampling algorithms when the target distribution Q has a known density (up to normalization). The weighted porous media equations form another family of gradient flows based on α-divergences Markowich & Villani (2000) ; Otto (2001) ; Ambrosio et al. (2005) ; Dolbeault et al. (2008) ; Vázquez (2014) which are very useful in the presence of heavy tails. Our gradient flows are Lipschitz-regularizations of such classical PDE's (Fokker-Planck and porous media equations), see Appendix B for a PDE and numerical analysis perspective on such flows. Finally, deterministic particle methods and associated probabilistic flows of ODEs such as the ones derived here for Lipschitz-regularized gradient flows for (f, Γ) divergences, were considered in recent works for classical KL-divergences and associated Fokker-Planck equations as sampling tools Maoutsa et al. (2020) ; Boffi & Vanden-Eijnden (2022) , for Bayesian inference Reich & Weissmann (2021) and as generative models Song et al. (2021) . Our latent generative particles approach is inspired by latent diffusion models using auto-encoders Rombach et al. (2021) and by autoencoders used for model reduction in coarse-graining for molecular dynamics, Vlachas et al. (2022) ; Wang & Gómez-Bombarelli (2019) ; Stieffenhofer et al. (2021) .

2. LIPSCHITZ-REGULARIZED f -DIVERGENCES

In the paper Dupuis & Mao (2022) , continuing with Birrell et al. (2022a) a new general class of divergences has been constructed which interpolate between f -divergences and integral probability metrics and inherit desirable properties from both. In this paper we focus on one specific family which we view as a Lipschitz regularization of the KL-divergence (or f -divergences) or as an entropic regularization of the 1-Wasserstein metric. We denote by P(R d ) the space of all Borel probability measures on R d by P  1 (R d ) = P ∈ P(R d ) : |x|dP (x) < ∞ . We denote by C b (R d ) the L = Γ aL ). f-divergences. If f : [0, ∞) → R is strictly convex and lower-semicontinuous with f (1) = 0 the f -divergence of P with respect to Q is defined by D f (P ∥Q) = E Q [f ( dP dQ ) ] if P ≪ Q and set to be +∞ otherwise. We have the variational representation (see e.g. Birrell et al. (2022a) for a proof) D f (P ∥Q) = sup ϕ∈C b (R d ) E P [ϕ] -inf ν∈R {ν + E Q [f * (ϕ -ν)]} where f * (s) = sup t∈R {st -f (t)} is the Legendre-Fenchel transform of f . We will use the KLdivergence with f KL (x) = x log x and the α-divergence: f α = x α -1 α(α-1) with Legendre transforms f * KL (y) = e y-1 and f * α ∝ y α (α-1) (see the Appendix). For KL the infimum over ν can be solved analytically and yields the Donsker-Varadhan with a log E Q [e ϕ ] term (see Birrell et al. (2022b) for more on variational representations). Wasserstein metrics. The 1-Wasserstein metrics W Γ1 (P, Q) with transport cost |x -y| is an integral probability metrics, see Arjovsky et al. (2017) . By keeping the Lipschitz constant as a regularization parameter we set W Γ L (P, Q) = sup ϕ∈Γ L {E P [ϕ] -E Q [ϕ]} and note that we have W Γ L (P, Q) = LW Γ1 (P, Q). Lipschitz-regularized f -divergences. The Lipschitz regularized f -divergences are defined directly in terms their variational representations, by replacing the optimization over bounded continuous functions in (1) by Lipschitz continuous functions in Γ L . D Γ L f (P ∥Q) := sup ϕ∈Γ L E P [ϕ] -inf ν∈R {ν + E Q [f * (ϕ -ν)]} . (3) Some of the important properties of Lipschitz regularized f -divergences, which summarizes results from Dupuis & Mao (2022) ; Birrell et al. (2022a) are given in Theorem 2.1. It is assumed there that f is super-linear (called admissible in Birrell et al. (2022a) ), that is lim s→∞ f (s)/s = +∞. This excludes the case of α-divergences for α < 1, for which the existence of optimizers is a more delicate problem, but parts of the theorems remain true. Theorem 2.1. Assume that f is superlinear and strictly convex. Then for P, Q ∈ P 1 (R d ) we have 1. Infimal Convolution Formula: D Γ L f (P ∥Q) = inf γ∈P(Ω) W Γ L (P, γ) + D f (γ∥Q) . In particu- lar we have 0 ≤ D Γ L f (P ∥Q) ≤ min D f (P ∥Q), W Γ L (P, Q) .

2.. Interpolation and limiting behavior of

D Γ L f (P ∥Q): lim L→∞ D Γ L f (P ∥Q) = D f (P ∥Q) and lim L→0 1 L D Γ L f (P ∥Q) = W Γ1 (P, Q) . ( ) 3. Optimizers: There exists an optimizer ϕ L, * ∈ Γ L , unique up to a constant in supp(P )∪supp(Q). Remark 2.2. The optimizer γ L, * in the infimal convolution formula exists, is unique and dγ L, * ∝ (f * ) ′ (ϕ L, * )dQ (see Birrell et al. (2022a) for details). For example for KL, dγ L, * ∝ e ϕ L, * dQ. where p t and q are the densities at time t and the stationary density respectively. A similar result relates weighted porous media equation and gradient flows for f divergences Otto (2001) . From a generative model perspective where Q is known only through samples (and may not have a density in the first place as Q is concentrated on low dimensional structure), one cannot use such flows without further regularization. Score matching and diffusion models start by regularizing the data by adding small amount noise to the data (see Sohl-Dickstein et al. (2015) ; Ho et al. (2020) andSong & Ermon (2020); Song et al. (2021) ). Next, we propose a different and complementary approach by regularizing the divergence in (5) directly. We refer to Section A and Section F.1 for further connections between these different approaches and the last Example in Section 6.

3. LIPSCHITZ-REGULARIZED

Lipschitz-regularized gradient flows. Given a target probability measure Q, we build an evolution equation for probability measures based on the Lipschitz regularized f -divergences D Γ L f (P ∥Q) by considering the PDE ∂ t P t = div P t ∇ δD Γ L f (P t ∥Q) δP t , P 0 = P ∈ P 1 (R d ) where δD Γ L f (P ∥Q) δP is the first variation of D Γ L f (P ∥Q) (to be discussed below in Theorem 3.1). An advantage of the Lipschitz regularized f -divergences is its ability to compare singular measures and so (6) is to be understood in the sense of distributions (integrating against test functions). For this reason we use the probability measure P t notation in (6), instead of density notation p t as in the FPE (5). In the limit L → ∞ and if P ≪ Q, (6) yields the FPE (5) (for KL divergence) and the weighted porous medium equation (for α-divergences) Otto (2001); Dolbeault et al. (2008) , see Appendix B. The following theorem was first proved in Dupuis & Mao (2022) for KL and can be generalized to the f -divergences considered in Theorem 2.1 (see the proof in Appendix C. Theorem 3.1. Assume f is superlinear and strictly convex and P, Q ∈ P 1 (R d ). Then we have δD Γ L f (P ∥Q) δP (P ) = ϕ L, * . In more details, let ρ be a signed measure of total mass 0 and let ρ = ρ + -ρ -where ρ ± ∈ P 1 (R d ) are mutually singular. If P + ϵρ ∈ P 1 (R d ) for sufficiently small |ϵ| then D γ L f (P + ϵρ∥Q) is differentiable at ϵ = 0 and lim ϵ→0 1 ϵ D Γ L f (P + ϵρ∥Q) -D Γ L f (P ∥Q) = ϕ L, * dρ . Combining Theorem 3.1 with (6) leads to a new class of PDEs: Transporter/Discriminator PDE: ∂ t P t = div(P t ∇ϕ L, * t ), where ϕ L, * t = arg max ϕ∈Γ L {E Pt [ϕ] -E Q [f * (ϕ)]} (9) Remark 3.2. (a) The transporter/discriminator PDE (9) makes sense when P and Q are replaced by their empirical measures PN , QN based on N IID samples. This will be the basis of our numerical algorithm in Section 4 (see Algorithm 1). (b) Also (9) makes sense if P and Q are mutually singular (e.g. when Q is supported on a low-dimensional structure). We can view (9) as a Lipschitz regularization of classical PDEs which allows particle-based approximations based on data. In particular, the Lipschitz condition on ϕ ∈ Γ L enforces a finite speed of propagation of at most L in the transport equation in (9). This is in sharp contrast with the Fokker Planck equation given in Appendix B which is a diffusion equation, see Appendix B.2 for more details and practical implications.

4. LIPSCHITZ-REGULARIZED GENERATIVE PARTICLES

In this section we build a numerical algorithm to solve the transporter/discriminator gradient flow when N IID samples from the target distribution are given. For a map T : R d → R d and P ∈ P(R d ), the pushforward measure is denoted by T # P (i.e. T # P (A) = P (T -1 (A)). The forward-Euler discretization of the system (9) yields: Euler method for the Transporter/Discriminator PDE: P n+1 = I -∆t∇ϕ L, * n # P n , where ϕ L, * n = arg max ϕ∈Γ L {E Pn [g] -E Q [f * (ϕ)]} When only N IID samples {X (i) } N i=1 of the target distribution Q are available we build a particle system by considering N IID samples {Y (i) } N i=1 from some initial measure P (M ̸ = N samples are also possible) and (10) becomes Lipschitz regularized generative particles: Y (i) n+1 = Y (i) n -∆t∇ϕ L, * n (Y (i) n ) , ϕ L, * n = arg max ϕ∈Γ L N i=1 ϕ(Y (i) n ) N - N i=1 f * (ϕ(X (i) )) N The empirical measure P N n = N -1 N i=1 δ Y (i) n built from (11) gives a solution of the system (10) if we use as target the empirical measure QN = N -1 N i=1 δ X (i) and as initial condition the empirical measure P N = N -1 N i=1 δ Y (i) 0 . Finally we note that ( 11) is a time-discretization of the Lagrangian formulation of (9), i.e. the ODE/variational problem d dt Y t = -∇ϕ L, * (Y t , t) , where ϕ L, * = argmax ϕ∈Γ L E Pt [ϕ] -inf ν∈R {ν + E Q [f * (ϕ -ν)]} . ( ) Algorithm 1: Lipschitz regularized generative particles algorithm Require: f defined in (2) and its Legendre conjugate f * , L: Lipschitz constant, ν: scalar parameter for optimizing f divergence, T : number of updates for the particles, γ: time step size, N : number of particles Require: W = {W l } D l=1 : parameters for the neural network ϕ : R d → R, D: depth of the neural network, δ: learning rate of the neural network, T NN : number of updates for the neural network. Result: {Y (i) T } N i=1 Sample {X (i) } N i=1 ∼ Q, a batch from the real data Sample {Y (i) 0 } N i=1 ∼ P 0 = P , a batch of prior samples Initialize ν ← 0, W randomly and W l ← L 1/D * W l /∥W l ∥ 2 for n = 0 to (T -1) do for m = 0 to T NN -1 do grad W,ν ← ∇ W N -1 N i=1 ϕ(Y (i) n ; W ) -N -1 f * ϕ(X (i) n ; W ) -ν + ν W ← W + δ * grad W , ν ← ν + δ * grad ν W l ← L 1/D * W l /∥W l ∥ 2 end Y (i) n+1 ← Y (i) n -γ∇ϕ L n (Y (i) n ; W ), i = 1, • • • , N end // The height of ϕ is adjusted by the optimization over ν. This keeps ϕ(X n )| ≤ L and thus the particle speed is bounded by L. Hence the Lipschitz regularization imposes a speed limit L on the particles, ensuring the stability of the algorithm for suitable choices of L. This implicit grid is reminiscent of the Courant, Friedrichs, and Lewy (CFL) condition for the stability of discrete scheme. These are fundamental features for the performance and the stability of Algorithm 1 derived from (11) (see Sections 6) and Appendix B.

Kinetic energy of particles:

The gradient structures implies, see Theorem B.1, that, for (9), the derivative of the regularized divergence satisfies d dt D Γ L f (P t ∥Q) = -I Γ L f (P t ∥Q) where I Γ L f (P t ∥Q) = E Pt [|∇ϕ L, * | 2 ] which is interpreted as a Lipschitz-regularized Fisher Information. As L → ∞ one recovers for example the Fisher Information used for the Fokker-Planck equation. For the Algorithm 1 the Lipshitz-regularized Fisher information I Γ L f ( P N n ∥ QN ) = |∇ϕ L, * n | 2 P N n (dx) = 1 N N i=1 |∇ϕ L, * n (Y (i) n )| 2 , ( ) is equal to the total kinetic energy of the particle since ∇ϕ L, * n (Y (i) n ) is the velocity of the i th particle at time n. Clearly when the total kinetic energy I Γ L f ( P N n ∥ QN ) is zero, the algorithm will stop.

5. LATENT GENERATIVE PARTICLES: GRADIENT FLOWS IN LATENT SPACE

A standard paradigm of machine learning is that target measures are often supported on low dimensional structures. We leverage this insight, in the form of an auto-encoder, to construct particle algorithms in a latent, lower dimensional space. The resulting latent particle algorithms are both more accurate and efficient, even in high-dimensional sample spaces, and their performance is guaranteed by a new Data Processing Inequality in Theorem 5. 1. Assume Q = Q Y is supported on some low dimensional set S ⊂ Y = R d , an encoder map E : Y → Z where Z ⊂ R d ′ , d ′ < d and a decoder map D : Z → Y are invertible in S, i.e. D • E(S) = D(Z) = S. We denote by E # Q Y the image of the measure Q Y by the map E, i.e. for A ⊂ Z, we define E # Q Y (A) = Q Y (E -1 (A) ) and likewise for D # P Z . The following theorem expresses how information remains controlled under encoding/decoding and guarantees the performance of the approximation D # P Z in the real space. The latter is achieved by an a posteriori estimate ( 14), in the sense of numerical analysis, where the approximation in the tractable latent space Z will bound the error in the real space Y. Theorem 5.1. Suppose that 1. Perfect encoding. For Q Y the encoder E and the decoder D are such that D # E # Q Y = Q Y . 2. Lipschitz decoder. The decoder is Lipschitz continuous with Lipschitz constant a D . Then, for any P Z ∈ P 1 (Z) we have D Γ L f D # P Z ∥Q Y ≤ D Γ a D L f P Z ∥E # Q Y . ( ) This theorem and more general versions thereof for other representation learning tools beyond autoencoders, is proved in Appendix D.2. The proof is a consequence of a new, tighter data processing inequality derived in Birrell et al. (2022a) that involves both transformations of probabilities and discriminator spaces Γ. Remark 5.2. In practice an autoencoder is trained on data using the empirical measure Q N and suitable loss function and neural network architectures. Assumption 2 in Theorem 5.1 can easily be enforced using e.g. spectral normalization. Assumption 1 is a reasonable, but somewhat idealized, version of the requirement that the autoencoder captures adequately the features of the dataset Q. In particular the dimension of the latent space Z needs to be selected carefully (see Section 6). Gradient flow in latent spaces. If ϕ Z t is the discriminator in latent space leading to the gradient flow (9), ∂ t P Z t = div(P Z t ∇ϕ Z t ) then, in the particle algorithm, each particle is transported following (the time-discretization of) the ODE żt = -∇ϕ Z t (z t ), as in Section 4. The Algorithm 2 can be found in the Appendix D.3. Upon decoding we find the transport ODE in real space is ẏt = ∂D ∂z (z t ) T żt = - ∂D ∂z (z t ) T ∂D ∂z (z t )∇ y ϕ Y t (D(z t )) where ∂D ∂z (z t ) is the Jacobian of D at the point z t and the reconstructed discriminator ϕ Y is given by ϕ Z = ϕ Y • D. Using (15) we can therefore interpret the encoding as learning a mobility µ t = ∂D ∂z (z t ) T ∂D ∂z (z t ), i.e., learning a better geometry in real space. This leads to a gradient flow in real space with non-trivial mobility, cf. ( 9), ∂ t P Y t = div µ t P Y t ∇ϕ Y t . ( ) We note that the mobility concept is well-known in computational materials science where it is used to model kinetics of species and interfaces, see for instance Cahn (1965) ; Zhu et al. (1999) ; Wang et al. (2020) . Finally, we note a similar computation to (15) in Mroueh et al. (2019) regarding the interpretation of GAN's as a gradient flow. The differences (and similarities) between (Lipschitzregularized) Generative particle algorithm (GPA) and GAN are summarized in Figure 4 and Table 1 , where in the latter we also include a comparison between mobilities. GPA GPA in a latent space GAN Discriminator ϕ Y ∈ Lip(Y) ϕ Z ∈ Lip(Z) ϕ Y ∈ Lip(Y) Generator (I Y -∆t∇ϕ Y ) # P Y n D • (I Z -∆t∇ϕ Z ) # P Z n G θ (z), z ∼ N (0, I Z ) Updates Particles Particles Generator parameters y ∈ Y z ∈ Z θ ∈ R |θ| Mobility µ (16) I Y ∂D ∂z (z t ) T ∂D ∂z (z t ) ∂G(θt,z) ∂θ T ∂G(θt,z) ∂θ Table 1 : Comparison of features in GPA, GPA in the latent space and GAN. Note that GPA consists of one neural network for the discriminator while GAN has two neural networks for the discriminator and the generator, respectively. We use the notations Y = R d and Z = E(Y) for distinguishing the real space and the latent space. A schematic diagram in Figure 4 shows how the Latent GPA and the GAN interact between the latent and the real spaces.

6. EXPERIMENTS

We present three types of experiments: (1) generating MNIST images from a small number of data, see Figure 1;  (2) merging gene expression data sets which are very high-dimensional (and thus require a latent space description) but also have a low number samples on the order of the low hundreds since they correspond to patients, see Figure 2;  (c) generating heavy-tailed distributions where KL or maximum likelihood-based methodologies will necessarily fail since an f -divergence is required, see Figure 3 . Overall, we found that GPA perform best in learning from small number of samples, learning distributions with heavy tails, and exhibit enhanced learning when a latent space is available. 1. Learning MNIST from a few data. Our (f KL , Γ 1 )-GPA is found to perform well in generating images using a small number of samples while GANs struggle with limited data. We stress from the digit-conditional MNIST data generation example in Figure 1 that (f KL , Γ 1 )-GPA could generate images from ten digit labels out of 200 samples. We compare the performance with (f KL , Γ 1 )-GAN and the well-known Wasserstein GAN. FID values and the improvement of our result with latent GPA can be found in Appendix F.2.

2.. Merging of microarray gene expression data sets

Using GPA, we can transport arbitrary source data to arbitrary target data even if a relatively small number of samples is available from the target. Furthermore, combining our algorithm with representation learning, we introduce a bioinformatics application of our algorithm in the analysis of gene expression data. Gene expression datasets The (f KL , Γ 1 )-GPA was able to learn digits from a small data set, while the other methods failed. Using sufficiently large training data, GANs outperformed in capturing the scale, which can be observed by the more intense color contrast between a digit and its background. See FID scores in Table 6 . are not only high-dimensional but also small-sized thus it is crucial to increase the sample size by integrating together all available datasets from the same disease. However, this is not a straightforward process since it is well known that gene expression datasets may have different statistics even when they target the same disease; a phenomenon referred to as "batch effects" Tran et al. (2020) . We propose to mitigate batch effects via the latent generative particle algorithm and match the statistics of the two datasets. Figure 2 presents the results on applying our algorithm between two breast cancer datasets from the Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/). More technical details and experiments can be found in the Appendix F.3. 3. GPA and Porous medium equations for heavy-tailed targets Similarly to score-based methods (see equation 22 in Section A) developed in Boffi & Vanden-Eijnden (2022) to solve the FP equation 5, our proposed methods will allow to develop new particle systems algorithms for solving porous medium equations with steady state probability measure Q and density q: ∂ t p t = div p t ∇f ′ p t q (17) For the special case where the f -divergence is an α-divergence with f α (x) ∼ x α , we obtain a power law in equation 17 and ultimately the well-known porous media equations, e.g. 

7. REPRODUCIBILITY STATEMENT

Each figure contains main parameters such as divergence-specifying f , Lipschitz constant L, and dataset parameter. In Appendix Section E, experimental setting for the experiments in the main text and appendix are described in these aspects: • Data sets 

A CONNECTIONS OF GPA WITH SCORE GENERATIVE METHODS

In this Section we discuss connections of GPAs with diffusion-type generative models in particular with score-based models which seem to be the most closely related. Score-based generative modeling (SGM) relies closely on concepts and methods related to Langevin samplers, e.g. Durmus & Moulines (2017) . Given R d -valued samples {Y (i) } N i=1 from an unknown probability Q with distribution q, we want to produce more realizations from Q. In SGMs, the score of the unknown distribution, ∇ log q(y) is learned from the training set. An optimization problem to learn the score is defined as follows Song & Ermon (2020) : we search for s : R d → R d in a function space F parametrized by θ (typically neural networks), min θ L(θ) = min θ 1 2 R d ∥s(y; θ) -∇ log q(y)∥ 2 q(y)dy. ( ) The key observation in ( 18) is that the loss functional can be estimated via available samples from Q without any evaluation of the density q (of Q) or ∇ log q, Hyvärinen (2005) . Indeed, by expanding the square and integrating by parts in ( 18) we arrive at an equivalent loss function J(θ): argmin θ L(θ) = argmin θ L 0 (θ) = argmin θ E Q 1 2 ∥s(y; θ)∥ 2 2 + ∇ y • s(y; θ) . ( ) Since computing the divergence of s(x; θ), when it is expressed as some neural network is quite expensive, the complexity of this last loss functional ( 19) can be further reduced by using a Hutchinsontype randomization Hutchinson (1989) for the efficient evaluation of ∇ y • s(y; θ), by developing the so-called "sliced score matching" method Song et al. (2020) . Lastly, we remark that it may initially appear that the selection of the loss functional ( 18) is somewhat ad hoc as in many other successful machine learning algorithms. However, using the Girsanov Theorem it can be shown that ( 18) has rigorous information-theoretic underpinnings because it is related to the minimization of the KL divergence on path-space, Song et al.. After learning the score from the minimization of ( 19), namely ŝ(y) = s(y; θ * ) ≈ ∇ log q(y), there are several directions that can be taken towards generative modeling. The most naïve option is to simulate trajectories of the overdamped Langevin dynamics using the learned score: dY t = ŝ(Y t )dt + √ 2dW t , where ŝ(y) ≈ ∇ log q(y) . (20) As discussed in Song & Ermon (2020) , naïvely applying this approach is not practical: (a) Langevin dynamics will not sample from the true data distribution when the data lies on a lower dimensional manifold; (b) the score cannot be accurately estimated for regions with little data; (c) Langevin dynamics mixes poorly, and therefore generates poorly, especially for multi-modal systems. To overcome these challenges, Song & Ermon (2020) ; Song et al. (2021) propose to add noise to the available data from Q in a systematic fashion. In Song & Ermon (2020) , the authors propose to build a noise-conditional score network, in which a sequence of noisy datasets are generated by adding different noises to the given dataset. In the end, we simulate an annealed Langevin dynamics analogue to (20): dY t = σ 2 (t)∇ log q(Y t )dt + √ 2σ(t)dW t , with the annealing schedule σ(t). The distribution of Y t at some finite cut-off time T is supposed to approximate the data distribution q. Indeed ( 21) is shown to mix better, and estimate the score better in regions of low probability, Song & Ermon (2020) . In Song et al. (2021) , the author propose an alternative method by considering a forward-backward formulation. However all these Langevin dynamics approaches require tuning -mainly since the annealing schedule or the forward model need to be user-prescribed and is problem-dependent, like all annealing methods. Connections between Score-based methods and GPA. The distribution p t of the solution Y t of the SGM (20) is the solution of classical Fokker Planck (FP) equation ( 5), given a target distribution q. The FP is of course a diffusion equation, but it can also be viewed as a transport equation taking the form ∂ t p t = ∇ • (v(x t , t)p t ) , with velocity field v(x, t) = ∇ δKL(pt∥q) δpt . In analogy to (12) we can write the solution of this transport formulation as the density of particles evolving according to its Lagrangian formulation d dt Y t = v(Y t , t) = ∇ log q(Y t ) -∇ log p t (Y t ) , where Y t ∼ P t . In Song et al. (2021) , the authors proposed the deterministic probability flow ( 22) as an alternative to generative stochastic samplers such as ( 20) and ( 21) due to advantages related to better statistical estimators. In the perspective of GPAs, we can write ( 22), in an equivalent transport/variational problem that employs the variational formulation of the KL divergence, (1). First, we note that ϕ * (x) = log pt(x) q(x) is the maximizer (discriminator) of this variatonal representation, Birrell et al. (2022b) . Next, notice that ( 22) can also be written in the equivalent particle/variational form (12) which for the KL divergence, and due to ϕ * (x) = log pt(x) q(x) , takes the form d dt X t = v(X, t) = -∇ϕ * (X, t) , where ϕ * = argmax E Pt [ϕ] -inf ν∈R ν + E Q [e ϕ-ν-1 ] . It is evident that ( 23) is mathematically equivalent to ( 22), since ϕ * (x) = log pt(x) q(x) . However, the particle/variational problem (23)-through its generalization (12)-allows us to use a much wider variety of divergences that-unlike KL used in score-based methods-are suitable for probability distributions supported at lower dimensional manifolds, are singular such as empirical distributions, or have heavy tails. These are precisely the featues of the GPA developed in in this paper, based on (f, Γ)-divergences.

B PDE, CONVERGENCE & LEARNING RATES

Here, we discuss how PDE perspectives and tools provide insights for the analysis, stability and convergence for the proposed generative particle algorithms. We focus on continuous time for convenience. By recalling Theorem 2.1,part 4. and γ L, * → P as L → ∞ (if P is absolutely continuous with respect to Q) in Remark 2.2, (9) becomes a Lipschitz-regularized f -divergences gradient flow (with its limit as L → ∞), i.e. ∂ t P t = div P t ∇f ′ dγ L, * dQ Lip. regularized f -divergence flow -→ L→∞ ∂ t P t = div P t ∇f ′ dP t dQ f -divergence flow (24) The right hand side of ( 24) is a nonlinear operator which encodes the Lipschitz regularization in the discriminator space. This defines a new class of PDE gradient flows where absolute continuity between P t and Q for every t ≥ 0, is not required, contrary to gradient flows of f -divergences (obtained as L → ∞.) We discuss these connections next in the context of two well-known PDEs, the (linear) Fokker-Planck and the (non-linear) porous media equation. Rewriting the limiting equation in terms of the density h t = dPt dQ we have 1. Lipschitz-regularized Fokker-Planck. For the KL, f (x) = x log(x) we obtain  ∂ t P t = div P t ∇ log dγ Γ L , * t dQ -→ L→∞ ∂ t h t = (∆ + ∇ log(q) • ∇)h t ∂ t p t = 1 α -1 div   p t ∇ η Γ L , * t q α-1   -→ L→∞ ∂ t h t = 1 α (∆ + ∇ log q • ∇) h α t

B.1 CONVERGENCE TO EQUILIBRIUM AND FUNCTIONAL INEQUALITIES

Functional inequalities are fundamental methods for guaranteeing the convergence of gradient flow PDE to their equilibrium states and therefore are natural tools for studying convergence properties for the corresponding particle-based algorithms. A first step is to compute the rate of change of the divergence along solutions P t of (9). Theorem B.1. [Lipschitz regularized dissipation] Along a trajectory of a smooth solution {P t } t≥0 of (9) with source probability distribution P we have the following rate of decay identity: d dt D Γ L f (P t ∥Q) = -I Γ L f (P t ∥Q) ≤ 0 (27) where we define the Lipschitz-regularized Fisher Information as I Γ L f (P ∥Q) = |∇ϕ L, * | 2 P (dx) = E P ∇f ′ dγ L, * dQ 2 . ( ) Consequently, for any T ≥ 0, we have D Γ L f (P T ∥Q) = D Γ L f (P ∥Q) - T 0 I Γ L f (P s ∥Q)ds . Remark B.2. (a) For the generative particles the Lipschitz-regularized Fisher Information can be interpreted as their total kinetic energy, see Paragraph 12. (b) When f = f KL , as L → ∞, we recover the usual Fisher information I Γ f (P ∥Q) = E P |∇ log p q | 2 which is used to prove convergence to the equilibrium state for the Fokker-Planck equation 25. Functional inequalities, such as the classical Poincaré and the Logarithmic Sobolev-type inequalities, and many generalizations thereof see Markowich & Villani (2000) ; Otto & Villani (2000) ; Toscani & Villani (2000) ; Dolbeault et al. (2008) ; Wang (2005) are a powerful to prove convergence to equilibrium (e.g exponential or polynomial convergence), building on dissipation estimates such as Theorem B.1. For example, if for some λ > 0, a Sobolev inequality D Γ f (P ∥Q) ≤ 1 λ I Γ f (P ∥Q) holds (true when Q is sub-Gaussian), then we obtain exponential convergence to Q for any P 0 : D Γ f (P t ∥Q) ≤ e -λt D Γ f (P 0 ∥Q). There exits various results, e.g. Carrillo et al. (2006) ; Dolbeault et al. (2008); Markowich & Villani (2000) ; Wang (2008) , reviewed in the Section B.3, B.4, on functional inequalities when L = ∞ which have been used to prove convergence to equilibrium (at exponential or polynomial rate) for Fokker-Planck and/or the porous media equation when the target distribution are Gaussian distribution, stretched exponential and Student-t type distribution. The existence of functional inequalities for Lipschitz-regularized gradient flows is an open question.

PARTICLES

For a Lipschitz-regularized gradient flow (24), the transporter/discriminator representation (9) implies that the domain of dependence is determined by the velocity fields ∇ϕ L, * t whose norm is bounded by the Lipschitz constant L. Therefore the domain of dependence of the solution is finite and is contained in a cone of slope L that emanates from any point (x, t) back to the time plane t = 0. From a numerical analysis point of view, (10) is an explicit numerical scheme pn+1-pn ∆t = div p n ∇ϕ L, * n (x) . For corresponding spatial discretization schemes there is an abundance of numerical methods which we can use to get some numerical analysis insight into our particle schemes (10). In particular, the Courant, Friedrichs, and Lewy (CFL) condition for stability of discrete schemes asserts that a numerical method can be convergent only if its numerical domain of dependence contains the true domain of dependence of the continuous PDE, LeVeque (2007). In our context, the CFL condition means sup x |∇ϕ L, * t (x)| ∆t ∆x ≤ C max where C max = 1 for such explicit schemes, LeVeque (2007) . Clearly, the Lipschitz regularization enforces a CFL type condition with a learning rate ∆t proportional to the inverse of L. It remains an open question how to rigorously extend this CFL analysis to particle-based algorithms where the spatial discretization grid ∆x is known only implicitly as noted in Remark 4.1(b), see also related questions in Carrillo et al. (2017) . Nevertheless, in our experiments we explore the inversely proportional relation between L and ∆t suggested by the CFL analysis.

B.3 FOKKER-PLANCK EQUATION, AND ITS CONVERGENCE TO EQUILIBRIUM STATE

Generalized Fokker Planck as the gradient flow of f -divergences. Let p t is the density of P t . The associated gradient flow is given by the generalized Fokker-Planck equation ∂ t p t = ∇ • p t ∇ δD f (P ∥Q) δP (p t ) = ∇ • p t ∇f ′ p t q (29) The Fokker Planck as gradient flow of KL. When f = f KL , we obtain the known Fokker-Planck equation ∂ t p t -∆p t + ∇ • p t ∇q q = 0. B.3.1 EXPONENTIAL DECAY WHEN q ∝ e -V AND V IS λ-CONVEX In this section for simplicity that the probability densities of both source and target distributions exist and are denoted by p, q. We consider the Cauchy problem of the Fokker-Planck equation given in Section B with p(t = 0, •) = p ≥ 0 and p = 1. (30) The next theorem in Markowich & Villani (2000) gives us the conditions that a probability measure satisfies in order to logarithmic Sobolev inequalities and consequently exponential decay. Theorem B.3. Let q ∈ L 1 (R d ) and V be λ-convex (i.e. D 2 V (x) ≥ λI d for all x ∈ R d ) , where I d is the identity matrix of dimension d. Then, q satisfies a logarithmic Sobolev inequality with constant λ, i.e. D KL (p∥q) ≤ 1 2λ I(p∥q), and the solution of the homogeneous Fokker-Planck equation goes to equilibrium in KL divergence, with a rate e -2λt at least. Typical examples that satisfy the conditions of Theorem B.3 are q(x) = e -|x| β e -|x| β , for x ∈ R d with β ≥ 2 (31) When β = 2, the target probability distribution with density q is the Gaussian with variance σ and zero mean, i.e. q(x) = 1 (2πσ) d/2 e -|x| 2 2σ , (i.e. V (x) = |x| 2 2σ ). By applying Theorem B.3, we get that for any initial probability distribution P which is absolutely continuous with respect to Q, D KL (p t ∥q) ≤ D KL (p 0 ∥q)e -2t/σ (32) where we have also used that the Stam-Gross Logarithmic Sobolev inequality, i.e. D KL (p∥q) ≤ σ 2 I(p∥q), see formula ( 14) in Markowich & Villani (2000) . B.3.2 POLYNOMIAL DECAY WHEN q ∝ e -V AND V IS DEGENERATELY CONVEX AT INFINITY We consider a potential V ∈ W 2,∞ loc such that q = 1 and degenerately convex at infinity, i.e. U (u) -a ≤ V (u) ≤ U (u) + b (33) where a, b are nonnegative constants and U is convex degenerate, i.e. D 2 U (u) ≥ c(1 + |u|) β-2 , c > 0 and β ∈ (0, 2) (34) Without loss of generality we assume that U takes its unique minimum at 0. We further assume that for some b, c, C 0 > 0 ∇V (u) • u ≥ c|u| b -C 0 ( ) A typical potential that satisfies (33), ( 34) and ( 35) is V = |x| β with 0 < β < 2. Before we state the next theorem in Toscani & Villani (2000) , we further define the following quantities M s (p) := p(x)(1 + |x| 2 ) s/2 , s > 2 and δ := 2 -β 2(2 -β) + (s -2) ∈ (0, 1 2 ) ( ) Theorem B.4. Let V be a potential satisfying assumptions (33), ( 34) and ( 35). Let p 0 be a probability density such that D KL (p 0 ∥q) < ∞, M s (p 0 ) < ∞ given in (36) for s > 2 . Let also {p t } t≥0 be a (smooth) solution of the Fokker-Planck equation with potential V and with initial datum p 0 . Then, there is a constant C depending on D KL (p 0 ∥q), M s (p 0 ). and s such that for all t > 0, D KL (p t ∥e -V ) ≤ C t κ , with κ = 1 -2δ δ = s -2 2 -β . ( ) where δ is given in (36). Note that as β → 2, one recovers the usual logarithmic Sobolev inequality as discussed in Sect. B.3.1. We summarize the said examples in the following table. Table 2 : Rate of convergence to equilibrium state q ∝ e -V in KL divergence Examples of q ∝ e -V Rate of convergence in KL divergence q ∝ e -|x| β , β ≥ 2 at least e -2λt Special case: The gradient flow of f -divergences for likelihood ratio. One may rewrite (29) in terms of the likelihood ratio denoted by h t and defined as N (0, σ) at least e -2λt , with λ = 1 σ q ∝ e -|x| β , 0 < β < 2 O(t -κ ), h t = dp t dq By using the operator identity (q being the multiplication operator by the function q), i.e. ∇q = q (∇ + ∇ log q) we have that ∇ • p t ∇f ′ p t q = q(∇ + ∇ log q)h t ∇f ′ (h t ) and thus we can rewrite (29) as ∂ t h t (x) = (∇ + ∇ log q) • h t ∇f ′ (h t ) Moreover if we denote ∇ * the adjoint of ∇ on L 2 (q) we have ∇ * = -(∇ + ∇ log q) and thus (39) has the form ∂ t h t (x) = -∇ * h t ∇f ′ (h t ) (40) Let now f α (x) = x α -1 α(α-1) , we rewrite h∇h α-1 = 1 β ∇v β = v β-1 ∇r =⇒ v = h α-1 and h = v β-1 =⇒ β = α α -1 and thus we obtain ∂ t h t (x) = 1 α (∆ + ∇ log q • ∇) h α t for t ≥ 0 and x ∈ R d corresponding to a non-negative initial condition h(x, 0) = h 0 (x), x ∈ R d is called weighted Porous Medium equation. For existence and uniqueness, see Dolbeault et al. (2008) . Remark B.5. The formula for f * α is given by f * α (y) =      (α-1) α (α-1) α y α (α-1) 1 y>0 + 1 α(α-1) , α > 1 ∞1 y≥0 + 1 α(1-α) α (1-α) |y| -α (1-α) 1 y>0 -1 α(1-α) 1 y<0 , α ∈ (0, 1) Remark B.6. For completeness, we discuss a related gradient flow known as granular media equation. To be precise, the 2-Wasserstein gradient flow of F(p) = 1 2 MMD[p, q] 2 where MMD[p, q] is the Maximum mean discrepancy (MMD) Gretton et al. (2012) . By recalling (2), MMD is defined as MMD[p, q] = sup g∈BRKHS(0,1) {E Q [g] -E P [g]} and its maximizer ϕ * (z) = f q,p (z) = k(x, z)q(x)dx -k(x, z)p(x)dx = k ⋆ p(z) -k ⋆ q(z) is called witness function between the probability densities q and p. In fact, g * is the difference between the mean embeddings of q and p which finally makes MMD be re-written as the RKHS norm of the unnormalized g * , i.e. MMD[p, q] = ∥ϕ * ∥ H (43) Then the gradient flow equation associated to F leads to the granular media equation, i.e ∂ t p t (x) = div (p∇ • (k ⋆ p -k ⋆ q)) ≡ div (p t ∇ϕ * t )

B.4.2 FUNCTIONAL INEQUALITIES FOR THE WEIGHTED POROUS MEDIUM EQUATION

In this section, we apply Theorem 4.5 in Dolbeault et al. (2008) to Weighted Porous Medium for the likelihood ratio h t = pt q and we prove polynomial decay in KL and χ 2 -divergence. Before we state the result we first define the L r -Poincaré inequality and L r -logarithmic Sobolev inequality (see also Dolbeault et al. (2008) ). Definition B.7. Let q be a probability measure on a Riemannian manifold (M, g). Then the entropy is defined as follows: for any smooth function f ∈ C 1 (M ) Ent q (f ) := f log f f dq dq while Var q (f ) := f -f dq 2 dq ( ) Definition B.8. Let q be a probability measure on a Riemannian manifold (M, g). Let also ν be a positive measure on (M, g). We assume that q ∈ (0, 1]. We say that (q, ν) satisfies L r -Poincaré inequality with constant C P if and only if, for any nonnegative function f ∈ C 1 (M ) Var q (f 2r ) 1/r ≤ C P |∇f | 2 dν We say that (q, ν) satisfies L r -logarithmic Sobolev inequality with constant C LS if and only if, for any nonnegative function f ∈ C 1 (M ) Ent q (f 2r ) 1/r ≤ C LS |∇f | 2 dν Theorem B.9. If (q, q) satisfies a L 2/3 -Poincaré Sobolev inequality, for some constant C P > 0, then for any non-negative initial condition h 0 ≡ p0 q ∈ L 2 (q), we have for every t ≥ 0 χ 2 (p t ∥q) ≤ χ 2 (p 0 ∥q) -1/2 + 8 9 C P t -2 . ( ) Reciprocally, if the above inequality is satisfied for any g 0 , then (q, q) satisfies a L 2/3 -Poincaré Sobolev inequality with constant C P > 0. Theorem B.10. Let α > 1. If (q, q) satisfies a L 1/α -logarithmic Sobolev inequality, for some constant C LS > 0, then for any non-negative initial condition h 0 such that D KL (p 0 ∥q) < ∞, we have for every t ≥ 0 D KL (p t ∥q) ≤ [D KL (p 0 ∥q)] 1-α + 4(α -1) α C LS t -1/(α-1) . Reciprocally, if the above inequality is satisfied for any g 0 , then (q, q) satisfies a L 1/α -logarithmic Sobolev inequality with constant C LS > 0. Next we discuss two examples of probability distributions satisfy L r -Poincaré inequality and L rlogarithmic Sobolev inequality: Let r ∈ (0, 1] and β ∈ [ 1 2 , 1). The probability measure dq = 1 2Γ 1 + 1 β e -|x| β dx, x ∈ R satisfies a L r -Poincaré inequality and L r -logarithmic Sobolev inequality. Let r ∈ [1/2, 1), then for β > 2r 1-r the probability measure dq = β (1 + |x|) 1+β dx, x ∈ R satisfies a L r -Poincaré inequality and a L r -logarithmic Sobolev inequality.  ∝ e -V in χ 2 -divergence Examples of q ∝ e -V Rate of convergence in χ 2 divergence q = e -|x| β 2Γ(1+ 1 β ) , 0 < r ≤ 1 , 1/2 ≤ β < 1 given in (49) q = β (1+|x|) 1+β , 1/2 ≤ r < 1 , β ≥ 2r 1-r given in (49)

C FIRST VARIATION OF REGULARIZED DIVERGENCES

In this section we prove the following theorem Theorem C.1. Assume f is superlinear and strictly convex and P, Q ∈ P 1 (R d ). 1. For x / ∈ supp(P ) ∩ supp(Q) define ϕ L, * (y) = sup x∈supp(Q) ϕ L, * (x) + L|x -y| then ϕ L, * is Lipschitz continuous on R d . 2. ϕ L, * = sup{h(x) : h ∈ Γ L , h(y) = ϕ L, * (y), for every y ∈ supp(Q)} 3. Let ρ be a signed measure of total mass 0 and let ρ = ρ + -ρ -where ρ ± ∈ P 1 (K) are mutually singular. If P + ϵρ ∈ P 1 (K) for sufficiently small |ϵ| then D γ L f (P + ϵρ∥Q) is differentiable at ϵ = 0 and we have lim ϵ→0 1 ϵ D Γ L f (P + ϵρ∥Q) -D Γ L f (P ∥Q) = ϕ L, * dρ . In other words we have δD Γ L f (P ∥Q) δP (P ) = ϕ L, * Proof. The proof of 1. is straightforward by using the triangular inequality of norms. For 2., since h ∈ Γ L , we have that h(x) ≤ h(y) + ∥x -y∥. This implies that for y ∈ supp(Q) and x / ∈ supp(Q), h(x) ≤ inf y∈supp(Q) {h(y) + ∥x -y∥} = inf y∈supp(Q) {ϕ L, * (y) + ∥x -y∥} = ϕ L, * (x). Since ϕ L, * (y) ∈ Γ L , this concludes the proof. For 3., we use the variational formula (3) for D Γ L f (P + ϵρ∥Q) where we suppose that P + ϵρ ∈ P 1 (R d ). D Γ L f (P + ϵρ∥Q) = sup ϕ∈Γ L E P +ϵρ [ϕ] -inf ν∈R {ν + E Q [f * (ϕ -ν)]} ≥ ϕ * ,L d(P + ϵρ) -inf ν∈R ν + f * (ϕ * ,L -ν)dQ = ϵ ϕ * ,L dρ + D Γ L f (P ∥Q) Thus lim inf ϵ→0 + 1 ϵ D Γ L f (P + ϵρ∥Q) -D Γ L f (P ∥Q) ≥ ϕ * ,L dρ For the other direction: Let us define F (ϵ) = D Γ L f (P + ϵρ∥Q). By Theorem 18 and 71 in Birrell et al. (2022a) , F (ϵ) is convex, lower semicontinuous and finite on [0, ϵ 0 ]. Due to the convexity of F , F is differentiable on (0, ϵ 0 ) except for a countable number of points. Let ϵ ∈ (0, ϵ 0 ) such that F is differentiable and δ > 0 small. Also, let ϕ * ,L ϵ be the optimizer of D Γ L f (P + ϵρ∥Q) satisfying ϕ * ,L ϵ (0) = 0 so that D Γ L f (P + ϵρ∥Q) = ϕ * ,L ϵ d(P + ϵρ) -inf ν∈R ν + f * (ϕ * ,L ϵ -ν)dQ By using the same argument as before in the proof, we have that D Γ L f (P + (ϵ + δ)ρ∥Q) -D Γ L f (P + ϵρ∥Q) ≥ δ ϕ * ,L ϵ dρ and D Γ L f (P + (ϵ -δ)ρ∥Q) -D Γ L f (P + ϵρ∥Q) ≥ -δ ϕ * ,L ϵ dρ (55) which gives us that ϕ * ,L ϵ dρ ≤ lim δ→0 1 δ D Γ L f (P + (ϵ + δ)ρ∥Q) -D Γ L f (P + ϵρ∥Q) = F ′ (ϵ) = lim δ→0 1 δ D Γ L f (P + ϵρ∥Q) -D Γ L f (P + (ϵ -δ)ρ∥Q) ≤ ϕ * ,L ϵ dρ Consequently, F ′ (ϵ) = ϕ * ,L ϵ dρ (57) Let F ′ + (0) = lim ϵ→0 + 1 ϵ (F (ϵ) -F (0)) . By convexity, for any sequence {ϵ n } n∈N such that ϵ 0 > ϵ n ↓ 0, we have F ′ + (0) = lim n→∞ F ′ (ϵ n ) = lim n→∞ ϕ * ,L ϵn dρ By applying the Arzelá-Ascoli to ϕ * ,L ϵn , and then doing a diagonalization argument, there exists a subsequence of {n k } k≥0 ⊂ {n} n≥0 , such that ϕ * ,L By the lower semicontinuity of D Γ L f (•∥Q), we have D Γ L f (P ∥Q) ≤ lim inf n→∞ D Γ L f (P + ϵ n ρ∥Q) = lim inf n→∞ E P +ϵnρ [ϕ * ,L ϵn ] -inf ν∈R ν + E Q [f * (ϕ * ,L ϵn -ν)] = lim inf n→∞ E P +ϵnρ [ϕ * ,L ϵn ] -lim sup n→∞ inf ν∈R ν + E Q [f * (ϕ * ,L ϵn -ν)] ≤ E P [ϕ * ,L 0 ] -inf ν∈R ν + E Q [f * (ϕ * ,L 0 -ν)] ≤ D Γ L f (P ∥Q) where for the second inequality we use the dominated convergence theorem, (57) and that by Fatou's lemma For the generalization purpose, we identify these mappings as dirac kernels K E (y, dz) = δ E(y) (dz) and K D = δ D(z) (dy). 1. A pullback function induced by the kernel K E is given as K E [f ](y) := f (z ′ )K E (y, dz ′ ) = f (E(y)) for f ∈ M b (Z), y ∈ Y 2. A push-forward measure induced by the kernel K E maps P(Y) → P(Z) and defines a push-forward measure which is given as K E [P ](B) := K E (y, B)P (dy) = P • E -1 (B) for P ∈ P(Y) and a Z-measurable set B. Likewise, the kernel K D induces a pullback function and a push-forward measure in the opposite direction. In the previous formulation, the Q Y -perfect encoding property D # E # Q Y = Q Y can be rewritten as Q Y = K D [K E [Q Y ]]. Given any P Y ∈ P(Y), we have the latent probability measure P Z = K E [P Y ] ∈ P(Z) and the reconstructed probability measure P Y = K D [K E [P Y ]] ∈ P(D(Z)) where D(Z) ⊂ Y = R d . In general, P Y ̸ = P Y . Transition probability kernels are defined in the form of conditional distributions: K p (y, dz) = p(dz|y) from Y to Z and K q (z, dy) = q(dy|z) from Z to Y. The kernel-induced pullback functions K p [f ](y) = f (z ′ )p(dz ′ |y) = E Z|Y =y∼p(dz|y) [f (Z)|Y = y] or K q [g](z) = f (y ′ )q(dy ′ |z) = E Y |Z=z∼q(dy|z) [g(Y )|Z = z] are interpreted as conditional expectations. In addition, the kernels induce push forward measures P Z (dz) = p(dz|y)P Y (dy) for P Y ∈ P(R d ) or R Y (dy) = q(dy|z)R Z (dy) for R Z ∈ P(R d ′ ). For the Q-perfect encoding property, we require these kernels to satisfy dQ Y (dy) = q(dy|z)p(dz|y)dQ Y (dy).

D.2 DATA PROCESSING INEQUALITY AND PROOF OF THEOREM 5.1

The proof of Theorem 5.1 is a consequence of a, new, tighter data processing inequality derived in Birrell et al. (2022a) that involves both transformations of probabilities and discriminator spaces Γ. Theorem D.1 (Data processing inequality for (f, Γ) -divergences). Given a real valued convex function f , P, Q ∈ P(Ω), and a probability kernel K from (Ω, M) to (N, N ), if Γ ⊂ N is nonempty, then D Γ f (K[P ]∥K[Q]) ≤ D K[Γ] f (P ∥Q). Proof. From the variational formulation of divergences, we have D Γ f (K[P ]∥K[Q]) = sup ϕ∈Γ,ν∈R (ϕ(y)-ν)K(x, dy)P (dx)- f * (ϕ(y)-ν)K(x, dy)Q(dx). (62) Since f * is convex, Jensen's inequality gives f * (ϕ(y) -ν)K(x, dy) ≥ f * (ϕ(y) -ν)K(x, dy) for all x ∈ Ω. Hence, D Γ f (K[P ]∥K[Q]) ≤ sup ϕ∈Γ,ν∈R E P [K[ϕ] -ν] -E Q [f * (K[ϕ] -ν)] = D K[Γ] f (P ∥Q). Now we state and prove the generalized version of the Theorem 5.1. Theorem D.2. Suppose that 1. Perfect encoding. For Q Y the encoder E and decoder D are such that K D [K E [Q Y ]] = Q Y . 2. K D [Γ Y ] ⊂ Γ Z . The pullback functions induced by the decoder kernel is included in the real function space. Then, for any P Z ∈ P(R d ′ ) we have D Γ Y f (K D [P Z ]||Q Y ) ≤ D Γ Z f (P Z ||K E [Q Y ]). Proof. Since the encoder E and the decoder D perfectly reconstruct Q Y , D Γ Y f (K D [P Z ]∥Q Y ) = D Γ Y f (K D [P Z ]∥K D [K E [Q Y ]]). From data processing inequality, D Γ Y f (K D [P Z ]∥K D [K E [Q Y ]]) ≤ D K D [Γ Y ] f (P Z ∥K E [Q Y ]). By the assumption that  K D [Γ Y ] ⊂ Γ Z , D K D [Γ Y ] f (P Z ∥K E [Q Y ]) ≤ D Γ Z f (P Z ∥K E [Q Y ]).

D.3 LATENT GENERATIVE PARTICLES ALGORITHM

(i) T } N i=1 Sample { X(i) = E(X (i) ) ∈ R d ′ } N i=1 where {X (i) } ∼ Q is a batch from the real data Sample { Ȳ0 (i) = E(Y (i) ) ∈ R d ′ } N i=1 where {Y (i) } ∼ P 0 = P is a batch of prior samples Apply Lipschitz regularized generative particles algorithm 1 on X(i) and Ȳ0 (i) Reconstruct Y (i) T = D( ȲT (i) )

E EXPERIMENTAL SETTING

Neural network architectures. We use the discriminator ϕ (compared to GAN setting) which is implemented using a neural network. In Table 4 we provide the architecture of the neural networks used to produce the experimental results. The Lipschitz constraint on ϕ is implemented by spectral normalization (the weight matrix in each layer of the D layers in total has spectral norm ∥W l ∥ 2 = L 1/D ) which is interpreted as a hard constraint. Imposing a gradient penalty term in the loss (soft constraint) was tried, but there were some problems: required additional tuning for the initial weight scales, and did not successfully restrict particle speeds bounded by L. Therefore we keep imposing the hard constraint in the entire paper. Data sets and important parameters. See Table 5 . More details can be found in Supplementary material README.md. We compare the GPA and other generative dynamics such as RKHS-based methods (Figure 5 ) and score-based methods by examples of 2D Mixture of Gaussians (Figure 6 ). × 5 Conv SN, 2 × 2 stride (1 → ch 1 ) leaky ReLU 5 × 5 Conv SN, 2 × 2 stride (ch 1 → ch 2 ) leaky ReLU 5 × 5 Conv SN, 2 × 2 stride (ch 2 → ch 3 ) leaky ReLU Flatten with dimension ℓ 3 W 4 ∈ R ℓ3×d with SN, b 4 ∈ R d ReLU W 5 ∈ R d×1 with SN, b 5 ∈ R Linear (a) Image data (MNIST, CIFAR10) FNN Discriminator W 1 ∈ R d×ℓ1 with SN, b 1 ∈ R ℓ1 ReLU W 2 ∈ R ℓ1×ℓ2 with SN, b 2 ∈ R ℓ2 ReLU W 3 ∈ R ℓ2×ℓ3 with SN, b 3 ∈ R ℓ3 ReLU W 4 ∈ R ℓ3×1 with SN, b 4 ∈ R Linear (b) Low dimensional data with dimension d

F.2 ADDITIONAL LATENT GENERATIVE PARTICLES EXAMPLE

We applied (f KL , Γ 1 ) generative particle algorithm in the latent space and reconstructed to the high dimensional image data. In the high dimensional space, we first sampled initial particles from the logistic distribution and target data in [0, 1] 28×28 for MNIST and in [0, 1] 32×32×3 for CIFAR10. For each of MNIST and CIFAR10, autoencoder with 128d latent dimension were trained. Then GPA was done in the 128d latent spaces. The number of training samples are N = 200, 2000.

F.3 MICROARRAY GENE EXPRESSION DATA

The flexibility on the choice of source distributions and the small sample size regime enable our generative particles algorithm to be used for medical data-processing purpose. In addition, using latent generative particles scheme, we can effectively handle high-dimensional data, such as gene expression data, typically in the dimension of 5 ∼ 6 × 10 5 depending on the probes in a significantly reduced dimensions.We suggest batch normalization/data merging as an application of our algorithm. Dataset. We tested on publicly available gene expression data sets from Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) : • Breast cancer: Accession number GSE47109 (206 samples), GSE10843 (245 samples) The former forms the source dataset, and the latter forms the target dataset. The source and the target data lie in the same dimensional space R 54,675 . Auto-encoder. We applied PCA on the combined matrix for the source data and the target data which have been firstly normalized to mean zero and variance one. The normalized PCA can be interpreted as a linear auto-encoder. The PCA decoder is Lipschitz continuous with L =  √ d ′ . (Let y = d ′ i=1 z i v i , y ′ = d ′ i=1 z ′ i v i ∈ R d . The decoder D(z) = d ′ i=1 z i v i satisfies ∥D(z) - D(z ′ )∥ = ∥ d ′ i=1 (z i -z ′ i )v i ∥ ≤ ∥z -z ′ ∥ d ′ i=1 ∥v i ∥ 2 = √ d ′ ∥z -z ′ ∥



bounded continuous function and by Γ L = {f : R d → R : |f (x) -f (y)| ≤ L|x -y| for all x, y} the Lipschitz continous functions with Lipschitz constant bounded by L (note that aΓ

; W ) values to reside in the domain of f * and allows to avoid the degeneracy. Remark 4.1. (a) The transport mechanism given by (11) is linear. However, nonlinear interactions between particles as introduced via the discriminator φL, * n are created due to the velocity field ∇ϕ L, * n which depends on all particles that comprise P N n and QN at each step n. (b) Computationally, the discriminator optimization (over Lipschitz continuous functions) is implemented, for example, via spectral normalization for neural networks architectures. Moreover the gradient of the discriminator is computed only at the positions of the particles. (c) The Lipschitz bound L on the discriminator space implies a pointwise bound |∇ϕ L, * n (Y (i)

Figure 1: MNIST; Learned digits by different generative models (column) with different number of training data (row).The (f KL , Γ 1 )-GPA was able to learn digits from a small data set, while the other methods failed. Using sufficiently large training data, GANs outperformed in capturing the scale, which can be observed by the more intense color contrast between a digit and its background. See FID scores in Table6.

Figure 2: (Gene expression data; Merging Breast Cancer datasets). We merged gene data using our latent GPA in significantly lower dimensions. Two distinct gene data sets but from the same disease decrease their dimension from 54,675D into d ′ = 2, 5, 10, 20, 50, 100, 200 using normalized PCA. Then, latent particles are transported using GPA. blue: source (206 samples), red: target (245 samples), black: transported (206 samples). (a) Latent particles in the R d ′ with d ′ = 20 which are encoded by the PCA. (b) Transported samples are reconstructed to the real space. The 2D visualizations are obtained using the UMAP algorithm McInnes et al. (2018). (c) The MMD distance Gretton et al. (2012) between the reconstructed datasets. blue: MMD(P Y 0 , P Y T ), red: MMD(Q Y , P Y T ), T = 25K. The transported distribution has smaller distances from the target distribution when d ′ = 5, 10, 20, 50, 200.

(a) α = 2 Lipschitz-regularized GPA (b) Score matching and annealed Langevin dynamics

Figure 3: (2D Mixture of Student-t) (f α , Γ 1 )-GPA with α = 2 for a heavy tailed target and comparison with score based model. 200 target samples from Student-t(ν) with ν = 0.5 are provided to transport 500 particles which are uniformly distributed in the plotted region at time t = 0. Blue: target, Orange: output. (a) The choice of divergence f α with α = 2 and propagation of particles through the (f α , Γ 1 )-GPA captures the heavy-tailed target. (b) Noise conditional score network (NCSN, score based model Song et al.) evolves particles by learning the vector field, the score ∇ log Q(x), from data. However, a mixture of disjoint distributions makes it hard to learn the score where the data is sparse. NCSN tackles the problem by injecting different levels of noise on the data and learning the scores of noise-injected distributions. Then it propagates particles through annealed Langevin dynamics using a sequence of scores s σ with different noise levels σ. When the level of injected noise gets smaller as much as σ ≤ 1, score-matching of the perturbed data for Student-t was extremely hard. In addition, particles transportation through (annealed) Langevin dynamics might not lead to the convergence to the heavy tailed distribution. A similar comparison for a mixture of Gaussians can be found in Figure6.

25)2. Lipschitz-regularized Weighted Porous Medium equation (WPME). For the α-divergence with f α (x) = 1 α(α-1) x α we obtain a regularization of the porous media equationOtto (2001);Dolbeault et al. (2008)

κ as in (37) B.4 WEIGHTED POROUS MEDIUM EQUATIONS AND THEIR CONVERGENCE TO EQUILIBRIUM STATE B.4.1 WEIGHTED POROUS MEDIUM EQUATION

Lip L (R d ). For simplicity, from now on we denote n the convergent subsequence.At this point, we recall that for any ϵ ∈ (0,ϵ 0 ), ϕ * ,L ϵ (0) = 0. For any x, |ϕ * ,L ϵ (x) -ϕ * ,L ϵ (0)| ≤ L∥x∥ d which implies that |ϕ * ,L ϵ (x)| ≤L∥x∥ d Thus by the dominated convergence theorem F ′ + (0) = lim n→∞ ϕ * ,L ϵn dρ = ϕ * 0 dρ

Since both sides of the inequality coincide, ϕ * ,L 0 must be the optimizer. By Theorem 3.1, part 1. and Theorem C.1, part 2., we have that ϕ * ,L 0 (x) ≤ ϕ * ,L for all x. ThusF ′ + (0) = ϕ * 0 dρ ≤ ϕ * dρ.which concludes the proof.D DETAILS ON LATENT GENERATIVE PARTICLES D.1 GENERALIZATION OF ENCODER AND DECODER FUNCTIONSIn the Section 5, we built Theorem 5.1 based on an encoder mapE : Y → Z (for instance Y = R d , Z ⊂ R d ′ , d ′ << d) and a decoder map D : Z → Y.

Latent Lipschitz regularized generative particles algorithm Require: f defined in (2) and its Legendre conjugate f * , L: Lipschitz constant, ν: scalar parameter for optimizing f divergence, T : number of updates for the particles, γ: time step size, N : number of particles Require: W = {W l } D l=1 : parameters for the neural network ϕ : R d ′ → R, D: depth of the neural network, δ: learning rate of the neural network, T NN : number of updates for the neural network. Require: E : R d → R d ′ : trained encoder, D : R d ′ → R d : trained decoder. Result: {Y

Figure 4: Workflow of different generative models. green: real space, yellow: latent space, blue: parameter space

by Cauchy-Schwarz inequality. ) Outputs See Figures 8, 9 for the transported particles in the latent space for varying d ′ = 2, 5, 10, 20, 50, 100, 200 and the reconstructed space.

Figure 5: (2D Mixture of Gaussians 1) Comparison with RKHS based generative dynamics. (a) (f KL , Γ L )-generative particles algorithm with different values for L. The particles are transported to the 4 wells faster as L gets larger, however for large L the algorithm become unstable (L ≥ 100). Learning rates are chosen as γ = 1.0, δ = 0.005. (c) KALE flow can be compared with (f KL , Γ L )generative particles algorithm in the sense of being a different regularization technique. The KALE gradient flow regularizes the RKHS norm of ϕ * while (f KL , Γ L )-generative particles algorithm regularizes the norm of ∇ϕ * . The KALE flow Glaser et al. (2021) fails to capture the 4 wells in a reasonable amount of time. Here a Gaussian kernel with σ = 0.5 is chosen for the RKHS kernel. Learning rate is chosen as 0.001. (d) Bottom: MMD Arbel et al. (2019) gradient flow (without extra noise). A Gaussian kernel with σ = 0.5 is used for the RKHS. Top: For comparison KL gradient flow trained with the unadjusted Langevin algorithm (ULA) Durmus & Moulines (2017). The comparison with KL ULA and MMD flow suggests that the use of regularization enables the convergence without further techniques such as adding noise.

Figure 6: (2D Mixture of Gaussians) (f KL , Γ 1 )-GPA and the score based model (Noise conditional score network, NCSN). 200 target samples from Mixture of Gaussians with σ Q = 1.0 are provided to transport 500 particles which are uniformly distributed in the plotted region at time t = 0. Blue: target, Orange: output. (a) The choice of divergence f KL and propagation of particles through the (f KL , Γ 1 )-GPA captures the target. (b) shows learning a mixture of Gaussians is tractable using NCSN. Compare with a heavy-tailed target example in Figure 3.

Figure 7: GPA in the 128d latent space. 200 samples are generated from 200, 2000 training data.See FID values in table 6.

Figure 8: (Gene expression data, BreastCancer) Latent samples. blue: source, red: target, black: transported. (h) The distance between the latent distributions. blue: MMD(P Z 0 , P Z T ), red: MMD(Q Z , P Z T ), black: MMD(P Z 0 , Q Z ) with T = 25, 000.

Figure 9: (Gene expression data, BreastCancer) Reconstructed samples. blue: source, red: target, black: transported. (h) The distance between the reconstructed distributions. blue: MMD(P Y 0 , P Y T ), red: MMD(Q Y , P Y T ), black: MMD(P Y 0 , Q Y ) with T = 25, 000.

It is well-known Jordan et al. (1998) that the Fokker Planck equation (FPE) can be thought as the gradient flow of the KL divergence ∂ t p t = div p t ∇ δD KL (p t ∥q)

Bonforte et al. (2010);Otto (2001);Dolbeault et al. (2008), used for applications to actual porous medium flow, typically in dimension 3. However, here we propose porous medium equations and associated particle algorithms as statistical learning tools for pdfs with heavy tails. For instance, score-based methods, are KL-based, see Song et al. and the discussion in Section A, hence they are not suitable for heavy tailed distributions: we refer to the collapse in the algorithms that minimize KL divergence observed in Figure2inBirrell et al. (2022a), in stark contrast to the stable behavior of

In Supplementary material, source code, all dependent libraries, and documentation (README.md) are attached. README.md specifies the required open-source libraries and the entire parameters set including random seeds for reproducing individual experiments.

Rate of convergence to equilibrium state q

Neural network architectures of the discriminator ϕ : R d → R

Data sets and important parameters KL , Γ 1 )-GPA 4571.98 5143.55 (f KL , Γ 1 )-GAN 5603.55 1270.13 Wasserstein-GAN 5653.20 1879.18 (a) Final FID for MNIST: GPA and GANs.

Conditional image generation performance summary. See Figure 1, 7.

