INVERTIBLE NORMALIZING FLOW NEURAL NETWORKS BY JKO SCHEME

Abstract

Normalizing flow is a class of deep generative models for efficient sampling and density estimation. In practice, the flow often appears as a chain of invertible neural network blocks. To facilitate training, past works have regularized flow trajectories and designed special network architectures. The current paper develops a neural ODE flow network inspired by the Jordan-Kinderleherer-Otto (JKO) scheme, which allows an efficient block-wise training procedure: as the JKO scheme unfolds the dynamic of gradient flow, the proposed model naturally stacks residual network blocks one-by-one and reduces the memory load as well as the difficulty of training deep networks. We also develop an adaptive time-reparametrization of the flow network with a progressive refinement of the trajectory in probability space, which improves the optimization efficiency and model accuracy in practice. On highdimensional generative tasks for tabular data, JKO-iFlow can process larger data batches and perform competitively as or better than continuous and discrete flow models, using 10X less number of iterations (e.g., batches) and significantly less time per iteration.

1. INTRODUCTION

The JKO scheme approximates the transport of a diffusion process and the ResNet is trained block-wise. Generative models have been widely studied in statistics and machine learning to infer data-generating distributions and sample from the estimated distributions (Ronquist et al., 2012; Goodfellow et al., 2014; Kingma & Welling, 2014; Johnson & Zhang, 2019) . The normalizing flow has recently been a very popular generative framework. In short, a flow-based model learns the data distribution via an invertible mapping F between data density p X (X), X ∈ R d and the target standard multivariate Gaussian density p X (Z), Z ∼ N (0, I d ) (Kobyzev et al., 2020) . Benefits of the approach include efficient sampling and explicit likelihood computation. To make flow models practically useful, past works have made great efforts to develop flow models that facilitate training (e.g., in terms of loss objectives and computational techniques) and induce smooth trajectories (Dinh et al., 2017; Grathwohl et al., 2019; Onken et al., 2021) . Among flow models, continuous normalizing flow (CNF) transports the data density to that of the target through continuous dynamics (e.g, Neural ODE (Chen et al., 2018) ). CNF models have shown promising performance on generative tasks Kobyzev et al. (2020) . However, a known computational challenge of CNF models is model regularization, primarily due to the non-uniqueness of the flow transport. To regularize the flow model and guarantee invertibility, Behrmann et al. (2019) adopted spectral normalization of block weights that leads to additional computation. Meanwhile, (Liutkus et al., 2019) proposed the sliced-Wasserstein distance, Finlay et al. (2020) ; Onken et al. (2021) utilized optimal-transport costs, and (Xu et al., 2022) proposed Wasserstein-2 regularization. Although regularization is important to maintain invertibility for general-form flow models and improves performance in practice, merely using regularization does not resolve non-uniqueness of the flow and there remains variation in the trained flow depending on initialization. Besides unresolved challenges in regularization, there remain several practical difficulties when training such models. In many settings, flows consist of stacked blocks, each of which can be arbitrarily complex. Training such deep models often places high demand on computational resources, numerical accuracy, and memory consumption. In addition, determining the flow depth (e.g., number of blocks) is also unclear. In this work, we propose JKO-iFlow, a normalizing flow network which unfolds the Wasserstein gradient flow via a neural ODE invertible network, inspired by the JKO-scheme Jordan et al. (1998) . The JKO scheme, cf. ( 5), can be viewed as a proximal step to unfold the Wasserstein gradient flow to minimize the KL divergence (relative entropy) between the current density and the equilibrium. Each block in the flow model implements one step in the JKO-scheme can be trained given the previous blocks. As the JKO scheme pushes forwards the density to approximate the solution of Fokker-Planck equation of a diffusion process with small step-size, the trained flow model induces a smooth trajectory of density evolution, as shown in Figure 1 . The theoretical assumption does not incur a restriction in practice when training, whereby one can use larger step sizes coupled with numerical integration techniques. The proposed JKO-iFlow model can be viewed as trained to learn the unique transport map following the Fokker-Planck equation. Unlike most CNF models where all the residual blocks are initialized together and trained end-to-end, the proposed model allows a block-wise training which reduces memory and computational load. We further introduce time reparametrization with progressive refinement in computing the flow network, where each block corresponds to a point on the density evolution trajectory in the space of probability measures. Algorithmically, one can thus determine the number of blocks adaptively and refine the trajectory determined by existing blocks. Empirically, such procedures yield competitive performance as other CNF models with significantly less computation. The JKO Flow approach proposed in this work also suggests a potential constructive approximation analysis of deep flow model. Method-wise, the proposed model differs from other recent JKO deep models. We refer to Section 1.1 for more details. In summary, the contribution includes • We propose a neural ODE model where each residual block computes a JKO step and the training objective can be computed from integrating the ODE on data samples. The network has general form and invertibility can be satisfied due to the regularity of the optimal pushforward map that minimizes the objective in each JKO step. • We develop an block-wise procedure to train the invertible JKO-iFlow network, which determines the number of blocks adaptively. We also propose a technique to reparametrize and refine an existing JKO-iFlow probability trajectory. Doing so removes unnecessary blocks and increases the overall accuracy. • Experiment wise, JKO-iFlow greatly reduces memory consumption and the amount of computation, with competitive/better performance as several existing continuous and discrete flow models.

1.1. RELATED WORKS

For deep generative models, popular approaches include generative adversarial networks (GAN) (Goodfellow et al., 2014; Gulrajani et al., 2017; Isola et al., 2017) and variational auto-encoder (VAE) (Kingma & Welling, 2014; 2019) . Apart from known training difficulties (e.g., mode collapse (Salimans et al., 2016) and posterior collapse (Lucas et al., 2019) ), these models do not provide likelihood or inference of data density. The normalizing flow framework (Kobyzev et al., 2020) has been extensively developed, including continuous flow (Grathwohl et al., 2019) , Monge-Ampere flow (Zhang et al., 2018) , discrete flow (Chen et al., 2019) , graph flow (Liu et al., 2019) , etc. Efforts have been made to develop novel invertible mapping structures (Dinh et al., 2017; Papamakarios et al., 2017) , regularize the flow trajectories (Finlay et al., 2020; Onken et al., 2021) , and extend the use to non-Euclidean data (Mathieu & Nickel, 2020; Xu et al., 2022) . Despite such efforts, the model and computational challenges of normalizing flow models include regularization and the large model size when using a large number of residual blocks, which cannot be determined a priori, and the associated memory and computational load. In parallel to continuous normalizing flow which are neural ODE models, neural SDE models become an emerging tool for generative tasks. Diffusion process and Langevin dynamics in deep generative models have been studied in score-based generative models (Song & Ermon, 2019; Ho et al., 2020; Block et al., 2020; Song et al., 2021) under a different setting. Specifically, these models estimate the score function (i.e., gradient of the log probability density with respect to data) of data distribution via neural network parametrization, which may encounter challenges in learning and sampling of high dimensional data and call for special techniques (Song & Ermon, 2019) . The recent work of Song et al. (2021) developed reverse-time SDE sampling for score-based generative models, and adopted the connection to neural ODE to compute the likelihood; using the same idea of backward SDE, Zhang & Chen (2021) proposed joint training of forward and backward neural SDEs. Theoretically, latent diffusion Tzen & Raginsky (2019b; a) was used to analyze neural SDE models. The current work focuses on neural ODE model where the deterministic vector field f (x, t) is to be learned following a JKO scheme of the Fokker-Planck equation. Rather than neural SDE, our approach involves no sampling of SDE trajectories nor learning of the score function. Our obtained residual network is also invertible, which can not be achieved by the diffusion models above. We experimentally obtain competitive or improved performance against on simulated and high-dimensional tabular data. JKO-inspired deep models have been studied in several recent works. (Bunne et al., 2022) reformulated the JKO step for minimizing an energy function over convex functions. JKO scheme has also been used to discretize Wasserstein gradient flow to learn a deep generative model in (Alvarez-Melis et al., 2021; Mokrov et al., 2021) , which adopted input convex neural networks (ICNN) (Amos et al., 2017) . ICNN as a special type of network architecture may have limited expressiveness (Rout et al., 2022; Korotin et al., 2021) . In addition to using gradient of ICNN, (Fan et al., 2021) proposed to parametrize the transport in a JKO step by a residual network but identified difficulty in calculating the push-forward distribution. The approach in (Fan et al., 2021 ) also relies on a variational formulation which requires training an additional network similar to the discriminator in GAN using inner-loops. In contrast, our method trains an invertible neural-ODE flow network which enables the flow from data density to normal and backward as well as the computation of transported density by integrating the divergence of the velocity field along ODE solutions. The objective in JKO step to minimize KL divergence can also be computed directly without any inner-loop training, cf. Section 4. For the expressiveness of generating deep models, universal approximation properties of deep neural networks for representing probability distributions have been developed in several works. Lee et al. (2017) established approximation by composition of Barron functions (Barron, 1993) ; Bailey & Telgarsky (2018) developed space-filling approach, which was generalized in Perekrestenko et al. (2020; 2021) ; Lu & Lu (2020) constructed a deep ReLU network with guaranteed approximation under integral probability metrics, using techniques of empirical measures and optimal transport. These results show that deep neural networks can provably transport one source distribution to a target one with sufficient model capacity under certain regularity conditions of the pair of densities. In our proposed flow model, each residual block is trained to approximate the vector field f (x, t) that induces the Fokker-Planck equation, cf. Section 3.2. Our model potentially leads to a constructive approximation analysis of neural ODE flow model to generate data density p X .

2. PRELIMINARIES

Normalizing flow. A normalizing flow can be mathematically expressed via a density evolution equation of ρ(x, t) such that ρ(x, 0) = p X and as t increases ρ(x, t) approaches p Z ∼ N (0, I d ) Tabak & Vanden-Eijnden (2010). Given an initial distribtuion ρ(x, 0), such a flow typically is not unique. We consider when the flow is induced by an ODE of x(t) in R d ẋ(t) = f (x(t), t), where x(0) ∼ p X . The marginal density of x(t) is denoted as p(x, t), and it evolves according to the continuity equation (Liouville equation) of (1) written as ∂ t p + ∇ • (pf ) = 0, p(x, 0) = p X (x). (2) Ornstein-Uhlenbeck (OU) process. Consider a Langevin dynamic denoted by the SDE dX t = -∇V (X t )dt + √ 2dW t , where V is the potential of the equilibrium density. We focus on the case of normal equilibrium, that is, V (x) = |x| 2 /2 and then p Z ∝ e -V . In this case the process is known as the (multivariate) OU process. Suppose X 0 ∼ p X , and let the density of X t be ρ(x, t) also denoted as ρ t (•). The Fokker-Planck equation describes the evolution of ρ t towards the equilibrium p Z as ∂ t ρ = ∇ • (ρ∇V + ∇ρ), V (x) := |x| 2 /2, ρ(x, 0) = p X (x). (3) Under generic conditions, ρ t converges to p Z exponentially fast. For Wasserstein-2 distance and the standard normal p Z , classical argument gives that (take C = 1 in Eqn (6) of Bolley et al. (2012) ) W 2 (ρ t , p Z ) ≤ e -t W 2 (ρ 0 , p Z ), t > 0. JKO scheme. The seminal work Jordan et al. (1998) established a time discretization scheme of the solution to (3) by the gradient flow to minimize KL(ρ||p Z ) under the Wasserstein-2 metric in probability space. Denote by P the space of all probability densities on R d with finite second moment. The JKO scheme at k-th step with step size h > 0, starting from ρ (0) = ρ 0 ∈ P, is written as ρ (k+1) = arg min ρ∈P F [ρ] + 1 2h W 2 2 (ρ (k) , ρ), F [ρ] := KL(ρ||p Z ). It was proved in Jordan et al. (1998) that as h → 0, ρ (k) converges to the solution ρ(•, kh) of ( 3) for all k, and the convergence ρ h (•, t) → ρ(•, t) is strongly in L 1 (R d , (0, T )) for finite T , where ρ h is piece-wise constant interpolated on (0, T ) from ρ (k) .

3. JKO SCHEME BY NEURAL ODE

Given i.i.d. observed data samples X i ∈ R d , i = 1, . . . , N , drawn from some unknown density p X , the goal is to train an invertible neural network to transports the density p X to an a priori specified density p Z in R d , where each data sample X i is mapped to a code Z i . A prototypical choice of p Z is the standard multivariate Gaussian N (0, I d ). By a slight abuse of notation, we denote by p X and p Z both the distributions and the density functions of data X and code Z respectively.

3.1. THE OBJECTIVE OF JKO STEP

We are to specify f (x, t) in the ODE (1), to be parametrized and learned by a neural ODE, such that the induced density evolution of p(x, t) converges to p Z as t increases. We start by dividing the time horizon [0, T ] into finite subintervals with step size h, let t k = kh and I k+1 := [t k , t k+1 ). Define p k (x) := p(x, kh), namely the density of x(t) at t = kh. The solution of (1) determined by the vector-field f (x, t) on t ∈ I k+1 (assuming the ODE is well-posed (Sideris, 2013)) gives a one-to-one mapping T k+1 on R d , s.t. T k+1 (x(t k )) = x(t k+1 ) and T k+1 transports p k into p k+1 , i.e., (T k ) # p k-1 = p k , where we denote by T # p the push-forward of distribution p by T , such that (T # p)(•) = p(T -1 (•)). Suppose we can find f (•, t) on I k+1 such that the corresponding T k+1 solves the JKO scheme (5), then with small h, p k approximates the solution to the Fokker-Planck equation 3, which then flows towards p Z . By the Monge formulation of the Wasserstein-2 distance between p and q as W 2 2 (p, q) = min T :T # p=q E x∼p ∥x -T (x)∥ 2 , solving for the transported density p k by ( 5) is equivalent to solving for the transport T k+1 by T k+1 = arg min T :R d →R d F [T ] + 1 2h E x∼p k ∥x -T (x)∥ 2 , F [T ] = KL(T # p k ||p Z ). The equivalence between ( 5) and ( 6) is proved in Lemma A.1. Furthermore, the following proposition gives that the value of F [T ] can be computed from f (x, t) on t ∈ I k+1 only once p k is determined by f (x, t) for t ≤ t k . The counterpart for convex function based parametrization of T k was given in Theorem 1 of (Mokrov et al., 2021) , where the computation using the change-of-variable differs as we adopt an invertible neural ODE approach here. The proof is left to Appendix A. Proposition 3.1. Given p k , up to a constant c independent from f (x, t) on t ∈ I k+1 , KL(T # p k ||p Z ) = E x(t k )∼p k Ç V (x(t k+1 )) - t k+1 t k ∇ • f (x(s), s)ds å + c. By Proposition 3.1, the minimization ( 6) is equivalent to min {f (x,t)} t∈I k+1 E x(t k )∼p k Ç V (x(t k+1 )) - t k+1 t k ∇ • f (x(s), s)ds + 1 2h ∥x(t k+1 ) -x(t k )∥ 2 å , where x(t k+1 ) = x(t k ) + t k+1 t k f (x(s), s)ds. Taking a neural ODE approach, we parametrize {f (x, t)} t∈I k+1 as a residual block with parameter θ k+1 , and then ( 8) is reduced to minimizing over θ k+1 . This leads to block-wise learning algorithm to be introduced in Section 4.

3.2. INFINITESIMAL OPTIMAL f (x, t)

In each JKO step of (8), let p = p k denote the current density, q = p Z be the target equilibrium density. In this subsection, we show that the optimal f in (8) with small h reveals the difference between score functions between target and current densities. Thus minimizing the objective (8) searches for a neural network parametrization of the score function ∇ log ρ t without denoising score matching as in diffusion-based models (Ho et al., 2020; Song et al., 2021) . Consider general equilibrium distribution q with a differentiable potential V . To analyze the optimal pushforward mapping in the small h limit, we shift the time interval [kh, (k + 1)h] to be [0, h] to simplify notation. Then ( 8) is reduced to min {f (x,t)} t∈[0,h) E x(0)∼p Ç V (x(h)) - h 0 ∇ • f (x(s), s)ds + 1 2h ∥x(h) -x(0)∥ 2 å , where x(h) = x(0)+ h 0 f (x(s), s)ds. In the limit of h → 0+, formally, x(h)-x(0) = hf (x(0), 0)+ O(h 2 ), and suppose V of q is C 2 , V (x(h)) = V (x(0)) + h∇V (x(0)) • f (x(0), 0) + O(h 2 ). For any differentiable density ρ, the (Stein) score function is defined as s ρ = ∇ log ρ, and we have ∇V = -s q . Taking the formal expansion of orders of h, the objective in ( 9) is written as E x∼p Å V (x) + h Å -s q (x) • f (x, 0) -∇ • f (x, 0) + 1 2 ∥f (x, 0)∥ 2 ã + O(h 2 ) ã . ( ) Note that E x∼p V (x) is independent of f (x, t), and the O(h) order term in ( 10) is over f (x, 0) only, thus the minimization of the leading term is equivalent to min f (•)=f (•,0) E x∼p Å -T q f + 1 2 ∥f ∥ 2 ã , T q f := s q • f + ∇ • f , where T q is known as the Stein operator (Stein, 1972) . The T q f in (11) echoes that the derivative of KL divergence with respect to transport map gives Stein operator (Liu & Wang, 2016) . The Wasserstein-2 regularization gives an L 2 regularization in (11). Let L 2 (p) be the L 2 space on (R d , p(x)dx), and for vector field v on R d , v ∈ L 2 (p) if |v(x)| 2 p(x)dx < ∞. One can verify that, when both s p and s q are in L 2 (p), the minimizer of ( 11) is f * (•, 0) = s q -s p . This shows that the infinitesimal optimal f (x, t) equals the difference of the score functions of the equilibrium and the current density.

3.3. INVERTIBILITY OF FLOW MODEL AND EXPRESSIVENESS

At time t the current density of x(t) is ρ t , the analysis in Section 3.2 implies that the optimal vector field f (x, t) has the expression as f (x, t) = s q -s ρt = -∇V -∇ log ρ t . With this f (x, t), the Liouville equation ( 2) coincides with the Fokker-Planck equation (3). This is consistent with that JKO scheme with small h recovers the solution to the Fokker-Planck equation. Under proper regularity condition of V and the initial density ρ 0 , the r.h.s. of ( 12) is also regular over space and time. This leads to two consequences, in approximation and in learning: Approximation-wise, the regularity of f (x, t) allows to construct a k-th residual block in the flow network to approximate {f (x, t)} t∈I k when there is sufficient model capacity, by classical universal approximation theory of shallow networks (Barron, 1993; Yarotsky, 2017) . The JKO-iFlow model proposed in this work suggests a constructive proof of the expressiveness of the invertible neural ODE model to generate any sufficiently regular density p X , which we further discuss in the last section. For learning, when properly trained with sufficient data, the neural ODE vector field f (x, t; θ k ) will learn to approximate (12). This can be viewed as inferring the score function of ρ t , and also leads to invertibilty of the trained flow net in theory: Suppose the trained f (x, t; θ k ) is close enough to (12), it will also has bounded Lipschitz constant. Then the residual block is invertible as long as the step size h is sufficiently small, e.g. less than 1/L where L is the Lipschitz bound of f (x, t; θ k ). In practice, we typically use smaller h than needed merely by invertibility (allowed by model budget) so that the flow network can more closely track the Fokker-Planck equation of the diffusion process. The invertibility of the proposed model is numerically verified in experiments (see Table 1 ).

4.1. BLOCK-WISE TRAINING

Note that the training of (k + 1)-th block in (8) can be conducted once the previous k blocks are trained. Specifically, with finite training data 8) is replaced by the sample average over {x i (kh)} n i=1 which can be computed from the previous k blocks. Note that for each given x(t) = x(t k ), both x(t k+1 ) and the integral of ∇ • f in (8) can be computed by a numerical neural ODE integrator. Following previous works, we use the Hutchinson trace estimator (Hutchinson, 1989; Grathwohl et al., 2019) to estimate the quantity ∇ • f in high dimensions. Applying the numerical integrator in computing (8), we denote the resulting k-th residual block abstractly as f θ k with trainable parameters θ k . {X i = x i (0)} n i=1 , the expectation E x(t)∼p k in ( Algorithm 6), and terminate training more blocks if the per-dimension loss is below ϵ. Lastly, the heuristic approach in line 5 of training a "free block" (i.e., block without the W 2 loss) is to flow the push-forward density p L closer to p Z , where the former is obtained through the first L blocks and the latter denotes the Gaussian density at equilibrium. Note that Algorithm 1 significantly reduces memory and computational complexity: only one block is trained when optimizing (8), regardless of flow depth. Therefore, one can use larger data batches and more refined numerical integrator without memory explosion. In addition, one can train each block for a fixed number of epochs using either back-propagation or the NeuralODE integrator (Grathwohl et al., 2019, adjoint method) . We found direct back-propagation enables faster training but may also lead to greater numerical errors and memory consumption. Despite greater inaccuracies, we observed similar empirical performances across both methods for JKO-iFlow, possibly due to the block-wise training that accumulates fewer errors than a generic flow model composed of multiple blocks.

4.2. IMPROVED COMPUTATION OF TRAJECTORIES IN PROBABILITY SPACE

We adopt two additional computational techniques to facilitate learning of the trajectories in the probability space, represented by the sequence of densities p k , k = 1, • • • , K, associated with the K residual blocks of the proposed normalizing flow network. The two techniques are illustrated in Figure 2 . Additional details can be found in Appendix B. • Trajectory reparametrization. We empirically observe fast decay of the movements W 2 2 (T # p k , p k ); in other words, initial blocks transport the densities much further than the later ones. This is especially unwanted because in order to train the current block, the flow model needs to transport data through all previous blocks, yet the current block barely contributes to the density transport. Hence, instead of having t k := kh with fixed increments per block, we reparametrize the values of t k through an adaptive procedure, which is entirely based on the W 2 distance at each block and the averaged W 2 distance over all blocks. • Progressive refinement. To improve the probability trajectory obtained by the trained residual blocks, we propose a refinement technique that trains additional residual blocks based on time steps t k obtained after reparametrization. In practice, refinement can be useful when the time increment t k+1 -t k for certain blocks is too large. In those cases, there may exist numerical inaccuracies as the loss ( 8) is computed over a longer time horizon. More precisely, we increase the number of JKO-steps parametrized by residual blocks, where in practice, we training C additional "intermediate" blocks for density transport between p k and p k+1 at each k. 

Probability Trajectory Movement

W 2  f 6 f 5 f 4 f 3 f 2 f 1 f 1 f 2 f 3

5. EXPERIMENT

We first generate based on two-dimensional simulated samples. We then perform unconditional and conditional generation on high-dimensional real tabular data. We also show JKO-iFlow's generative performance on MNIST. Additional details are in Appendix C. Table 1 : Inversion error E x∼p X ∥T -1 θ (T θ (x)) -x∥ 2 of JKO-iFlow computed from sample average on test data, where T θ denotes the transport mapping over all the blocks of the trained flow network.

GAS MINIBOONE BSD300 Rose

Fractal tree Olympic rings Checkerboard 1.48e-5 1.58e-6 1.09e-6 1.53e-5 3.30e-6 3.58e-5 2.24e-6 3.07e-5 Table 2 : Numerical metrics on high-dimensional real datasets. All competitors are trained after 10 times more iterations (i.e., batches), because their performance under the same number of iterations is not comparable to JKO-iFlow. Complete results are shown in Table A we measure the number of iterations for JKO-iFlow as the sum of iterations over all blocks. Using this metric allows us to examine performance easily across models, under a fixed-budget framework in terms of batches available to the model. The second is the maximum mean discrepancy (MMD) comparison (Gretton et al., 2012; Onken et al., 2021) , which is a way of measuring the difference between two distributions based on samples. Additional details for MMD appear in Appendix C.4. We also report negative log-likelihood as an additional metric in Table A .1. Conditional Generation. Due to the increasing need for conditional generation, we also apply JKO-iFlow for conditional generation: generate samples based on the conditional distribution X|Y . Most existing conditional generative methods treat Y as an additional input of the generator, leading to potential training difficulties. Instead, we follow the IGNN approach (Xu et al., 2022) , which also incurs minimal changes to our training. Additional details are in Appendix C.5. MNIST. We illustrate the generative quality of JKO-iFlow using an AutoEncoder. Consider a pretrained encoder Enc : R 784 → R d and decoder Dec : R d → R 784 such that Dec(Enc(X)) ≈ X for a flattened image X. We choose d = 16. The encoder (resp. decoder) uses one fully-connected layer followed by the ReLU (resp. Sigmoid) activation. Then, JKO-iFlow is trained on N encoded images {Enc(X i )} N i=1 , and the trained model gives an invertible transport mapping (over all residual blocks) T θ : R d → R d . The images are generated upon sampling noises Z ∼ N (0, I d ) through the backward flow followed by the decoder, namely Dec(T -1 θ (Z)). The generated images are shown in Figure 6 .

6. DISCUSSION

The work can be extended in several directions. The application to larger-scale image dataset by adopting convolutional layers will further verify the usefulness of the proposed method. The applications to generative tasks on graph data, by incorporating graph neural network layers in JKO-iFlow model, are also of interest. This also includes conditional generative tasks, of which the first results on toy data are shown in this work. For the methodology, the time-continuity over the parametrization of the residual blocks (as a result of the smoothness of the Fokker-Planck flow) have not been exploited in this work, which may further improve model capacity as well as learning efficiency. Theoretically, the model expressiveness of flow model to generate any regular data distribution can be analyzed based on Section 3.3. To sketch a road-map, a block-wise approximation guarantee of f (x, t) as in ( 12) can lead to approximation of the Fokker-Planck flow (3), which pushes forward the density to be ϵ-close to normality in T = log(1/ϵ) time, cf. (4). Reversing the time of the ODE then leads to an approximation of the initial density ρ 0 = p X by flowing backward in time from T to zero. Further analysis under technical assumptions is left to future work.

A PROOFS A.1 PROOFS IN SECTION

Lemma A.1. Suppose p and q are two densities on R d in P, the following two problems min ρ∈P L ρ [ρ] = KL(ρ||q) + 1 2h W 2 2 (p, ρ), T :R d →R d L T [T ] = KL(T # p||q) + 1 2h E x∼p ∥x -T (x)∥ 2 , have the same minimum, and (a) If T * : R d → R d is a minimizer of (14), then ρ * = (T * ) # p is a minimizer of (13). (b) If ρ * is a minimizer of (13), then the optimal transport from p to ρ * minimizes (14). Proof of Lemma A.1. Let the minimum of ( 14) be L * T , and that of (13) be L * ρ . Proof of (a): Suppose L T achieves minimum at T * , then T * is the optimal transport from p to ρ * = (T * ) # p because otherwise L T can be further improved. By definition of L ρ , we have L * T = L T [T * ] = L ρ [ρ * ] ≥ L * ρ . We claim that L * T = L * ρ . Otherwise, there is another ρ ′ such that L ρ [ρ ′ ] < L * T . Let T ′ be the optimal transport from p to ρ ′ , and then L T [T ′ ] = L ρ [ρ ′ ] < L * T , contradicting with that L * T is the minimum of L T . This also shows that L ρ [ρ * ] = L * T = L * ρ , that is, ρ * is a minimizer of L ρ . Proof of (b): Suppose L ρ achieves minimum at ρ * . Let T * be the OT from p to ρ * , then E x∼p |x - T * (x)| 2 = W 2 (p, ρ * ) 2 , and then L T [T * ] = L ρ [ρ * ] = L * ρ which equals L * T as proved in (a) . This shows that T * is a minimizer of L T . Proof of Proposition 3.1, Given p k being the density of x(t) at t = kh, recall that T is the solution map from x(t) to x(t + h). We denote ρ t := p k , and ρ t+h := T # p k . By definition, KL(T # p k ||p Z ) = E x∼ρ t+h (log ρ t+h (x) -log p Z (x)). Because p Z ∝ e -V , V (x) = |x| 2 /2, we have log p Z (x) = -V (x) + c 1 for some constant c 1 . Thus E x∼ρ t+h log p Z (x) = E x(t)∼ρt log p Z (x(t + h)) = c 1 -E x(t)∼ρt V (x(t + h)). To compute the first term in (15), note that E x∼ρ t+h log ρ t+h (x) = E x(t)∼ρt log ρ t+h (x(t + h)), and by the expression (called "instantaneous change-of-variable formula" in normalizing flow literature (Chen et al., 2018), which we derive directly in below) d dt log ρ(x(t), t) = -∇ • f (x(t), t), we have that for each value of x(t), log ρ t+h (x(t + h)) = log ρ(x(t + h), t + h) = log ρ(x(t), t) - t+h t ∇ • f (x(s), s)ds. Inserting back to (17), we have E x∼ρ t+h log ρ t+h (x) = E x(t)∼ρt log ρ t (x(t)) -E x(t)∼ρt t+h t ∇ • f (x(s), s)ds). The first term is determined by ρ t = p k , and thus is a constant c 2 independent from f (x, t) on t ∈ [kh, (k + 1)h]. Together with ( 16), we have shown that r.h.s. of (15 ) = c 2 -E x(t)∼ρt t+h t ∇ • f (x(s), s)ds) -c 1 + E x(t)∼ρt V (x(t + h)), which proves (7). Derivation of (18): by chain rule, d dt log ρ(x(t), t) = ∇ρ(x(t), t) • ẋ(t) + ∂ t ρ(x(t), t) ρ(x(t), t) = ∇ρ • f -∇ • (ρf ) ρ (x(t),t) (by ( 1) and ( 2)) = -∇ • f (x(t), t). B TECHNICAL DETAILS OF SECTION 4.2 Although the layer-wise training formulation in Section 4.1 enjoys several aforementioned benefits, there exists undesirable movement patterns along the trajectory. Empirically, the movement by initial blocks f θ k is much larger than later ones. The blue curve labeled "Phase 1" in Figure A.2a visualizes one typical pattern of the movement measured by W 2 distances. In fact, this phenomenon is not specific to training flow networks by the JKO scheme. It essentially arises due to smaller gradient magnitude at later estimates, which gradually approach a local minimum during optimization. In particular, such irregular movement also appears in gradient descent in vector space. We thus propose a reparametrize-and-refine technique.

1.. Vector-space case

We first motivate our method with optimization in vector space. Suppose our goal is to find a local minimum x * of F (x) for a nonlinear differentiable function F : R d → R. Starting at x (0) , consider the following sequential optimization problem, where x (t) denotes the estimate at the t-th iteration and h t is a pre-specified regularization parameter: x (t+1) = arg min x F (x) + 1 2h t ∥x -x (t) ∥ 2 2 . ( ) Using the first order Taylor expansion F (x) ≈ F (x (t) ) + ∇F (x (t) ) T (x -x (t) ) at x (t) , we get x (t+1) = x (t) -h t g t , g t := ∇ x F (x (t) ). Define the arc length of iterates S t := ∥x (t+1) -x (t) ∥ 2 = h t ∥g t ∥ 2 , whereby it appears in practice that the magnitude of S t is near zero as x (t) → x * . This issue is typical as a result of small gradient as estimates approach the local minimum. We thus propose Algorithm 2 to resolve this uneven arc length issue, which takes in iterates x (t,old) and step sizes h old t from the previous trajectory. We first motivate and explain the reparametrization step in line 3. Mathematically, we want arc lengths defined using re-optimized values x (t,new) ). The quantity ∥g new t ∥ 2 is unknown before re-optimization takes place, so that we approximate it using ∥g old t ∥ 2 = S old t /x (t,old) . In practice, using the quantity Sh old t /S old t alone to update h t can be undesirable, because larger h new t tend to cause non-smooth trajectories and inaccurate final estimates. We thus introduce inertia controlled by parameter η and upper bound the largest h t by h max to allow more flexibility. We now explain the refinement step in line 4. We interpolate C ≥ 0 intermediate points between each pair of (x (t,new) , x (t+1,new) ). For instance, if C = 1, we optimize for the "mid-point" x (t+1/2,new) before reaching x (t+1,new) . Using this approach ensures smoother new trajectories {x (t,new) } t≥1 and potentially more accurate final estimate. Figure A.3 illustrates the behavior and our solution on minimizing the Muller-Brown energy potential in R 2 .

2.. JKO Flownet reparametrization

Although Algorithm 2 is developed for re-parametrizing and refining trajectories in vector space R d , it can be directly used to reparametrize h for JKO-iFlow by replacing the arc length S t between consecutive iterates in vector space with the W 2 movement in probability space of the residual Algorithm 2 Trajectory improvement (vector-space case) Require: Penalty factors h old t and iterates x (t,old) for t = 1, . . . , T . Hyper-parameters h max > 0 and η ∈ (0, 1]. block. More precisely, let L be the total number of trained blocks via Algorithm 1 and denote h k := t k+1 -t k as the "step-size" for block f θ k . Replace the iterates x (t) in vector space with x(t k ), which is the mapping through previous k -1 blocks. Then, the arc length S t becomes the W 2 distance, which can be easily computed using N samples {x i (t k )} N i=1 along each step of the trajectory. The refinement step thus becomes training additional residual blocks via optimizing (8).

C EXPERIMENTAL DETAILS

C.1 CHOICE OF t k IN ALGORITHM 1 Recall that to train our JKO-iFlow, one needs as input a sequence of t k , where the k-th JKO block integrates from t k to t k+1 . Although the selection of t k varies by problem, we consider two choices in our settings. • Constant increment. Denote h k := t k+1 -t k , We let h k ≡ c 1 for a constant c 1 > 0. On many experiments for two-dimensional toy data and high-dimensional data, we use c 1 = 1. • Constant multiplier. Given t 0 > 0 and a constant c 2 , we let t k+1 := c 2 t k . The rationale is that from empirical evidence, the W 2 movement as in (6) tends to be larger at initial blocks than at latter blocks, so that moving later blocks more than the initial ones would enable more uniform movements, thus faciliating the training process. On some experiments for two-dimensional toy data and high-dimensional data, we let t 0 = 0.75 and c 2 = 1.2. We acknowledge that many other choices are possible. We also want to emphasize that due to the reparametrization and refinement techniques proposed in Section 4.2, the values of t k would be adaptively updated based on data, where the adaptive values would yield more uniform W 2 movements over blocks as we saw in Section 5.

C.2 OTHER SETUP DETAILS

All experiments are conducted using PyTorch (Paszke et al., 2019) and PyTorch Geometric (Fey & Lenssen, 2019) . Regarding network architecture 



Figure 1: Comparison of JKO-iFlow (proposed) and other flow models. The JKO scheme approximates the transport of a diffusion process and the ResNet is trained block-wise.

Figure 2: Diagram illustrating trajectory reparametrization and refinement. The top panel shows the original trajectory under three blocks via Algorithm 1. The bottom panel shows the trajectory under six blocks after reparametrization and refinement, which renders the W2 movements more even.

SETUPCompeting Methods and metrics. We compare JKO-iFlow with five other models, including four flow-based model and one diffusion model. The first two continuous flow models are FFJORD(Grathwohl et al., 2019) and OT-Flow(Onken et al., 2021). The next two discrete flow models are IResNet(Behrmann et al., 2019) and IGNN(Xu et al., 2022), which replaces the expensive spectral normalization in IResNet with Wasserstein-2 regularization to promote smoothness. The last diffusion model is the score-based generative modeling based on neural stochastic differential equation(Song et al., 2021), which we call it ScoreSDE for comparison. We are primarily interested in two types of criteria. The first is the computational efficiency in terms of the number of iterations (e.g., batches that the model uses in training) and training time per iteration. Due to the block-wise training scheme, (a) True data JKO-iFlow τ : 2.06e-4, MMD[m]: 3.52e-4 τ : 2.96e-4, MMD[c]: 8

Figure 3: Two-dimensional simulated datasets. The generated samples X by JKO-iFlow in (a) are closer to the true data X than competitors in (b)-(e). Under the more carefully selected bandwidth via the sample-median technique, MMD[m] in (20) by JKO-iFlow is also closer to the threshold τ in (21) than others. (f)-(h) visualizes generation by JKO-iFlow on more examples.

Figure 4: Conditional graph node feature generation by JKO-iFlow and iGNN. We visualize the conditionally generated samples upon projecting down to the first two principal components determined by true X|Y . We visualize generation at two different values of Y .

-dimensional toy data. Figures 3a-3d compare JKO-iFlow with the competitors on nonconditional generation, where the subcaption indicates MMD values (20) under both bandwidths and the corresponding thresholds τ in (21). We omit showing IResNet with spectral normalization as it yields similar results as W 2 IResNet. The generative quality by JKO-iFlow is the closest to that of the ground truth, and when the MMD bandwidth is more carefully selected via the sample-median technique, JKO-iFlow also yields smaller MMD than others. Meanwhile, Figures3f-3hshows the satisfactory generative performance by JKO-iFlow on other examples. In Appendix, Figure A.2 compares the performance of JKO-iFlow before and after using the technique described in Section 4.2, where the generative quality by JKO-iFlow improves after several reparametrization moving iterations, and Figure A.4 shows additional unconditional and conditional generation results.High-dimensional tabular data. In terms of conditional graph node feature generation, Figure4compares JKO-iFlow with IGNN on the solar dataset introduced in iGNN. The results show that JKO-iFlow yields competitive or clearly better MMD values on the conditional distribution X|Y with the most or second most observations, respectively. Next,

(a) Components of loss (8) over moving iterations. (b) Results at moving iteration 1. (c) Results at moving iteration 5.

Figure 5: MINIBOONE, reparametrization moving iterations of JKO-iFlow. We plot different components of the loss objective (8) over t k . In (a), results at moving iteration 5 are obtained by using Algorithm 2 (modified for training flow model) 4 times, and the reparametrization gives more uniform W 2 losses after moving iterations. On this example, the generative performance are both good before and after the moving iterations, cf. plots (b) and (c).

Figure 6: MNIST generation by JKO-iFlow coupled with a pre-trained auto-encoder. C shows the complete results, including the number of training iterations and test log-likelihood. Overall, we remark that comparisons using MMD[m] (i.e., MMD with bandwidth selected using the sample-median technique) best align with visual comparisons in Figure A.1 of Appendix C.6, so that we suggest MMD[m] as a more reliable metric out of others we used. Furthermore, we illustrate the reparametrization technique on MINIBOONE in Figure 5, where the benefit appears in yielding a flow trajectory with more uniform movement under a competitive generative performance.

Figure A.1 visualizes the principal component projections of the generated samples by JKO-iFlow and competitors of the high-dimensional real datasets. • Figure A.2 visualizes components of loss 8 and the resulting generated images. • Figure A.3 visualizes the trajectory of estimates in R 2 of minimizing the Muller-Brown energy potential. • Figure A.4 shows additional unconditional and conditional generated samples by JKO-iFlow on toy data.

Figure A.1: Generative quality on high-dimensional datasets via PCA projection of generated samples. The generative quality in general aligns with the MMD[m] values shown in Table 2 and A.1.

Figure A.2: Rose, reparametrization moving iterations of JKO-iFlow. The plots and setup are identical to Figure 5. We observe improved generative quality after the moving iterations.

Figure A.3: Reparametrization and refinement moving iterations in vector space based on Algorithm 2. The task is to estimate a local minimizer of the Muller-Brown energy potential. We see that arc lengths between consecutive iterates become more even in magnitude over more reparametrization and refinement moving iterations.

Figure A.4: Additional unconditional and conditional generation on simulated toy datasets by JKO-iFlow.

Block-wise JKO-iFlow training Require: Time stamps {t k }, training data, termination criterion Ter and tolerance level ϵ, maximal number of blocks L max . 1: Initialize k = 1. 2: while Ter(k) > ϵ and k ≤ L max do Optimize f θ k upon minimizing (8) with mini-batch sample approximation, given {f θi } k-1 i=1 . Set k ← k + 1. 4: end while 5: L ← k. Optimize f θ L+1 using (8) with h = ∞. {▷ Free block, no W 2 regularization.} This leads to a block-wise training of the normalizing flow network, as summarized in Algorithm 1. Regarding input parameters,, we found the generative performance JKO-iFlow may vary depending on starting choices of t k , but a simple choice such as t k = k often yields reasonably good performance. We discuss further the initial selection of t k in Appendix C.1. Meanwhile, one can use any suitable termination criterion Ter(k) in line 2 of Algorithm 1. In our experiments, we monitor the per-dimension W 2 loss W 2 2 (T # p k , p k ) as defined in (

.1.



to satisfy S new

1: Compute S old t := ∥x (t+1,old) -x (t,old) ∥ 2 and S :=

For simulated 2D data, high-dimensional real data, and MNIST using pre-trained autoencoder: each residual block uses fully-connected layers of the form d → H → H → d, where d (resp. H) is the feature (resp. hidden nodes') dimension. The hidden dimension vary by example, in the range of 128∼512.• For conditional graph node feature generation: each residual block uses one Chebnet input layer of order 3 followed by two fully-connected layers. The hidden dimension H = 64 in all hidden layers.

1: Numerical metrics on high-dimensional real datasets, in addition to those Table2. Comparing to flow-based models, JKO-iFlow takes much less iterations to reach a small enough MMD value. Although ScoreSDE is the fastest, its performance, even under 100 times more iteration than JKO-iFlow, is still worse in terms of MMD[m] on all except GAS. We advocate the comparison using MMD[m] because the results align with visual comparisons in Figure A.1.

2: MMD[c] and negative loglikelihood results of OT-Flow and FFJORD, as taken from(Onken et al., 2021). We include them to compare against ours in TableA.1. The models in previous studies use comparable model size (especially for OT-Flow), where the numerical results in some cases are much smaller than ours due to significantly longer training time.

C.3 DATASET

For two-dimensional simulated examples, we generate fresh random draws of 10000 training samples at each training epoch. The four high-dimensional real datasets (POWER, GAS, HEP-MASS, MINIBOONE) come from the University of California Irvine (UCI) machine learning data repository. These datasets are commonly used to compare flow models (Grathwohl et al., 2019; Finlay et al., 2020; Onken et al., 2021) . The solar dataset as used in iGNN (Xu et al., 2022) is retrieved from the National Solar Radiation Database (NSRDB).

C.4 MMD METRICS

Besides visual comparison, the maximum mean discrepancy (MMD) (Gretton et al., 2012; Onken et al., 2021) provides a quantitative way to evaluate the performance of generative models. Given samples X := {x i } N i=1 and Y := {y j } M j=1 and a kernel function k(x, y), we computeFor our purpose, we use the Gaussian kernel k(x, y) := exp -∥x -y∥ 2 /h with bandwidth h. We select the bandwidth both as a constant value h c = 2 and via the "sample-median technique" (Gretton et al., 2012) (Gretton et al., 2012) . In this setting, MMD is an impartial evaluation metric as it is not used to train JKO-iFlow or any competing methods.We can also determine the statistical significance of a MMD value. First, compute the thresholdwhere Q 1-α denotes the upper 1 -α quantile of a set of scalars and I b j ⊂ {1, . . . , N } denotes the j-th index set at the b-th bootstrapping without replacement. Then, under the null hypothesis that X and Y are drawn from the same distribution, this hypothesis is rejected if MMD exceeds the threshold τ . The Type-I error is controlled at level α. Thus, if the MMD values by two models both exceed τ , we prefer the model with the smaller MMD. If both values are under τ , then they generate equally well. In our experiments, we use B = 1000 bootstraps, each of which has 50% re-sampled test samples.

C.5 CONDITIONAL GENERATION

We follow the conditional generation scheme as proposed in iGNN (Xu et al., 2022) . More precisely, when the response variable Y is a categorical variable taking value in K classes, iGNN designs the target distribution as a Gaussian mixture model. Thus, instead of flowing from data density p X to noise density p Z , iGNN flows from the conditional data density p X|Y to p H|Y , where H|Y ∼ H(µ Y , σ 2 I). One can then minimize the negative log-likelihood -log p X|Y using logp H|Y and the change-ofvariable formula.To use JKO-iFlow for conditional generation in this setting, we thus only need to modify the objective (8). Instead of using V Z based on Z ∼ N (0, I d ), we would using V H|Y based on the Gaussian mixture H|Y ∼ H(µ Y , σ 2 I).

C.6 ADDITIONAL RESULTS

We present complete results in addition to those in Section 5. In particular,• Table A .1 contains the complete numerical results of JKO-iFlow against competitors on highdimensional real datasets. For ScoreSDE, we use the implementation in (Huang et al., 2021) , which computes the evidence lower bound (ELBO) for the data log-likelihood as reported in the last column. In addition, Table A .2 contains MMD and negative log-likelihood results for OT-Flow and FFJORD as taken from the original papers.

