LEARNING CONTINUOUS NORMALIZING FLOWS FOR FASTER CONVERGENCE TO TARGET DISTRIBUTION VIA ASCENT REGULARIZATIONS

Abstract

Normalizing flows (NFs) have been shown to be advantageous in modeling complex distributions and improving sampling efficiency for unbiased sampling. In this work, we propose a new class of continuous NFs, ascent continuous normalizing flows (ACNFs), that makes a base distribution converge faster to a target distribution. As solving such a flow is non-trivial and barely possible, we propose a practical implementation to learn flexibly parametric ACNFs via ascent regularization and apply it in two learning cases: maximum likelihood learning for density estimation and minimizing reverse KL divergence for unbiased sampling and variational inference. The learned ACNFs demonstrate faster convergence towards the target distributions, therefore, achieving better density estimations, unbiased sampling and variational approximation at lower computational costs. Furthermore, the flows show to stabilize themselves to mitigate performance deterioration and are less sensitive to the choice of training flow length T .

1. INTRODUCTION

Normalizing flows (NFs) provide a flexible way to define an expressive but tractable distribution which only requires a base distribution and a chain of bijective transformations (Papamakarios et al., 2021) . Neural ODE (Chen et al., 2018) extends discrete normalizing flows (Dinh et al., 2014; 2016; Papamakarios et al., 2017; Ho et al., 2019) to a new continuous-time analogue by defining the transformation via a differential equation, substantially expanding model flexibility in comparison to the discrete alternatives. (Grathwohl et al., 2018; Chen and Duvenaud, 2019) propose a computationally cheaper way to estimate the trace of Jacobian to accelerate training, while other methods focus on increasing flow expressiveness by e.g. augmenting with additional states (Dupont et al., 2019; Massaroli et al., 2020) , or adding stochastic layers between discrete NFs to alleviate the topological constraint (Wu et al., 2020) . Recent diffusion models like (Hodgkinson et al., 2020; Ho et al., 2020; Song et al., 2020; Zhang and Chen, 2021) extend the scope of continuous normalizing flows (CNFs) with stochastic differential equations (SDEs). Although these diffusion models significantly improve the quality of the generated images, the introduced diffusion comes with costs: some models no longer allow for tractable density estimation; or the practical implementations of these models rely on a long chain of discretizations, thus needing relatively more computations than tractable CNF methods, which can be critical for some use cases such as online inference. (Finlay et al., 2020; Onken et al., 2021; Yang and Karniadakis, 2020) introduce several regularizations to learn simpler dynamics using optimal transport theory, which decrease the number of discretization steps in integration and thus reduce training time. (Kelly et al., 2020) extends the L 2 transport cost to regularize any arbitrary order of dynamics. Although these regularizations are beneficial for decreasing the computational costs of simulating flows, they do not improve the slow convergence of density to the target distributions like trained vanilla CNF models shown in Figure 1 . To accelerate the flow convergence, STEER (Ghosh et al., 2020) and TO-FLOW (Du et al., 2022) propose to optimize flow length T in two different approaches: STEER randomly samples the length during training while TO-FLOW establishes a subproblem for T during training. To understand the effectiveness of Figure 2 : The log-likelihood estimates of trained vanilla CNF models with various flow length T n and the steepest ACNF with dynamics defined in eq.( 6) at different t on 2-moon distribution. All vanilla CNF models reach their maximum around T n and deteriorate rapidly afterwards while the log-likelihood estimate of ACNF elevates rapidly at initial and increases monotonically. these methods, we train multiple Neural ODE models with different flow length T n for a 2-moon distribution and examine these flows by the estimated log-likelihoods in Figure 2 . Although sampling or optimizing T dynamically performs a model selection during training and leads models to reach higher estimates at shorter flows, it cannot prevent the divergence after T n . Furthermore, shorter flows are more limited in expressiveness for higher maximum likelihoods and sensitive to flow length. In this work, we present a new family of CNFs, ascent continuous normalizing flows (ACNFs), to address the aforementioned problems. ACNF concerns a flow that transforms a base distribution monotonically to a target distribution, and the dynamics is imposed to follow the steepest ACNF. However, solving such a steepest flow is non-trivial and barely possible. We propose a practical implementation to learn parametric ACNFs via ascent regularization. Learned ACNFs exhibit three main beneficial behaviors: 1) faster convergence to target distribution with less computation; 2) self-stabilization to mitigate flow deterioration; and 3) insensitivity to flow training length T . We demonstrate these behaviors in three use cases: modeling data distributions; learning annealed samplers for unbiased sampling; and learning a tractable but more flexible variational approximation.

2. CONTINUOUS NORMALIZING FLOWS

Considering a time-t transformation z(t) = Φ t (x) on the initial value x, i.e. z(0) = x, the change of variable theorem reveals the relation between the transformed distribution p t (z(t)) and p(x): p t (z(t)) = det J -1 Φt (x) |p(x), where J Φt is the Jacobian matrix of Φ t . As Φ t normalizes x towards some base distribution, p t (z(t)) is referred to as the normalized distribution at time t, starting from the data distribution p(x). Continuous normalizing flow is the infinitesimal limit of the chain of discrete flows and the infinitesimal transformation is specified by an ordinary differential equation (ODE): dz(t) dt = dΦ t (x) dt = f (z(t), t). The instantaneous change of variable theorem (Chen et al., 2018, theorem 1) shows the infinitesimal changes of log p t (z(t)) is: Thus, the log-normalized distribution log p t (z(t)) can be obtained by integrating eq.( 3) backwards with a common approximation to the base distribution µ, i.e. p T ≈ µ: d log p t (z(t)) dt = -∇ • f (z(t), t). log p t (z(t)) = log p T (z(T )) - T t ∇ • f (z(τ ), τ )dτ ≈ log µ(z(T )) - T t ∇ • f (z(τ ), τ )dτ, where z(t) = x + t 0 f (z(τ ), τ )dτ . The accuracy of log p 0 (x), obtained by the right hand side, depends on the approximation error of p T to µ and the error can vary at different z(T ). To avoid the problems in analysis and investigate how flow length affects on modeled distribution, we introduce pt (x), estimating density of a t-length flow Φ t , which is shown via the change of variable theorem: pt (x) = det (J Φt (x)) |µ(Φ t (x)). As indicated by eq.( 4) and Figure 3 , pt initiates at the base distribution, i.e. p0 (x) = µ(x). Combining eq.( 1) and eq.( 4), the estimated density pt relates to normalized distribution p t (z(t)) as: pt (x) p(x) = µ(Φ t (x)) p t (Φ t (x)) = µ(z(t)) p t (z(t)) . It shows that as p t → µ, pt (x) → p(x). When there exists a flow, of which the normalized density is equal to the base distribution, i.e. p T = µ, then the estimated likelihood becomes exact to the data distribution, i.e. pT (x) = p(x). Like the instantaneous change of variable theorem in eq.( 3), we derive the infinitesimal change of time-t estimated log-likelihood: Proposition 1 (Instantaneous Change of Log-likelihood Estimate). Let z(t) be a finite continuous random variable at time t as the solution of a differential equation dz(t) dt = f (z(t), t) with initial value z(0) = x. Assuming that p0 = µ at t = 0 and f is uniformly Lipschitz continuous in z and t, then the change in estimated log-likelihood log pt (x) at t follows a differential equation: d log pt (x) dt = ∇ • f (z(t), t) + ∇ log µ(z(t)) • f (z(t), t). Proof. See Appendix A.1 for detailed derivation and its relation to eq.( 3). Unlike the integral for log p t (z(t)) that relies on the approximation and requires to solve the whole trajectory z(τ ), τ ∈ [0, T ], log pt (x) can be evaluated exactly simultaneously with z(t) for any or/and different t: log pt (x) = log µ(x) + t 0 (∇ • f (z(τ ), τ ) + ∇ log µ(z(τ )) • f (z(τ ), τ )) dτ.

3. ASCENT CONTINUOUS NORMALIZING FLOWS

By using KL divergence as distance measure of distributions, we have the following duality: KL(p(x)||p t (x)) = const -p(x) log pt (x)dx = KL(p t (z(t))||µ(z(t))), that maximum likelihood learning of pT (x) for data samples from p(x) is equivalent to minimizing 1) the forward KL divergence between p(x) and pt (x) as the first equality; 2) the reverse KL divergence in normalization direction as the second equality. We can measure the rates of KL divergences or the expected log-likelihood by their time derivative, and define ascent continuous normalizing flows (ACNFs) that monotonically decrease KL divergence or increase the expected log-likelihood , i.e. ∂ ∂t p(x) log pt (x)dx ≥ 0; or ∂ ∂t KL(p t (z(t))||µ(z(t))) ≤ 0. By applying total variation, we can find the dynamics for the steepest descent of reverse KL divergence or the steepest ascent of the expected log-likelihood: Theorem 1 (Dynamics for Steepest Ascent Continuous Normalizing Flows). Let z(t) be a finite continuous random variable and the solution of a differential equation dz(t) dt = f (z(t), t) with initial value z(0) = x. Its probability p t (z(t)) subjects to the continuity equation ∂ t p t + ∇ • (p t f ) = 0. The dynamics of the steepest flow for decreasing KL(p t (z(t))||µ(z(t))) is f * (z(t), t) = ∇ log µ(z(t)) - ∇p t (z(t)) p t (z(t)) = ∇ log µ(z(t)) -∇ log p t (z(t)). Proof. See Appendix A.2 for detailed derivation. The steepest dynamics is the difference between two gradients: ∇ log µ and ∇ log p t w.r.t. state z(t). There are a few important implications of eq.( 6): 1) the dynamics is time-variant as p t evolves along the flow of z(t); 2) the dynamics at time t only depends on the current state z(t), thus no history is needed; 3) the flow is initiated at the difference between ∇ log µ(x) and ∇ log p(x), gradually slows down and eventually stops when p t converges to µ. The convergence rate of the steepest flow can also be proven as the negative Fisher divergence, ∂KL(p t ∥µ)/∂t = -F(p t ||µ) = -E pt ∥∇ log µ(z) -log p t (z)∥ 2 2 , therefore this optimal deterministic CNF is related to (overdamped) Langevin diffusion, see Appendix A.3 for the derivation of convergence rate and detailed discussion of their relation. This optimal flow also can be considered as a special instance of Wasserstein gradient flow (Ambrosio et al., 2005) with KL divergence as the energy functional. Previous works (Finlay et al., 2020; Yang and Karniadakis, 2020; Onken et al., 2021) apply the optimal transport theory to regularize flow dynamics in Euclidean space, while Wasserstein gradient flow or eq.(6 instead regularizes flow in probability measure space. We refer readers to (Ambrosio et al., 2005) for accessible introduction. In some special cases, the flow can be solved by introducing an auxiliary potential, V (z, t) = p t (z)/µ(z), which has a partial differential equation (PDE): ∂V (z, t) ∂t =∆V (z, t) + 2∇ log µ(z) • ∇V (z, t) + ∇ log V (z, t) • ∇V (z, t), with the initial condition V (z(0), 0) = p0(z(0)) µ(z(0)) = p(x) µ(x) . See Appendix A.4 for its derivation. Solving this PDE for p t (z(t)) is non-trivial as the closed form solution is typically unknown. JKO integration is practically used in literature (Mokrov et al., 2021; Fan et al., 2021) for the solution, which approximates the dynamics of density p t by its time discretization. However, it requires to know the initial condition while p(x) is generally unknown and needs to be modeled for data. (Tabak and Vanden-Eijnden, 2010) proposes to approximate p(x) by the spatial discretization of samples, which hardly can be scaled up even for intermediate dimensions. To tackle these difficulties and accelerate unregulated flows for faster convergence, we propose ascent regularization to learn parametric ACNFs, as inspired by previous works (Yang and Karniadakis, 2020; Onken et al., 2021; Finlay et al., 2020; Kelly et al., 2020; Ghosh et al., 2020 ) that enforce flows with certain behaviors via regularization in training. Ascent regularization penalizes the difference between the parametric dynamics and the steepest dynamics by ∥f θ -f * ∥ 2 2 , which needs to evaluate score function ∇ log p t (z(t)). Therefore, we propose the instantaneous change of the score function: Theorem 2 (Instantaneous Change of Score Function). Let z(t) be a finite continuous random variable with probability density p t (z(t)) at time t. Let dz(t) dt = f (z(t), t) be a differential equation describing a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and t, the infinitesimal change in the gradient of log-density at t is d∇ log p t (z(t)) dt = -∇ log p t (z(t)) ∂f (z(t), t) ∂z(t) -∇ (∇ • f (z(t), t)) . Proof. See Appendix A.5 for detailed derivation. ∇ log p(z(t), t) follows a linear matrix differential equation, where the linear coefficient is the Jacobian and the bias term is the gradient of divergence. To be noted, an alternative proof can be found in concurrent work (Lu et al., 2022, theorem D.1) . We discuss the training of ACNFs in two different learning cases: maximum likelihood learning for data modeling and density estimation in Section 4; minimizing reverse KL divergence for learning annealed samplers for unbiased sampling in Section 5 .

Algorithm 1 Maximum likelihood learning of ACNF with ascent regularization

Require: Data samples X = {x j } j=1,...,M , parameteric dynamics of flow f θ , length of flow T , ascent regularization coefficient λ, mini-batch size N , base distribution µ Initialize θ while θ is not converged do Sample a mini-batch of N data x i ∼ X Integrate augmented states [z i (t), log pt (x i )] forward with initial value [x i , log µ(x i )] from 0 to T Integrate augmented states [z i (t), ∇ log p t (z i (t))] backwards with initial value [z i (T ), ∇ log µ(z i (T ))] from T to 0 Compute loss function L in eq.( 9) and ∇ θ L by adjoint sensitivity method Update θ by gradient descent algorithm end while Algorithm 2 Training ACNF as annealed sampler for unbiased sampling with ascent regularization Require: target distribution π = γ/Z, parameteric dynamics of flow f θ , length of flow T , number of samples N , ascent regularization coefficient λ, base distribution µ Initialize θ while θ is not converged do Sample z i (0 ) ∼ p 0 = µ Evaluate log µ(z i (0)) and ∇ log µ(z i (0)) Integrate augmented states [z i (t), log p t (z i (t)), ∇ log p t (z i (t))] with initial value [z i 0 , log µ(z i 0 ), ∇ log µ(z i 0 )] from 0 to T Evaluate log w(z i (T )) = log γ(z i (T )) -log p T (z i (T )) Compute loss function L in eq.( 10) and ∇ θ L by adjoint sensitivity method Update θ by gradient descent algorithm end while

4. MAXIMUM LIKELIHOOD LEARNING OF ACNF FOR DENSITY ESTIMATION VIA ASCENT REGULARIZATION

For maximizing likelihood learning of pT to fit data, the total objective with ascent regularization is: min f L = 1 N N i=1 -log pT (x i ; θ) + λ T 0 ∥ ∇ log p t (z i (t); θ) -∇ log µ(z i (t)) + f (z i (t), t; θ)∥ 2 2 dt , where λ is the ascent regularization coefficient to control the trade-off between maximizing likelihood and regularization on the ascent behavior of the learned dynamics. When λ = 0, ACNF degrades to vanilla CNF. The first term in eq.( 9) is obtained by integrating eq.( 5) over [0, T ], simultaneously with z(t), while the ascent regularization can be integrated backwards with augmented initial [z(T ), ∇ log p T (z(T ))], with ∇ log p T (z(T )) ≈ ∇ log µ(z(T )). We summarize the pseudo-code for maximum likelihood learning of ACNFs in Algorithm 1. We show the interpretation of ascent regularization as score matching in Section A.6 in Appendix, thus Algorithm 1 can be implemented in more efficient ways like (Lu et al., 2022; Song et al., 2021) for some cases.

5. LEARNING ACNF AS ANNEALED SAMPLER FOR UNBIASED SAMPLING

Except modeling data samples and performing density estimation, NF as a sampler shows to be more sample efficient in Annealed Importance Sampling (AIS) (Neal, 2001) when comparing to classic MCMC methods (Arbel et al., 2021) . A typical AIS and its extension use a sequence of annealed targets {π k } k=0:K that bridges an easy-to-sample and tractable distribution π 0 = µ to the target π K := π = γ(•)/Z that is known up to the normalization constant. SNF (Wu et al., 2020) and AFT (Arbel et al., 2021) propose to fit K discrete NFs and each NF approximates the transport map between π k-1 and π k . However, the rate of sampling convergence is dependent on the pre-defined annealed targets. Besides, a larger K annealing step is needed to decrease the variance of the estimator, which comes at an additional computational cost (Doucet et al., 2022) . As ACNF can also define the flow from a base distribution to a target distribution, it can learn a continuous flow of the annealed target instead of the pre-defined discrete one, and later generate samples. Different to (Grosse et al., 2013) , the annealed target by ACNF does not require a specific form of distribution. As ACNF enforces faster convergence to the target distribution, ACNF sampler potentially generates better samples than CNF or linear annealed scheduling especially at limited steps K, thus the estimate, e.g. on logarithm of normalization constant log Z, is more accurate.  (z i (T )) = log γ(z i (T )) -log p T (z i (T )). With ascent regularization like previous section, the total objective becomes: min f L = 1 N N i=1 -log w(z i (T ); θ) + λ T 0 ∥ ∇ log p t (z i (t); θ) -∇ log µ(z i (t)) + f (z i (t), t; θ)∥ 2 2 dt , where f (z(t), t; θ) is the annealed generation dynamics. Unlike the previous section, as the sampler initiates by the base distribution, log p t (z i (t)) and ∇ log p t (z i ) are integrated simultaneously with z i (t) = z i (0) + t 0 f (z i (τ ), τ )dτ with a sample z i (0) ∼ µ. We summarize pseudo-code for learning ACNF annealed sampler in Algorithm 2. Once ACNF sampler is learned, it can generate unbiased samples: generate one-shot samples from ACNF with flow length t according to computation budget; correct samples by resampling according to importance weights like (Müller et al., 2019) or by Markov Chain Monte Carlo methods with Metropolis-Hastings correction.

6.1. DENSITY ESTIMATION ON TOY 2D DISTRIBUTIONS

Before we deploy ACNF for modeling complex distributions, we first examine it on a 2-modal Gaussian mixture in 2D and use a standard Gaussian as the base distribution. Figure 4 shows that the potential field of the learned ACNF is very similar to the numerical PDE solutions of eq.( 7) while the potential of CNF converges much slower than that of ACNF and then diverges after T . See Appendix A.7 for experiment details and comparison on the choices of λ, T and other regularization methods. We then train vanilla CNF, RNODE (Finlay et al., 2020) and ACNFs to model for various 2D toy distributions and visualize the density estimation along flows. Figure 5 shows the densities at t ∈ [0, 2T ], T = 10 by learned CNF and ACNFs with various regularization coefficients for 2-moon distribution. The densities that are close to the target distribution are highlighted inside the red border. We show that even slight regularization makes the learned flows to 1) converge much faster towards the target; 2) maintain the best estimations for long time after T . Seen from the left of Figure 6 , the quantitative evaluation on the log-likelihood estimates implies the same conclusion. More analysis on different T and experiment setups are given in Appendix A.8. One may suspect that more complex dynamics explain the faster ascent of likelihood estimates. To validate the actual improvements by ACNF, we report the number of function evaluations (NFEs) like (Finlay et al., 2020) by counting the times when a numerical solver calls to evaluate dynamics function in integral, with and without log-likelihood estimate for all models. The marks in the left of Figure 6 show NFEs along flows while the middle one plots log-likelihood estimates versus NFEs. ACNFs clearly demonstrate that they learn even less complex dynamics than CNF and RNODE, and log-likelihood gain per NFE of ACNFs are much higher than the two baselines especially at early stage. Regarding ascent regularization, a larger coefficient leads to more rapid gain on the log-likelihoods initially, however, too large regularization over-constrains models to reach a good maximum. A moderate regularization benefits both maximum likelihood and faster convergence. Furthermore, we report NFEs at t/T = 1 for CNF, RNODE and ACNFs trained with various λ = 0.0001, 0.005, 0.001, 0.005, 0.01, 0.05 for flow length T = 0.5, 1, 5, 10 on the right of Figure 6 . ACNFs have generally lower NFEs than CNFs and RNODEs, and most models report the lowest NFEs at T = 1. It indicates that optimizing T like TO-FLOW (Du et al., 2022) and STEER (Ghosh et al., 2020) may decrease computational cost at T , however, neither strategy can accelerate the convergence of flow and prevent deterioration like ACNFs as shown by the density estimate of CNF in Figure 14 in Appendix A.8. Figure 7 shows density evaluations on other multi-modal distributions. Learned ACNFs show faster convergence than CNFs for all distributions and even give a better maximum density estimation on the challenging task, e.g. Olympics distribution.

6.2. DENSITY ESTIMATION ON REAL DATASETS

We demonstrate density estimations on real-world benchmarks datasets including POWER, GAS, HEPMASS, MINIBOONE from the UCI machine learning data repository and BSDS300 natural image patches. Like FFJORD, all tabular datasets and BSDS300 are pre-processed as in (Papamakarios et al., 2017) . Table 1 reports the averaged NLLs on test data for FFJORD, RNODE and ACNFs trained with different λ. The detailed description of experiments and models refers to Appendix A.9. Although FFJORD with multi-step flows increases the flexibility of flows, it tends to have a worse performance than the base distribution initially and then improves NLL mainly at the late stage of flows. A larger ascent regularization of ACNFs contributes to more rapid initial increases on NLL that these flows transform the base distribution faster towards the data distribution. When training on HEPMASS and BSDS300, a too large regularization coefficient impedes model to converge.

6.3. ACNF AS A FASTER ANNEALING SAMPLING PROPOSAL FOR UNBIASED SAMPLING

Following Algorithm 2, we train CNF and ACNFs with regularization coefficients λ = 0.0001, 0.001, 0.01 to learn the flow of annealed targets. We evaluate the estimates of log Z on -120.89 † FFJORD uses multi-step flow models for some datasets, so the total length of flow is no longer training configuration of T but T times the number of flow steps. T listed here refers the total length of flow. ‡ The flow length t after T is set slightly different among datasets due to the multi-step FFJORD: 1.2T for POWER and GAS, 1.1T for HEPMASS, and 1.25T for MINIBOONE and BSDS300, but it is always the same cross different models. §FFJORDs are trained to match the performance as originally reported. a Gaussian mixture target with 8 components whose means are fixed evenly in space and standard deviations as 0.3 and the base distribution is a Gaussian, N (0, 3 2 I), to give adequate support. Figure 8 compares the estimates evaluated along flows and reports estimates versus NFEs. We benchmark CNF and ACNFs with the linear annealed target log γ k (•) = β k log γ(•) + (1 -β k ) log π 0 , where scheduling is β k = k/K = t k /T and K = 20, using {170, 25, 10}-step Metropolis sampler between each intermediate target like (Arbel et al., 2021; Wu et al., 2020) . As ACNFs converge faster towards the target than CNF, the estimates via one-shot samples from ACNFs are less biased than that from CNF especially at the initial of flows. Besides, ACNFs are more computationally efficient in terms of accuracy gain per NFEs. ACNFs with λ = 0.01, 0.001 show less biased estimates earlier than even the best tuned linear annealed target. Besides, the linear annealed target requires at least 1 order more computations than ACNFs for comparable accuracy due to slow mixing of Metropolis sampler, and its performance is very sensitive to the number of MC steps. Figure 18 Figure 9 : Reconstructions from VAE-ACNF and VAE and original data for some challenging samples.

6.4. VARIATIONAL INFERENCE WITH ACNFS

In addition to density estimation and unbiased sampling, CNF provides more flexible variational approximation to improve variational inference (Rezende and Mohamed, 2015) . We follow the experiment setup as (Grathwohl et al., 2018) , that uses an encoder/decoder with a CNF architecture, of which the encoder gives the base latent posterior approximation for CNF and the decoder decodes latent inference at the end of the flow back to observation dimension. To train a VAE-ACNF model, the log weight in eq.( 10) is replaced by an ELBO estimate like (Kingma and Welling, 2014). We 

7. SCOPES AND LIMITATIONS

While we have demonstrated that ascent regularization is effective to learn flows that converge faster to target distributions, there are still a number of limitations, which we would like to address in the future. First, more efficient implementations on score function evaluation e.g. by estimators or model design or learning via score matching (Song et al., 2021) can accelerate training for high-dimensional problems. Second, Hypernet (Ha et al., 2016) used in experiments is found suitable to illustrate faster convergence behavior of ACNF as time exerts a large impact on the dynamics, however, it is slower to train than other simpler network architectures. A better architecture may improve the training speed while maintaining the desired characteristics of flows. Third, although the proposed ACNF and ascent regularization have been discussed under the framework of CNF, the concept can be easily extended and explored for score-based models and other stochastic flows. Finally, ACNF and ascent regularization can be applied for a sequence of distributions, e.g. inference of sequential data.

8. CONCLUSION

We introduce ACNFs, a new class of CNFs, that define flows with monotonic convergence toward a target distribution. We derive the dynamics for the steepest ACNF and propose a practical implementation to learn parametric ACNFs via ascent regularization. We demonstrate ACNF in three use cases: modeling data and performing density estimation, learning an annealed sampler for unbiased sampling, and learning variational approximation for variational inference. The learned ACNFs illustrate three beneficial behaviors: 1) faster convergence to the target distribution with less computation; 2) self-stabilization to mitigate performance deterioration; 3) insensitivity to flow training length T . Experiments on both toy distributions and real-world datasets demonstrate the effectiveness of ascent regularization on learning ACNFs for various purposes.

A APPENDIX A.1 PROOF OF THE INSTANTANEOUS CHANGE OF LOG-LIKELIHOOD ESTIMATE

Proposition (Instantaneous Change of Log-likelihood Estimate). Let z(t) be a finite continuous random variable at time t as the solution of a differential equation dz(t) dt = f (z(t), t) with initial value z(0) = x. Assuming that p0 = µ at t = 0 and f is uniformly Lipschitz continuous in z and t, then the change in estimated log-likelihood log pt (x) at t follows a differential equation: d log pt (x) dt = ∇ • f (z(t), t) + ∇ log µ(z(t)) • f (z(t), t). Proof. To prove this theorem, we take the infinitesimal limit of finite changes of log pt (x) through time. As f is assumed to be Lipschitz continuous in z(t) and t, Φ t (x) represents the unique solution of the ODE (eq.( 2)) at time t on the initial value x: Φ t (x) = x + t 0 f (Φ τ (x), τ )dτ, We also denote the transformation on z(t) over an ϵ change in time as: z(t + ϵ) = Φ ϵ (z(t)) = Φ t+ϵ (x). Using the definition of estimated log density pt (x) in eq.( 4), the infinitesimal limit is: d log pt (x) dt := lim ϵ→0 + 1 ϵ log det J Φt+ϵ (x) | -log det (J Φt (x)) | + log µ(Φ t+ϵ (x)) -log µ(Φ t (x)) = lim ϵ→0 + 1 ϵ log det J Φt+ϵ (x) | -log det (J Φt (x)) | + lim ϵ→0 + 1 ϵ (log µ(Φ t+ϵ (x)) -log µ(Φ t (x))) . The derivation of first term is very similar to (Chen et al., 2018, theorem 1) except the sign of the function: lim ϵ→0 + 1 ϵ log det J Φt+ϵ (x) | -log det (J Φt (x)) | = lim ϵ→0 + 1 ϵ log det (J Φϵ (z(t))) | + log det (J Φt (x)) | -log det (J Φt (x)) | = lim ϵ→0 + 1 ϵ log det (J Φϵ (z(t))) |, where we summarize the main steps: lim ϵ→0 + log | det J Φϵ (z(t))| ϵ = lim ϵ→0 + ∂ ∂ϵ log | det J Φϵ (z(t))| ∂ϵ ∂ϵ → 1 (L'Hopital's rule) = lim ϵ→0 + ∂ ∂ϵ | det J Φϵ (z(t))| | det J Φϵ (z(t))| → 1 = lim ϵ→0 + ∂ ∂ϵ | det J Φϵ (z(t))| = lim ϵ→0 + Tr adj ∂ ∂z(t) Φ ϵ (z(t)) ∂ ∂ϵ ∂ ∂z(t) Φ ϵ (z(t)) (Jacobi's formula) = Tr lim ϵ→0 + ∂ ∂ϵ ∂ ∂z(t) Φ ϵ (z(t)) (adjacent matrix → I as ϵ → 0 + ) = Tr lim ϵ→0 + ∂ ∂ϵ ∂ ∂z(t) z(t) + ϵf (z(t), t) + o(ϵ 2 ) + . . . = Tr lim ϵ→0 + ∂ ∂ϵ I + ϵ ∂f (z(t), t) ∂z(t) + o(ϵ 2 ) + . . . = Tr lim ϵ→0 + ∂f (z(t), t) ∂z(t) + o(ϵ) + . . . = ∇ • f (z(t), t). Before deriving the second term, we take the first-order Taylor expansion of log µ(Φ ϵ (z(t))) at z(t) = Φ t (x): log µ(Φ t+ϵ (x)) = log µ(Φ ϵ (z(t))) = log µ(z(t))+∇ log µ(z(t))•(Φ ϵ (z(t)) -z(t))+o(ϵ 2 )+. . . , hence, lim ϵ→0 + log µ(Φ t+ϵ (x)) -log µ(Φ t (x)) ϵ = lim ϵ→0 + log µ(z(t)) + ∇ log µ(z(t)) • (Φ ϵ (z(t)) -z(t)) + o(ϵ 2 ) + . . . -log µ(z(t)) ϵ = lim ϵ→0 + ∇ log µ(z(t)) • Φ ϵ (z(t)) -z(t) ϵ + o(ϵ) + . . . =∇ log µ(z(t)) • f (z(t), t). Therefore, the differential of log pt (x) is: d log pt (x) dt = ∇ • f (z(t), t) + ∇ log µ(z(t)) • f (z(t), t). To show the relation between two differentials d log pt(x) dt and d log pt(z(t)) dt , we first need the relation between log pt (x) and log p t (z(t)): log pt (x) = log p(x) + log µ(z(t)) -log p t (z(t)). Taking the total derivative on both l.h.s. and r.h.s. of last equation: d log pt (x) dt = d log µ(z(t)) dt - d log p t (z(t)) dt =∇ log µ(z(t)) • f (z(t), t) - d log p t (z(t)) dt =∇ log µ(z(t)) • f (z(t), t) + ∇ • f (z(t), t). The total derivative d log pt(x) dt is defined on the fixed variable x, while the infinitesimal change on r.h.s. is evaluated on the variable z(t). So solving log pt (x) requires to simulate z(t) simultaneously. Different to solving log p t (z(t)) on the reversed direction of solving z(t), log pt (x) only needs the trajectory of z(τ ), τ ∈ [0, t], while log p t (z(t)) requires to know the whole trajectory of z(τ ), τ ∈ [0, T ]. Therefore, using log pt (x) is more advantageous when evaluating models at any t other than T or at multiple t. As for training, since p T is specified as µ at T , maximizing log p(x) in vanilla CNF is essentially equivalent to maximizing log pt (x) in ACNF. If we take the time partial derivative on the log-likelihood equation, then ∂ log pt (x) ∂t = ∂ log µ(z(t)) ∂t - ∂ log p t (z(t)) ∂t = - ∂ log p t (z(t)) ∂t , so that the convergence rate of distribution estimate pt (x) towards p(x) is equivalent to the normalized distribution p t (z) towards µ(z). 

A.2 PROOF

f * (z(t), t) = ∇ log µ(z(t)) - ∇p t (z(t)) p t (z(t)) = ∇ log µ(z(t)) -∇ log p t (z(t)). To keep this proof simple, we derive this theorem in Euclidean space. If readers are familiar with non-Euclidean metric spaces, we refer more rigid of Wasserstein gradient flow proof in (Ambrosio et al., 2005) . Proof. Assuming that N samples X = {x i } i=1:N ∈ R N d are drawn from p(x), the averaged negative estimated log-likelihood at time t is: J(Φ t ) = - 1 N N i=1 log pt i ) = 1 N N i=1 (log p t (Φ t (x i )) -log µ(Φ t (x i )) -log p(x i )) . Using the chain rule, the derivative of J(Φ t ) w.r.t. Φ t (x i ) is: [∇J(Φ t )] i = ∇ log p t (Φ t (x i )) -∇ log µ(Φ t (x i )), where ∇J(Φ t ) is a matrix that each row is for each sample i = 1, 2, . . . , N and each column is for each dimension j = 1, 2, . . . , d. To numerically compute the solutions of Euler-Lagrange equation, i.e. ∇J(Φ t ) = 0, we use gradient descent to define the evolution of transformation Φ t for each x i : dΦ t (x i ) dt = -[∇J(Φ t )] i = ∇ log µ(Φ t (x i )) -∇ log p t (Φ t (x i )), which evolves Φ in the direction that decreases J(Φ) most rapidly, starting at initial Φ 0 (x i ) = x i . The next step is to extend the assumption of the finite number of data samples N to infinity, i.e. N → ∞, therefore, the objective J(Φ t ) at time t is updated as: J(Φ t ) = - U log pt (x)dx = U (log p t (Φ t (x)) -log µ(Φ t (x)) -log p(x)) dx = U L(x, Φ t (x), ∇Φ t (x))dx, where x ∈ U ⊆ R d and L(x, Φ t (x), ∇Φ t (x)) = log p t (Φ t (x)) -log µ(Φ t (x)) -log p(x). For each j dimension of Φ t , the functional derivative of J(Φ t ) w.r.t. [Φ t ] j is: δJ(Φ t ) δ[Φ t ] j = ∂L ∂[Φ t ] j (x, Φ t (x), ∇Φ t (x)) -∇ • ∂L ∂∇[Φ t ] j (x, Φ t (x), ∇Φ t (x)) = [∇ log p t (Φ t (x))] j -[∇ log µ(Φ t (x))] j , as ∂L ∂∇[Φt]j = 0. Therefore, the gradient descent that defines the evolution of transformation Φ t is: dΦ t (x) dt = - δJ(Φ t ) δΦ t = ∇ log µ(Φ t (x)) -∇ log p t (Φ t (x)), therefore, the dynamics for the steepest ascent continuous normalizing flow is: f * (z(t), t) = dΦ t (x) dt = ∇ log µ(z(t)) -∇ log p t (z(t)).

A.3 CONVERGENCE RATE OF OPTIMAL ASCENT CONTINUOUS NORMALIZING FLOWS AND ITS RELATION TO LANGEVIN DYNAMICS

The convergence rate of KL divergence w.r.t. t can be derived as (we start from a general flow dynamics f ): ∂ ∂t KL(p t (z)∥µ(z)) = ∂ ∂t KL(p(x)∥p t (x)) = -p(x) ∂ ∂t pt (x)dx = -p(x) (∇ • f (z(t), t) + ∇ log µ(z(t)) • f (z(t), t)) dx = -p t (z(t)) (∇ • f (z(t), t) + ∇ log µ(z(t)) • f (z(t), t)) dz(t) = -p t (z) i ∂f i (z, t) ∂z i + i ∂ log µ(z) ∂z i f i (z) dz = - i -f i (z, t) ∂p t (z) ∂z i + ∂ log µ(z) ∂z i f i (z) = - i p t (z)(- ∂ log p t (z) ∂z i + ∂ log µ(z) ∂z i )f i (z, t)dz = -E pt [(∇ log µ(z) -∇ log p t (z)) • f (z, t)] . (12) (Liu, 2017)[theorem 3.1] shows similar derivation from discrete transformation perspective and links to Stein variational gradient flows. When dynamics f is equal to the fastest flow dynamics f * as eq.( 6), then the convergence rate becomes negative Fisher divergence (estimated w.r.t. p t ): ∂ ∂t KL(p t (z)∥µ(z)) = -E pt ∥∇ log p t (z) -∇ log µ(z)∥ 2 2 . This convergence rate can be easily proved the same to overdamped Langevin diffusion dynamics which is defined via a stochastic differential equation at case of β = 1: dz(t) = ∇ log µ(z(t))dt + 2β -1 dW t , where W t is a Brownian motion. Under the Langevin dynamics, the transformed distribution has a PDE: ∂p t (z) ∂t = -∇ • (p t (z)∇ log µ(z)) + β -1 ∆p t (z) = -∇ • (p t (z)∇ log µ(z)) + β -1 ∇ • (∇p t (z)) = -∇ • (p t (z)(∇ log µ(z) -β -1 ∇ log p t (z))). The last line reveals the steepest gradient flow dynamics as eq.( 6) when β = 1. Therefore, the optimal ascent continuous normalizing flows and overdamped Langevin dynamics transform a distribution equivalently when β = 1. And this Fokker Plank equation is a linear (w.r.t. p t (z)) and deterministic although Langevin dynamics is stochastic. The main difference between these two flows is that the dynamics of (optimal) ascent continuous normalizing flow is deterministic, so as any particular sample trajectory; while Langevin dynamics defines a stochastic process and sample trajectories are stochastic.

A.4 DERIVATION OF POTENTIAL FIELD PDE

The optimal dynamics defined in eq.( 6) can be rewritten in terms of the potential function V (z(t), t), as V (z, t) := pt(z) µ(z) : f * = ∇ log µ(z(t)) -∇ log p(z(t), t) = -∇ log V (z(t), t). Published as a conference paper at ICLR 2023 The continuity equation reveals the time derivative of the transformed density p(z(t), t) at t: ∂p t (z(t)) ∂t = -∇ • (p t (z(t))f (z(t), t)) = -p t (z(t))∇ • f (z(t), t) -∇p t (z(t)) • f (z(t), t). Therefore, the time derivative of log p t (z) with dynamics defined in eq.( 14) is: ∂ log p t (z(t)) ∂t = 1 p t (z(t)) ∂p t (z(t)) ∂t = -∇ • f (z(t), t) -∇ log p t (z(t)) • f (z(t), t) = ∆ log V (z(t), t) + ∇ log p t (z(t)) • ∇ log V (z(t), t). Using the last equation, the time derivative of log V (z, t) is derived as: ∂ log V (z, t) ∂t := ∂ log p t (z(t)) ∂t - ∂ log µ(z(t)) ∂t = -∇ • f (z(t), t) -∇ log p t (z(t)) • f (z(t), t) -∇ log µ(z(t)) • f (z(t), t) =∆ log V (z(t), t) + (∇ log p t (z(t)) + ∇ log µ(z(t))) • ∇ log V (z(t), t) =∆ log V (z(t), t) + (2∇ log µ(z(t)) + ∇ log V (z(t), t)) • ∇ log V (z(t), t), therefore, the time derivative of potential field is: ∂V (z, t) ∂t =∆V (z, t) + 2∇ log µ(z) • ∇V (z, t) + ∇ log V (z, t) • ∇V (z, t). When t = 0, V (x, 0) = p(x) µ(x) ; when t → ∞, V (z, t) ≡ 1, ∀z.

A.5 INSTANTANEOUS CHANGE OF SCORE FUNCTION

Theorem (Instantaneous Change of Score Function). Let z(t) be a finite continuous random variable with probability density p t (z(t)) at time t. Let dz(t) dt = f (z(t), t) be a differential equation describing a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and t, the infinitesimal change in the gradient of log-density at t is d∇ log p t (z(t)) dt = -∇ log p t (z(t)) ∂f (z(t), t) ∂z(t) -∇ (∇ • f (z(t), t)) . Proof. As f is assumed to be Lipschitz continuous in z(t) and t, Φ t (x) represents the unique solution map. We denote the transformation on z(t + ϵ) reversed over an ϵ change in time as: z(t + ϵ) = Φ ϵ (z(t)), z(t) = Φ -ϵ (z(t + ϵ)) , and applying the change of variable theorem on log p t+ϵ (z(t + ϵ)), defined on the variable z(t + ϵ): log p t+ϵ (z(t + ϵ)) = log p t (z(t)) -log | det J Φϵ (z(t))| = log p t (Φ -ϵ (z(t + ϵ))) -log | det J Φϵ (Φ -ϵ (z(t + ϵ)))|. Taking the derivative of log p t+ϵ (z(t + ϵ)) w.r.t. z(t + ϵ) on both l.h.s. and r.h.s. of the last equation: ∇ log p t+ϵ (z(t + ϵ)) = (∇ log p t (z(t)) -∇ log | det J Φϵ (z(t))|) ∂Φ -ϵ (z(t + ϵ)) ∂z(t + ϵ) , and the infinitesimal limit of finite changes of gradient of log density can be defined: d∇ log p t (z(t)) dt := lim ϵ→0 + 1 ϵ (∇ log p t+ϵ (z(t + ϵ)) -∇ log p t (z(t))) = lim ϵ→0 + 1 ϵ (∇ log p t (z(t)) -∇ log | det J Φϵ (z(t))|) ∂Φ -ϵ (z(t + ϵ)) ∂z(t + ϵ) -∇ log p t (z(t)) =∇ log p t (z(t)) lim ϵ→0 + 1 ϵ ∂Φ ϵ (z(t)) ∂z(t) -1 -I -lim ϵ→0 + 1 ϵ ∇ log | det J Φϵ (z(t))| ∂Φ ϵ (z(t)) ∂z(t) -1 = -∇ log p t (z(t)) ∂f (z(t), t) ∂z(t) -∇ (∇ • f (z(t), t)) , where the two limits are derived in detail: lim ϵ→0 + 1 ϵ ∂Φ ϵ (z(t)) ∂z(t) -1 -I = lim ϵ→0 + 1 ϵ ∂ ∂z(t) (z(t) + ϵf (z(t), t) + o(ϵ 2 ) + . . .) -1 -I = lim ϵ→0 + 1 ϵ I + ϵ ∂f (z(t), t) ∂z(t) + o(ϵ 2 ) + . . .) -1 - = lim ϵ→0 + 1 ϵ I -ϵ ∂f (z(t), t) ∂z(t) + o(ϵ 2 ) + . . . -I (inverse by geometric power series expansion) = lim ϵ→0 + - ∂f (z(t), t) ∂z(t) + o(ϵ) + . . . = - ∂f (z(t), t) ∂z(t) , and lim ϵ→0 + 1 ϵ ∇ log | det J Φϵ (z(t))| ∂Φ ϵ (z(t)) ∂z(t) -1 = lim ϵ→0 + 1 ϵ ∇ log | det J Φϵ (z(t))| I -ϵ ∂f (z(t), t) ∂z(t) + o(ϵ 2 ) + . . . = lim ϵ→0 + ∇ log | det J Φϵ (z(t))| ϵ -lim ϵ→0 + ∇ log | det J Φϵ (z(t))| ∇1→0 ∂f (z(t), t) ∂z(t) =∇ lim ϵ→0 + log | det J Φϵ (z(t))| ϵ =∇ (∇ • f (z(t), t)) . Therefore, ∇ log p(x, t) follows a linear matrix differential equation, where the linear matrix is defined by the Jacobian ∂f (z(t),t) ∂z(t) and the bias term is the gradient of divergence of the differential function ∇ (∇ • f (z(t), t)).

A.6 INTERPRETING ASCENT REGULARIZATION AS SCORE MATCHING OBJECTIVE

To show the ascent regularization in eq.( 9) and eq.( 10) relates to the score matching objective, we first assume a diffusion process defined via a stochastic differential equation (SDE): dz(t) = h(z(t), t) + g(t)dW(t), z(0) = x; x ∼ p(x), where W t is Brownian motion and we denote p t (z(t)) as the marginal distribution at time t and P T as the path measure of the SDE up to time T . (Anderson, 1982) shows the reverse time process is also a diffusion process which shares the same marginals as the forward process: dz(t) = h(z(t), t) -g 2 (t)∇ log p t (z(t)) dt + g(t)d W(t), z(T ) ∼ p T , where W(t) is a reverse-time Brownian motion. The reverse-time diffusion introduces the conditional path measure P(•|z(T )). As the score function, ∇ log p t (z(t)), is generally unknown for an arbitrary diffusion process, we approximate the reverse-time diffusion by a secondary reverse-time diffusion process by a parametric score function: dz(t) = h(z(t), t) -g 2 (t)s θ (z(t), t) dt + g(t)d W(t), z(T ) ∼ p T , which induces the conditional path measure Pθ T (•|z(T )) to approximate P T (•|z(T )). Under some regularity conditions that permit the definition of Radon-Nikodym derivative, dP T (•|z(T ))/d Pθ T (•|z(T )), Girsanov theorem gives the expectation of KL divergence between two path measures: E p T KL(P T (•|z(T ))∥ Pθ T (•|z(T ))) = -E P log d Pθ T (•|z(T )) dP T (•|z(T )) =E P T 0 g(t) (s θ (z(t), t) -∇ log p t (z(t))) d Wt + 1 2 T 0 g 2 (t)∥s θ (z(t), t) -∇ log p t (z(t))∥ 2 dt = 1 2 E P g 2 (t) T 0 ∥s θ (z(t), t) -∇ log p t (z(t))∥ 2 dt . Using the chain rule of KL divergence, we can show the KL divergence between two path measures: KL(P T ∥ Pθ T ) =KL(p T (z(T ))∥µ(z(T ))) + E p T KL(P T (•|z(T ))∥ Pθ T (•|z(T ))) =KL(p T (z(T ))∥µ(z(T ))) + 1 2 E P g 2 (t) T 0 ∥s θ (z(t), t) -∇ log p t (z(t))∥ 2 dt =KL(p(x)∥p(x)) + 1 2 E P T 0 g 2 (t)∥s θ (z(t), t) -∇ log p t (z(t))∥ 2 dt . Assume that the parametric dynamics f (z(t), t; θ) = ∇ log µ(z(t)) -s θ (z(t), t) has the similar structure as the optimal dynamics in eq.( 6) as s θ (z(t), t) to approximate ∇ log p t (z(t)) and g(t) ≡ √ 2λ, then we recover the total learning objective with ascent regularization coefficient λ in eq.( 9). Therefore, the total objective is equivalent to minimize the KL divergence of two path measures on the joint (infinite) variable space. Similar analysis can also be applied to the objective in eq.( 10). When λ = β -1 = g 2 (t) 2 and learned score s θ (z(t), t) matches to ∇ log p t (z(t)) so that h(z(t), t) = ∇ log µ(z(t)), then SDE in eq.( 18) becomes the overdamped Langevin dynamics in eq.( 13) as well as optimal ACNF (eq.( 6)) with critical damping dynamics, i.e. λ = β -1 = 1. As the ascent regularization can be interpreted as a score matching objective, it is possible to implement Algorithm 1 and Algorithm 2 in a more time efficient way for training like (Lu et al., 2022; Song et al., 2021) . However, note that the explicit score matching objective can hardly be used directly in the implementation as ∇ log p t (z(t)) is intractable in general and requires to be evaluated e.g. via score function integral in ascent regularization. (Lu et al., 2022; Song et al., 2021; 2020; Ho et al., 2020) use its surrogates, e.g. denoising score matching. To enable practical training, denoising score matching objective relies on the explicit form of conditional (noised) distributions ∇ log p t|0 (z(t)|z(0)), e.g. Gaussian. For image or data generation tasks, Gaussian assumption may Figure 10 : Comparison of log potential field, log V (z(t), t), evaluated on trained vanilla CNF, RNODE with regularization coefficient as 0.1 and ACNF models with regularization coefficient λ as 0.1 and 1 for 2-modal Gaussian mixture along flow at t ∈ [0, 2T ] and the numerical PDE solutions of eq.( 7). Color indicates the value of field: turquoise is 0, and the lighter the color is the larger the value is, and vice versa. Color indicates the value of field: turquoise is 0, and the lighter the color is the larger the value is, and vice versa. not seem so limited as long as the chain of discrete transformation is adequately long for adequate expressivity of the marginal distribution at T . However, for inference tasks e.g. using flows as variational approximation or annealed sampler, constraining the distribution induced by flows with Gaussian assumption can hinder their approximate potential for true posterior.

A.7 ANALYSIS ON A TOY EXAMPLE: FROM A GAUSSIAN TO A MIXTURE OF GAUSSIAN

Before we deploy ACNF for complex distributions, we first demonstrate its validity on a simpler problem: to learn a 2-modal Gaussian mixture with a standard Gaussian as the base distribution. Since the density of the target distribution is known in this case, we can numerically solve the potential field V (z, t) for t ∈ [0, T ] in eq.( 7) even though the exact solution is still hard to obtain for this simple case. The PDE solution presented in Figure 4 and Figure 10 is implemented using py-pde package. A fixed Cartesian grid is used which has the same center locations as the other potential fields evaluated by density estimations. The PDE solver in py-pde uses the finite difference method, and we choose explicit solver to keep simulation simple. To define the parametric dynamics function for training, we use hypernetworks (Ha et al., 2016) that a smaller network generates the weights of layers. This architecture is suitable to demonstrate ACNFs as the function of dynamics is supposed to evolve with time via changing the weights by the hypernetworks. We follow the same implementation of hypernetworks as Neural ODE and use torchdiff for ODE solution and adjoint method. 1The last row of Figure 10 as Figure 4 shows the logarithm of the potential solutions, while the rest show the log potential field of learned flows evaluated by the ratio p(x)/p t (x) when training T is set as 10. Without ascent regularization, the potential field converges slower and only reaches close to a uniform field at T . After T , some areas start to be under/over-represented when the learned flow continues to move samples towards the center of the field. Nevertheless, the flows learned with ascent regularization transform densities faster to the target distribution. When the ascent regularization coefficient λ is 1, the evolution of the potential fields is very similar to that of PDE solutions which indicates the learned flow is close to the optimal ascent continuous normalizing flow. Apart from vanilla CNF, we train RNODE models to demonstrate the effect of kinetic energy regularization on the transformation of distributions. As known from (Finlay et al., 2020) , the optimal flow that minimizes L 2 transport cost induces straight sample trajectories and samples travel with constant speeds. Figure 11 shows the flows of RNODE models trained under the same configurations as Figure 4 , and the kinetic energy regularization coefficients are 0, 0.01, 0.1, 1 respectively. Although RNODEs learn simpler ODE functions with lower NFEs compared to the flow without regularization, these flows do not induce the transformed distributions to converge faster. They are even slower at larger regularization coefficients. Like vanilla CNF, RNODE does not prevent the distribution to deteriorate after T . NFEs for each flow in Figure 11 , at the time that the transformed distribution gives the maximum estimated log-likelihood, are 38, 38, 36, 32, while the flows by ACNF are 26, 32, 36 under λ = 0.01, 0.1, 1, nevertheless ascent regularization does not explicitly regularize for simpler ODE functions. We also tried Frobenius norm regularization on the Jacobian as suggested by (Finlay et al., 2020) , HJB regularization (Onken et al., 2021; Yang and Karniadakis, 2020) , second-order regularization (Kelly et al., 2020) , however, the evolution of potential fields under these regularizations does not differ much to that of vanilla CNF and RNODEs as shown. To demonstrate the effect of the length of flow T in training configuration, we train vanilla CNF and ACNFs with other flow length, e.g. T = 5 and ascent regularization factors as 0, 0.01, 0.1, 1, and evaluate the learned flows at t ∈ [0, 2] as Figure 4 . Under some suitable condition that there exists an optimal ACNF between the base and the target distributions, the flow is almost independent to the choice of flow length T . Comparing Figure 12 with T = 5 to Figure 4 with T = 1 but testing both on t ∈ [0, 2], the flow by vanilla CNF is idle at early stage for T = 5 and is very sensitive to the choice of T , while the flows with ascent regularization are almost independent to the choice of T , which possibly makes tedious model selection on different T or optimizing T (Ghosh et al., 2020; Du et al., 2022) no longer necessary.

A.8 DENSITY ESTIMATION ON 2D TOY DISTRIBUTIONS

Like Section A.7, we specify dynamics model by hypernetworks and all hypernetworks are defined by one hidden layer with 32 units and 64 for the width of hypernetworks to learn all 2-dimensional distributions. As shown in the last section, the flows learned with ascent regularization are almost insensitive to T for Gaussian mixture. To examine whether this conclusion still applies to more complex distributions, we retrain ACNF models with ascent regularization coefficients λ = 0.0001, 0.0005, 0.001, 0.005 under different flow lengths T = 10, 5, 1, 0.5. Figure 13 (T = 5) and Figure 14 (T = 1) shows the evolution of the density estimations for each model at t ∈ [0, 2T ] like Figure 5 (T = 10). When decreasing T from 10 to 5, the density estimations are almost identical under the same regularization coefficients. When T decreases from 5 to 1, the highlighted area shrinks slightly at low regularization coefficients, e.g. 0.0001, 0.0005. Model trained with a smaller T may require a larger λ to have Table 3 : Model architectures of ACNFs for density estimations on tabular data reported in Table 1 . the highlighted areas are larger for 2-moon, 2-circle and checkerboard distribution than Olympics distributions, since the Olympics distribution is more challenging and requires a relatively large regularization coefficient. A.9 DENSITY ESTIMATION ON TABULAR DATASETS For tabular datasets, we follow the experiment setup and model configurations as recommended by FFJORD (Grathwohl et al., 2018) and all data are pre-processed according to (Papamakarios et al., 2017) . We found that the concatenate layer used in FFJORD, that concatenates time t and states z(t) as a flat input vector for differential function, dilutes the ascent regularization on the parameters, especially when data dimensions are high, e.g. for MINOBOONE and BSDS300 datasets. Nevertheless, the hypernetwork architecture used in previous sections, even a deeper one, turns out to be inadequate to reach a similar log-likelihood evaluation as FFJORD and slow to train. To tackle this issue, we use an encoder to encode states z(t) to a lower dimension and apply the weights by the hypernetworks on the encodings and later a decoder maps the transformed encodings back to the data dimension. We summary model architectures and training configurations for each dataset in Table 3 .

A.10 ACNFS AS ANNEALED SAMPLER FOR UNBIASED SAMPLING AND ESTIMATE OF NORMALIZATION CONSTANT

To extend ACNF annealed sampler with stochasticity, we replace the discrete NF blocks in SNF by the discrete realization of each adaptive step of ACNF and each is followed with a stochastic block by e.g. discrete Langevin flow or MCMC flow as in SNF. The original importance weight update for discrete flows also needs to be replaced by the integral of negative divergence of dynamics and resampling steps are added as AFT (Arbel et al., 2021) . The complete algorithm is summarized in Algorithm 3. Figure 18 shows the generated samples of all different methods as reported in Figure 8 plus adding MC steps on top of trained ACNF to form SNF models by Algorithm 3. Like quantitative evaluation shown in Figure 8 , learned ACNFs with regularization coefficient λ = 0.01 has distinctly faster convergence than CNF, best tuned linear annealed target and less regularized ACNFs, but uses less computation. The add-on MC steps on trained ACNF boasts the convergence slightly as shown by the last two rows. Although diffeomorphism constraint does not show much effect on limiting the expressiveness of CNF/ACNF in this experiment, adding stochastic blocks is still very beneficial especially at the beginning stage of the flows. activation function is tanh. All models reported in Table 2 are trained under the same learning rate as 0.001 , Adam optimizer and batch size as 100. Figure 19 shows more reconstructed samples from VAE-ACNF and vanilla VAE, with comparison of original data. In general, the reconstructions from VAE-ACNF are smoother than the ones from vanilla VAE and original data samples. Figure 9 shows some challenging examples for VAE to reconstruct. VAE-ACNF tends to reconstruct images by adding more details, not only to make it smoother, but also to possibly strengthen their identity of classes. Furthermore, due to the coarse variational approximation, some reconstructions of VAE fail to retain their features in original data and change the identity of classes.



https://github.com/rtqichen/torchdiffeq



Figure 1: Distribution transformations of two learned flows for 1d Gaussian mixture from a Gaussian distribution at t ∈ [0, 4T ]. Although the two flows reach similar densities at T , the density of ACNF converges faster to the target distribution before T and diverges slower after T than that of CNF. Color indicates the density of true Gaussian mixture.

Figure 3: Upper: transformations on variables and densities on normalization and sampling directions. Lower left: data samples (orange) and the grid of states (blue) transformations along normalization direction. Lower right: density estimation pt along sampling direction.

Figure 4: Comparison on log potential field along the flow by trained vanilla CNF and ACNF with λ = 1 and the numerical PDE solutions of eq.(7) for 2-modal Gaussian mixture at t ∈ [0, 2T ] . Color indicates the value of field: turquoise is 0 and the lighter the color is the larger the value is.

Figure 5: Comparison on density evaluation of trained vanilla CNF and ACNFs with λ = 0.0001, 0.0005, 0.001, 0.005 on 2-moon distribution along integral t ∈ [0, 2T ].Different to maximum likelihood in Section 4, training ACNF for annealed sampling is to minimize the reverse KL divergence, KL(p T (z(T ))∥π(z(T ))). It can be evaluated up to a constant by the logarithm of importance weights of samples and log w(z i (T )) = log γ(z i (T )) -log p T (z i (T )). With ascent regularization like previous section, the total objective becomes:

Figure 6: Left: comparison on estimated log-likelihoods of models trained under different regularization λ as Figure 5. Middle: log-likelihood vs NFE as the left figure. Right: comparison on NFEs evaluated at t/T = 1 of vanilla CNF, RNODE and ACNF trained with various flow length T and λ.

Figure 7: Comparison on density estimations of trained ACNF and vanilla CNF models on various two-dimensional toy distributions along flows with increasing t ∈ [0, 2T ]

Figure 8: Left: comparison on estimated log Z by different methods along the flow over 5 different runs. Right: estimated log Z vs NFEs. See Figure 18 for generated sample comparisons.

in Appendix A.10 shows the generated samples by all methods in Figure 8. Adding MC steps with learned ACNFs can further accelerate sample convergence and increase the expressiveness of flows.

evaluate VAE-ACNF to VAE-FFJORD and vanilla VAE without flow on MNIST data. To make a fair comparison, we fix the learned encoder-decoder when training all flows. A detailed description of model architecture and experimental setup can be found in Appendix A.11. The averaged negative ELBO on test data of VAE is 86.50 and Table2reports that of VAE-FFJORD and VAE-ACNFs with λ = 1e -4 , 1e -3 along the flows. Compared to VAE-FFJORD, VAE-ACNFs show faster descent on negative ELBO at the initial of the flows, and a larger coefficient shows faster convergence of the variational approximation. VAE-ACNFs also circumvent the flow deterioration by VAE-FFJORD, thanks to the self-stabilization behavior of ACNF. Figure9and Figure19in Appendix A.11 show some reconstruction examples from VAE-ACNF. These reconstructions tend to correct some defects in original images, add details to strengthen identities while remaining sharp.

Figure 11: Comparison of log potential field, log V (z(t), t), evaluated on trained vanilla CNF and RNODE (Finlay et al., 2020) models with T = 1 for 2-modal Gaussian mixture along the flows at t ∈ [0, 2T ] as Figure 4. The kinetic energy regularization coefficients are 0, 0.01, 0.1, 1 respectively.Color indicates the value of field: turquoise is 0, and the lighter the color is the larger the value is, and vice versa.

Figure 12: Comparison on log potential field, log V (z(t), t) of trained vanilla CNF and ACNF models with T = 5 for 2-modal Gaussian mixture, evaluated along the flows at t ∈ [0, 2] as Figure 4. The ascent regularization coefficients λ are 0, 0.01, 0.1, 1 respectively.

Figure 13: Comparison on density estimations of trained vanilla CNF and ACNFs with regularization coefficients λ = 0.0001, 0.0005, 0.001, 0.005 and T = 5 on 2-moon distribution at t ∈ [0, 2T ].

Figure 14: Comparison on density estimations of trained vanilla CNF and ACNFs with regularization coefficients λ = 0.0001, 0.0005, 0.001, 0.005 and T = 1, on 2-moon distribution at t ∈ [0, 2T ].

Figure 15: Comparison on density estimations of trained vanilla CNF and ACNFs with regularization coefficients λ = 0.0001, 0.0005, 0.001, 0.005 and T = 10 on 2-circle distribution at t ∈ [0, 2T ].

Figure 16: Comparison on density estimations of trained vanilla CNF and ACNFs with regularization coefficients λ = 0.0001, 0.0005, 0.001, 0.005 and T = 10 on Olympics distribution at t ∈ [0, 2T ].

Figure 17: Comparison on density estimations of trained vanilla CNF and ACNFs with regularization coefficients λ = 0.0001, 0.0005, 0.001 and T = 10 on checkerboard distribution at t ∈ [0, 2T ].

Figure 19: More reconstructed samples from VAE-ACNF, vanilla VAE and original data. The first row of three is the reconstruction from VAE-ACNF, the second one is the reconstruction from vanilla VAE while the last one is the original data samples.

Averaged negative log-likelihoods (NLLs) on test data for density estimation.

Averaged negative ELBO on MNIST datasets under different length of flows t. FFJORD 85.90 85.07 83.96 83.26 82.88(82.82 †) 89.74 VAE-ACNF, 1e -4 85.62 84.60 83.45 83.06 82.74 85.67 VAE-ACNF, 1e -3 84.70 83.95 83.22 82.53 82.80 84.37 † originally reported in FFJORD

Its probability p t (z(t)) subjects to the continuity equation ∂ t p t + ∇ • (p t f ) = 0. The dynamics of the steepest flow for decreasing KL(p t (z(t))||µ(z(t))) is

ACKNOWLEDGMENTS

This work is supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP). Y. K. is a member of the ELLIIT Strategic Research Area at Lund University.

ETHICS STATEMENT

As this work mainly concerns to propose a flow-based model and practical implementation for learning, it does not involve human subjects, practices to data set releases, or security and privacy issue. At this stage of study, we do not foresee the effects of potential system failures due to weaknesses in the proposed methods.

REPRODUCIBILITY STATEMENT

All proposition (proposition 1) and theorems (theorem 1 and 2) proposed in this paper are proved with details in Appendix A.1, A.2 and A.5 as well as other minor derivations mentioned in main body of the paper. The pseudo-code for both learning cases are provided in Algorithm 1 and Algorithm 2. The datasets, models, experiment setups for each demonstration are described in details in Appendix A.8 ∼ A.11. Furthermore, we attach some source codes in supplementary material for further checkup. MCMC update with π invariant kernel via Metropolis-Hastings: (5-7) ACNF with ascent regularization factor λ = 0.0001, 0.001, 0.01, (8-9) SNF with trained ACNF λ = 0.01 (as 7th row) and {1, 5} MC step as the stochastic block as Algorithm 3.

A.11 VARIATIONAL INFERENCE WITH ACNFS

Our experiment setup mimics (Grathwohl et al., 2018) , and the encoder and decoder are defined by 7-layer neural networks with specified latent dimension as 64. The first 6 layers of the encoder are implemented as gated convolutional networks and the last one is a linear layer to output mean and diagonal covariance. For the decoder, the first 6 layers are also gated convolutional networks while the last layer is a vanilla convolutional network. We define the length of flow for both VAE-FFJORD and VAE-ACNF as T = 1 and the number of steps as 2. The networks for modeling differential function of flows are the modified hypernetworks as for the tabular datasets, with 4 layers, and the

