LEARNING CONTINUOUS NORMALIZING FLOWS FOR FASTER CONVERGENCE TO TARGET DISTRIBUTION VIA ASCENT REGULARIZATIONS

Abstract

Normalizing flows (NFs) have been shown to be advantageous in modeling complex distributions and improving sampling efficiency for unbiased sampling. In this work, we propose a new class of continuous NFs, ascent continuous normalizing flows (ACNFs), that makes a base distribution converge faster to a target distribution. As solving such a flow is non-trivial and barely possible, we propose a practical implementation to learn flexibly parametric ACNFs via ascent regularization and apply it in two learning cases: maximum likelihood learning for density estimation and minimizing reverse KL divergence for unbiased sampling and variational inference. The learned ACNFs demonstrate faster convergence towards the target distributions, therefore, achieving better density estimations, unbiased sampling and variational approximation at lower computational costs. Furthermore, the flows show to stabilize themselves to mitigate performance deterioration and are less sensitive to the choice of training flow length T .

1. INTRODUCTION

Normalizing flows (NFs) provide a flexible way to define an expressive but tractable distribution which only requires a base distribution and a chain of bijective transformations (Papamakarios et al., 2021) . Neural ODE (Chen et al., 2018) extends discrete normalizing flows (Dinh et al., 2014; 2016; Papamakarios et al., 2017; Ho et al., 2019) to a new continuous-time analogue by defining the transformation via a differential equation, substantially expanding model flexibility in comparison to the discrete alternatives. (Grathwohl et al., 2018; Chen and Duvenaud, 2019) propose a computationally cheaper way to estimate the trace of Jacobian to accelerate training, while other methods focus on increasing flow expressiveness by e.g. augmenting with additional states (Dupont et al., 2019; Massaroli et al., 2020) , or adding stochastic layers between discrete NFs to alleviate the topological constraint (Wu et al., 2020) . Recent diffusion models like (Hodgkinson et al., 2020; Ho et al., 2020; Song et al., 2020; Zhang and Chen, 2021) extend the scope of continuous normalizing flows (CNFs) with stochastic differential equations (SDEs). Although these diffusion models significantly improve the quality of the generated images, the introduced diffusion comes with costs: some models no longer allow for tractable density estimation; or the practical implementations of these models rely on a long chain of discretizations, thus needing relatively more computations than tractable CNF methods, which can be critical for some use cases such as online inference. (Finlay et al., 2020; Onken et al., 2021; Yang and Karniadakis, 2020) introduce several regularizations to learn simpler dynamics using optimal transport theory, which decrease the number of discretization steps in integration and thus reduce training time. (Kelly et al., 2020) extends the L 2 transport cost to regularize any arbitrary order of dynamics. Although these regularizations are beneficial for decreasing the computational costs of simulating flows, they do not improve the slow convergence of density to the target distributions like trained vanilla CNF models shown in Figure 1 . To accelerate the flow convergence, STEER (Ghosh et al., 2020) and TO-FLOW (Du et al., 2022) propose to optimize flow length T in two different approaches: STEER randomly samples the length during training while TO-FLOW establishes a subproblem for T during training. To understand the effectiveness of Figure 2 : The log-likelihood estimates of trained vanilla CNF models with various flow length T n and the steepest ACNF with dynamics defined in eq.( 6) at different t on 2-moon distribution. All vanilla CNF models reach their maximum around T n and deteriorate rapidly afterwards while the log-likelihood estimate of ACNF elevates rapidly at initial and increases monotonically. these methods, we train multiple Neural ODE models with different flow length T n for a 2-moon distribution and examine these flows by the estimated log-likelihoods in Figure 2 . Although sampling or optimizing T dynamically performs a model selection during training and leads models to reach higher estimates at shorter flows, it cannot prevent the divergence after T n . Furthermore, shorter flows are more limited in expressiveness for higher maximum likelihoods and sensitive to flow length. In this work, we present a new family of CNFs, ascent continuous normalizing flows (ACNFs), to address the aforementioned problems. ACNF concerns a flow that transforms a base distribution monotonically to a target distribution, and the dynamics is imposed to follow the steepest ACNF. However, solving such a steepest flow is non-trivial and barely possible. We propose a practical implementation to learn parametric ACNFs via ascent regularization. Learned ACNFs exhibit three main beneficial behaviors: 1) faster convergence to target distribution with less computation; 2) self-stabilization to mitigate flow deterioration; and 3) insensitivity to flow training length T . We demonstrate these behaviors in three use cases: modeling data distributions; learning annealed samplers for unbiased sampling; and learning a tractable but more flexible variational approximation.

2. CONTINUOUS NORMALIZING FLOWS

Considering a time-t transformation z(t) = Φ t (x) on the initial value x, i.e. z(0) = x, the change of variable theorem reveals the relation between the transformed distribution p t (z(t)) and p(x): p t (z(t)) = det J -1 Φt (x) |p(x), where J Φt is the Jacobian matrix of Φ t . As Φ t normalizes x towards some base distribution, p t (z(t)) is referred to as the normalized distribution at time t, starting from the data distribution p(x). Continuous normalizing flow is the infinitesimal limit of the chain of discrete flows and the infinitesimal transformation is specified by an ordinary differential equation (ODE): dz(t) dt = dΦ t (x) dt = f (z(t), t). (2) The instantaneous change of variable theorem (Chen et al., 2018, theorem 1) shows the infinitesimal changes of log p t (z(t)) is: d log p t (z(t)) dt = -∇ • f (z(t), t). (3)



Figure 1: Distribution transformations of two learned flows for 1d Gaussian mixture from a Gaussian distribution at t ∈ [0, 4T ]. Although the two flows reach similar densities at T , the density of ACNF converges faster to the target distribution before T and diverges slower after T than that of CNF. Color indicates the density of true Gaussian mixture.

