BUILDING NORMALIZING FLOWS WITH STOCHASTIC INTERPOLANTS

Abstract

A generative model based on a continuous-time normalizing flow between any pair of base and target probability densities is proposed. The velocity field of this flow is inferred from the probability current of a time-dependent density that interpolates between the base and the target in finite time. Unlike conventional normalizing flow inference methods based the maximum likelihood principle, which require costly backpropagation through ODE solvers, our interpolant approach leads to a simple quadratic loss for the velocity itself which is expressed in terms of expectations that are readily amenable to empirical estimation. The flow can be used to generate samples from either the base or target, and to estimate the likelihood at any time along the interpolant. In addition, the flow can be optimized to minimize the path length of the interpolant density, thereby paving the way for building optimal transport maps. In situations where the base is a Gaussian density, we also show that the velocity of our normalizing flow can also be used to construct a diffusion model to sample the target as well as estimate its score. However, our approach shows that we can bypass this diffusion completely and work at the level of the probability flow with greater simplicity, opening an avenue for methods based solely on ordinary differential equations as an alternative to those based on stochastic differential equations. Benchmarking on density estimation tasks illustrates that the learned flow can match and surpass conventional continuous flows at a fraction of the cost, and compares well with diffusions on image generation on CIFAR-10 and ImageNet 32×32. The method scales ab-initio ODE flows to previously unreachable image resolutions, demonstrated up to 128 × 128.

1. INTRODUCTION

Contemporary generative models have primarily been designed around the construction of a map between two probability distributions that transform samples from the first into samples from the second. While progress has been from various angles with tools such as implicit maps (Goodfellow et al., 2014; Brock et al., 2019) , and autoregressive maps (Menick & Kalchbrenner, 2019; Razavi et al., 2019; Lee et al., 2022) , we focus on the case where the map has a clear associated probability flow. Advances in this domain, namely from flow and diffusion models, have arisen through the introduction of algorithms or inductive biases that make learning this map, and the Jacobian of the associated change of variables, more tractable. The challenge is to choose what structure to impose on the transport to best reach a complex target distribution from a simple one used as base, while maintaining computational efficiency. In the continuous time perspective, this problem can be framed as the design of a time-dependent map, X t (x) with t ∈ [0, 1], which functions as the push-forward of the base distribution at time t = 0 onto some time-dependent distribution that reaches the target at time t = 1. Assuming that these distributions have densities supported on Ω ⊆ R d , say ρ 0 for the base and ρ 1 for the target, this amounts to constructing X t : Ω → Ω such that if x ∼ ρ 0 then X t (x) ∼ ρ t for some density ρ t such that ρ t=0 = ρ 0 and ρ t=1 = ρ 1 . (1) space tim e ρ 0 ρ 1 t = 0 t = 1 Figure 1: The density ρ t (x) produced by the stochastic interpolant based on (5) between a standard Gaussian density and a Gaussian mixture density with three modes. Also shows in white are the flow lines of the map X t (x) our method produces. One convenient way to represent this time-continuous map is to define it as the flow associated with the ordinary differential equation (ODE) Ẋt (x) = v t (X t (x)), X t=0 (x) = x (2) where the dot denotes derivative with respect to t and v t (x) is the velocity field governing the transport. This is equivalent to saying that the probability density function ρ t (x) defined as the pushforward of the base ρ 0 (x) by the map X t satisfies the continuity equation (see e.g. (Villani, 2009; Santambrogio, 2015) and Appendix A) ∂ t ρ t + ∇ • (v t ρ t ) = 0 with ρ t=0 = ρ 0 and ρ t=1 = ρ 1 , (3) and the inference problem becomes to estimate a velocity field such that (3) holds. Here we propose a solution to this problem based on introducing a time-differentiable interpolant I t : Ω × Ω → Ω such that I t=0 (x 0 , x 1 ) = x 0 and I t=1 (x 0 , x 1 ) = x 1 (4) A useful instance of such an interpolant that we will employ is I t (x 0 , x 1 ) = cos( 1 2 πt)x 0 + sin( 1 2 πt)x 1 , though we stress the framework we propose applies to any I t (x 0 , x 1 ) satisfying ( 4) under mild additional assumptions on ρ 0 , ρ 1 , and I t specified below. Given this interpolant, we then construct the stochastic process x t by sampling independently x 0 from ρ 0 and x 1 from ρ 1 , and passing them through I t : x t = I t (x 0 , x 1 ), x 0 ∼ ρ 0 , x 1 ∼ ρ 1 independent. (6) We refer to the process x t as a stochastic interpolant. Under this paradigm, we make the following key observations as our main contributions in this work: • The probability density ρ t (x) of x t connecting the two densities, henceforth referred to as the interpolant density, satisfies (3) with a velocity v t (x) which is the unique minimizer of a simple quadratic objective. This result is the content of Proposition 1 below, and it can be leveraged to estimate v t (x) in a parametric class (e.g. using deep neural networks) to construct a generative model through the solution of the probability flow equation ( 2), which we call InterFlow. • By specifying an interpolant density, the method therefore separates the tasks of minimizing the objective from discovering a path between the base and target densities. This is in contrast with conventional maximum likelihood (MLE) training of flows where one is forced to couple the choice of path in the space of measures to maximizing the objective. • We show that the Wasserstein-2 (W 2 ) distance between the target density ρ 1 and the density ρ1 obtained by transporting ρ 0 using an approximate velocity vt in ( 2) is controlled by our objective function. We also show that the value of the objective on vt during training can be used to check convergence of this learned velocity field towards the exact v t . • We show that our approach can be generalized to shorten the path length of the interpolant density and optimize the transport by additionally maximizing our objective over the interpolant I t (x 0 , x 1 ) and/or adjustable parameters in the base density ρ 0 . • By choosing ρ 0 to be a Gaussian density and using (5) as interpolant, we show that the score of the interpolant density, ∇ log ρ t , can be explicitly related to the velocity field v t . This allows us to draw connection between our approach and score-based diffusion models, providing theoretical groundwork for future exploration of this duality. • We demonstrate the feasibility of the method on toy and high dimensional tabular datasets, and show that the method matches or supersedes conventional ODE flows at lower cost, as it avoids the need to backpropagate through ODE solves. We demonstrate our approach on image generation for CIFAR-10 and ImageNet 32x32 and show that it scales well to larger sizes, e.g. on the 128×128 Oxford flower dataset.

