FLOW MATCHING FOR GENERATIVE MODELING

Abstract

We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale. Specifically, we present the notion of Flow Matching (FM), a simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths. Flow Matching is compatible with a general family of Gaussian probability paths for transforming between noise and data samples-which subsumes existing diffusion paths as specific instances. Interestingly, we find that employing FM with diffusion paths results in a more robust and stable alternative for training diffusion models. Furthermore, Flow Matching opens the door to training CNFs with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport (OT) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization. Training CNFs using Flow Matching on ImageNet leads to consistently better performance than alternative diffusion-based methods in terms of both likelihood and sample quality, and allows fast and reliable sample generation using off-the-shelf numerical ODE solvers.



However, aside from diffusion that can be trained efficiently via, e.g., denoising score matching (Vincent, 2011), no scalable CNF training algorithms are known. Indeed, maximum likelihood training (e.g., Grathwohl et al. ( 2018)) require expensive numerical ODE simulations, while existing simulation-free methods either involve intractable integrals (Rozen et al., 2021) or biased gradients (Ben-Hamu et al., 2022) . The goal of this work is to propose Flow Matching (FM), an efficient simulation-free approach to training CNF models, allowing the adoption of general probability paths to supervise CNF training. Importantly, FM breaks the barriers for scalable CNF training beyond diffusion, and sidesteps the need to reason about diffusion processes to directly work with probability paths. In particular, we propose the Flow Matching objective (Section 3), a simple and intuitive training objective to regress onto a target vector field that generates a desired probability path. We first show that we can construct such target vector fields through per-example (i.e., conditional) formulations. Then, inspired by denoising score matching, we show that a per-example training objective, termed Conditional Flow Matching (CFM), provides equivalent gradients and does not require explicit knowledge of the intractable target vector field. Furthermore, we discuss a general family of per-example probability paths (Section 4) that can be used for Flow Matching, which subsumes existing diffusion paths as special instances. Even on diffusion paths, we find that using FM provides more robust and stable training, and achieves superior performance compared to score matching. Furthermore, this family of probability paths also includes a particularly interesting case: the vector field that corresponds to an Optimal Transport (OT) displacement interpolant (McCann, 1997) . We find that conditional OT paths are simpler than diffusion paths, forming straight line trajectories whereas diffusion paths result in curved paths. These properties seem to empirically translate to faster training, faster generation, and better performance. We empirically validate Flow Matching and the construction via Optimal Transport paths on Im-ageNet, a large and highly diverse image dataset. We find that we can easily train models to achieve favorable performance in both likelihood estimation and sample quality amongst competing diffusion-based methods. Furthermore, we find that our models produce better trade-offs between computational cost and sample quality compared to prior methods. Figure 1 depicts selected unconditional ImageNet 128×128 samples from our model.

2. PRELIMINARIES: CONTINUOUS NORMALIZING FLOWS

Let R d denote the data space with data points x = (xfoot_0 , . . . , x d ) ∈ R d . Two important objects we use in this paper are: the probability density path p : [0, 1] × R d → R >0 , which is a time dependent 1 probability density function, i.e., p t (x)dx = 1, and a time-dependent vector field, v : [0, 1] × R d → R d . A vector field v t can be used to construct a time-dependent diffeomorphic map, called a flow, ϕ : [0, 1] × R d → R d , defined via the ordinary differential equation (ODE): d dt ϕ t (x) = v t (ϕ t (x)) ϕ 0 (x) = x Previously, Chen et al. (2018) suggested modeling the vector field v t with a neural network, v t (x; θ), where θ ∈ R p are its learnable parameters, which in turn leads to a deep parametric model of the flow ϕ t , called a Continuous Normalizing Flow (CNF). A CNF is used to reshape a simple prior density p 0 (e.g., pure noise) to a more complicated one, p 1 , via the push-forward equation p t = [ϕ t ] * p 0 where the push-forward (or change of variables) operator * is defined by [ϕ t ] * p 0 (x) = p 0 (ϕ -1 t (x)) det ∂ϕ -1 t ∂x (x) . A vector field v t is said to generate a probability density path p t if its flow ϕ t satisfies equation 3. One practical way to test if a vector field generates a probability path is using the continuity equation, which is a key component in our proofs, see Appendix A. We recap more information on CNFs, in particular how to compute the probability p 1 (x) at an arbitrary point x ∈ R d in Appendix C.

3. FLOW MATCHING

Let x 1 denote a random variable distributed according to some unknown data distribution q(x 1 ). We assume we only have access to data samples from q(x 1 ) but have no access to the density function itself. Furthermore, we let p t be a probability path such that p 0 = p is a simple distribution, e.g., the standard normal distribution p(x) = N (x|0, I), and let p 1 be approximately equal in distribution to q. We will later discuss how to construct such a path. The Flow Matching objective is then designed to match this target probability path, which will allow us to flow from p 0 to p 1 .



We use subscript to denote the time parameter, e.g., pt(x).



are a class of deep learning algorithms aimed at estimating and sampling from an unknown data distribution. The recent influx of amazing advances in generative modeling, e.g., for image generation Ramesh et al. (2022); Rombach et al. (2022), is mostly facilitated by the scalable and relatively stable training of diffusion-based models Ho et al. (2020); Song et al. (2020b). However, the restriction to simple diffusion processes leads to a rather confined space of sampling probability paths, resulting in very long training times and the need to adopt specialized methods (e.g., Song et al. (2020a); Zhang & Chen (2022)) for efficient sampling. In this work we consider the general and deterministic framework of Continuous Normalizing Flows (CNFs; Chen et al. (2018)). CNFs are capable of modeling arbitrary probability path and are in particular known to encompass the probability paths modeled by diffusion processes (Song et al., 2021).

Figure 1: Unconditional ImageNet-128 samples of a CNF trained using Flow Matching with Optimal Transport probability paths.

