FLOW STRAIGHT AND FAST: LEARNING TO GENER-ATE AND TRANSFER DATA WITH RECTIFIED FLOW

Abstract

We present rectified flow, a simple approach to learning (neural) ordinary differential equation (ODE) models to transport between two empirically observed distributions π 0 and π 1 , hence providing a unified solution to generative modeling and domain transfer, among various other tasks involving distribution transport. The idea of rectified flow is to learn the ODE to follow the straight paths connecting the points drawn from π 0 and π 1 as much as possible. This is achieved by solving a straightforward nonlinear least squares optimization problem, which can be easily scaled to large models without introducing extra parameters beyond standard supervised learning. The straight paths are the shortest paths between two points, and can be simulated exactly without time discretization and hence yield computationally efficient models. We show that, by learning a rectified flow from data, we effectively turn an arbitrary coupling of π 0 and π 1 to a new deterministic coupling with provably non-increasing convex transport costs. In addition, with a "reflow" procedure that iteratively learns a new rectified flow from the data bootstrapped from the previous one, we obtain a sequence of flows with increasingly straight paths, which can be simulated accurately with coarse time discretization in the inference phase. In empirical studies, we show that rectified flow performs superbly on image generation and image-to-image translation. In particular, on image generation and translation, our method yields nearly straight flows that give high quality results even with a single Euler discretization step. Code is available at https://github.com/gnobitab/RectifiedFlow.

1. INTRODUCTION

Compared with supervised learning, the shared difficulty of various forms of unsupervised learning is the lack of paired input/output data that makes standard regression or classification tasks possible. The crux of many unsupervised methods is to find meaningful correspondences between points from two distributions. For example, generative models such as generative adversarial networks (GAN) and variational autoencoders (VAE) (e.g., Goodfellow et al., 2014; Kingma & Welling, 2013; Dinh et al., 2016) seek to map data points to latent codes following a simple elementary (e.g., Gaussian) distribution with which the data can be generated and manipulated. On the other hand, domain transfer methods find mappings to transfer points between two different data distributions, both observed empirically, for the purpose of image-to-image translation, style transfer, and domain adaption (e.g., Zhu et al., 2017; Flamary et al., 2016; Trigila & Tabak, 2016; Peyré et al., 2019) . These tasks can be framed unifiedly as finding a transport map between two distributions: Learning Transport Mapping Given empirical observations of two distributions π 0 , π 1 on R d , find a transport map T : R d → R d , which, in the infinite data limit, gives Z 1 := T (Z 0 ) ∼ π 1 when Z 0 ∼ π 0 , that is, (Z 0 , Z 1 ) is a coupling (a.k.a transport plan) of π 0 and π 1 . We should note that the answers of this problem are not unique because there are often infinitely many transport maps between two distributions. Optimal transport (OT) (e.g., Villani, 2021; Ambrosio et al., 2021; Figalli & Glaudo, 2021; Peyré et al., 2019) addresses the more challenging problem of finding an optimal coupling that minimizes a notion of transport cost, typically of form E[c(Z 1 -Z 0 )], where c : R d → R is a cost function, such as c(x) = ∥x∥ 2 . However, for the generative and transfer modeling tasks above, the transport cost is not of direct interest, even though it induces a number of desirable properties. Hence, it is not necessary to accurate solve the OT problems given the high difficulty of doing so. An important question is to identify relaxed notions of optimality that are of direct interest for ML tasks and are easier to enforce in practice. Several lines of techniques have been developed depending on how to represent and train the map T . In traditional generative models, T is parameterized as a neural network, and trained with either GAN-type minimax algorithms or (approximate) maximum likelihood estimation (MLE). However, GANs suffer from numerically instability and mode collapse issues, and require substantial engineering efforts and human tuning, which tend to transfer poorly across different model architecture and datasets. On the other hand, MLE tends to be intractable for complex models, and hence requires either approximate (variational or Monte Carlo) inference techniques such as those used in VAE, or special model structures that yield tractable likelihood such as normalizing flow and auto-regressive models, which causes difficult trade-offs between expressive power and computational cost. Recently, advances have been made by representing the transport plan implicitly as a continuous time process, including flow models with neural ordinary differential equations (ODEs) (e.g., Chen et al., 2018; Papamakarios et al., 2021; Song et al., 2020a) and diffusion models by stochastic differential equations (SDEs) (e.g., Song et al., 2020b; Ho et al., 2020; Tzen & Raginsky, 2019; De Bortoli et al., 2021; Vargas et al., 2021) . In these models, a neural network is trained to represent the drift force of the processes and a numerical ODE/SDE solver is used to simulate the process during inference. By leveraging the mathematical structures of ODEs/SDEs, the continuous-time models can be trained efficiently without resorting to minimax or traditional approximate inference techniques. The most notable examples are the score-based generative models (Song & Ermon, 2019; 2020; Song et al., 2020b) and denoising diffusion probabilistic models (DDPM) (Ho et al., 2020) , which has achieved impressive empirical results on image generation recently (e.g., Dhariwal & Nichol, 2021) . However, compared with the traditional "one-step" models like GAN and VAE, continuous-times models are effectively "infinite-step" and cast high computational cost in inference time: drawing a single point (e.g., an image) requires to solve the ODE/SDE with a numerical solver that needs to repeatedly call the expensive neural force field for a large number of times. Moreover, in existing approaches, generative modeling and domain transfer are typically treated separately. It often requires to extend techniques to solve domain transfer problems; see e.g., Cycle



Figure 1: The trajectories of rectified flows for image generation (π0: standard Gaussian noise, π1: cat faces, top two rows), and image transfer between human and cat faces (π0: human faces, π1: cat faces, bottom two rows), when simulated using Euler method with step size 1/N for N steps. The first rectified flow induced from the training data (1-rectified flow) yields good results with a very small number (e.g., ≥ 2) of steps; the straightened reflow induced from 1-rectified flow (denoted as 2-rectified flow) has nearly straight line trajectories and yield good results even with one discretization step.

