CONVEX POTENTIAL FLOWS: UNIVERSAL PROBABILITY DISTRIBUTIONS WITH OPTIMAL TRANSPORT AND CONVEX OPTIMIZATION

Abstract

Flow-based models are powerful tools for designing probabilistic models with tractable density. This paper introduces Convex Potential Flows (CP-Flow), a natural and efficient parameterization of invertible models inspired by the optimal transport (OT) theory. CP-Flows are the gradient map of a strongly convex neural potential function. The convexity implies invertibility and allows us to resort to convex optimization to solve the convex conjugate for efficient inversion. To enable maximum likelihood training, we derive a new gradient estimator of the log-determinant of the Jacobian, which involves solving an inverse-Hessian vector product using the conjugate gradient method. The gradient estimator has constantmemory cost, and can be made effectively unbiased by reducing the error tolerance level of the convex optimization routine. Theoretically, we prove that CP-Flows are universal density approximators and are optimal in the OT sense. Our empirical results show that CP-Flow performs competitively on standard benchmarks of density estimation and variational inference.

1. INTRODUCTION

Normalizing flows (Dinh et al., 2014; Rezende & Mohamed, 2015) have recently gathered much interest within the machine learning community, ever since its recent breakthrough in modelling high dimensional image data (Dinh et al., 2017; Kingma & Dhariwal, 2018) . They are characterized by an invertible mapping that can reshape the distribution of its input data into a simpler or more complex one. To enable efficient training, numerous tricks have been proposed to impose structural constraints on its parameterization, such that the density of the model can be tractably computed. We ask the following question: "what is the natural way to parameterize a normalizing flow?" To gain a bit more intuition, we start from the one-dimension case. If a function f : R ! R is continuous, it is invertible (injective onto its image) if and only if it is strictly monotonic. This means that if we are only allowed to move the probability mass continuously without flipping the order of the particles, then we can only rearrange them by changing the distance in between. In this work, we seek to generalize the above intuition of monotone rearrangement in 1D. We do so by motivating the parameterization of normalizing flows from an optimal transport perspective, which allows us to define some notion of rearrangement cost (Villani, 2008) . It turns out, if we want the output of a flow to follow some desired distribution, under mild regularity conditions, we can characterize the unique optimal mapping by a convex potential (Brenier, 1991) . In light of this, we propose to parameterize normalizing flows by the gradient map of a (strongly) convex potential. Owing to this theoretical insight, the proposed method is provably universal and optimal; this means the proposed flow family can approximate arbitrary distributions and requires the least amount of transport cost. Furthermore, the parameterization with convex potentials allows us to formulate model inversion and gradient estimation as convex optimization problems. As such, we make use of existing tools from the convex optimization literature to cheaply and efficiently estimate all quantities of interest. In terms of the benefits of parameterizing a flow as a gradient field, the convex potential is an R d ! R function, which is different from most existing discrete-time flows which are R d ! R d . This makes CP-Flow relatively compact. It is also arguably easier to design a convex architecture, as we do not need to satisfy constraints such as orthogonality or Lipschitzness; the latter two usually require a direct or an iterative reparameterization of the parameters. Finally, it is possible to incorporate additional structure such as equivariance (Cohen & Welling, 2016; Zaheer et al., 2017) into the flow's parameterization, making CP-Flow a more flexible general purpose density model.

2. BACKGROUND: NORMALIZING FLOWS AND OPTIMAL TRANSPORT

Normalizing flows are characterized by a differentiable, invertible neural network f such that the probability density of the network's output can be computed conveniently using the change-ofvariable formula p Y (f (x)) = p X (x) @f (x) @x 1 () p Y (y) = p X (f 1 (y)) @f 1 (y) @y (1) where the Jacobian determinant term captures the local expansion or contraction of the density near x (resp. y) induced by the mapping f (resp. f 1 ), and p X is the density of a random variable X. The invertibility requirement has led to the design of many special neural network parameterizations such as triangular maps, ordinary differential equations, orthogonality or Lipschitz constraints. Universal Flows For a general learning framework to be meaningful, a model needs to be flexible enough to capture variations in the data distribution. In the context of density modeling, this corresponds to the model's capability to represent arbitrary probability distributions of interest. Even though there exists a long history of literature on universal approximation capability of deep neural networks (Cybenko, 1989; Lu et al., 2017; Lin & Jegelka, 2018) , invertible neural networks generally have limited expressivity and cannot approximate arbitrary functions. However, for the purpose of approximating a probability distribution, it suffices to show that the distribution induced by a normalizing flow is universal. Among many ways to establish distributional universality of flow based methods (e.g. Huang et al. 2018; 2020b; Teshima et al. 2020; Kong & Chaudhuri 2020) , one particular approach is to approximate a deterministic coupling between probability measures. Given a pair of probability densities p X and p Y , a deterministic coupling is a mapping g such that g(X) ⇠ p Y if X ⇠ p X . We seek to find a coupling that is invertible, or at least can be approximated by invertible mappings. Optimal Transport Let c(x, y) be a cost function. The Monge problem (Villani, 2008) pertains to finding the optimal transport map g that realizes the minimal expected cost J c (p X , p Y ) = inf e g:e g(X)⇠p Y E X⇠p X [c(X, e g(X))] When the second moments of X and Y are both finite, and X is regular enough (e.g. having a density), then the special case of c(x, y) = ||x y|| 2 has an interesting solution, a celebrated theorem due to Brenier (1987; 1991) : Theorem 1 (Brenier's Theorem, Theorem 1.22 of Santambrogio ( 2015)). Let µ, ⌫ be probability measures with a finite second moment, and assume µ has a Lebesgue density p X . Then there exists a convex potential G such that the gradient map g := rG (defined up to a null set) uniquely solves the Monge problem in eq. (2) with the quadratic cost function c(x, y) = ||x y|| 2 . Some recent works are also inspired by Brenier's theorem and utilize a convex potential to parameterize a critic model, starting from Taghvaei & Jalali (2019), and further built upon by Makkuva et al. ( 2019) who parameterize a generator with a convex potential and concurrently by Korotin et al. (2019) . Our work sets itself apart from these prior works in that it is entirely likelihood-based, minimizing the (empirical) KL divergence as opposed to an approximate optimal transport cost.

