WASSERSTEIN-2 GENERATIVE NETWORKS

Abstract

We propose a novel end-to-end non-minimax algorithm for training optimal transport mappings for the quadratic cost (Wasserstein-2 distance). The algorithm uses input convex neural networks and a cycle-consistency regularization to approximate Wasserstein-2 distance. In contrast to popular entropic and quadratic regularizers, cycle-consistency does not introduce bias and scales well to high dimensions. From the theoretical side, we estimate the properties of the generative mapping fitted by our algorithm. From the practical side, we evaluate our algorithm on a wide range of tasks: image-to-image color transfer, latent space optimal transport, image-to-image style transfer, and domain adaptation.

1. INTRODUCTION

Generative learning framework has become widespread over the last couple of years tentatively starting with the introduction of generative adversarial networks (GANs) by Goodfellow et al. (2014) . The framework aims to define a stochastic procedure to sample from a given complex probability distribution Q on a space Y ⊂ R D , e.g. a space of images. The usual generative pipeline includes sampling from tractable distribution P on space X and applying a generative mapping g : X → Y that transforms P into the desired Q. In many cases for probability distributions P, Q, there may exist several different generative mappings. For example, the mapping in Figure 1b seems to be better than the one in Figure 1a and should be preferred: the mapping in Figure 1b is straightforward, wellstructured and invertible. Existing generative learning approaches mainly do not focus on the structural properties of the generative mapping. The reasonable question is how to find a generative mapping g • P = Q that is well-structured. Typically, the better the structure of the mapping is, the easier it is to find such a mapping. There are many ways to define what the well-structured mapping is. But usually, such a mapping is expected to be continuous and, if possible, invertible. One may note that when P and Q are both one-dimensional (X , Y ⊂ Rfoot_0 ), the only class of mappings g : X → Y satisfying these properties are monotone mappings 1 , i.e. ∀x, x ∈ X (x = x ) satisfying g(x) -g(x ) • x -x > 0. The intuition of 1-dimensional spaces can be easily extended to X , Y ⊂ R D . We can require the similar condition to hold true: ∀x, x ∈ X (x = x ) g(x) -g(x ), x -x > 0. (1) The condition (1) is called monotonicity, and every surjective function satisfying this condition is invertible. In one-dimensional case, for any pair of continuous P, Q with non-zero density there exists a unique monotone generative map given by g(x) = F -1 (1995) , where F (•) is the cumulative distribution function of P or Q. However, for D > 1 there might exist more than one generative monotone mapping. For example, when P = Q are standard 2-dimensional Gaussian distributions, all rotations by angles -π 2 < α < π 2 are monotone and preserve the distribution. One may impose uniqueness by considering only maximal Peyré (2018) monotone mappings g : X → Y satisfying ∀N = 2, 3 . . . and N distinct points x 1 , . . . , x N ∈ X (N + 1 ≡ 1): Q F P (x) McCann et al. N n=1 g(x n ), x n -x n+1 > 0. (2) The condition ( 2) is called cycle monotonicity and also implies "usual" monotonicity (1). Importantly, for almost every two continuous probability distributions P, Q on X = Y = R D there exists a unique cycle monotone mapping g : X → Y satisfying g (1995) . Thus, instead of searching for arbitrary generative mapping, one may significantly reduce the considered approximating class of mappings by using only cycle monotone ones. • P = Q, see McCann et al. According to Rockafellar (1966) , every cycle monotone mapping g is contained in a sub-gradient of some convex function ψ : X → R. Thus, every convex class of functions may produce cycle monotone mappings (by considering sub-gradients of these functions). In practice, deep input convex neural networks (ICNNs, see Amos et al. ( 2017)) can be used as a class of convex functions. Formally, to fit a cycle monotone generative mapping, one may apply any existing approach, such as GANs Goodfellow et al. (2014) , with the set of generators restricted to gradients of ICNN. However, GANs typically require solving a minimax optimization problem. It turns out that the cycle monotone generators are strongly related to Wasserstein-2 distance (W 2 ). The approaches by Taghvaei & Jalali (2019); Makkuva et al. ( 2019) use dual form of W 2 to find the optimal generative mapping which is cycle monotone. The predecessor of both approaches is the gradient-descent algorithm for computing W 2 distance by Chartrand et al. (2009) . The drawback of all these methods is similar to the one of GANs -their optimization objectives are minimax. Cyclically monotone generators require that both spaces X and Y have the same dimension, which poses no practical limitation. Indeed, it is possible to combine a generative mapping with a decoder of a pre-trained autoencoder, i.e. train a generative mapping into a latent space. It should be also noted that the cases with equal dimensions of X and Y are common in computer vision. The typical example is image-to-image style transfer when both the input and the output images have the same size and number of channels. Other examples include image-to-image color transfer, domain adaptation, etc. In this paper, we develop the concept of cyclically monotone generative learning. The main contributions of the paper are as follows: 1. Developing an end-to-end non-minimax algorithm for training cyclically monotone generative maps, i.e. optimal maps for quadratic transport cost (Wasserstein-2 distance). 2. Proving theoretical bound on the approximation properties of the transport mapping fitted by the developed approach. 3. Developing a class of Input Convex Neural Networks whose gradients are used to approximate cyclically monotone mappings.



We consider only monotone increasing mappings. Decreasing mappings have analogous properties.



(a) An Arbitrary Mapping. (b) The Monotone Mapping.

Figure 1: Two possible generative mappings that transform distribution P to distribution Q.

For example, GAN-based approaches, such as f -GAN by Nowozin et al. (2016); Yadav et al. (2017), W-GAN by Arjovsky et al. (2017) and others Li et al. (2017); Mroueh & Sercu (2017), approximate generative mapping by a neural network with a problem-specific architecture.

