WASSERSTEIN-2 GENERATIVE NETWORKS

Abstract

We propose a novel end-to-end non-minimax algorithm for training optimal transport mappings for the quadratic cost (Wasserstein-2 distance). The algorithm uses input convex neural networks and a cycle-consistency regularization to approximate Wasserstein-2 distance. In contrast to popular entropic and quadratic regularizers, cycle-consistency does not introduce bias and scales well to high dimensions. From the theoretical side, we estimate the properties of the generative mapping fitted by our algorithm. From the practical side, we evaluate our algorithm on a wide range of tasks: image-to-image color transfer, latent space optimal transport, image-to-image style transfer, and domain adaptation. (2) The condition (2) is called cycle monotonicity and also implies "usual" monotonicity (1). Importantly, for almost every two continuous probability distributions P, Q on X = Y = R D there exists a unique cycle monotone mapping g : X → Y satisfying g (1995). Thus, instead of searching for arbitrary generative mapping, one may significantly reduce the considered approximating class of mappings by using only cycle monotone ones. According to Rockafellar (1966) , every cycle monotone mapping g is contained in a sub-gradient of some convex function ψ : X → R. Thus, every convex class of functions may produce cycle monotone mappings (by considering sub-gradients of these functions). In practice, deep input convex neural networks (ICNNs, see Amos et al. ( 2017)) can be used as a class of convex functions. Formally, to fit a cycle monotone generative mapping, one may apply any existing approach, such as GANs Goodfellow et al. (2014), with the set of generators restricted to gradients of ICNN. However, GANs typically require solving a minimax optimization problem. It turns out that the cycle monotone generators are strongly related to Wasserstein-2 distance (W 2 ). The approaches by Taghvaei & Jalali (2019); Makkuva et al. ( 2019) use dual form of W 2 to find the optimal generative mapping which is cycle monotone. The predecessor of both approaches is the gradient-descent algorithm for computing W 2 distance by Chartrand et al. (2009) . The drawback of all these methods is similar to the one of GANs -their optimization objectives are minimax. Cyclically monotone generators require that both spaces X and Y have the same dimension, which poses no practical limitation. Indeed, it is possible to combine a generative mapping with a decoder of a pre-trained autoencoder, i.e. train a generative mapping into a latent space. It should be also noted that the cases with equal dimensions of X and Y are common in computer vision. The typical example is image-to-image style transfer when both the input and the output images have the same size and number of channels. Other examples include image-to-image color transfer, domain adaptation, etc. In this paper, we develop the concept of cyclically monotone generative learning. The main contributions of the paper are as follows: 1. Developing an end-to-end non-minimax algorithm for training cyclically monotone generative maps, i.e. optimal maps for quadratic transport cost (Wasserstein-2 distance). 2. Proving theoretical bound on the approximation properties of the transport mapping fitted by the developed approach. 3. Developing a class of Input Convex Neural Networks whose gradients are used to approximate cyclically monotone mappings.

1. INTRODUCTION

Generative learning framework has become widespread over the last couple of years tentatively starting with the introduction of generative adversarial networks (GANs) by Goodfellow et al. (2014) . The framework aims to define a stochastic procedure to sample from a given complex probability distribution Q on a space Y ⊂ R D , e.g. a space of images. The usual generative pipeline includes sampling from tractable distribution P on space X and applying a generative mapping g : X → Y that transforms P into the desired Q. In many cases for probability distributions P, Q, there may exist several different generative mappings. For example, the mapping in Figure 1b seems to be better than the one in Figure 1a and should be preferred: the mapping in Figure 1b is straightforward, wellstructured and invertible. Existing generative learning approaches mainly do not focus on the structural properties of the generative mapping. For example, GAN-based approaches, such as f -GAN by Nowozin et al. (2016) ; Yadav et al. (2017) , W-GAN by Arjovsky et al. (2017) and others Li et al. (2017) ; Mroueh & Sercu (2017) , approximate generative mapping by a neural network with a problem-specific architecture. The reasonable question is how to find a generative mapping g • P = Q that is well-structured. Typically, the better the structure of the mapping is, the easier it is to find such a mapping. There are many ways to define what the well-structured mapping is. But usually, such a mapping is expected to be continuous and, if possible, invertible. One may note that when P and Q are both one-dimensional (X , Y ⊂ Rfoot_0 ), the only class of mappings g : X → Y satisfying these properties are monotone mappings 1 , i.e. ∀x, x ∈ X (x = x ) satisfying g(x) -g(x ) • x -x > 0. The intuition of 1-dimensional spaces can be easily extended to X , Y ⊂ R D . We can require the similar condition to hold true: ∀x, x ∈ X (x = x ) g(x) -g(x ), x -x > 0. (1) The condition (1) is called monotonicity, and every surjective function satisfying this condition is invertible. In one-dimensional case, for any pair of continuous P, Q with non-zero density there exists a unique monotone generative map given by g(x) = F -1 McCann et al. (1995) , where F (•) is the cumulative distribution function of P or Q. However, for D > 1 there might exist more than one generative monotone mapping. For example, when P = Q are standard 2-dimensional Gaussian distributions, all rotations by angles -π 2 < α < π 2 are monotone and preserve the distribution. One may impose uniqueness by considering only maximal Peyré (2018) monotone mappings g : X → Y satisfying ∀N = 2, 3 . . . and N distinct points x 1 , . . . , x N ∈ X (N + 1 ≡ 1): 3) is also known as Monge's formulation of optimal transportation Villani (2008) . Q F P (x) Equation ( The principal OT generative method Seguy et al. (2017) is based on optimizing the regularized dual form of the transport cost (3). It fits two potentials ψ, ψ (primal and conjugate) and then uses the barycentric projection to establish the desired (third) generative network g. Although the method uses non-minimax optimization objective, it is not end-to-end (consists of two sequential steps). In the case of quadratic transport cost c(x, y) = x-y 2 2 , the value (3) is known as the square of Wasserstein-2 distance: W 2 2 (P, Q) = min g•P=Q X x -g(x) 2 2 dP(x). It has been well studied in literature Brenier (1991) ; McCann et al. (1995) ; Villani (2003; 2008) and has many useful properties which we discuss in Section 3 in more detail. The optimal mapping for the quadratic cost is cyclically monotone. Several algorithms exist Lei et al. (2019) ; Taghvaei & Jalali (2019) ; Makkuva et al. (2019) for finding this mapping. The recent approach by Taghvaei & Jalali (2019) uses the gradient-descent-based algorithm by Chartrand et al. (2009) for computing W 2 . The key idea is to approximate the optimal potential ψ * by an ICNN Amos et al. (2017) , and extract the optimal generator g * from its gradient ∇ψ * . The method is impractical due to high computational complexity: during the main optimization cycle, it solves an additional optimization sub-problem. The inner problem is convex but computationally costly. This was noted in the original paper and de-facto confirmed by the lack of experiments with complex distributions. A refinement of this approach is proposed by Makkuva et al. (2019) . The inner optimization sub-problem is removed, and a network is used to approximate its solution. This speeds up the computation, but the problem is still minimax.

3. PRELIMINARIES

In the section, we recall the properties of W 2 distance (4) and its relation to cycle monotone mappings. Throughout the paper, we assume that P and Q are continuous distributions on X = Y = R D with finite second moments. 3 This condition guarantees that (3) is well-defined in the sense that the optimal mapping g * always exists. It follows from (Villani, 2003, Brenier's Theorem 2.12 ) that its restriction to the support of P is unique (up to the values on the small sets) and invertible. The symmetric characteristics apply to its inverse (g * ) -1 , which induces symmetry to definition (4) for quadratic cost. According to Villani (2003) , the dual form of ( 4) is given by W 2 2 (P, Q) = X x 2 2 dP(x) + Y y 2 2 dQ(y) Const(P,Q) -min ψ∈Convex X ψ(x)dP(x) + Y ψ(y)dQ(y) Corr(P,Q) , where the minimum is taken over all the convex functions (potentials) ψ : X → R ∪ {∞}, and ψ(y) = max x∈X x, y -ψ(x) is the convex conjugate Fenchel (1949) to ψ, which is also a convex function, ψ : Y → R ∪ {∞}. We call the value of the minimum in (5) cyclically monotone correlations and denote it by Corr(P, Q). By equating ( 5) with ( 4), one may derive the formula Corr(P, Q) = max g•P=Q X x, g(x) dP(x). Note that -Corr(P, Q) can be viewed as an optimal transport cost for bilinear cost function c(x, y) = -x, y , see McCann et al. (1995) . Thus, searching for optimal transport map g * for W 2 is equivalent to finding the mapping which maximizes correlations (6). It is known for W 2 distance that the gradient g * = ∇ψ * of optimal potential ψ * readily gives the minimizer of (4), see Villani (2003) . Being a gradient of a convex function, it is necessarily cycle monotone. In particular, the inverse mapping can be obtained by taking the gradient w.r.t. input of the conjugate of optimal potential ψ * (y) McCann et al. (1995) . Thus, we have (g * ) -1 (y) = ∇ψ * -1 (y) = ∇ψ * (y). In fact, one may approximate the primal potential ψ by a parametric class Θ of input convex functions ψ θ and optimize correlations min θ∈Θ Corr(P, Q | ψ θ ) = min θ∈Θ X ψ θ (x)dP(x) + Y ψ θ (y)dQ(y) in order to extract the approximate optimal generator g θ † : X → Y from the approximate potential ψ θ † . Note that in general it is not true that g θ † • P will be equal to Q. However, we prove that if Corr(P, Q | ψ θ † ) is close to Corr(P, Q), then g θ † • P ≈ Q, see our Theorem A.3 in Appendix A.2. The optimization of ( 8) can be performed via stochastic gradient descent. It is possible to get rid of conjugate ψ θ and extract an analytic formula for the gradient of (8) w.r.t. parameters θ by using ψ θ only, see the derivations in Taghvaei & Jalali (2019) ; Chartrand et al. (2009) : ∂Corr(P, Q | ψ θ ) ∂θ = X ∂ψ θ (x) ∂θ dP(x) - Y ∂ψ θ (x) ∂θ dQ(y), where ∂ψ θ ∂θ in the second integral is computed at x = (∇ψ θ ) -1 (y), i.e. inverse value of y for ∇ψ θ . In practice, both integrals are replaced by their Monte Carlo estimates over random mini-batches from P and Q. Yet to compute the second integral, one needs to recover the inverse values of the current mapping ∇ψ θ for all y ∼ Q in the mini batch. To do this, the following optimization sub-problem has to be solved x = (∇ψ θ ) -1 (y) ⇔ x = arg max x∈X x, y -ψ θ (x) for each y ∼ Q in the mini batch. The optimization problem (9) is convex but complex because it requires computing the gradient of ψ θ multiple times. It is computationally costly since ψ θ is in general a large neural network. Besides, during iterations over θ, each time a new independent batch of samples arrives. This makes it hard to use the information on the solution of (9) from the previous gradient descent step over θ in (8).

4. AN END-TO-END NON-MINIMAX ALGORITHM

In Subsection 4.1, we describe our novel end-to-end algorithm with non-minimax optimization objective for fitting cyclically monotone generative mappings. In Subsection 4.2, we state our main theoretical results on approximation properties of the proposed algorithm.

4.1. ALGORITHM

To simplify the inner optimization procedure for inverting the values of current ∇ψ θ , one may consider the following variational approximation of the main objective: min ψ∈Convex Corr(P, Q|ψ) = min ψ∈Convex X ψ(x)dP(x) + Y =ψ(y) max x∈X x, y -ψ(x) dQ(y) = min ψ∈Convex X ψ(x)dP(x) + max T ∈Y X Y T (y), y -ψ(T (y)) dQ(y) , where by considering arbitrary measurable functions T , we obtain a variational lower bound which matches the entire value for T = ∇ψ -1 (y) = ∇ψ(y). Thus, a possible approach is to approximate both primal and dual potentials by two different networks ψ θ and ψ ω and solve the optimization problem w.r.t. parameters θ, ω, e.g. by stochastic gradient descent/ascent Makkuva et al. ( 2019). Yet such a problem is still minimax. Thus, it suffers from typical problems such as convergence to local saddle points, instabilities during training and usually requires non-trivial hyperparameters choice. We propose a method to get rid of the minimax objective by imposing additional regularization. Our key idea is to add regularization term R Y (θ, ω) which stimulates cycle consistency Zhu et al. (2017), i.e. optimized generative mappings g θ = ∇ψ θ and g -1 ω = ∇ψ ω should be mutually inverse: R Y (θ, ω) = Y g θ • g -1 ω (y) -y 2 dQ(y) = Y ∇ψ θ • ∇ψ ω (y) -y 2 dQ(y). From the previous discussion and equation ( 7), we see that cycle consistency is a quite natural condition for W 2 distance. More precisely, if ∇ψ θ and ∇ψ ω are exactly inverse to each other (assuming ∇ψ θ is injective), then ψ ω is a convex conjugate to ψ θ up to a constant. In contrast to regularization used in Seguy et al. (2017) , the proposed penalties use not the values of the potentials ψ θ , ψ ω itself but the values of their gradients (generators). This helps to stabilize the value of the regularization term which in the case of Seguy et al. (2017) may take extremely high values due to the fact that convex potentials grow fast in absolute value. 4Our proposed regularization leads to the following non-minimax optimization objective (λ > 0): min θ,ω X ψ θ (x)dP(x) + Y ∇ψ ω (y), y -ψ θ (∇ψ ω (y)) dQ(y) + λ 2 R Y (θ, ω) Corr(P,Q|ψ θ ,ψω;λ) . ( ) The practical optimization procedure is given in Algorithm 1. We replace all the integrals by Monte Carlo estimates over random mini-batches from P and Q. To perform optimization, we use the stochastic gradient descent over parameters θ, ω of primal ψ θ and dual ψ ω potentials. We use the automatic differentiation to evaluate ∇ψ ω , ∇ψ θ and the gradients of (12) w.r.  L Corr = 1 K x∈X ψ θ (x) + y∈Y ∇ψ ω (y), y -ψ θ ∇ψ ω (y) ; 3. Compute the Monte-Carlo estimate of the cycle-consistency regularizer: L Cycle := 1 K y∈Y ∇ψ θ • ∇ψ ω (y) -y 2 2 ; 4. Compute the total loss L Total := L Corr + λ 2 • L Cycle ; 5. Perform a gradient step over {θ, ω} by using ∂LTotal ∂{θ,ω} ; end to the time required to compute the value of ψ θ (x). We empirically measured that this factor roughly equals 8-12, depending on the particular architecture of ICNN ψ θ (x). We discuss the time complexity of a gradient step of our method in a more detail in Appendix C.2. In Subsection 5.1, we show that our non-minimax approach converges up to 10x times faster than minimax alternatives by Makkuva et al. (2019) and Taghvaei & Jalali (2019) .

4.2. APPROXIMATION PROPERTIES

Our gradient-descent-based approach described in Subsection 4.1 computes Corr(P, Q) by approximating it with a restricted sets of convex potentials. Let (ψ † , ψ ‡ ) be a pair of potentials obtained by the optimization of correlations. Formally, the fitted generators g † = ∇ψ † and (g ‡ ) -1 = ∇ψ ‡ are byproducts of optimization (12). We provide guarantees that the generated distribution g † • P is indeed close to Q as well as the inverse mapping (g ‡ ) -1 pushes Q close to P. Theorem 4.1 (Generative Property for Approximators of Regularized Correlations). Let P, Q be two continuous probability distributions on X = Y = R D with finite second moments. Let ψ * : X → R be the optimal convex potential: ψ * = arg min ψ∈Convex Corr(P, Q|ψ) = arg min ψ∈Convex X ψ(x)dP(x) + Y ψ(y)dQ(y) . Let two differentiable convex functions ψ † : X → R and ψ ‡ : Y → R satisfy for some ∈ R: Corr P, Q | ψ † , ψ ‡ ; λ ≤ X ψ * (x)dP(x) + Y ψ * (y)dQ(y) + = Corr(P, Q) Equals (6) + . ( ) Assume that ψ † is β † -strongly convex (β † > 1 λ > 0) and B † -smooth (B † ≥ β † ). Assume that ψ ‡ has bijective gradient ∇ψ ‡ . Then the following inequalities hold true: 1. Correlation Upper Bound (regularized correlations dominate over the true ones) Corr P, Q | ψ † , ψ ‡ ; λ ≥ Corr P, Q (i.e. ≥ 0); 2. Forward Generative Property (mapping g † = ∇ψ † pushes P to be O( )-close to Q) W 2 2 (g † • P, Q) = W 2 2 (∇ψ † • P, Q) ≤ (B † ) 2 • λβ † -1 • 1 β † + √ λ 2 ; 3. Inverse Generative Property (mapping (g ‡ ) -1 = ∇ψ ‡ pushes Q to be O( )-close to P) W 2 2 (g ‡ ) -1 • Q, P = W 2 2 (∇ψ ‡ • Q, P) ≤ β † -1 λ . Informally, Theorem 4.1 states that the better we approximate correlations between P and Q by potentials ψ † , ψ ‡ , the closer we expect generated distributions g † • P and (g ‡ ) -1 • Q to be to Q and P respectively in the W 2 sense. We prove Theorem 4.1 and provide extra discussion on smoothness and strong convexity in Section A.2. Additionally, we derive Theorem A.3 which states analogous generative properties for the mapping obtained by the base method (8) with single potential and no regularization. Due to the Forward Generative property of Theorem 4.1, one may view the optimization of regularized correlations (12) as a process of minimizing W 2 2 between the forward generated g † • P and true Q distributions (same applies to the inverse property). Wasserstein-2 distance prevents mode dropping for distant modes due to the quadratic cost. See the experiment in Figure 9 in Appendix C.3. The following Theorem demonstrates that we actually can approximate correlations as well as required if the approximating classes of functions for potentials are large enough. Theorem 4.2 (Approximability of Correlations). Let P, Q be two continuous probability distributions on X = Y = R D with finite second moments. Let ψ * : Y → R be the optimal convex potential. Let Ψ X , Ψ Y be classes of differentiable convex functions X → R and Y → R respectively and 1. ∃ψ X ∈ Ψ X with X -close gradient to the forward mapping ∇ψ * in L 2 (X → R D , P) sense: ∇ψ X -∇ψ * 2 P def = X ∇ψ X (y) -∇ψ * (y) 2 dP(y) ≤ X , and ψ X is B X -smooth; 2. ∃ψ Y ∈ Ψ Y with Y -close gradient to the inverse mapping ∇ψ * in L 2 (Y → R D , Q) sense: ∇ψ Y -∇ψ * 2 Q def = Y ∇ψ Y (y) -∇ψ * (y) 2 dQ(y) ≤ Y . Let (ψ † , ψ ‡ ) be the minimizers of the regularized correlations within Ψ X × Ψ Y : (ψ † , ψ ‡ ) = arg min ψ∈Ψ X ,ψ ∈Ψ Y Corr P, Q | ψ, ψ ; λ . ( ) Then the regularized correlations for (ψ † , ψ ‡ ) satisfy the following inequality: Corr P, Q | ψ † , ψ ‡ ; λ ≤ Corr(P, Q) + λ 2 (B X √ Y + √ X ) 2 + (B X √ Y + √ X ) • ( √ Y ) + B X 2 Y , i.e. regularized correlations do not exceed true correlations plus O( X + Y ) term. By combining Theorem 4.2 with Theorem 4.1, we conclude that solutions ψ † , ψ ‡ of (15) push P and Q to be O( X + Y )-close to Q and P respectively. In practice, it is reasonable to use inputconvex neural networks as classes of functions Ψ X , Ψ Y . Fully-connected ICNNs satisfy universal approximation property Chen et al. (2018) . In Appendix A.3, we prove that our method can be applied to the latent space scenario. Theorem A.4 states that the distance between the target and generated (latent space generative map combined with the decoder) distributions can upper bounded by the quality of the latent fit and the reconstruction loss of the auto-encoder. In Appendix A.4, we prove Theorem A.5, which demonstrates how our method can be applied to non-continuous distributions P and Q.

5. EXPERIMENTS

In this section, we experimentally evaluate the proposed model. 5 In Subsection 5.1, we apply our method to estimate optimal transport maps in the Gaussian setting. In Subsection 5.2, we consider latent space mass transport. In Subsection 5.3, we experiment with image-to-image style translation. In Appendix C, we provide training details and additional experiments on color transfer, domain adaptation and toy examples. The architectures of input convex networks that we use (Dense/Conv ICNNs) are described in Appendix B. The provided results are not intended to represent the state-ofthe-art for any particular task -the goal is to show the feasibility of our approach and architectures.

5.1. OPTIMAL TRANSPORT BETWEEN GAUSSIANS

We test our method in the Gaussian setting P, Q = N (µ P , Σ P ), N (µ Q , Σ Q ). It is one of the few setups with the ground truth OT mapping existing in a closed form. We compare our method [W2GN] with quadratic regularization approach by Seguy et al. (2017) [LSOT] and minimax approaches by Taghvaei & Jalali (2019) To assess the quality of the recovered transport map ∇ψ † , we consider unexplained variance percentage: L 2 -UVP(∇ψ † ) = 100 • ∇ψ † -∇ψ * 2 P /Var(Q) %. Here ∇ψ * is the optimal transport map. For values ≈ 0%, ∇ψ † is a good approximation of OT map. For values ≥ 100%, map ∇ψ † is nearly useless. Indeed, a trivial benchmark ∇ψ 0 (x) = E Q [y] provides L 2 -UVP(∇ψ 0 ) = 100%. As it is seen from Table 1 , LSOT leads to high error which grows drastically with the dimension. W2GN, MM-1 and MM-2 approaches perform nearly equally in metrics. It is expected since they both optimize analogous objectives. These methods compute optimal transport maps with low error (L 2 -UVP<3% even in R 4096 ). However, as it is seen from the convergence plot in Figure 2 , our approach converges several times faster: it naturally follows from the fact that MM approaches contain an inner optimization cycle. We discuss the experiment in more detail in Appendix C.4. Table 2 : FID scores for 64 × 64 generated images. We test our algorithm on CelebA image generation (64 × 64). First, we construct the latent space distribution by using nonvariational convolutional auto-encoder to encode the images to 128-dimensional latent vectors. Next, we use a pair of DenseIC-NNs to fit a cyclically monotone mapping to transform standard normal noise into the latent space distribution. In Figure 3 , we present images generated directly by sampling from standard normal noise before (1st row) and after (2nd row) applying out transport map. While our generative mapping does not perform significant changes, its effect is seen visually as well as confirmed by improvement of Frechet Inception Distance (FID, see Heusel et al. ( 2017)), see Table 2 . For comparison, we also provide the score of a recent Wasserstein GAN by Liu et al. (2019) . In Appendix C.5, we provide additional examples and the visualization of the latent space. Figure 3 : Images decoded from standard Gaussian latent noise (1st row) and decoded from the same noise transferred by our cycle monotone map (2nd row).

5.3. IMAGE-TO-IMAGE STYLE TRANSLATION

In the problem of unpaired style transfer, the learner gets two image datasets, each with its own attributes, e.g. each dataset consists of landscapes related to a particular season. The goal is to fit a mapping capable of transferring attributes of one dataset to the other one, e.g. changing a winter landscape to a corresponding summer landscape. Our generative model fits a cycle monotone mapping. However, the desired style transfer may not be cycle monotone. Thus, our model may transfer only some of the required attributes. For example, for winter-to-summer transfer our model learned to colorize trees to green. Yet the model experiences problems with replacing snow masses with grass. As Seguy et al. (2017) noted, OT is permutation invariant. It does not take into account the relation between dimensions, e.g. neighbouring pixels or channels of one pixel. Thus, OT may struggle to fit the optimal generative mapping via convolutional architectures (designed to preserve the local structure of the image). Figures 4a and 4b demonstrate highlights of our model. Yet we provide examples when the model does not perform well in Appendix C.8. To fix the above mentioned issue, one may consider OT for the quadratic cost defined on the Gaussian pyramid of an image Burt & Adelson (1983) or, similar to perceptual losses used for super-resolution Johnson et al. (2016) , consider perceptual quadratic cost. This statement serves as the challenge for our further research.

6. CONCLUSION

In this paper, we developed an end-to-end algorithm with a non-minimax objective for training cyclically monotone generative mappings, i.e. optimal transport mappings for the quadratic cost. Additionally, we established theoretical justification for our method from the approximation point of view. The results of computational experiments confirm the potential of the algorithm in various practical problems: latent space mass transport, image-to-image color/style transfer, domain adaptation. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223-2232, 2017.

A PROOFS

In Subsection A.1, we provide important additional properties of Wasserstein-2 distance and related L 2 -spaces required to prove our main theoretical results. In Subsection A.2, we prove our main Theorems 4.1 and 4.2. Additionally, we show how our proofs can be translated to the simpler case (Theorem A.3), i.e. optimizing correlations with a single potential (8) by using the basic approach of Taghvaei & Jalali (2019) . In Subsection A.3, we justify the pipeline of Figure 13 . To do this, we prove a theorem that makes it possible to estimate the quality of the latent space generative mapping combined with the decoding part of the encoder. In Subsection A.4, we prove a useful fact which makes it possible to apply our method to distributions which do not have a density. A.1 PROPERTIES OF WASSERSTEIN-2 METRIC AND RELATION TO L 2 SPACES To prove our results, we need to introduce Kantorovich's formulation of Optimal Transport Kantorovitch (1958) which extends Monge's formulation (3). For a given transport cost c : X × Y → R and probability distributions P and Q on X and Y respectively, we define Cost(P, Q) = min µ∈Π(P,Q) X ×Y c(x, y)dµ(x, y), where Π(P, Q) is the set of all probability measures on X × Y whose marginals are P and Q respectively (transport plans). If the optimal transport solution exists in the form of mapping g * : X → Y minimizing (3), then the optimal transport plan in ( 17) is given by µ * = [id, g * ] • P. Otherwise, formulation (17) can be viewed as a relaxation of (3). For quadratic cost c(x, y) = 1 2 x -y 2 , the root of (17) defines Wasserstein-2 distance (W 2 ), a metric on the space of probability distributions. In particular, it satisfies the triangle inequality, i.e. for every triplets of probability distributions P 1 , P 2 , P 3 on X ⊂ R D we have W 2 (P 1 , P 3 ) ≤ W 2 (P 1 , P 2 ) + W 2 (P 2 , P 3 ). ( ) We will also need the following lemma. Lemma A.1 (Lipschitz property of Wasserstein-2 distance). Let P, P be two probability distributions with finite second moments on X 1 ⊂ R D1 . Let T : X 1 → X 2 ⊂ R D2 be a measurable mapping with Lipschitz constant bounded by L. Then the following inequality holds true: W 2 (T • P, T • P ) ≤ L • W 2 (P, P ), i.e. the distribution distance between P, P mapped by T does not exceed the initial distribution distance multiplied by Lipschitz constant L. Proof. Let µ * ∈ Π(P 1 , P 2 ) be the optimal transport plan between P 1 and P 2 . Consider the distribution on X 2 × X 2 given by µ = T • µ * , where mapping T is applied component-wise. The left and the right marginals of µ are equal to T • P 1 and T • P 2 respectively. Thus, µ ∈ Π(T • P 1 , T • P 2 ) is a transport plan between T • P 1 and T • P 2 . The cost of µ is not smaller than the optimal cost, i.e. W 2 2 (T • P 1 , T • P 2 ) ≤ X2×X2 x -x 2 2 dµ(x, x ). Next, we use the Lipschitz property of T and derive X2×X2 x -x 2 2 dµ(x, x ) = µ = T • µ * = X1×X1 T (x) -T (x ) 2 2 dµ * (x, x ) ≤ X1×X1 L 2 x -x 2 2 dµ * (x, x ) = L 2 X1×X1 x -x 2 2 dµ * (x, x ) = L 2 • W 2 2 (P 1 , P 2 ). (21) To finish the proof, we combine ( 20) with ( 21) and obtain the desired inequality (19). Throughout the rest of the paper, we use L 2 (X → R D , P) to denote the Hilbert space of functions f : X → R D with integrable square w.r.t. probability measure P. The corresponding inner product for f 1 , f 2 ∈ L 2 (X → R D , P) is denoted by f 1 , f 2 P def = X f 1 (x), f 2 (x) dP(x). We use • P = •, • P to denote the corresponding norm induced by the inner product. Lemma A.2 (L 2 inequality for Wasserstein-2 distance). Let P be a probability distribution on X 1 ⊂ R D1 . Let T 1 , T 2 ∈ L 2 (X 1 → R D2 , P). Then the following inequality holds true: 1 2 T 1 (x) -T 2 (x) 2 P ≥ W 2 2 (T 1 • P, T 2 • P). Proof. We define the transport plan µ = [T 1 , T 2 ] • P between T 1 • P and T 2 • P and, similar to the previous Lemma A.1, use the fact that its cost is not smaller than the optimal cost.

A.2 PROOFS OF THE MAIN THEORETICAL RESULTS

First, we prove our main Theorem 4.1. Then we formulate and prove its analogue (Theorem A.3) for the basic correlation optimization method (8) with single convex potential. Next, we prove our main Theorem 4.2. At the end of the subsection, we discuss the constants appearing in theorems: strong convexity and smoothness parameters. Proof of Theorem 4.1. We split the proof into three subsequent parts. Part 1. Upper Bound on Correlations. First, we establish a lower bound for regularized correlations Corr P, Q | ψ † , ψ ‡ ; λ omitting the regularization term. X ψ † (x)dP(x) + Y y, ∇ψ ‡ (y) -ψ † ∇ψ ‡ (y) dQ(y) = Y ψ † ∇ψ * (y) dQ(y) + Y y, ∇ψ ‡ (y) -ψ † ∇ψ ‡ (y) dQ(y) = (22) Y ψ † ∇ψ * (y) -ψ † ∇ψ ‡ (y) dQ(y) + Y y, ∇ψ ‡ (y) dQ(y) ≥ Y ∇ψ † • ∇ψ ‡ (y), ∇ψ * (y) -∇ψ ‡ (y) + β † 2 ∇ψ * (y) -∇ψ ‡ (y) 2 ]dQ(y) + (23) Y y, ∇ψ ‡ (y) dQ(y) + Y y, ∇ψ * (y) dQ(y) Corr(P,Q) - Y y, ∇ψ * (x) dQ(y) Corr(P,Q) = (24) ∇ψ † • ∇ψ ‡ , ∇ψ * -∇ψ ‡ Q + β † 2 ∇ψ * -∇ψ ‡ 2 Q -id Y , ∇ψ * -∇ψ ‡ Q + Corr(P, Q) = (25) ∇ψ † • ∇ψ ‡ -id Y , ∇ψ * -∇ψ ‡ Q + β † 2 ∇ψ * -∇ψ ‡ 2 Q + Corr(P, Q) = 1 2β † ∇ψ † • ∇ψ ‡ -id Y Q + ∇ψ † • ∇ψ ‡ -id Y , ∇ψ * -∇ψ ‡ Q + β † 2 ∇ψ * -∇ψ ‡ 2 Q + Corr(Q, P) - 1 2β † ∇ψ † • ∇ψ ‡ -id Y 2 Q = Published as a conference paper at ICLR 2021 We add the omitted regularization term back to (26) and obtain the following bound: Corr(Q, P) + 1 2 1 β † ∇ψ † • ∇ψ ‡ -id Y + β † ∇ψ * -∇ψ ‡ 2 Q - 1 2β † ∇ψ † • ∇ψ ‡ -id Y 2 Q . ( Corr P, Q | ψ † , ψ ‡ ; λ ≥ Corr(Q, P) + 1 2 (λ - 1 β † ) • ∇ψ † • ∇ψ ‡ -id Y 2 Q + 1 2 1 β † ∇ψ † • ∇ψ ‡ -id Y + β † ∇ψ * -∇ψ ‡ 2 Q . ( ) Since λ > 1 β † , the obtained inequality proves that the true correlations Corr(P, Q) are upper bounded by the regularized correlations Corr P, Q | ψ † , ψ ‡ ; λ . Note that if the optimal map ∇ψ † is ≥ β † strongly convex, the bound ( 27) is tight. Indeed, it turns into equality when we substitute ∇ψ ‡ = (∇ψ † ) -1 = ∇ψ * . Part 2. Inverse Generative Property. We continue the derivations of part 1. Let u = ∇ψ † • ∇ψ ‡id Y and v = ∇ψ * -∇ψ ‡ . By matching ( 27) with ( 14), we obtain ≥ 1 2 (λ - 1 β † ) u 2 Q + 1 2 1 β † u + β † v 2 Q . ( ) Now we derive an upper bound for v 2 Q . For a fixed u we have 1 β † u + β † v 2 Q ≤ 2 -(λ - 1 β † ) u 2 Q . Next, we apply the triangle inequality: β † v Q ≤ 1 β † u + β † v Q + 1 β † u Q ≤ 2 -(λ - 1 β † ) u 2 Q + 1 β † u Q . ( ) The expression of the right-hand side of (29) attains its maximal value 2 1-1 λβ † at u Q = 2 λ 2 β † -λ . We conclude that ∇ψ * -∇ψ ‡ 2 Q = v 2 Q ≤ 2 β † -1 λ . Finally, we apply L 2 -inequality of Lemma A.2 to distribution Q, mappings ∇ψ * and ∇ψ ‡ , and obtain W 2 2 (∇ψ ‡ • P, Q) ≤ β † -1 λ , i.e . the desired upper bound on the distance between the generated and target distribution. Part 3. Forward Generative Property. We recall the bound (28). Since all the summands are positive, we derive u 2 Q ≤ 2 λ -1 β † . ( ) We will use (30) to obtain an upper bound on ∇ψ * -∇ψ † P . To begin with, we note that since ψ † is β † -strongly convex, its convex conjugate ψ † is 1 β † -smooth. Thus, gradient ∇ψ † is 1 β † -Lipschitz. We conclude that for all x, x ∈ X : ∇ψ † (x) -∇ψ † (x ) ≤ 1 β † x -x . We raise both parts of (31) into the square, substitute x = ∇ψ † • ∇ψ † (y) and x = ∇ψ † • ∇ψ ‡ (y) and integrate over Y w.r.t Q. The obtained inequality is as follows: Y ∇ψ † (y) -∇ψ ‡ (y) 2 dQ(y) ≤ 1 (β † ) 2 Y ∇ψ † • ∇ψ † (y) -∇ψ † • ∇ψ ‡ (y) 2 dQ(y) (32) Next, we derive ∇ψ † -∇ψ ‡ 2 Q = Y ∇ψ † (y) -∇ψ ‡ (y) 2 dQ(y) ≤ Y 1 (β † ) 2 =y ∇ψ † • ∇ψ † (y) -∇ψ † • ∇ψ ‡ (y) =-u(y) 2 dQ(y) = u 2 Q (β † ) 2 . ( ) In transition to line (33), we use the previously obtained inequality (32). Next, we use the triangle inequality for • Q to bound ∇ψ † -∇ψ * Q ≤ † -∇ψ ‡ Q + ∇ψ ‡ -∇ψ * =v Q ≤ 2 λ -1 β † • 1 β † + 2 β † -1 λ = 2 λ -1 β † • ( 1 β † + λ β † ) = 2 λβ † -1 • ( 1 β † + √ λ) Next, we derive a lower bound for the left-hand side of (34) by using B † -smoothness of ψ † . For all x, x ∈ X we have ∇ψ † (x) -∇ψ † (x ) ≤ B † x -x . ( ) We raise both parts of (35) to the square, substitute x = ∇ψ † (y) and x = ∇ψ * (y), and integrate over Y w.r.t. Q: Y ∇ψ † • ∇ψ † (y) -∇ψ † • ∇ψ * (y) 2 dQ(y) ≤ (B † ) 2 Y ∇ψ † (y) -∇ψ * (y) 2 dQ(y) (36) Next, we use (36) to derive ∇ψ † -∇ψ * 2 Q = Y ∇ψ † (y) -∇ψ * (y) 2 dQ(y) ≥ Y 1 (B † ) 2 ∇ψ † • ∇ψ † (y) =y -∇ψ † • ∇ψ * (y) 2 dQ(y) = 1 (B † ) 2 Y ∇ψ * (x) -∇ψ † (x) 2 dP(x) ≥ 2 (B † ) 2 W 2 2 (∇ψ † • P, Q) In line (37), we use the L 2 property of Wasserstein-2 distance (Lemma A.2). We conclude that W 2 2 (∇ψ † • P, Q) ≤ (B † ) 2 • λβ † -1 • 1 β † + √ λ 2 , and finish the proof. It is quite straightforward to formulate analogous result for the basic optimization method with single potential (8). We summarise the statement in the following Theorem A.3 (Generative Property for Approximators of Correlations). Let P, Q be two continuous probability distributions on Y = X = R D with finite second moments. Let ψ * : Y → R be the convex minimizer of Corr(P, Q|ψ). Let differentiable ψ † is β † -strongly convex and B † -smooth (B † ≥ β † > 0) function ψ † : X → R satisfy Corr P, Q | ψ † ≤ X ψ * (x)dP(x) + Y ψ * (y)dQ(y) + = Corr(P, Q) + . ( ) Then the following inequalities hold true: 1. Forward Generative Property (map g † = ∇ψ † pushes P to be O( )-close to Q) W 2 2 (g † • P, Q) = W 2 2 (∇ψ † • P, Q) ≤ B † ; 2. Inverse Generative Property (map (g † ) -1 = ∇ψ † = (∇ψ † ) -1 pushes Q to be O( )-close to P) W 2 2 (g † ) -1 • Q, P ≤ β † . Proof. First, we note that ≥ 0 by the definition of ψ * . Next, we repeat the first part of the proof of Theorem (4.1) by substituting ψ ‡ := ψ † . Thus, by using ∇ψ † • ∇ψ ‡ = id Y we obtain the following simple analogue of formula ( 27): Corr(P, Q | ψ † ) ≥ Corr(P, Q) + β † 2 ∇ψ * -∇ψ † Q , i.e. ≥ β † 2 ∇ψ * -∇ψ † Q . Thus, by using Lemma (A.2), we immediately derive W 2 2 (∇ψ † • Q, P) ≤ β † , i.e. inverse generative property. To derive forward generative property, we note that B † -smoothness of ψ † means that ψ † is 1 B † -strongly convex. Thus, due to symmetry of the objective, we can repeat all the derivations w.r.t. ψ † instead of ψ † in order to prove W 2 2 (∇ψ † • P, Q) ≤ B † , i.e. forward generative property. Proof of Theorem 4.2. We repeat the first part of the proof of Theorem 4.1, but instead of exploiting strong convexity of ψ X to obtain an upper bound on the regularized correlations, we use B Xsmoothness to obtain a lower bound. The resulting analogue (40) to ( 27) is as follows: Corr P, Q | ψ X , ψ Y , λ -Corr(P, Q) ≤ 1 2 (λ - 1 B X ) • ∇ψ X • ∇ψ Y -id Y 2 Q + 1 2 1 √ B X ∇ψ X • ∇ψ Y -id Y + √ B X ∇ψ * -∇ψ Y 2 Q ≤ (40) 1 2 (λ - 1 B X ) • ∇ψ X • ∇ψ Y -id Y 2 Q + 1 2 1 √ B X ∇ψ X • ∇ψ Y -id Y Q + √ B X ∇ψ * -∇ψ Y Q 2 = (41) λ 2 • ∇ψ X • ∇ψ Y -id Y 2 Q + ∇ψ X • ∇ψ Y -id Y Q • ∇ψ * -∇ψ Y Q + B X 2 ∇ψ * -∇ψ Y 2 Q (42) Here in transition from line (40) to (41), we apply the triangle inequality. For every y ∈ Y we have ∇ψ X • ∇ψ Y (y) -∇ψ X • ∇ψ * (y) ≤ B X • ∇ψ Y (y) -∇ψ * (y) . We raise both parts of this inequality to the power 2 and integrate over Y w.r.t. Q. We obtain ∇ψ X • ∇ψ Y -∇ψ X • ∇ψ * 2 Q ≤ (B X ) 2 • ∇ψ Y -∇ψ * 2 Q ≤ (B X ) 2 • Y . Now we recall that since ∇ψ * • Q = P, we have ∇ψ X • ∇ψ * -∇ψ * • ∇ψ * id Y 2 Q = ∇ψ X -∇ψ * 2 P ≤ X . Next, we combine ( 43) and ( 44) apply the triangle inequality to bound ∇ψ X • ∇ψ Y -id Y Q ≤ ∇ψ X • ∇ψ Y (y) -∇ψ X • ∇ψ * (y) Q + ∇ψ X • ∇ψ * -∇ψ * • ∇ψ * id Y Q ≤ B X √ Y + √ X . We substitute all the bounds to (40): Corr P, Q | ψ X , ψ Y , λ -Corr(P, Q) ≤ λ 2 (B X √ Y + √ X ) 2 + (B X √ Y + √ X ) • ( √ Y ) + B X 2 Y . ( ) and finish the proof by using Corr P, Q | ψ † ψ ‡ , λ ≤ Corr P, Q | ψ X , ψ Y , λ which follows from the definition of ψ † , ψ ‡ . One may formulate and prove analogous result for the basic optimization method with a single potential (8). However, we do not include this in the paper since a similar result exists Taghvaei & Jalali (2019). All our theoretical results require smoothness or strong convexity properties of potentials. We note that the assumption of smoothness and strong convexity also appears in other papers on Wasserstein-2 optimal transport, see e.g. Paty et al. (2019) . The property of B-smoothness of a convex function ψ means that its gradient ∇ψ has Lipshitz constant bounded by B. In our case, constant B serves as a reasonable measure of complexity of the fitted mapping ∇ψ: it estimates how much the mapping can warp the space. Strong convexity is dual to smooothness in the sense that a convex conjugate ψ to β-strongly convex function ψ is 1 β -smooth (and vise-versa) Kakade et al. (2009) . In our case, β-strongly convex potential means that its inverse gradient mapping (∇ψ) -1 = ∇ψ can not significantly warp the space, i.e. has Lipshitz constant bounded by 1 β . Recall the setting of our Theorem 4.2. Assume that the optimal transport map ∇ψ * between P and Q is a gradient of β-strongly convex (β > 0) and B-smooth (B < ∞) function. In this case, by considering classes Ψ X = Ψ Y equal to all min(β, 1 B )-strongly convex and max(B, 1 β )-smooth functions, by using our method (for any λ > 1 β ) we will exactly compute correlations and find the optimal ∇ψ * .

A.3 FROM LATENT SPACE TO DATA SPACE

In the setting of the latent space mass transport, we fit a generative mapping to the latent space of an autoencoder and combine it with the decoder to obtain a generative model. The natural question is how close decoded distribution is to the real data distribution S used to train an encoder. The following Theorem states that the distribution distance of the combined model can be naturally divided into two parts: the quality of the latent fit and the reconstruction loss of the auto-encoder. Theorem A.4 (Decoding Theorem). Let S be the real data distribution on S ⊂ R K . Let u : S → Y = R D be the encoder and v : Y → R K be L-Lipschitz decoder. Assume that a latent space generative model has fitted a map g † : X → Y that pushes some latent distribution P on X = R D to be close to Q = u • S in W 2 2 -sense, i.e. W 2 2 (g † • P, Q) ≤ . Then the following inequality holds true: W 2 ( v • g † • P Generated data distribuon , S) ≤ L √ + 1 2 E S s -v • u(s) 2 2 Autoencoder's reconstrution loss 1 2 , ( ) where v • g † is the combined generative model. Proof. We apply the triangle inequality and obtain W 2 (v • g † • P, S) ≤ W 2 (v • g † • P, v • Q) + W 2 (v • Q, S). Let P † = g † • P be the fitted latent distribution (W 2 2 (P † , Q) ≤ ). We use Lipschitz Wasserstein-2 property of Lemma A.1 and obtain L √ ≥ L • W 2 (P † , Q) ≥ W 2 (v • P † , v • Q) = W 2 (v • g † • P, v • Q). Next, we apply L 2 -property (Lemma A.2) to mappings id S , v • u and distribution S, and derive 1 2 E S s -v • u(s) 2 2 ≥ W 2 2 (S, v • Q). ( ) The desired inequality (47) immediately results from combining ( 48) with ( 49), (50).

A.4 EXTENSION TO THE NON-EXISTENT DENSITY CASE

Our main theoretical results require distributions P, Q to have finite second moments and density on X = Y = R D . While the existence of second moments is a reasonable condition, in the majority of practical use-cases the density might not exist. Moreover, it is typically assumed that the supports of distributions are manifold of dimension lower than D or even discrete sets. One may artificially smooth distributions P, Q by convolving them with a random white Gaussian noisefoot_5 Λ = N (0, σ 2 I D ) and find a generative mapping g † : X → Y between smoothed P * Λ and Q * Λ. For Wasserstein-2 distances, it is natural that the generative properties of g † as a mapping between P * Λ and Q * Λ will transfer to generative properties of g † as a mapping between P, Q, but with some bias depending on the statistics of Λ.  W 2 (T • P, Q) ≤ (L + 1)σ D 2 + √ . Proof. We apply the triangle inequality twice and obtain W 2 (T • P, Q) ≤ W 2 (T • P, T • [P * Λ]) + W 2 (T • [P * Λ], [Q * Λ]) + W 2 ([Q * Λ], Q). Consider a transport plan µ(y, y ) ∈ Π([Q * Λ], Q) satisfying µ(y | y) = Λ(y -y). The cost of µ is given by Y Y y -y 2 2 dΛ(y -y )dQ(y) = Y Dσ 2 2 dQ(y) = Dσ 2 2 . Since the plan is not necessarily optimal, we conclude that W 2 2 ([Q * Λ], Q) ≤ Dσ 2 2 . Analogously, we conclude that W 2 2 (P, [P * Λ]) ≤ Dσ 2 2 . Next, we apply Lipschitz property of Wasserstein-2 (Lemma A.1) and obtain: W 2 (T • P, T • [P * Λ]) ≤ L • W 2 (P, [P * Λ]) = Lσ D 2 . Finally, we combine all the obtained bounds and derive W 2 (T • P, Q) ≤ (L + 1)σ D 2 + √ .

B NEURAL NETWORK ARCHITECTURES

In Subsection B.1, we describe the general architecture of the input convex networks. In this section, we describe particular realisations of the general architecture that we use in experiments: DenseICNN in Subsection B.2 and ConvICNN in Subsection B.3.

B.1 GENERAL INPUT-CONVEX ARCHITECTURE

We approximate convex potentials by Input Convex Neural Networks Amos et al. (2017) . The overall architecture is schematically presented in Figure 5 . Input convex network consists of two principal blocks: 1. Linear (L) block consists of linear layers. Activation functions and pooling operators in the block are also linear, e.g. identity activation or average pooling. 2. Convexity preserving (CP) block consists of linear layers with non-negative weights (excluding biases). Activations and pooling operators in this block are convex and monotone. Within blocks it is possible to use arbitrary skip connections obeying the stated rules. Neurons of L Block can be arbitrarily connected to those of CP block by applying a convex activationfoot_6 and adding the result with a positive weight. It comes from the convex function arithmetic that every neuron (including the output one) in the architecture of Figure 5 is a convex function of the input.foot_7  In our case, we expect the network to be able to easily fit the identity generative mapping g(x) = ∇ψ(x) = x, i.e. ψ(x) = 1 2 x 2 + c is a quadratic function. Thus, we mainly insert quadratic activations between L and CP blocks, which differs from Amos et al. (2017) where no activation was used. Gradients of input quadratic functions correspond to linear warps of the input and are intuitively highly useful as building blocks (in particular, for fitting identity mapping). We use specific architectures which fit to the general scheme shown in Figure 5 . ConvICNN is used for image-processing tasks, and DenseICNN is used otherwise. The exact architectures are described in the subsequent subsections. We use CELU function as a convex and monotone activation (within CP block) in all the networks. We have also tried SoftPlus among some other continuous and differentiable functions, yet this negatively impacted the performance. The usage of ReLU function is also possible, but the gradient of the potential in this case will be discontinuous. Thus, it will not be Lipschitz, and the insights of our Theorems 4.1 and 4.2 may not work. As a convex and monotone pooling (within CP block), it is possible to use Average and LogSumExp pooling (smoothed max pooling). Pure Max pooling should be avoided for the same reason as ReLU activation. However, in ConvICNN architecture we use convolutions with stride instead of pooling, see Subsection B.3. In order to use insights of Theorem 4.1, we impose strong convexity and smoothness on the potentials. As we noted in Appendix A.2, B-smoothness of a convex function is equal to 1 B strong convexity of its conjugate function (and vise versa). Thus, we make both networks ψ θ , ψ ω to be β := 1 B strongly convex, and cycle regularization keeps 1 β = B smoothness for ψ θ ≈ (ψ ω ) -1 and ψ ω ≈ (ψ θ ) -1 . In practice, we achieve strong convexity by adding extra value β 2 x 2 to the output of the final neuron of a network. In all our experiments, we set β -1 = 1000000.foot_8 In addition to smoothing, strong convexity guarantees that ∇ψ θ and ∇ψ ω are bijections, which is used in Theorems 4.1, 4.2.

B.2 DENSE INPUT CONVEX NEURAL NETWORK

For DenseICNN, we implement Convex Quadratic layer each output neuron of which is a convex quadratic function of input. More precisely, for each input x ∈ R Nin it outputs (cq 1 (x), . . . , cq Nout (x)) ∈ R Nout , with cq n (x) = x, A n x + b n , x + c n for positive semi-definite quadratic form A ∈ R Nin×Nin , vector b ∈ R Nin and constant c ∈ R. Note that for large N in , the size of such layer grows fast, i.e. ≥ O(N 2 in • N out ). To fix this issue, we represent each quadratic matrix as a product A n = F T n F n , where F ∈ R r×Nin is the matrix of rank at most r. This helps to limit optimization to only positive quadratic forms (and, in particular, symmetric), and reduce the number of weights stored for quadratic part to O(r • N in • N out ). Actually, the resulting quadratic forms A n will have rank at most r. The architecture is shown in Figure 6 . We use Convex Quadratic Layers in DenseICNN for connecting input directly to layers of a fully connected network. Note that such layers (even at full rank) do not blow the size of the network when the input dimension is low, e.g. in the problem of color transfer. The hyperparameters of DenseICNN are widths of the layers and ranks of the convex input-quadratic layers. For simplicity, we use the same rank r for all the layers. We denote the width of the first convex quadratic layer by h 0 and the width of k + 1-th Convex Quadratic and k-th Linear layers by h k . The complete hyperparameter set of the network is given by [r; h 0 ; h 1 , . . . , h K ].

B.3 CONVOLUTIONAL INPUT CONVEX NEURAL NETWORK

We apply convolutional networks to the problem of unpaired image-to-image style transfer. The architecture of ConvICNN is shown in Figure 7 . The network takes an input image (128 × 128 with 3 RGB channels) and outputs a single value. The gradient of the network w.r.t. the input serves as a generator in our algorithm. 

C EXPERIMENTAL DETAILS AND EXTRA RESULTS

In the first subsection, we describe general training details. In the second subsection, we discuss the computational complexity of a single gradient step of our method. Each subsequent subsection corresponds to a particular problem and provides additional experimental results and training details: toy experiments in Subsection C.3 and comparison with minimax approach in Subsection C.4, latent space optimal transport in Subsection C.5, image-to-image color transfer in C.6, domain adaptation in Subsection C.7, image-to-image style transfer in Subsection C.8.

C.1 GENERAL TRAINING DETAILS

The code is written on PyTorch framework. The networks are trained on a single GTX 1080Ti. The numerical optimization procedure is provided in Algorithm 1 of Section 4.1. In each experiment, both primal ψ θ and conjugate ψ ω potentials have the same network architecture. The minimization of ( 12) is done via mini batch stochastic gradient descent with weight clipping (excluding biases) in CP block to the [0, +∞). 10 We use Adam Kingma & Ba (2014) optimizer. For every particular task we pretrain the potential network ψ θ by minimizing mean squared error to satisfy ∇ψ θ (x) ≈ x and copy the weights to ψ ω . This provides a good initialization for the main training, i.e. ∇ψ θ and ∇ψ ω are mutually inverse. Doing tests, we noted that our method converges faster if we disable back-propagation through term ∇ψ ω which appears twice in the second line of (12). In this case, the derivative w.r.t. ω is computed by using the regularization terms only.foot_10 This heuristic allows to save additional memory and computational time because a smaller computational graph is built. We used the heuristic in all the experiments. In the experiments with high dimensional data (latent space optimal transport, domain adaptation and style transfer), we add the following extra regularization term to the main objective (12): R X (θ, ω) = X g -1 ω • g θ (x) -x 2 dP(x) = Y ∇ψ ω • ∇ψ θ (x) -x 2 dP(x). Term ( 51) is analogous to the term R Y (θ, ω) given by ( 11). It also keeps forward g θ and inverse g -1 ω generative mappings being approximately inverse. From the theoretical point of view, it is straightforward to obtain approximation guarantees similar to those of Theorems 4.1 and 4.2 for the optimization with two terms: R X and R Y . However, we do not to include R X in the proofs in order to keep them simple.

C.2 COMPUTATIONAL COMPLEXITY

The time required to evaluate the value of ( 12) and its gradient w.r.t. θ, ω is comparable up to a constant factor to that of a single evaluation of ψ θ (x). This claim follows from the well-known fact that gradient evaluation ∇ θ h θ (x) of h θ : R D → R, when parameterized as a neural network, requires time proportional to the size of the computational graph. Hence gradient computation requires computational time proportional to the time for evaluating the function h θ (x) itself. The same holds true when computing the derivative with respect to x. Thus, the number of operations required to compute different terms in ( 12), e.g. ∇ψ ω (y), ψ θ ∇ψ ω (y) and ∇ψ θ • ∇ψ ω (y), is also linear w.r.t. the computation time of ψ θ (x) or, equivalently, ψ ω (x). As a consequence, the time required for the forward pass of ( 12) is larger than the forward pass for ψ θ (x) only up to a constant factor. Thus, the backward pass for (12) with respect to parameters of ICNNs θ and ω is also linear in the computation time of ψ θ (x). We empirically measured that for our DenseICNN potentials, the computation of gradient of (12) w.r.t. parameters θ, ω requires roughly 8-12x more time than the computation of ψ θ (x). Evaluating ∇ψ θ , ∇ψ ω takes roughly 3-4x more time than evaluating ψ θ (x).

C.3 TOY EXPERIMENTS

In this subsection, we test our algorithm on 2D toy distributions from Gulrajani et al. (2017) ; Seguy et al. (2017) . In all the experiments, distribution P is the standard Gaussian noise and Q is a Gaussian mixture or a Swiss roll. Both primal and conjugate potentials ψ θ and ψ ω have DenseICNN [2; 128; 128, 64] architecture. Each network has roughly 25000 trainable parameters. Some of them vanish during the training because of the weight clipping. For each particular problem the networks are trained for 30000 iterations with 1024 samples in a mini batch. Adam optimizer Kingma & Ba (2014) with lr = 10 -3 is used. We put λ = 1 in our cycle regularization and impose additional 10 -10 L 1 regularization on the weights. For the case when Q is a mixture of 8 Gaussians, the intermediate learned distributions are shown in Figure 8 . The overall structure of the forward mapping has already been learned on iteration 200, while the inverse mapping gets learned only on iteration ≈ 2000. This can be explained by the smoothness of the desired optimal mappings ∇ψ * and ∇ψ * . The inverse mapping ∇ψ * has large Lipschitz constant because it has to unsqueeze dense masses of 8 Gaussians. In contrast to the inverse mapping, the forward mapping ∇ψ * has to squeeze the distribution. Thus, it is expected to have lower Lipschitz constant (everywhere except the neighbourhood of a central point which is a fixed point of ∇ψ * due to symmetry).

Additional examples (Gaussian

Mixtures & Swiss roll) are shown in Figures 10a, 10b and 9 . When Q is a mixture of 100 gaussians (Figure 9 ), our model learns all of the modes and does not suffer from mode dropping. We do not state that the fit is perfect but emphasize that mode collapse also does not happen. Note that all our theoretical results require distributions P, Q to be smooth. Our method fits continuous optimal transport map via differentiable w.r.t. input ICNNs with CELU activation. At the same time, Makkuva et al. ( 2019) also uses ICNNs with discontinuous gradient w.r.t. the input, e.g. by using ReLU activations to fit discontinuous generative mappings. We do not know whether our theoretical results can be directly generalized to the discontinuous mappings case (without using smoothing as we suggested in Subsection A.4). However, we note that the usage of ICNNs with discontinuous gradient naturally leads to "torn" generated distributions, see an example in Figure 11 . While the fitted mapping is indeed close to the true Swiss Roll in W 2 sense, it clearly suffers from "torn" effect. From the practical point of view, this effect seems to be similar to mode collapse, a well-known disease of GANs.

C.4 GAUSSIAN OPTIMAL TRANSPORT DETAILS AND DISCUSSION

We consider the Gaussian setting P, Q = N (0, Σ P ), N (0, Σ Q ) for which the ground truth OT solution has a closed form, see (Álvarez-Esteban et al., 2016, Theorem 2.3 ). Considering non-centered P, Q is unnecessary since W 2 2 (P, Q) = µ P -µ Q 2 + W 2 2 (P 0 , Q 0 ) , where P 0 , Q 0 are centered copies of P, Q. In D-dimensional space,

√

Σ P ( Σ Q -analogously) is initialized as S T P ΛS P , where truth optimal transport map from P to Q is linear and given by S P ∈ O D is a random rotation, Λ is diagonal with eigenvalues [ 1 2 , . . . , 1 2 b k , . . . , 2], b = D-1 √ 4. The ground ∇ψ * (x) = Σ -1 2 P Σ 1 2 P Σ Q Σ 1 2 P 1 2 Σ -1 2 P x. We compare our approach with the method by Seguy et al. (2017) [LSOT] and the minimax approaches by Taghvaei & Jalali (2019)  ψ θ ,ψω X ψ θ (x)dP(x) + Y ψ ω (y)dQ(y) + 1 2 X ×Y x, y -ψ θ (x) -ψ ω (y) 2 + d P × Q (x, y) , where we defined ψ θ = x 2 2 -u θ (x) and ψ ω = y 2 2 -v ω (y) to make LSOT notation (u θ , v ω ) match our notation (ψ θ , ψ ω ). We use L2 regularizer (but not entropy) since it empirically works better (as noted in LSOT paper). Potentials ψ θ , ψ ω are NOT restricted to be convex. The transport plan T • P ≈ Q is recovered via the barycentric projection, see (Seguy et al., 2017, Section 4) . We set λ = min(D, 50) for our method and = 0.01 for LSOT (chosen empirically). In all the methods we use DenseICNN[1; D, D, D 2 ] of Subsection B.2. In LSOT, we do not convexify nets (do not clamp weights), i.e. they are "usual" unrestricted neural networks (empirically selected as the best option) as originally implied in LSOT paper. It follows from the Table 1 in Section 5.1 that LSOT leads to high bias error which grows drastically with the dimension. While theoretically → 0 should solve the bias problem, practically small leads to optimization instabilities. LSOT's drawback is that estimation of the regularizer requires sampling from joint measure P × Q on R 2D . For the majority of pairs (x, y) the L2 regularizer vanishes (the effect worsens with D → ∞). In contrast to LSOT, our cycle regularizer uses samples only from marginal measures (Q or P) on R D . The biasing effect is studied theoretically (our Theorems 4.1, 4.2), and the bias is inexistent when optimal potentials ψ * , ψ * are contained in the approximating function classes. W2GN (ours), MM-1 and MM-2 methods are capable of computing maps and distances with low error (L 2 -UVP<3% even in R 4096 ). However, as it is seen from the convergence plots in Figure 12 , our approach converges several times faster: it naturally follows from the fact that MM-1, MM-2 approaches contain an inner optimization cycle.

C.5 LATENT SPACE OPTIMAL TRANSPORT DETAILS

We follow the pipeline of Figure 13 below. The latent space distribution is constructed by using convolution auto-encoder to encode CelebA images into 128-dimensional latent vectors respectively. To train the auto-encoder, we use a perceptual loss on features of a pre-trained VGG-16 network. Figure 13 : The pipeline of latent space mass transport. We use DenseICNN [4; 256; 256; 128; 64] to fit a cyclically monotone generative mapping to transform standard normal noise into the latent space distribution. For each problem the networks are trained for 100000 iterations with 128 samples in a mini batch. Adam optimizer with lr = 3 × 10 -4 is used. We put λ = 100 as the cycle regularization parameter. We provide additional examples of generated images in Figure 14 . We also visualize the latent space distribution of the autoencoder and the distribution fitted by generative map in Figure 15 . The FID scores presented in Table 2 are computed via PyTorch implementation of FID Scorefoot_11 . As a benchmark score we added WGAN-QC by Liu et al. (2019) . Finally, we emphasize that cyclically monotone generative mapping that we fit is explicit. Similarly to Normalizing Flows Rezende & Mohamed (2015) and in contrast to other methods, such as Lei et al. (2019) , it provides tractable density inside the latent space. Since ∇ψ ω ≈ (∇ψ θ ) -1 is differentiable and injective, one may use the change of variables formula for density q(y) = [det ∇ 2 ψ ω (y)] • p(∇ψ ω (y)) to study the latent space distribution. 

C.6 COLOR TRANSFER

The problem of color transfer between images 13 is to map the color palette of the image into the other one in order to make it look and "feel" similar to the original. Optimal transport can be applied to color transfer, but it is sensitive to noise and outliers. To avoid these problems, several relaxations were proposed Rabin et al. (2014) ; Paty et al. (2019) . These approaches solve a discrete version of Wasserstein-2 OT problem. The computation of optimal transport cost for large images is barely feasible or infeasible at all due to extreme size of color palettes. Thus, the reduction of pixel color palette by k-means clustering is usually performed to make OT computation feasible. Yet such a reduction may lose color information. Our algorithm uses mini-batch stochastic optimization. Thus, it has no limitations on the size of color palettes. On training, we sequentially input mini-batches of images' pixels (∈ R 3 ) into potential networks with DenceICNN [3; 128; 128, 64] architecture. 14 The networks are trained for 5000 iterations with 1024 pixels in a mini batch. Adam optimizer with lr = 10 -3 is used. We put λ = 3 as the cycle regularization parameter. We impose extra 10 -10 L 1 -penalty on the weights. The color transfer results for ≈ 10 megapixel images are presented in Figure 16a . The corresponding color palettes are given in Figure 16b . Additional example of color transfer are given in Figure 17 .

C.7 DOMAIN ADAPTATION

The domain adaptation problem is to learn a model f (e.g. a classifier) from a source distribution Q. This model has to perform well on a different (related) target distribution P. Most of the methods based on OT theory solve domain adaptation explicitly by transforming distribution P into Q and then applying the model f to generated samples. In some cases the mapping g : X → Y (which transforms P to Q) is obtained by solving a discrete OT problem Courty et al. 13 Images may have unequal size. Yet they are assumed to have the same number of channels, e.g. RGB ones. 14 Since our model is parametric, the complexity of fitted generative mapping g † : R 3 → R 3 between the palettes depends on the size of potential networks. We address the unsupervised domain adaptation problem which is the most difficult variant of this task. Labels are available only in the source domain, so we do not use any information about the labels. Our method trains g as a gradient of a convex function. It can be applied to new arriving samples which are not present in the train set. We test our model on MNIST (≈ 60000 images; 28 × 28) and USPS (≈ 10000 images; rescaled to 28 × 26) digits datasets. We perform USPS → MNIST domain adaptation. To do this, we train LeNet ≥ 99%-accuracy classifier h on MNIST. Then, we apply h to both datasets, extract 84 last layer features. Thus, we form distributions Q (features for MNIST) and P (features for USPS). To fit a cycle monotone domain adaptation mapping, we use DenseICNN [32; 128; 128, 128] potentials. We train our model on mini-batches of 64 samples for 10000 iterations with cycle regularization λ = 1000. We use Adam optimizer with lr = 10 -4 and impose 10 -7 L 1 -penalty on the weights of the networks. Similar to Seguy et al. (2017) , we compare the accuracy of MNIST 1-NN classifier f applied to features x ∼ P of USPS with the same classifier applied to mapped features g † (x). 1-NN is chosen as the classification model in order to eliminate any influence of the base classification model on the domain adaptation and directly estimate the effect provided by our cycle monotone map. The results of the experiment are presented in Table 3 . Since domain adaptation quality highly depends on the quality of the extracted features, we repeat the experiment 3 times, i.e. we train 3 LeNet MNIST classifiers for feature extraction. We report the results with mean and central tendency. For benchmarking purposes, we also add the score of 1-NN classifier applied to the features of USPS transported to MNIST features by the discrete optimal transport. It can be considered as the "most straightforward" optimal transport map. 15Repeat 1 Repeat 2 Repeat 3 Average (µ ± σ) Our reported scores are comparable to the ones reported by Seguy et al. (2017) . We did not reproduce their experiments since Seguy et al. (2017) does not provide the source code for domain adaptation. Thus, we refer the reader directly to the paper's reported scores (Table 1 of Seguy et al. (2017) , first column with scores). For visualization purposes, we plot the two main components of the PCA decomposition of feature spaces (for one of the conducted experiments) in Figure 18 : MNIST features, mapped USPS features by using our method, original USPS features.

C.8 IMAGE-TO-IMAGE STYLE TRANSFER

We experiment with ConvICNN potentials on publicly availablefoot_13 Winter2Summer and Photo2Cezanne datasets containing 256 × 256 pixel images. We train our model on mini batches of 8 randomly cropped 128 × 128 pixel RGB image parts. As an additional augmentation, we use random rotations (± π 18 ), random horizontal flips and the addition of small Gaussian noise (σ = 0.01). The networks are trained for 20000 iterations with cycle regularization λ = 35000. We use Adam optimizer and impose additional 10 -1 L 1 -penalty on the weights of the networks. Our scheme of style transfer between datasets is presented in Figure 19 . We provide the additional results for Winter2Summer and Photo2Monet datasets in Figures 20b and 20a respectively. In all the cases, our networks change colors but preserve the structure of the image. In none of the results did we note that the model removes large snow masses (for winter-to-summer transform) or covers green trees with white snow (for summer-to-winter). We do not know the exact explanation for this issue but we suppose that the desired image manipulation simply may not be cycle monotone. For completeness of the exposition, we provide some results of cases when our model does not perform well (Figure 21 ). In Figure 21a the model simply increases the green color component, while in Figure 21b it decreases this component. Although in many cases it is actually enough to transform winter to summer (or vice-versa), sometimes more advanced manipulations are required. In the described experiments, we applied our method directly to original images without any specific preprocessing or feature extraction. The model captures some of the required attributes to transfer, but sometimes it does not produce expected results. To fix this issue, one may consider OT for the quadratic cost defined on features extracted from the image or on embeddings of images (similar to



We consider only monotone increasing mappings. Decreasing mappings have analogous properties. Commonly, in OT it is assumed that dim X = dim Y. In practice, the continuity condition can be assumed to hold true. Indeed, widely used heuristics, such as adding small Gaussian noise to dataSønderby et al. (2016), make considered distributions to be continuous. For example, in the case of identity map g(x) = ∇ψ(x) = x, we have quadratic growth: ψ = x 2 2 + c. The code is written on PyTorch framework and is publicly available at https://github.com/iamalexkorotin/Wasserstein2GenerativeNetworks. From the practical point of view, smoothing is equal to adding random noise distributed according to Λ to samples from P, Q respectively. Unlike activations within convexity preserving block, convex activation between L and CP block may not be monotone, e.g. σ(x) = x 2 can be used as an activation. It is possible into insert batch norm and dropout to L and CP blocks as well as between them. These layers do not affect convexity since they can be considered (during inference) as linear layers with non-negative weights. Imposing smoothness & strong convexity can be viewed as a regularization of the mapping: it does not perform too large/small warps of the input. See e.g.Paty et al. (2019). We also tried to use softplus, exponent on weights and regularization instead of clipping, but none of these worked well. The term ∇ψω(y), y becomes redundant for the optimization. Yet it remains useful for monitoring the convergence. https://github.com/mseitzer/pytorch-fid In contrast to our method, it can not be directly applied to out-of-train-sample examples. Moreover, its computation is infeasible for large datasets. https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix



(a) An Arbitrary Mapping. (b) The Monotone Mapping.

Figure 1: Two possible generative mappings that transform distribution P to distribution Q.

[MM-1] and by Makkuva et al. (2019) [MM-2].

Comparison of L 2 -UVP (%) for LSOT, MM-1, MM-2 and W2GN (ours) methods in D = 2, 4, . . . , 2 12 .

Figure 2: Comparison of convergence of W2GN (ours), MM-1, MM-2 approaches.

The principal model for solving this problem is CycleGAN by Zhu et al. (2017). It uses 4 networks and optimizes a minimax objective to train the model. Our method uses 2 networks, and has non-minimax objective. We experiment with ConvICNN potentials on publicly availaible Winter2Summer and Photo2Cezanne datasets containing 256 × 256 pixel images. We train our model on mini-batches of randomly cropped 128 × 128 pixel RGB image parts. The results on Winter2Summer and Photo2Cezanne datasets applied to random 128 × 128 crops are shown in Figure 4.(a) Winter2Summer dataset results. (b) Photo2Cezanne dataset results.

Figure 4: Results of image-to-image style transfer by ConvICNN, 128 × 128 pixel images.

In transition to line (22), we use change of variables formula P = ∇ψ * • Q. In line (23), we use β † -strong convexity of function ψ † and then add zero term in line (24). Next, for simplicity, we replace integral notation with L 2 (Y → R D , Q) notation starting from line (25).

Theorem A.5 (De-smoothing Wasserstein-2 Property). Let P, Q be two probability distributions on X = Y = R D with finite second moments. Let Λ = N (0, σ 2 I D ) be a Gaussian white noise . Let P * Λ and Q * Λ be versions of P and Q smoothed by Λ. Let T : X → Y be a L-Lipschitz measurable map satisfying W 2 (T • [P * Λ], [Q * Λ]) ≤ √ . Then the following inequality holds true:

Figure 5: General architecture of an Input Convex Neural Network.

Figure 6: Dense Input Convex Neural Network.

Figure 7: Convolutional Input Convex Neural Network. All convolutional layers have 128 channels. Linear and Convexity preserving blocks are successive, and no skip connections are used. Block L consists of two separate parts with stacked convolutions without intermediate activation. The square of the second part is added to the first part and is used as an input for the CP block. All convolutional layers of the network have 128 channels (zero-padding with offset = 1 is used).

(a) Initial distributions P and Q. (b) Generated distributions after 200 iterations. (c) Generated distributions after 2000 iterations. (d) Generated distributions after 30000 iterations.

Figure 8: Convergence stages of our algorithm applied to fitting cycle monotone mappings (forward and inverse) between distributions P (Gaussian) and Q (Mixture of 8 Gaussians).

Figure 9: Mixture of 100 Gaussians Q and distribution ∇ψ θ • P fitted by our algorithm.

(a) Mixture of 49 Gaussians Q and distribution ∇ψ θ • P ≈ Q fitted by our algorithm. (b) Swiss Roll distribution Q and distribution ∇ψ θ • P ≈ Q fitted by our algorithm.

Figure 10: Toy distributions fitted by our algorithm.

Figure 11: An example of a "torn" generative mapping to a Swiss Roll by a gradient of ICNN with ReLU activations.

[MM-1], Makkuva et al. (2019) [MM-2]. We recall the details of LSOT's regularized optimization of OT maps and distances. The objective is given by min

Figure 12: Comparison of convergence speed of W2GN, MM-1 and MM-2 approaches in dimensions D = 64, 256, 1024, 4096.

Figure 14: Images decoded from standard Gaussian latent noise (1st row) and decoded from the same noise transferred by our cycle monotone map (2nd row).

Figure 15: A pair of main principal components of CelebA Autoencoder's latent space. From left to right: Z ∼ N (0, I) [blue], mapped Z by W2GN [red], true autoencoders latent space [green]. PCA decomposition is fitted on autoencoders latent space [green].

(a) Original images (on the left) and images obtained by color transfer (on the right). The sizes of images are 3300 × 4856 (first) and 2835 × 4289 (second). (b) Color palettes (3000 random pixels, best viewed in color) for the original images (on the left) and for images with transferred color (on the right).

Figure 16: Results of Color Transfer between high resolution images (≈ 10 megapixel) by a pixelwise cycle monotone mapping.

(a) Original images (on the left) and images obtained by color transfer (on the right). (b) Color palettes (3000 random pixels, best viewed in color) for the original images (on the left) and for images with transferred color (on the right).

Figure 17: Results of Color Transfer between images by a pixel-wise cycle monotone mapping.

Figure 18: A pair of main principal components of feature spaces. From left to right: MNIST feature space, mapped USPS features by W2GN, original USPS feature space. PCA decomposition is fitted on MNIST features. Best viewed in color (different colors represent different classes of digits 0 -9).

Figure 19: Schematically presented image-to-image style transfer by a pair of ConvICNN fitted by our method.

(a) Results of cycle monotone image-to-image style transfer by ConvICNN on Photo2Cezanne dataset, 128 × 128 pixel images. (b) Results of cycle monotone image-to-image style transfer by ConvICNN on Winter2Summer dataset, 128 × 128 pixel images.

Figure 20: Additional results of image-to-image style transfer on Winter2Summer and Photo2Monet datasets.

1-NN classification accuracy on USPS → MNIST domain adaptation problem.

ACKNOWLEDGEMENTS

The work was partially supported by the Russian Foundation for Basic Research grant 21-51-12005 NNIO_a.

annex

the domain adaptation in Subsection C.7 or latent space mass transport 5.2). This statement serves as the challenge for our further research.

