OTCOP: LEARNING OPTIMAL TRANSPORT MAP VIA CONSTRAINT OPTIMIZATION

Abstract

The approximation power of the neural network makes it an ideal tool to learn optimal transport maps. However, existing methods are mostly based on the Kantorovich duality and require regularizations and/or special network structures. In this paper, we propose a direct constraint optimization algorithm for the computation of optimal transport maps based on the Monge formulation. We solve this constraint optimization problem by using three different methods: the Langrangian multiplier method, the augmented Lagrangian method, and the alternating direction method of multipliers (ADMM). We demonstrate a significant accuracy of learned optimal transport maps on high dimensional benchmarks. Moreover, we show that our methods reduce the regularization effects and accurately learn the target distributions at a lower transport cost.

1. INTRODUCTION

There has been a great interest in applying modern machine learning techniques for finding optimal transport maps between two distributions. Different from traditional computational methods that solve PDEs for optimal transport maps (Benamou & Brenier (2000) ; Angenent et al. (2003) ; Li et al. (2018) ), modern machine learning techniques aim to solve the problem directly by optimizations. The Sinkhorn Distance method Cuturi (2013) ; Peyré et al. (2019) , the regularized OT dual Seguy et al. (2017) have been used to find large scale optimal transport maps between discrete probability distributions and have been used to train generative networks Genevay et al. (2018) ; Sanjabi et al. (2018) . A geometric treatment is provided in Gu et al. (2013) . The Input Convex Neural Network (ICNN) is used to construct a convex Brenier potential for finding optimal transport maps Makkuva et al. (2020) between continuous distributions and is recently used in population dynamics Bunne et al. (2022) , which combines the ICNN and Sinkhorn distance methods Amos et al. (2022) . Despite these successes, most methods are based on the duality formulation and avoids the direct treatment on the Monge problem. In this paper, we focus on the direct solution of the Monge problem. The Monge problem (Monge (1781)) directly seeks to identify the optimal transport maps and is a nonlinear constraint optimization problem. The major difficulty in solving the problem numerically is that it is nonlinear and includes a constraint that the push-forward distribution is equal to the target distribution, which is difficult to implement. Therefore, most optimal transport algorithms avoid directly solving the Monge problem but use the Kantorovich duality (Kantorovich (1942) ), for which the objective function is linear and the transport map is obtained by taking the gradient of the Brenier potential for the quadratic cost. However, these two problems are not always identical (Villani (2009) ) and it is desirable to find a direct approach for the Monge problem. The Monge problem has been solved numerically using optimization based methods with polynomial approximations. For example, a Lagrangian penalty method was used to find optimal transport maps approximated by polynomials for Bayesian inference El Moselhy & Marzouk (2012) and space discretization was used in Haber et al. (2010) to calculate the Jacobian matrix of the transport maps and transferred the optimization to finite dimensional spaces. However, their approaches are limited to low dimensions as number of grids expands exponentially as dimensions become large. Considering the success of deep neural networks in approximating high dimensional data, the integration of classical constraint optimization methods and neural networks holds a promise. One successful application of the optimal transport theory to deep learning is the Wasserstein Generative Adversarial Network (WGAN) Arjovsky et al. (2017) . However, WGAN only use the optimal transport distance as a loss function and does not target at finding the optimal transport maps. It is desirable to study whether it is possible to lower the transport cost of the map learned by WGAN or other networks using the algorithm for finding optimal transport maps. This paper presents a new approach for finding optimal transport maps between two continuous distributions. We make the following contributions: • We integrate three constraint optimization algorithms including the Standard Lagrangian (SL), the Augmented Lagrangian method (AL) and the Alternating Direction Method of Multipliers (ADMM) with neural networks to solve the Monge problem of optimal transport with provable guarantees (Theorem 1-3). • We show that our method is able to find an accurate optimal transport map between Gaussian distributions, both theoretically (Theorem 2) and experimentally. Moreover, we apply our method to WGAN and show that our method can find a generative map with lower transport cost while not sacrificing the quality of outputs. • We compare the three algorithms and find the SL algorithm introduces errors but is simple and easy to implement, while AL and ADMM algorithms can find exact results and are more robust, and ADMM gives a lower transport cost in general. Notations. We use the notations α d = (α, • • • , α) ∈ R d and α d×d for the constant d × d matrix. The transport cost of a map T , which pushes distribution µ to ν, is defined to be E x∼µ [|x -T x| 2 ].

2.1. THE MONGE PROBLEM

Let (X, µ), (Y, ν) be two separable metric probability spaces. The Monge problem is to find a transport map T : X → Y that realizes the infimum inf X c(x, T x)dµ(x) T # µ = ν (1) where T # µ denotes the push forward of µ and c : X × Y → R + is a Borel measurable function which is lower semicontinuous. In this paper, we simply take the distance |x -y| 2 but our method applies to other distance functions. The existence of the Monge problem is difficult and does not hold always. However, under suitable conditions, for example for continuous distributions without atoms, the existence and uniqueness of the Monge problem is guaranteed (see for example, (Villani, 2009, Theorem 5.30) . Therefore, here we focus on learning transport maps between continuous distributions. For discrete distributions, one can apply dequantization techniques to transform them to continuous distributions Ho et al. (2019) .

2.2. THE MONGE PROBLEM AS CONSTRAINT OPTIMIZATION

In order to solve the Monge problem, we use a generative network, denoted by T θ with parameter set θ, which inputs random samples from the distribution µ and generates samples representing the target distributions ν. As can be seen from the definition, the Monge problem is a constraint optimization problem. However, the constraint T # µ = ν is a highly nonlinear constraint. In order to impose this constraint, we take d(•|•) to be a distance function (such as the Wasserstein distance, the MMD (Gretton et al. (2012) ) or the IPM (Müller (1997) )) or a probability divergence (such as the Kullback-Leibler (KL) divergence). The constraint optimization problem reads as min θ E x∼µ |x -T θ x| 2 , s.t. d(T θ# µ|ν) = 0. The objective of this paper is to solve the above problem using techniques from the constraint optimization theory (Bertsekas (2014) ). Since a neural network may not fully reveal the target distributions ν, the above problem can be relaxed to min θ E x∼µ |x -T θ x| 2 , s.t. d(T θ# µ|ν) ≤ α, When α goes towards zero, we can prove that the solution T θα to the problem (3) converges to the solution of the original Monge problem (1). The following theorem holds: Theorem 1. Let µ, ν be two probability measures on R d with finite second moments and are absolutely continuous. Let T θ be given by a neural network with bounded width(each with at least 2d + 2 neurons) and arbitrary depth, and with non-affine activation functions. Suppose for any α > 0, there exists a solution θ * α to problem (2), then as α → 0, T θ * α → T where T is a solution of the Monge problem (1). Moreover, sup x∈X |T α (x) -T (x)| C ≤ Cα for some constant C for any compact subset X ∈ R d . Proof of the above theorem follows from the universal approximation theorem (Kidger & Lyons (2020) ) and the existence theorem of the Monge problem (Villani ( 2009)), and is given in Appendix A.1.

2.3. EXAMPLE: THE MONGE PROBLEM FROM GAUSSIAN TO GAUSSIAN

For the case when µ ∼ N (X 1 , Σ 1 ) and ν ∈ N (X 2 , Σ 2 ) are two multivariate normal distributions, the optimal transport map is unique and can be explicitly given by T Pukelsheim (1982) ). Taking d = D KL to be the KLdivergence, we prove that the solution to problem (2) with * : x → X 2 + A * (x -X 1 ) with A * = Σ -1/2 1 (Σ 1/2 1 Σ 2 Σ 1/2 1 ) 1/2 Σ -1/2 1 (Olkin & T θ x = Ax+b (θ = {A ∈ R d×d , b ∈ R d }) is the optimal transport map T * (Theorem 2 in Appendix A.1).

3. CONSTRAINT OPTIMIZATION FOR OPTIMAL TRANSPORT

We propose to leverage three different algorithms to solve the constraint problem (2).

3.1. PENALTY METHOD (OTCOP-P)

Standard Lagrangian (SL). We introduce a Lagrangian multiplier λ and take the Lagrangian function as L SL (θ, λ) = E x∼µ |x -T θ x| 2 + λd(T θ# µ|ν). Then the solution to the problem (2) is a saddle point of the above Lagrangian. By duality theory, for each α ≥ 0, the problem (3) corresponds to the duality problem min θ L SL (θ, λ, 0) for a λ ∈ [0, ∞]. Hence we can take a suitable λ to solve the problem (2) approximately. According to the Brenier's polar factorization theorem Brenier (1991) , the optimal transport map should satisfy ∇ × T = 0, hence we can add an additional term |∇ × T | 2 into the above Lagrangian to impose this constraint. We will show experimentally that without this term, this constraint is almost satisfied and we will not included in our implementations. Quadratic penalty (QP). Instead of taking d(T θ# |ν), we can take a quadratic penalty loss L QP (θ, ρ) = E x∼µ |x -T θ x| 2 + 1 2 ρ(d(T θ# µ|ν)) 2 . ( ) As ρ goes towards infinity, the constraint violations is penalized with increasing severity. For example, we can take ρ k at the kth training step to be increased by multiplying by a constant bigger than 1 and parameter θ can be updated using gradient descent considering ρ as a constant. Convergence. Suppose there exists a global minimizer to the problem (2), and θ k is the exact minimizer of L QP (θ, ρ k ) and ρ k ↑ ∞. Then any limit point of the sequence {θ k } is a solution to problem (2). Moreover, for any ε > 0, there exists a sufficient large (Nocedal & Wright, 1999, Theorem 17.1) ). In addition, without assuming a global minimizer, for a sequence θ k such that ∇ θ k L QP (θ k ; ρ k ) → 0, its all limit points θ * satisfy the Karush-Kuhn-Tucker (KKT) conditions and there exists a subsequence such that lim k→∞ (ρ K > 0, |θ k -θ * | ≤ ε for k ≥ K (see k d(T θ k # |ν) = λ * , where λ * is the multiplier that satisfies the KKT condition (see Appendix A.4 for the KKT condition and see (Nocedal & Wright, 1999, Theorem 17. 2) for the proof). Advantages and disadvantages. The penalty method is simple and easy to implement. However, since the optimal value of the Lagrangian multiplier λ is unkown (SL) or the optimal condition for the Lagrangian multiplier ρ is infinite (QP), the penalty always introduce errors and the exact solution to the problem (2) cannot be reached. Moreover, the Hessian of the Lagrangian ∇ 2 θθ L QP becomes singular as ρ goes towards infinity and cause ill-condition problems. These issues can be solved by the methods below, but at the expense of a more computational cost.

3.2. THE AUGMENTED LAGRANGIAN METHOD (OCTOP-AL)

In order to overcome the above issues, we can use the augmented Lagrangian method, by taking the loss function as L AL (θ, λ, ρ) = E x∼µ |x -T θ x| 2 + λd(T θ# µ|ν) + ρ 2 (d(T θ# µ|ν)) 2 . ( ) The above function combines standard Lagrangian penalty (4) and quadratic Lagrangian penalty (5). At the kth iteration, fix λ k , ρ k and solve θ k = arg min θ L AL (θ, λ k , ρ k ). After the minimization, we update λ by λ k+1 = λ k + ρ k d(T θ k # µ|ν). Comparing the KKT conditions of the SL and the AL (see Appendix A.4) implies λ k + ρ k d(T θ k # µ|ν) ≈ λ * when λ k is taken close to λ * . Hence d(T θ k # µ|ν) ≈ (λ * -λ k )/ρ. Compared to the quadratic penalty method that d(T θ k # µ|ν) ≈ λ * /ρ, the infeasibility in θ k will be much smaller. Moreover, for certain choice of ρ, the local solution of ( 2) is a strict local minimizer of L AL (θ, λ, ρ) ((Nocedal & Wright, 1999, Chapter 17)). Convergence. One of the nice properties of the AL method is that for the exact Lagrangian multiplier λ * , the solution θ * of the problem ( 2) is a strict minimizer of L AL (θ, λ * , ρ) for all ρ sufficiently large. The existence of a threshold is proved under the condition that ∇ 2 θ L SL (θ * , λ * ) is locally strictly positive ((Nocedal & Wright, 1999, Theorem 17.6 )). Thus we can take ρ to be increasing at each minimizing step and when ρ becomes bigger than some threshold value ρ, gradient descent methods could find the local minimizer around θ * . Advantages and disadvantages. AL method introduces the multiplier estimates and reduces the likelihood that large values of ρ will be needed to obtain good feasibility and accuracy. The method is also simple and easy to implement. However, since this is a min-max method, training may experience oscillations and slower convergence rates.

3.3. ADMM METHOD (OCTOP-ADMM)

The ADMM method blends the decomposition techniques and the AL method and provides an efficient way for constraint optimizations Boyd et al. (2011) . Let S be the set S = {T θ : T θ# µ = ν}, problem (2) can be rewritten into the form min θ E x∼µ |x -T θ x| 2 + 1 S (T θ ), where 1 S is the indicator function that equals 0 if T θ ∈ S and equals ∞ if T θ ̸ ∈ S. In order to apply the ADMM method, we rewrite the above problem into the form min θ1,θ2 E x∼µ |x -T θ1 x| 2 + 1 S (T θ2 ), s.t. T θ1 = T θ2 . In the ADMM method, we alternatively update θ 1 and θ 2 . First we take θ 2 to be constant and take the minimization of the above problem over θ 1 , and then we project θ 2 onto the space S. In detail, we introduce the loss function L ADM M (θ 1 , θ 2 , Λ, ρ) = E x∼µ |x -T θ1 x| 2 + 1 d(T θ 2 # µ|ν)=0 + Λ T (T θ1 x -T θ2 x) + ρ 2 (E x∼µ [|T θ1 x -T θ2 x| 2 ]). Here Λ ∈ R d is the multiplier. The training procedure is given by 1. θ k+1 1 = arg min θ1 L ADM M (θ 1 , θ k 2 , Λ k , ρ) (assuming 1 d(T θ 2 # µ|ν)=0 = 0); 2. θ k+1 2 = arg min θ2 d(T θ2# µ|ν); 3. Λ k+1 = Λ k + ρE x∼µ (T θ k+1 1 # (x) -T θ k+1 2 # (x)). Convergence. The convergence of ADMM method is only known to hold under convex conditions or for some non-convex problems (Boyd et al. (2011) ). Using results of Wang et al. ( 2019), we can prove the convergence of the ADMM method if we modify the above method by relaxing problem (8) to min θ E x∼µ |x -T θ1 x| 2 + η ε (d(T θ2 , S)), s.t. θ 1 = θ 2 , ( ) where η ε is the mollifier function which converges to the δ-function as ε → 0. We show that the ADMM method converges to the KKT points and if the corresponding Lagrangian is a Kurdyka-Łojasiewicz function, the ADMM method converges globally to the unique solution (see Appendix A.3 for details). As a consequence, there exists a convergence sequence of the ADMM method for the problem (8) approximately. Advantage and disadvantages. The advantage of ADMM is that it decomposes the Monge problem into two sub-problems: minimizing the transport cost and minimizing the D KL . Compared to the AL method, AMDD solves two decomposed minimization problem and AMDD may also converge faster than the AL method in some situations (Wang et al. (2019) . However, the method also solves a min-max problem and may facing oscillations in the training process.

3.4. THE DISTANCE d

We need a distance/divergence functional d to compare the generated distribution and the target distribution. Here we require d to be able to compute using samples. The KL divergence D KL . When the densities of the target distribution is known, we can use networks of normalizing flow (Rezende & Mohamed ( 2015)) as T θ and the D KL is computed via D KL (T θ# µ|ν) = E x∼µ [log µ(x) -log det J T θ (x) -log ν(T θ x)], where J T θ is the Jacobian of the transport map which is able to compute using the normalizing flow networks. When the target densities are unknown. We need to reverse the KL divergence and take D KL (ν|T θ# µ) = E x∼ν [log ν(x) -log µ(T -1 θ x) -log det J T θ (T -1 x)]. We can drop the first term in the bracket since it is a constant. We can take L SL in (4) as L SL (θ, λ) = E x∼µ |x -T θ x| 2 + λE x∼ν [-log µ(T -1 θ x) -log det J T θ (T -1 x)]. For the ADMM method, the second step changes to θ k 2 ← arg min θ E x∼ν [-log µ(T -1 θ x) - log det J T θ (T -1 x)]. In order to apply the AL method, one can use E x∼ν [-∇ θ log µ(T -1 θ x) - ∇ θ log det J T θ (T -1 x)] to replace the D KL terms in (6). Test function as multiplier. A weak form of the constraint T θ# µ = ν is that for any measurable function f , f (T θ x)dµ(x) = f (x)dν(x) we can introduce a discriminator network f w as Lagrangian multiplier and the Lagrangian becomes L SL (θ, w) = E x∼µ [|x -T θ x| 2 ] + E x∼µ [f w (T θ x)] -E z∼ν [f w (z)]. The augmented Lagrangian then become L AL (θ, w, ρ) = E x∼µ [|x -T θ x| 2 ] + E x∼µ [f w (T θ x)] -E z∼ν [f w (z)] + ρ 2 (E x∼µ [f w (T θ x)] -E z∼ν [f w (z)]) 2 . ( ) The training procedure is given by 1. θ k = arg min θ L AL (θ, w, ρ); 2. w k+1 = w k + ρ k E x∼µ [∇ w f w (T θ x)] -E z∼ν [∇ w f w (z)]; 3. Assign ρ k+1 ≥ ρ k . To apply the ADMM method, we take the loss function to be L ADM M (θ 1 , θ 2 ,w 1 , w 2 , ρ) = E x∼µ [|x -T θ1 x| 2 ] + d w2 (T θ2# µ|ν)) + E x∼µ [f w1 (T θ1 x) -f w1 (T θ2 x)] + ρ 2 E x∼µ [f w1 (T θ1 x) -f w1 (T θ2 x)] 2 . ( ) Then the optimization is decoupled into optimization over θ 1 to learn the optimal transport map and the optimization over θ 2 to learn the target distribution. Here we can take for example d w2 to be the loss function of Wasserstein GANs with gradient penalty (Gulrajani et al. (2017) ): d w2 (T θ2# µ|ν)) = E x∼µ [f w1 (T θ x)] -E z∼ν [∇ w1 f w2 (z)] + cE x∼γ [(∥∇ xf w2 (x)∥ 2 -1) 2 ], where c is a positive constant and γ is the linear interpolation of T θ# µ and ν. The training procedure is given by 1. θ k+1 1 = arg min θ1 L ADM M (θ 1 , θ k 2 , w k 1 , w k 2 , ρ k ); 2. (θ k+1 2 , w k+1 2 ) = arg min θ2 max w2 L ADM M (θ k+1 1 , θ 2 , w k 1 , w 2 , ρ k ) updated same as WGAN with gradient penalty.

3.. w k+1

1 = w k 1 + ρ k E x∼µ [∇ w1 f w1 (T θ x)] -E z∼ν [∇ w1 f w1 (z)].

3.5. IMPLEMENTATION OF THE ALGORITHMS

The implementations of the algorithms are given in Algorithm 1, 2, 3. Here we give the implementation when d is given directly without using a discriminator network. For the case when test function is used as multiplier, see section 3.4 and Appendix ?? for details. Algorithm  1 ← θ 1 -η∇ θ L ADM M (θ 1 , θ 2 , λ, ρ) end for for m 2 steps do θ 2 ← θ 2 -η∇ θ d(T θ2# µ|ν) w 2 ← w 2 + η∇ w d(T θ2# µ|ν) (if d = d w ) end for Update λ ← λ + ρd(T θ1# µ|T θ2# µ) end for

4.1. MULTIVARIATE NORMAL DISTRIBUTIONS

Linear maps. First, we consider the optimal transport between multivariate normal distributions µ = N (X 1 , Σ 1 ) and ν = N (X 2 , Σ 2 ). Considering the map given by T θ = Ax + b with θ = {A ∈ R d×d , b ∈ R d }, we prove in Theorem 2 that the solution to problem (2) gives the correct solution to the Monge problem. Indeed, in this case, the SL, AL and ADMM algorithms reduce to optimization of linear objective function constraint by a nonlinear function and it could be theoretically analyzed by the constraint optimization theory Bertsekas (2014) . Here we take X 1 = 0 2 , Σ 1 = I 2 and X 2 = 0 2 , Σ 2 = [[4, 1], [1, 4]], let T θ x = Ax with A = [[a, b], [b, a]] (θ = {a, b}). Then the problem (2) reduces to min a,b 2(1 -a) 2 + 2b 2 , s.t. D KL (T θ µ|ν) = 1 30 (-15 log 1 15 a 2 -b 2 2 + 8a 2 -4ab + 8b 2 -30) = 0. The landscapes of D KL and L SL (θ, λ = 1) as well as the value of min θ L SL (θ, λ) as functions of λ, are plotted in Figure 1 . As can be seen from the figure, the function D KL (T θ# µ|ν) has multiple minimizers (red points), whereas the Lagrangian L SP (θ, 1) has a unique global minimizer (blue point). Hence minimizing the Lagrangian L SL helps to find the optimal transport maps by finding the approximate map with the lowest transport cost among all maps that realize the target distributions. More importantly, from the figure, the whole domain can be divided into four pieces by the landscape of D KL , and starting in each piece, gradient descent method will converge to one of the four different points. In contrast, the landscape of the Lagrangian changes dramatically and starting from any point, gradient descent method will only converge to one point. The Langrangian introduces errors for λ finite. As can be seen from the leftmost figure, the constraint D KL = 0 is more relaxed as λ becomes smaller. However, for large λ, the barrier region becomes wider and the gradient descent may converge to a local minimizer away from the optimal result. For small λ, the gradient descent is easy to find the global minimizer, but since penalty introduces error, the relaxation effect of the constraints can also push the minimizer away from the solution of the Monge problem. The choice of λ needs to take this tradeoff into considerations. Training using a one layer linear neural network confirms the above analysis. Training with neural networks with nonlinear activation function. Next we present our results on the training of Gaussian to Gaussian distributions with neural networks with nonlinear activation function. We take X 1 = 0 d , X 1 = 1 d and Σ 1 = I d , Σ 2 = 3I d + 1 d×d . Theoretical analysis gives that the optimal transport distance is 2d (see Theorem 2). Since here we use the D KL as distance function which is always positive, so the QP method behaves similarly as the SL method with different multiplier. Hence the results of the QP method is not presented here. The results are given in Table 1 . and the 784D Gaussian is taken for X 1 = 0 784 , X 1 = 2•1 784 with Σ 1 = Σ 2 = I 784 (theoretical result of the optimal transport distance is 784 * 4). As can be seen from the table, all three algorithms give nice result and learn approximately the optimal transport map with a high Training is done by using a d width and 10 depth neural network with tanh activation for all cases except for 78D Gaussian, for which a 100 depth neural network is used in order to learn the correlations correctly. Remarkably, for the highly correlated distributions in high dimensions, our method gives a nice result. • • • • • • • • •• •• •• •• ••• ••• •••• •••• ••••• ••••••• ••••••••• •••••••••••• ••••••••••••••••••• •••••••••••••••••• ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ADMM learns a lower transport cost. From the figure, we can see that the ADMM method learns a lower transport cost compared to other methods. This is because during training, minimizing the D KL between the generated and target distribution may not converge to the solution to the Monge problem, as illustrated in the simple 2D case above. Hence, the splitting feature enables ADMM to learn a transport map with lower transport cost. This can also be seen from the learning curve in Figure 4 in Appendix A.5. The transport of the second network (T θ2 ) has a higher transport cost when learning the target distribution, whereas the first network (T θ1 has a lower transport cost, while the learned target distribution remains accurate.

4.2. GAUSSIAN TO GAUSSIAN MIXTURES

We take a four component Gaussian mixture, each with variance matrix 0.5I 2 , centers lying on the four corners of the square [-1, 1] 2 and learn the optimal transport map between two dimensional standard Gaussian to this mixture distribution. We plot the Jacobian graph of the learned map and the value of the D KL between target distribution and the prediction by the network in Figure 2 . As can be seen from the figure, solely minimizing the D KL does not fulfil the right directions of the optimal transport maps. Using our method, the learned map is more balanced and approximately satisfies the condition ∇ × T x = 0. Note that here we donot include the penalty of |∇ × T x| 2 . From the learning curves, we confirm the findings above that ADMM learns a lower transport cost than the other methods. Here the transport cost of the second network converges to around 0.07, while the transport cost of the first network converges to about one half of that of the second network. AL and ADMM methods are more robust. Compared to the SL, AL gives a more robust result with respect to the value of λ. For a well chosen λ, SL performs as good as AL. However, if λ is not taken properly, the obtained transport cost will be higher or the target distribution is not well realized. For ADMM method, we can see the choice of ρ affects the training process, but in a big range, the choice of ρ has little effects on the final result (see appendix A.5 for the graphs indicating this finding). This confirms the benefits of the AL and ADMM method in the literature (Nocedal & Wright (1999) ; Boyd et al. (2011) ).

4.3. IMPROVED TRANSPORT COST OF GANS

We use the test function as multiplier and test the performance of the algorithms described in section 3.4. The SL method help overcome the vanishing gradient issue. The constraint f w is a Lipschitz-1 function is important in WGAN. Without this constraint, the training of WGAN may face vanishing gradients issue, as illustrated in Figure 6 in Appendix A.6. However, adding the transport cost to the cost function as in ( 12), the training no longer face this difficulty and the target distribution could be learned. Note that here we donot use a penalty term on the discriminator network as the gradient penalty method, the transport cost is only a function of the generator network and no regularization of the discriminator network is needed. All three methods significantly reduce the transport cost of GAN. Compared to the WGAN-GP, which has a transport cost around 3.68, all three methods (SL, AL, ADMM) described in section 3.4 show a significantly lower transport cost (0.077 for SL, 0.071 for AL and 0.049 for the ADMM). The transport losses are plotted in Figure 7 in Appendix A.6. However, the SL method is sensitive to the choice of λ, while the AL and ADMM methods are more robust. MNIST. We also train the WGANs on the MNIST dataset. By using WGAN with SL(λ = 1), we obtain a transport cost around 1.81 compared to 1.87 with only WGAN. MNIST like samples generated by the learned optimal is plotted in Figure 3 . Therefore, our method lowers the transport cost of WGAN while keeping the qualify of the generated distributions. We have shown that the incorporation of constraint optimization tools provides a direct and efficient way for computing optimal transport maps. By solving the Monge problem directly, our method avoids using special network structures or solving the dual problem. Moreover, applying our method to WGAN shows a lower transport cost for the generative networks without sacrificing the quality of the generated data. A.2 PROOF OF THEOREM 2 Theorem 2. Let µ = N (X 0 , Σ 0 ) and ν = N (X 1 , Σ 1 ) be two multivariate normal distributions with X 0 , X 1 ∈ R d and Σ 0 , Σ 1 ∈ R d×d . Let T θ : R d → R d be an linear operator defined by y = Ax + b with θ = (A, b) and A ∈ R d×d , b ∈ R d . Then there exists a unique solution to the following problem θ * = arg min θ R d |x -T θ x| 2 dµ(x), s.t. D KL (T θ# µ|ν) = 0 , and the corresponding transport map T θ * is the optimal transport map in the sense of (1). Proof. For x ∼ µ, the linear transformation y = Ax + b also satisfies a multivariate normal distribution, given by y ∼ ρ = N (AX 0 + b, AΣ 0 A T ). Using the formula for KL divergence between two multivariate normal distributions, we can get that the KL-divergence between ρ and ν is D KL (ρ|ν) = 1 2 tr(Σ -1 1 AΣ 0 A T ) + (X 1 -(AX 0 + b)) T Σ -1 1 (X 1 -(AX 0 + b)) -d + log det(AΣ 0 A T ) det Σ 1 Due to the fact that trB -log det B + d ≥ 0 for any positive definite matrix B > 0 and (X 1 - (AX 0 + b)) T Σ -1 1 (X 1 -(AX 0 + b)) ≥ 0, KL(ρ|ν) = 0 is equivalent to Σ -1 1 AΣ 0 A T = I d , (X 1 -(AX 0 + b)) = 0. The objective function in ( 17) can be calculated by R d |x -T θ x| 2 dµ(x) = R d |x -(Ax + b)| 2 dµ(x) = E x∼µ ((I -A)x -b) T ((I -A)x -b) = tr((I -A) T (I -A)E x∼µ0 (xx T )) -2b T (I -A)X 0 + b T b = tr((I -A) T (I -A)(Σ 0 + X 0 X T 0 )) -2b T (I -A)X 0 + b T b = tr((I -A) T (I -A)Σ 0 ) + |(I -A)X 0 -b| 2 . ( ) Therefore, the optimization problem (17) becomes min A∈R d×d ,b∈R d tr((I -A) T (I -A)Σ 0 ) + |(I -A)X 0 -b| 2 , s.t. Σ -1 1 AΣ 0 A T = I d , (X 1 -(AX 0 + b)) = 0. Eliminating b, this is equivalent to min A∈R d×d tr((I -A) T (I -A)Σ 0 ) + |X 0 -X 1 | 2 , s.t. Σ -1 1 AΣ 0 A T = I d . Since tr((I -A) T (I -A)Σ 0 ) = tr(Σ 0 ) -2tr(A T Σ 0 ) + tr(AΣ 0 A T ) = tr(Σ 0 ) -2tr(A T Σ 0 ) + d, the above problem is equivalent to max A∈R d×d tr(A T Σ 0 ), s.t. AΣ 0 A T = Σ 1 . ( ) Let R = Σ 1 2 0 A T Σ -1 2 1 , the above problem could be rewritten as max R∈R d×d ;R T R=I d tr(RΣ 1 2 1 Σ 1 2 0 ). The above formula (A * , b * ) is the same as that obtained by analytical method for the optimal transport and is shown to be the unique solution to the Monge problem (1).

A.3 PROOF OF THE CONVERGENCE OF THE ADMM METHOD

Theorem 3. Assume T θ is given by a neural network with Lipschitz continuous activation functions and satisfying ∥T θ ∥ → ∞ for θ → ∞. For sufficiently large ρ, the ADMM method generates a bounded sequence that converges to the stationary point of the Lagrangian L ε (θ 1 , θ 2 , Λ, ρ) = E x∼µ |x -T θ1 x| 2 + η ε (d(T θ2# µ|ν)) + Λ T (T θ1 x -T θ2 x) + ρ 2 (E x∼µ [Λ T (T θ1 x -T θ2 x)]) 2 . ( ) Proof. We only need to check the conditions in the reference Wang et al. ( 2019). First we show the objective function is coercive. For bounded At each training step (taking ρ to be constant), the KKT conditions for the QP loss (5) are θ 1 = θ 2 and θ 1 → ∞, then E x∼µ |x -T θ1 x| 2 + η ε (d(T θ2# µ|ν)) = ∞ ∇ θ L QP (θ k , ρ k ) = ∇ θ E x∼µ ( |x -T θ x| 2 )(θ k ) + ρ k d(T θ k # µ|ν)(∇ θ d(T θ# µ|ν))(θ k ) = 0, At each training step (taking ρ to be constant), the KKT conditions for the AL loss (6) are ∇ θ L AL (θ k , λ k , ρ k ) = ∇ θ E x∼µ ( |x -T θ x| 2 )(θ k ) + (λ k + ρ k d(T θ k # µ|ν))(∇ θ d(T θ# µ|ν))(θ k ) = 0, ∇ λ L AL (θ k , λ k , ρ k ) = d(T θ k µ|ν) = 0.

A.5 TRAINING RESULTS

The results of 78D Gaussian is plotted below. We demonstrate the robustness of AL and ADMM method compared to the SL method. The training curves for these three methods are plotted in Figure 5 . A.6 EXPERIMENT DETAILS Code for the numerical experiments is available at https://github.com/otcop/otcop.git. Table 1 and Figure 2 , 3 are produced using a normalizing flow network with planar transformation layers. The initial and target densities are known and samples are drawn from these distributions. Algorithms described in sections 3.1-3.3 are used. The KL divergence is computed via (11). For the WGAN example, we use a fully connected neural network with hidden layers of width 400. Gradient penalty is used for the WGAN-GP and for the SL, AL and ADMM methods, the alogrithm is described in section 3.4 and λ = 1 is used in the SL and ρ = 10 -4 is used for AL and ρ = 10 -5 is used for the ADMM. 



Figure 1: Graph of the KL error and the loss function of the SP method (left: min θ L AL (θ, λ) as well as the corresponding D KL and transport cost as functions of λ; middle: landscape of D KL as functions of a, b; right: landscape of L SL (θ, 1) as functions of a, b). Red points: minimizers of the D KL , blue point: minimizer of L SL (θ, 1).

Figure 2: Graph of the KL error and the loss function of the minimization of D KL , min D KL , the SP, AL and the ADMM method (from left to right). The D KL curve is the decreasing line and the transport cost is the increasing line. For the bottom right figure, orange line is for the first network and blue line is for the second network

Figure 3: MNIST like samples generated by WGAN with SL penalty

and thus is coercive. The Lipschitz continuous assumption implies the objective function is also Lipschitz continuous. Moreover, the solution to the subproblem min θ1 E x∼µ |x -T θ1 x| 2 and min θ2 η ε (d(T θ2# µ|ν)) is also Lipschitz continuous. Thus the conditions in Wang et al. (2019) holds and we can conclude that the ADMM method converges.A.4 THE KKT CONDITIONSThe KKT condition is the conditions for the saddle point of the Lagrangian. The KKT condition for the SL Lagrangian (4). At the saddle point (θ * , λ * ), the KKT conditions for (4) are∇ θ L SP (θ * , λ * ) = ∇ θ E x∼µ ( |x -T θ x| 2 )(θ * ) + λ * (∇ θ d(T θ# µ|ν))(θ * ) = 0, ∇ λ L SP (θ * , λ * ) = d(T θ * # µ|ν) = 0.

Figure 4: Training curves of the three algorithms for 78D Gaussian benchmark (the increasing line is the transport cost, and the decreasing line is the D KL . Left: the SL method; middle: the AL method; right: the ADMM method, orange line is for the first network and blue line is for the second network).

Figure 5: Graph of the training curves, increasing line for transport loss and decreasing line for the D KL . Top: SL with λ = 0.1(left), λ = 10(right) Middle: AL with λ = 0.1 (left), λ = 10 (right). Bottom: ADMM with Λ = 0.1 • 1 (left) and Λ = 10 • 1 (right).

Figure 6: Graph of the results of WGAN without gradient penalty and the SL method; Left to the right: the discriminator loss, the generator loss and the generated sample; top: WGAN without gradient penaly, bottom: the SL method (equation (12))

Training results on Gaussian and Gaussian mixtures

annex

Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999.Ingram Olkin and Friedrich Pukelsheim. The distance between two random vectors with given dispersion matrices. Linear Algebra and its Applications, 48:257-263, 1982. Gabriel Peyré, Marco Cuturi, et 

A APPENDIX

A.1 PROOF OF THEOREM 1The proof of Theorem 1 is given below by combing the existence theorem for the Monge problem and the universal approximation of neural networks.Proof. First given two probability measures µ, ν satisfying the assumptions of Theorem 1, then (Villani, 2009, Theorem 5.20) implies that there exists a unique deterministic solution T to the Monge problem and the map T is a Borel map. Then by Lusin's theorem, there exists a continuous function T ∈ C(R d ) such that T = T on any compact subset X ⊂ R d almost everywhere. Since T θ is given by a arbitrary depth neural network, we can use the universal approximation theorem in Kidger & Lyons (2020) to conclude:For any ε > 0, there existsFix α and take ε < α/2, since d(T θ# µ|ν) ≤ α, we can get that the solution T θ * α to the problem (3) satisfiesLet {α k } be a sequence converging to zero and T θ k be the solution to the corresponding problem (3), then d(TOn the other hand, by taking limit α → 0 in (15), we haveCombing this with the previous inequality implies that (16) 

