KERNEL NEURAL OPTIMAL TRANSPORT

Abstract

We study the Neural Optimal Transport (NOT) algorithm which uses the general optimal transport formulation and learns stochastic transport plans. We show that NOT with the weak quadratic cost may learn fake plans which are not optimal. To resolve this issue, we introduce kernel weak quadratic costs. We show that they provide improved theoretical guarantees and practical performance. We test NOT with kernel costs on the unpaired image-to-image translation task. (a) Celeba (female) → anime, 128 × 128. (b) Outdoor → church, 128 × 128. Figure 1: Unpaired image-to-image translation (one-to-many) by Kernel Neural Optimal Transport.

1. INTRODUCTION

Neural methods have become widespread in Optimal Transport (OT) starting from the introduction of the large-scale OT (Genevay et al., 2016; Seguy et al., 2018) and the Wasserstein Generative Adversarial Networks (Arjovsky et al., 2017) (WGANs) . Most existing methods employ the OT cost as the loss function to update the generator in GANs (Gulrajani et al., 2017; Sanjabi et al., 2018; Petzka et al., 2018) . In contrast to these approaches, (Korotin et al., 2023; Rout et al., 2022; Daniels et al., 2021; Fan et al., 2022a; Korotin et al., 2023) have recently proposed scalable neural methods to compute the OT plan (or map) and use it directly as the generative model. In this paper, we focus on the Neural Optimal Transport (NOT) algorithm (Korotin et al., 2023) . It is capable of learning optimal deterministic (one-to-one) and stochastic (one-to-many) maps and plans for quite general strong and weak (Gozlan et al., 2017; Gozlan & Juillet, 2020; Backhoff-Veraguas et al., 2019) transport costs. In practice, the authors of NOT test it on the unpaired image-to-image translation task (Korotin et al., 2023, 5) with the weak quadratic cost (Alibert et al., 2019, 5.2) . Contributions. We conduct the theoretical and empirical analysis of the saddle point optimization problem of NOT algorithm for the weak quadratic cost. We show that it may have a lot of fake solutions which do not provide an OT plan. We show that NOT indeed might recover them ( 3.1). We propose weak kernel quadratic costs and prove that they solve this issue ( 3.2). Practically, we show how NOT with kernel costs performs on the unpaired image-to-image translation task ( 5). Notations. We use X , Y, Z to denote Polish spaces and P(X ), P(Y), P(Z) to denote the respective sets of probability distributions on them. For a distribution P, we denote its mean and covariance matrix by m P and Σ P , respectively. We denote the set of probability distributions on X × Y with marginals P and Q by Π(P, Q). For a measurable map T : X × Z → Y (or T x : Z → Y), we denote the associated push-forward operator by T ♯ (or T x ♯). We use H to denote a Hilbert space (feature space). Its inner product is ⟨•, •⟩ H , and ∥ • ∥ H is the corresponding norm. 

2. BACKGROUND ON OPTIMAL TRANSPORT

Strong OT formulation. For distributions P ∈ P(X ), Q ∈ P(Y) and a cost function c : X ×Y → R, Kantorovich's (Villani, 2008) primal formulation of the optimal transport cost (Figure 2a ) is Cost(P, Q) def = inf π∈Π(P,Q) X ×Y c(x, y)dπ(x, y), where the minimum is taken over all transport plans π, i.e., distributions on X × Y whose marginals are P and Q. The optimal π * ∈ Π(P, Q) is called the optimal transport plan. A popular example of an OT cost for X = Y = R D is the Wasserstein-2 (W 2 2 ), i.e., formulation (1) for c(x, y) = 1 2 ∥x -y∥ 2 . (a) Strong OT formulation (1). (b) Weak OT formulation (2). Figure 2 : Strong (Kantorovich's) and weak (Gozlan et al., 2017) optimal transport formulations. Weak OT formulation. Let C : X × P(Y) → R be a weak cost (Gozlan et al., 2017) , i.e., a function which takes a point x ∈ X and a distribution of y ∈ Y as inputs. The weak OT cost between P, Q is Cost(P, Q) def = inf π∈Π(P,Q) X C x, π(•|x) dπ(x), where π(•|x) denotes the conditional distribution (Figure 2b ). Weak OT (2) subsumes strong OT formulation (1) for C(x, µ) = Y c(x, y)dµ(y). An example of a weak OT cost for X = Y = R D is the γ-weak (γ ≥ 0) Wasserstein-2 (W 2,γ ), i.e., formulation (2) with the γ-weak quadratic cost C 2,γ x, µ def = Y 1 2 ∥x -y∥ 2 dµ(y) - γ 2 Var(µ) = Y 1 2 ∥x - Y y dµ(y)∥ 2 dµ(y) + 1 -γ 2 Var(µ), where Var(µ) denotes the variance of µ: Var(µ) def = Y ∥y - Y y ′ dµ(y ′ )∥ 2 dµ(y) = 1 2 Y×Y ∥y -y ′ ∥ 2 dµ(y)dµ(y ′ ). For γ = 0, the transport cost (3) is strong, i.e., W 2 = W 2,0 . If the cost C(x, µ) is lower bounded, lower-semicontinuous and convex in the second argument, then we say that it is appropriate. For appropriate costs, the minimizer π * of (2) always exists (Backhoff-Veraguas et al., 2019, 1.3.1) . Since Var(µ) is concave and non-negative, cost (3) is appropriate when γ ∈ [0, 1]. For appropriate costs the (2) admits the following dual formulation: Cost(P, Q) = sup f X f C (x)dP(x) + Y f (y)dQ(y), where f are upper-bounded, continuous and not rapidly decreasing functions, see (Backhoff-Veraguas et al., 2019, 1.3.2) , and f C (x) Neural Optimal Transport (NOT). In (Korotin et al., 2023) , the authors propose an algorithm to implicitly learn an OT plan π * with neural nets (Figure 3 ). They introduce a (latent) atomless distribution S ∈ P(Z), e.g., Z = R Z and S = N (0, I Z ), and search for a function T * : X ×Z → Y (stochastic OT map) which satisfies T * x ♯S = π * (y|x) for some OT plan π * . That is, given x ∈ X , function T * pushes the distribution S to the conditional distribution π * (y|x) of an OT plan π * . In particular, T * satisfies the distributionpreserving condition T * ♯(P×S) = Q. To get T * , the authors use (5) to derive an equivalent dual form: Cost(P, Q) = sup f inf T X C x, T x ♯S - Z f T x (z) dS(z) dP(x) + Y f (y)dQ(y), where the inf is taken over measurable functions T : X × Z → Y. The functional under sup f inf T is denoted by L(f, T ). For every optimal potential f * ∈ arg sup f inf T L(f, T ), it holds that T * ∈ arg inf T L(f * , T ), see (Korotin et al., 2023, Lemma 4 ). Consequently, one may extract optimal maps T * from optimal saddle points (f * , T * ) of problem (6). In practice, saddle point problem (6) can be approached with neural nets f ω , T θ and the stochastic gradient descent-ascent (Korotin et al., 2023 , Algorithm 1). The limitation (Korotin et al., 2023, 6) of NOT algorithm is that arg inf T set of f * in (7) may contain not only optimal transport maps but other functions as well. As a result, the function T * recovered from a saddle point (f * , T * ) may be not an optimal stochastic map. In this paper, we show that for the γ-weak quadratic cost (3) this may be problematic: the arg inf T sets might contain fake solutions T * ( 3.1). To resolve the issue, we propose kernel γ-weak quadratic costs ( 3.2). Convex order. For two probability distributions P, Q on R D , we write P ⪯ Q if for all convex functions h : R D → R it holds h(x)dP(x) ≤ h(x)dQ(x). The relation "⪯" is a partial order, i.e., not all P, Q are comparable. If P ⪯ Q, then m P = m Q and Σ P ⪯ Σ Q (Scarsini, 1998, Lemma 3) . If P, Q are Gaussians, P ⪯ Q holds if and only if m P = m Q and Σ P ⪯ Σ Q (Scarsini, 1998, Theorem 4) . For P, Q, we define the projection of P onto the convex set of distributions which are ⪯ Q as Proj ⪯Q (P) = arg inf P ′ ⪯Q W 2 2 (P, P ′ ). ( ) The infimum is attained uniquely (Gozlan & Juillet, 2020, Proposition 1.1). There exists a 1-Lipschitz, continuously diffirentiable and convex function ϕ : R D → R satisfying Proj ⪯Q (P) = ∇ϕ♯P, see (Gozlan & Juillet, 2020 , Theorem 1.2) for details. Figure 4 : Every optimal restricted potential ϕ * satisfies ∇ϕ * ♯P = Proj ⪯Q ( 1 γ ♯P) ⪯ Q. Weak OT with the quadratic cost (γ > 0). For the γ-weak quadratic cost (3) on X = Y = R D , (Gozlan & Juillet, 2020, 5) , (Alibert et al., 2019, 5. 2) prove that there exists a continuously differentiable convex function ϕ * : R D → R such that π * ∈ Π(P, Q) is optimal if and only if Y y dπ * (y|x) = ∇ϕ * (x) holds true P-almost surely. In general, π * is not unique. We say that ϕ * is an optimal restricted potential. It may be not unique as a function R D → R, but ∇ϕ * is uniquely defined P-almost everywhere. It holds true that ∇ϕ * is 1 γ -Lipschitz; ∇ϕ * ♯P = Proj ⪯Q ( 1 γ ♯P); it is the OT map between P and Proj ⪯Q ( 1 γ ♯P) for the strong quadratic cost (Figure 4 ). The function ϕ * maximizes the dual form alternative to (5): W 2 2,γ (P, Q) = max ϕ∈Cvx( 1 γ ) X ∥x∥ 2 2 dP(x)+ Y ∥y∥ 2 2 dQ(x)- X ϕ(x)dP(x)+ Y ϕ(y)dQ(y) , where Cvx( 1 γ ) denotes the set of 1 γ -smooth convex functions ϕ : R D → R. Duality formula (9) appears in (Alibert et al., 2019, 5. 2), (Gozlan & Juillet, 2020, Theorem 1.2 & 5) but with different parametrization. In Appendix F, for completeness of the exposition, we derive (9) from the results of (Gozlan & Juillet, 2020) by the change of variables.

3. SOLVING ISSUES OF NEURAL OPTIMAL TRANSPORT

In what follows, we consider X = Y ⊂ R D . In 3.1, we theoretically derive that arg inf T sets (7) for the γ-weak quadratic cost (3) may indeed contain functions which are not stochastic OT maps. In 3.2, we introduce kernel weak quadratic costs and prove that they do not suffer from this issue, i.e., all functions in sets arg inf T for all optimal f * are stochastic OT maps. In 3.3, we discuss the practical aspects of learning with kernels. We give the proofs of all the statements in Appendix G.

3.1. FAKE SOLUTIONS FOR THE WEAK QUADRATIC COST

In this subsection, we consider X = Y = R D . We show that arg inf T L(f * , T ) sets ( 7) of optimal potentials f * in (5) for the γ-weak quadratic cost (3), in general, may contain functions T which are not stochastic OT maps. We call such T fake solutions. To show why one should be concerned about fake solutions, we emphasize their key defect below. Lemma 1 (Fake solutions are not distribution-preserving). Let f ∈ arg sup f inf T L(f, T ) and T † ∈ arg inf T L(f * , T ) be a fake solution. Then it holds that T † ♯(P × S) ̸ = Q. Throughout the section, we assume that P, Q have finite second moments. We analyse the potentials of the form f * (y) = 1 2 ∥y∥ 2 -ϕ * (y), where ϕ * : R D → R is an optimal restricted potential. To begin with, we show that such potentials are indeed optimal potentials for dual forms (5) and (6). Lemma 2 (Optimal restricted potentials provide dual form maximizers). Let ϕ * : R D → R be an optimal restricted potential. Assume that ϕ * takes only finite values. Then f * (y) def = 1 2 ∥y∥ 2 -ϕ * (y) maximizes dual formulations (5) and (6). For a convex ψ : R D → R, we denote the area around point y ∈ R D in which ψ is linear by U ψ (y) = {y ′ ∈ R D such that ∀x ∈ ∂ y ψ(y) it holds ψ(y ′ ) = ψ(y) + ⟨x, y ′ -y⟩} ⊇ {y}. (10) Proposition 1 (Convexity of sets of local linearity of a convex function). Set U ψ (y) is convex. Our following theorem provides a full characterization of the arg inf T L(f * , T ) sets in view. Theorem 1 (Characterization of saddle points with optimal restricted f * ). Let f * (y) = ∥y∥ 2 2 -ϕ * (y), where ϕ * is an optimal restricted potential. Assume that ϕ * takes only finite values. Then it holds true that arg inf T L(f * , T ) is a convex set and T † ∈ arg inf T L(f * , T ) ⇔ Z T † x (z)dS(z) = ∇ϕ * (x) holds true P-almost everywhere; T † x (z) ∈ U ψ ∇ϕ * (x) holds true P×S-almost everywhere, where ψ(y) (Kakade et al., 2009) . We define the optimal barycentric projection T * def = ϕ * (y) -γ 2 ∥y∥ 2 . Note that ψ is convex since ϕ * is 1 γ -smooth x (z) def = ∇ϕ * (x) for (x, z) ∈ X×Z; it does not depend on z ∈ Z. The function depends on the choice of optimal ϕ * ; we are interested only in its values in the support of P, where ∇ϕ * is unique ( 2). From definition (10), we see that ∇ϕ * (x) ∈ U ψ ∇ϕ * (x) and T * satisfies both conditions on the right side of (11). Thus, we have T * ∈ arg inf T L(f * , T ). Lemma 3 (The barycentric projection is not always a stochastic OT map). The following holds true T * is a stochastic OT map (a) ⇐⇒ Proj ⪯Q 1 γ ♯P = Q (b) =⇒ T * is the unique stochastic OT map. We use the word stochastic but T * is actually deterministic since it does not depend on z. From our Lemma 3, we derive that if Proj ⪯Q ( 1 γ ♯P) ̸ = Q, it holds that (f * , T * ) is a fake saddle point. Corollary 1 (Existence of fake saddle points). Assume that Proj ⪯Q ( 1 γ ♯P) ̸ = Q. Then problem (6) has optimal saddle points (f * , T * ) in which T * is not a stochastic OT map. Beside T * , our Theorem 1 can be used to construct arbitrary many fake solutions which are not OT maps. Let T * be any stochastic OT map and T † ∈ arg inf T L(f * , T ) satisfy T † ̸ = T * and Var T † ♯(P×S) ≤ Var(Q). For example, T † may be another stochastic OT map or the optimal barycentric projection T * . For any α ∈ (0, 1) consider T α = αT * + (1 -α)T † ∈ arg inf T L(f * , T ). Proposition 2 (Interpolant is not a stochastic OT map). Assume that Proj ⪯Q ( 1 γ ♯P) ̸ = Q. Then T α ♯(P × S) ̸ = Q. Consequently, T α is not a stochastic OT map between P and Q. It follows that a necessary condition for non-existence of fake saddle points is Proj ⪯Q ( 1 γ ♯P) = Q. This requirement is very restrictive and naturally prohibits using large values of γ. Also, due to our Lemma 3, the OT plan between P, Q must be deterministic. From the practical point of view, this requirement means that there will be no diversity in samples T * x (z) for a fixed x and z ∼ S. On the other hand, if Proj ⪯Q ( 1 γ ♯P) ̸ = Q, i.e., P is not γ-times more disperse than Q, the optimization may indeed converge to fake solutions. To show this, we consider the following example. 5a ) and run NOT (Korotin et al., 2023 , Algorihm 1) for γ ∈ { 1 2 , 3 4 , 1}-weak quadratic costs C 2,γ . We show the learned stochastic maps Tx (z) and their barycentric projections T (x) 5b, 5c, 5d . 2020, Corollary 2.1) . Next, we use our Lemma 3 and derive that the OT plan is unique, deterministic and equals the barycentric projection. The latter is the OT map between P and Q for the quadratic cost. It is given by ∇ϕ * (x) = 2x (Álvarez-Esteban et al., 2016, Theorem 2.3) . In Figure 5b (when γ = 1 Toy 2D example. We consider P = N (0, [ 1 2 ] 2 I 2 ), Q = N (0, I 2 ) (Figure def = Z T x (z)dS(z) in Figures Good case. When γ ≤ 1 2 , we have 1 γ ♯P = N (0, [ 1 2γ ] 2 I 2 ) with 1 2γ ≥ 1. Since the distributions 1 γ ♯P and Q are Gaussians and Σ Q ⪯ Σ 1 γ ♯P , we conclude that Proj ⪯Q ( 1 γ ♯P) = Q (Gozlan & Juillet, 2 ), we have T x (z) = T (x) ≈ 2x = ∇ϕ * (x) and T ♯(P×S) ≈ Q. Thus, NOT correctly learns the (unique and deterministic) OT plan. Bad case. When γ > 1 2 , we have 1 γ ♯P = N (0, [ 1 2γ ] 2 I 2 ) with 1 2γ < 1. Since 1 γ ♯P and Q are Gaussians and Σ Q ⪰ Σ 1 γ ♯P , we conclude that 1 γ P ⪯ Q (recall 2). Thus, Proj ⪯Q ( 1 γ ♯P) = 1 γ ♯P ̸ = Q by definition of the projection (8). The optimal barycentric projection is the OT map between Gaussians P and 1 γ ♯P for the quadratic cost. It is given by ∇ϕ * (x) = 1 γ x (Álvarez-Esteban et al., 2016, Theorem 2.3). In Figures 5c and 5d , we see that the learned T (x) ≈ 1 γ x, i.e., T captures the Importantly, we found that when γ > 1 2 (Figures 5c, 5d ), the transport map T extremely fluctuates during the optimization rather than converges to a solution. In Figure 6 , we visualize the evolution of T during training (for γ = 1). In all the cases, the barycentric projection T (x) ≈ x = ∇ϕ * (x) is almost correct. However, the "remaining" part of T is literally random. To explain the behavior, we integrate ∇ϕ * (x) = x and get that ϕ * (x) = 1 2 ∥x∥ 2 is an optimal restricted potential. We derive ψ(y) = ϕ * (y) -1 2 ∥y∥ 2 = 1 2 ∥y∥ 2 -1 2 ∥y∥ 2 ≡ 0 =⇒ U ψ (y) = U 0 (y) ≡ R D for every y ∈ R D . From our Theorem 1 it follows that T † ∈ arg inf T L(f * , T ) ⇔ Z T † x (z)dS(z) = ∇ϕ * (x) = x holds P-almost everywhere. Thus, a function T † recovered from (6) may be literally any function which captures the first conditional moment of a plan π * (y|x). This agrees with our practical observations. In Appendix C, we give an additional toy example illustrating the issue with fake solutions. Our results show that the γ-weak quadratic cost C 2,γ may be not a good choice for NOT algorithm due to fake solutions. However, prior works on OT (Korotin et al., 2023; Rout et al., 2022; Fan et al., 2022a; Gazdieva et al., 2022; Korotin et al., 2022) use strong/weak quadratic costs and show promising practical performance. Should we really care about solutions being fake? Yes. First, fake solutions T * ∈ arg inf T L(f * , T ) do not satisfy T * ♯(P×S) ̸ = Q, i.e., they are not distribution preserving (Lemma 1). Second, our analysis suggests that fake solutions might be one of the causes for the training instabilities reported in related works (Korotin et al., 2023, Appendix D) , (Korotin et al., 2021b, 4) : the map T may fluctuate between fake solutions rather than converge.

3.2. KERNEL WEAK QUADRATIC COST REMOVES FAKE SADDLE POINTS

In this section, we introduce kernel weak quadratic costs which generalize weak quadratic cost (3). We prove that for characteristic kernels the costs completely resolve the ambiguity of arg inf T sets. Henceforth, we assume that X = Y ⊂ R D are compact sets. Let H be a Hilbert space (feature space). Let u : X → H be a function (feature map). We define the γ-weak quadratic cost between features: C u,γ (x, µ) def = 1 2 Y ∥u(x) -u(y)∥ 2 H dµ(y) - γ 2 • 1 2 Y×Y ∥u(y) -u(y ′ )∥ 2 H dµ(y)dµ(y ′ ) . (12) We denote the PDS kernel k : Y × Y → R with the feature map u by k(y, y ′ ) def = ⟨u(y), u(y ′ )⟩ H . Cost ( 12) can be computed without knowing the map u, i.e., it is enough to know the PDS kernel k. By using ∥u(y) -u(y ′ )∥ 2 H = k(y, y) -2k(y, y ′ ) + k(y ′ , y ′ ), we obtain the equivalent form of ( 12): We call ( 12) and ( 13 We focus on characteristic kernels k and show that they resolve the ambiguity in arg inf T sets. C k,γ (x, µ) def = 1 2 k(x, x)+ 1-γ 2 Y k(y, y)dµ(y)- Y k(x, y)dµ(y)+ γ 2 Y×Y k(y, y ′ )dµ(y)dµ(y ′ ). ( ) Lemma 5 (Uniqueness of the optimal plan for characteristic kernel costs). Let k be a characteristic PDS kernel and γ ∈ (0, 1]. Then the OT plan π * for cost C k,γ (x, µ) is unique. Theorem 2 (Optimality of stochastic functions in all optimal saddle points). Let k be a continuous characteristic PDS kernel and γ ∈ (0, 1]. Consider weak OT problem (2) with cost C k,γ and its dual problem (5). For any optimal potential f * ∈ arg sup f inf T L(f, T ) it holds that T * ∈ arg inf T L(f * , T ) ⇐⇒ T * x ♯S = π * (y|x) holds true P-almost surely for all x ∈ X , (14) i.e., every optimal saddle point (f * , T * ) provides a stochastic OT map T * . Bilinear kernel k(x, y) = ⟨x, y⟩ is not characteristic and is not covered by our Theorem 2; its respective γ-weak quadratic cost C 2,γ suffers from fake solutions ( 3.1). In the next subsection, we give examples of practically interesting kernels k(x, y) which are ideologically similar to the bilinear but are characteristic. Consequently, their respective costs C k do not have ambiguity in arg inf T sets.

3.3. PRACTICAL ASPECTS OF LEARNING WITH KERNEL COSTS

Optimization. To learn the stochastic OT map T * for kernel cost (13), we use NOT's training procedure (Korotin et al., 2023, Algorithm 1) . It requires stochastic estimation of C k,γ (x, T x ♯S) to compute the corresponding term in (6). Similar to the γ-weak quadratic cost (Korotin et al., 2023, Equation 23) , it is possible to derive the following unbiased Monte-Carlo estimator C k,γ for x ∈ X and a batch Z ∼ S (|Z| ≥ 2): C k,γ x, T x (Z) = 1 2 k(x, x) + 1-γ 2|Z| z∈Z k T x (z), T x (z) - 1 |Z| z∈Z k(x, T x (z)) + γ 2|Z|(|Z| -1) z̸ =z ′ k T x (z), T x (z ′ ) ≈ C k,γ (x, T x ♯S). ( ) The time complexity of estimator ( 15) is O(|Z| 2 ) since it requires considering pairs z, z ′ in batch to estimate the variance term. Specifically for the bilinear kernel k(y, y ′ ) = ⟨y, y ′ ⟩, the variance can be 23), but NOT algorithm may suffer from fake solutions ( 3.1). Kernels. Consider the family of distance-induced kernels k(x, y) = 1 2 ∥x∥ α + 1 2 ∥y∥ α -1 2 ∥x -y∥ α . For these kernels, we have ∥u 12), ( 13) can be expressed as (x) -u(x ′ )∥ 2 H = ∥x -x ′ ∥ α , i.e., ( C k,γ (x, µ) = C u,γ (x, µ) = 1 2 Y ∥x -y∥ α dµ(y) - γ 2 • 1 2 Y×Y ∥y -y ′ ∥ α dµ(y)dµ(y ′ ) . ( ) For α = 2 the kernel is bilinear, i.e., k(x, y) = ⟨x, y⟩; it is PDS but not characteristic and ( 16) simply becomes the γ-weak quadratic cost (3). In the experiments ( 5), we focus on the case α = 1; it yields a PDS and characteristic kernel (Sejdinovic et al., 2013, Definition 13 & Proposition 14) .

4. RELATED WORK

In deep learning, OT costs are primarily used as losses to train generative models. Such approaches are called Wasserstein GANs (Arjovsky & Bottou, 2017) ; they are not related to our paper since they only compute OT costs but not OT plans. Below we discuss methods to compute OT plans. Existing OT solvers. NOT (Korotin et al., 2023) is the only parametric algorithm which is capable of computing OT plans for weak costs (2). Although NOT is generic, the authors tested it only with the γ-weak quadratic cost (3). The core of NOT is saddle point formulation (6) which subsumes analogs (Korotin et al., 2021b , Eq. 9), (Rout et al., 2022, Eq. 14) , (Fan et al., 2022a, Eq. 11) , (Henry-Labordere, 2019, Eq. 11), (Gazdieva et al., 2022, Eq. 10) , (Korotin et al., 2022, Eq. 7) for strong costs (1). For the strong quadratic cost, (Makkuva et al., 2020) , (Taghvaei & Jalali, 2019 , Eq. 2.2), (Korotin et al., 2021a, Eq. 10) consider analogous to (9) formulations restricted to convex potentials; they use Input Convex Neural Networks (ICNNs (Amos et al., 2017) ) to approximate the potentials. ICNNs are popular in OT (Korotin et al., 2021c; Mokrov et al., 2021; Huang et al., 2020; Alvarez-Melis et al., 2022; Bunne et al., 2022) but recent studies (Korotin et al., 2021b; 2022; Fan et al., 2022b) show that OT algorithms based on them underperform compared to unrestricted formulations such as NOT. In (Genevay et al., 2016; Seguy et al., 2018; Daniels et al., 2021) , the authors propose neural algorithms for f -divergence regularized costs (Genevay, 2019) . The first two methods suffer from bias in high dimensions (Korotin et al., 2021b, 4.2) . Algorithm (Daniels et al., 2021) alleviates the bias but is not end-to-end and is computationally expensive due to using the Langevin dynamics. There also exist GAN-based (Goodfellow et al., 2014) methods (Lu et al., 2020; Xie et al., 2019; González-Sanz et al., 2022) to learn OT plans (or maps) for strong costs. However, they are harder to set up in practice due to the large amount of tunable hyperparameters.

Kernels in OT.

In (Zhang et al., 2019; Oh et al., 2020) , the authors propose a strong kernel W 2 distance and an algorithm to approximate the transport map under the Gaussianity assumption on P, Q. In (Li et al., 2021) , the authors generalize Sinkhorn divergences (Genevay et al., 2019) to Hilbert spaces. These papers consider discrete OT formulations and data-to-data matching tasks; they do not use neural networks to approximate the OT map.

5. EVALUATION

In Appendix A, we learn OT between toy 1D distributions and perform comparisons with discrete OT. In Appendix B, we conduct tests on toy 2D distributions. In this section, we test our algorithm on an unpaired image-to-image translation task. We perform comparison with principal translation methods in Appendix K. The code is written in PyTorch framework and is available at https://github.com/iamalexkorotin/KernelNeuralOptimalTransport Image datasets. We test the following datasets as P, Q: aligned anime facesfoot_0 , celebrity faces (Liu et al., 2015) , shoes (Yu & Grauman, 2014) , Amazon handbags, churches from LSUN dataset (Yu et al., 2015) , outdoor images from the MIT places database (Zhou et al., 2014) , describable textures (Cimpoi et al., 2014) . The size of datasets varies from 5K to 500K images. Train-test split. We pick 90% of each dataset for unpaired training. The rest 10% are considered as the test set. All the results presented here are exclusively for test images, i.e., unseen data. Transport costs. We focus on the γ-weak cost for the kernel k(x, y) = 1 2 ∥x∥ + 1 2 ∥y∥ -1 2 ∥x -y∥. For completeness, we test other popular PDS kernels in Appendix E. Other training details (optimizers, architectures, pre-processing, etc.) are given in Appendix I. We learn stochastic OT maps between various pairs of datasets. We rescale images to 128 × 128 and use γ = 1 3 in the experiments with the kernel cost. Additionally, in Appendix D we analyse how varying parameter γ affects the diversity of generated samples. We provide the qualitative results in Figures 1, 7 and 8 ; extra results are in Appendix L. Thanks to the first term in ( 16), our translation map Tx (z) tries to minimally change the image content x in the pixel space. At the same time, the second term (kernel variance) in ( 16) enforces the map to produce diverse outputs for different z ∼ S. We provide quantitative comparison with NOT with the γ-weak quadratic cost C 2,γ . We compute FID score (Heusel et al., 2017) between the mapped input test subset and the output test subset (Table 1 ). For C 2,γ , we use the pre-trained models provided by the authors of NOT (Korotin et al., 2023, 5) . 2We observe that FID of NOT with kernel cost C k,γ is better than that of NOT with cost C 2,γ . We show qualitative examples in Appendix J. In Appendix H, we perform a detailed comparison of NOT's training stability with the weak quadratic and kernel costs.

6. DISCUSSION

Potential impact. Neural OT methods and their usage in generative models constantly advance. We expect our proposed weak kernel quadratic costs to improve applications of OT to unpaired learning. In particular, we hope that our theoretical analysis provides better understanding of the performance. Limitations (theory). In our Theorem 2, we implicitly assume the existence a maximizer f * of dual form (5) for kernel costs C k,γ . Deriving precise conditions for existence of such maximizers is a challenging question. We hope that this issue will be addressed in the future theoretical OT research. Limitations (practice). Applying kernel costs to domains of different nature (RGB images → depth maps, infrared images → RGB images) is not straightforward as it might require selecting meaningful shared features u (or kernel k). Studying this question is a promising avenue for the future research.

A TOY EXPERIMENTS IN 1D AND COMPARISON WITH DISCRETE OT

In this section, we learn transport plans between various pairs of toy 1D distributions and compare them with the discrete optimal transport (DOT) considered as the ground truth. We use the distanceinduced kernel and consider γ ∈ {1, 10}. All the rest training details (fully-connected architectures, optimizers, etc.) match those of (Korotin et al., 2023, Appendix C) . We consider Gaussian N (0, 1) → Mixture of 2 Gaussians and Mixture of 3 Gaussians → Mixture of 2 Gaussians. In Figure 9 , we visualize the pairs P, Q (1st and 2nd columns), the plan π learned by Kernel NOT (3rd column) and the plan π * learned by DOT (4th column). To compute DOT, we sample 10 3 random points x ∼ P, y ∼ Q and compute a discrete plan by ot.optim.cg solver from Python OT (POT) library https://pythonot.github.io/. Our learned plan π and DOT's plan π * nearly match. Note also that, as one may expect, with the increase of γ from 1 to 10, the conditional variance of the plan increases and for very high γ = 10 it becomes similar to the trivial plan P × Q. This is analogous to the entropic optimal transport, see, e.g., (Peyré et al., 2019, Figure 4 .2). 

B TOY EXPERIMENTS IN 2D

In this section, we learn transport maps between various common pairs of toy 2D distributions. We use the distance-induced kernel and γ = 1. All the rest training details (fully-connected architectures, optimizers, etc.) exactly match those of (Korotin et al., 2023, Appendix B) . We consider Gaussian N (0, 1 2 I 2 ) → Gaussian N (0, I 2 ) (the same experiment as in Figures 5d and 6 ), Gaussian N (0, [ 1 2 ] 2 I 2 ) → Mixture of 8 Gaussians and Gaussian N (0, [ 1 2 ] 2 I 2 ) → Swiss roll as P, Q pairs. In Figure 10 , we provide the learned stochastic (one-to-many) maps. Since the ground truth OT maps for kernel costs are not known, we provide only qualitative results. 

C ADDITIONAL TOY 2D EXAMPLE

In this section, we provide an additional toy 2D example demonstrating the issue with fake solutions for the weak quadratic cost. We consider a mixture of 4 Gaussians as P and the uniform distribution on a square as Q. We train NOT with the γ-weak quadratic cost for γ = 0, 1 2 , 1 and report the results in Figure 11 . For γ = 0 (Figure 11b ), we see that NOT with the weak quadratic costfoot_2 learns the target distribution. However, when γ = 1 2 (Figure 11c ) and γ = 1 (Figure 11d ), the method does not converge and yields fake solutions. In addition, in Figure 12 , we show that for γ = 1 the method notably fluctuates between the fake solutions. This is analogous to the toy example with Gaussians in 3.1, see Figure 6 . For completeness, we run NOT with our proposed kernel cost (16) on this pair (P, Q) and γ = 1 and show that it learns the distribution Q, see Figure 10d . 

D VARIANCE-SIMILARITY TRADE-OFF

In this section, we study how parameter γ affects the resulting learned transport map. In (Korotin et al., 2023, Appendix A) , the authors empirically show that for the γ-weak quadratic cost C 2,γ the variety of samples T x (z) produced for a fixed x and z ∼ S increases with the increase of γ, but their similarity to x decreases. We formalize this statement and generalize it for kernel costs C k,γ . For a plan π, we define its (feature) conditional variance and (the square of) input-output distance by respectively. Recall that Var u♯π(y|x) = Y×Y ∥u(y) -u(y ′ )∥ 2 H dπ(y|x)dπ(y ′ |x). We note that the γ-weak kernel cost of a plan π ∈ Π(P, Q) is given by Cost k,γ (π) = 1 2 Dist u (π) - γ 2 CVar u (π). ( ) Our following proposition explains the behaviour of the above mentioned values for OT plans. Proposition 3 (Behavior of the conditional variance and input-output distance). Let π * γ ∈ Π(P, Q) be an OT plan for γ-weak kernel cost. Then for γ 2 > γ 1 ≥ 0 it holds true  CVar u (π * γ1 ) ≤ CVar u (π * γ2 ) and Dist u (π * γ1 ) ≤ Dist u (π * γ2 ), γ = 0 γ = 1 3 γ = 2 3 γ = 1 γ = 4 3 CVaru( πγ) 0. ) ≥ Cost k,γ1 (π * γ2 ). ( ) The proof is given in Appendix G. We empirically check the proposition by training OT maps for C k,γ for texture → shoes translation (64×64), distance-induced k ( 5) and γ ∈ {0, 1 3 , 2 3 , 1, 4 3 }, see Figure 13 and Table 2 . We observe the increase of the variety of samples with the increase of γ. At the same time, with the increase of γ, output samples become less similar to the inputs.

E EXPERIMENTS WITH DIFFERENT KERNELS

We empirically test several popular kernels on texture → handbag translation (64 × 64), γ = 1 3 . We consider bilinear, distance-based, Gaussian and Laplacian kernels. The three latter kernels are characteristic. The quantitative and qualitative results are given in Figure 14 . For the kernels in view, the squared feature distance can be expressed as ∥u(x)-u(y)∥ 2 H = h(∥x-y∥) for some increasing function h : R + → R + . Due to this, all the stochastic transport maps try to preserve the input content in the pixel space. According to FID, the distance-based kernel performs better than the bilinear one, which agrees with the results in 5. Interestingly, both Gaussian and Laplacian kernels are slightly outperformed by the bilinear kernel. We do not know why this happens, but we presume that this might be related to their boundness (0 < k(x, y) ≤ 1) and exp operation.

F RESTRICTED DUALITY FOR THE WEAK QUADRATIC COST

In this section, we derive duality formula (9) for the γ-weak quadratic cost. First, we derive formula (9) for γ = 1 by using (Gozlan & Juillet, 2020, Theorems 1.1, 1.2). Next, following the discussion in (Gozlan & Juillet, 2020, 5. 2), we generalize duality formula (9) to arbitrary γ > 0. We note that the constants in our derivations differ from those in (Gozlan & Juillet, 2020) since our quadratic cost C 2,γ differs by a constant multiplicative factor. Part 1 (γ = 1). From (Gozlan & Juillet, 2020, Theorems 1.1, 1.2) it follows that there exists a lower-semi-continuous and convex function v * : R D → R ∪ {∞} which maximizes the following expression: W 2 2,1 (P, Q) = max v∈ l.s.c. Cvx X inf y∈R D v(y) + 1 2 ∥x -y∥ 2 dP(x) - Y v(y)dQ(y) . (21) Importantly, ϕ * (x) def = ( ∥•∥ 2 2 + v * )(x) is the optimal restricted potential, i.e., ∇ϕ * implements the projection of P to Q and is a 1-smooth convex function. We consider the change of variables ϕ (x) = ( ∥•∥ 2 2 + v)(x) in (21). Since ∥y∥ 2 2 + v(y) is 1-strongly convex, its conjugate ϕ(x) is 1-smooth. From the lower-semi-continuity of v it follows that ϕ(y) = ∥ • ∥ 2 2 + v (y) = ∥ • ∥ 2 2 + v (y) = 1 2 ∥y∥ 2 + v(y) =⇒ -v(y) = 1 2 ∥y∥ 2 -ϕ(y). ( ) We derive  inf y∈R D v(y) + 1 2 ∥x -y∥ 2 = 1 2 ∥x∥ 2 + inf y∈R D ϕ(y) -⟨x, y⟩ = 1 2 ∥x∥ 2 -sup y∈R D ⟨x, y⟩ -ϕ(y) = 1 2 ∥x∥ 2 -ϕ(x). ( ) We substitute ( 22) and ( 23) to ( 21) and obtain W 2 2,1 (P, Q) =max ϕ∈Cvx(1) X ∥x∥ 2 2 dP(x)+ Y ∥y∥ 2 2 dQ(x)- X ϕ(x)dP(x)+ Y ϕ(y)dQ(y) . ( ) We only need to note that (24) exactly matches the desired (9) for γ = 1. Part 2 (arbitrary γ > 0). In (Gozlan & Juillet, 2020, 5. 2), the authors show that the OT problem between P and Q for the γ-weak cost becomes the OT problem between 1 γ P and Q for the 1-weak cost. It holds -Veraguas et al., 2019, Thm. 1.4 ) W 2 2,γ (P, Q) = γ • inf P ′ ⪯Q W 2 2 ( 1 γ ♯P, P ′ ) =W 2 2,1 ( 1 γ ♯P,Q), see (Backhoff + 1 -γ 2γ X ∥x∥ 2 dP(x) + 1 -γ 2 Y ∥y∥ 2 dQ(x). (25) Moreover, ϕ * (x) is the optimal restricted potential for γ-weak cost between P, Q if and only if ϕ * 1 (x) = γϕ * (γ -1 x) is the optimal restricted potential between 1 γ ♯P and Q for the 1-weak quadratic cost. Note that ∇ϕ * (x) = ∇ϕ * 1 ( 1 γ x), i.e., ∇ϕ * (x) first scales x ∼ P by 1 γ and then implements projection ∇ϕ * 1 of 1 γ ♯P to Q. In particular, the function ϕ * is 1 γ -Lipschitz. We reparameterize the duality formula (24) for distributions 1 γ ♯P and Q as follows: W 2 2,1 ( 1 γ ♯P, Q) = max ϕ1∈Cvx(1) X ∥x∥ 2 2 d 1 γ ♯P (x)+ Y ∥y∥ 2 2 dQ(x)- X ϕ 1 (x)d 1 γ ♯P (x)+ Y ϕ 1 (y)dQ(y) = max ϕ1∈Cvx(1) 1 γ 2 X ∥x∥ 2 2 dP(x)+ Y ∥y∥ 2 2 dQ(x)- X ϕ 1 ( 1 γ x)dP(x)+ Y ϕ 1 (y)dQ(y) = (26) max ϕ∈Cvx( 1 γ ) 1 γ 2 X ∥x∥ 2 2 dP(x)+ Y ∥y∥ 2 2 dQ(x)- 1 γ X ϕ(x)dP(x)+ Y ϕ(y)dQ(y) . ( ) In transition from ( 26) to ( 27), we use the change of variables for ϕ(x) = γϕ 1 ( 1 γ x) known as the right scalar multiplication. It yields ϕ(y) = γϕ 1 (y). To finish the derivation of dual form (9), we simply substitute ( 27) to ( 25).

G PROOFS

Proof of Lemma 1. Assume the opposite, i.e., T † ♯(P × S) = Q. Then T † implicitly represents some transport plan π † ∈ Π(P, Q) between P and Q. By the definition of f * , T † and ( 6), it holds Cost(P, Q) = L(f * , T † ) = X C(x, T † x ♯S)dP(x) - X Z f * T † x (z) dS(z)dP(x) + Y f * (y)dQ(y) = (28) X C(x, T † x ♯S)dP(x) - Y f * (y)dQ(y) + Y f * (y)dQ(y) = (29) X C(x, T † x ♯S)dP(x) = X C(x, π † (•|x))dπ † (x), where in transition between lines ( 28) and ( 29), we use the change of variables formula for y = T † x (z) and the equality T † ♯(P × S) = Q. From (30) we see that the cost of the plan π † ∈ Π(P, Q) equals the optimal cost Cost(P, Q). As a result, it is an optimal plan by definition. In turn, T † is an stochastic OT map but not a fake solution. This is a contradiction. Thus, it holds that T † ♯(P × S) ̸ = Q. Proof of Lemma 2. We compute the C-transform of f * (y) = ∥y∥ 2 2 -ϕ * (x). We have (f * ) C (x) = inf µ∈P(Y) C 2,γ (x, µ) - Y f * (y)dµ(y) = inf µ∈P(Y) 1 2 Y ∥x -y∥ 2 dµ(y) - γ 2 Var(µ) - Y ∥y∥ 2 2 -ϕ * (y) dµ(y) = inf µ∈P(Y) 1 2 ∥x∥ 2 -⟨x, Y y dµ(y)⟩ + 1 2 Y ∥y∥ 2 dµ(y) - γ 2 Var(µ) - Y ∥y∥ 2 2 -ϕ * (y) dµ(y) = inf µ∈P(Y) 1 2 ∥x∥ 2 -⟨x, Y y dµ(y)⟩ - γ 2 Var(µ) + Y ϕ * (y)dµ(y) = 1 2 ∥x∥ 2 + inf µ∈P(Y) Y ϕ * (y) -⟨x, y⟩ dµ(y) - γ 2 Var(µ) = 1 2 ∥x∥ 2 + inf µ∈P(Y) Y ϕ * (y) - γ 2 ∥y∥ 2 -⟨x, y⟩ dµ(y) + γ 2 Y y dµ(y) Pick any µ ∈ P(Y) and denote its expectation by m = Y y dµ(y). Since ϕ * is 1 γ -smooth, it holds that ϕ * is γ-strongly convex, i.e., ϕ * (y) -γ 2 ∥y∥ 2 -⟨x, y⟩ is convex. We use the Jensen's inequality: Y ϕ * (y) - γ 2 ∥y∥ 2 -⟨x, y⟩ dµ(y) ≥ ϕ * (m) - γ 2 ∥m∥ 2 -⟨x, m⟩. Therefore, we may restrict the feasible set of inf in (31) to Dirac distributions δ m , m ∈ Y. That is, (f * ) C (x) = 1 2 ∥x∥ 2 + inf m∈Y ϕ * (m) - γ 2 ∥m∥ 2 -⟨x, m⟩ + γ 2 ∥m∥ 2 = 1 2 ∥x∥ 2 + inf m∈Y ϕ * (m) -⟨x, m⟩ = 1 2 ∥x∥ 2 -sup m∈Y ⟨x, m⟩ -ϕ * (m) =ϕ * (x) = 1 2 ∥x∥ 2 -ϕ * (x), ( ) where ϕ * (x) = ϕ * (x) holds since ϕ * is continuous. We substitute (33) to ( 5) and obtain X (f * ) C (x)dP(x) + Y f * (y)dQ(y) = X ∥x∥ 2 2 -ϕ * (x) dP(x) + Y ∥y∥ 2 2 -ϕ * (y) dQ(y) = W 2 2,γ (P, Q), where in line (34), we use the optimality of ϕ * and ( 9). We conclude that f * maximizes ( 5), ( 6). Proof of Proposition 1. For all y ′ , y ′′ ∈ U ψ (y), α ∈ [0, 1], from the convexity of ψ it follows that ψ αy ′ + (1 -α)y ′′ ≤ αψ(y ′ ) + (1 -α)ψ(y ′′ ). By the definition of the subgradient of a convex function it also holds ψ αy ′ + (1 -α)y ′′ ≥ ψ(y) + ⟨x, αy ′ + (1 -α)y ′′ -y⟩ = α ψ(y) + ⟨x, y ′ -y⟩ + (1 -α) • ψ(y) + ⟨x, y ′′ -y⟩ = αψ(y ′ ) + (1 -α)ψ(y ′′ ) for all x ∈ ∂ y ψ(y). Therefore, ( 35) and ( 36) are equalities, and αy ′ + (1 -α)y ′′ ∈ U ψ (y). Proof of Theorem 1. It holds that T † ∈ arg inf T L(f * , T ) ⇔ T † ∈ arg inf T X C x, T x ♯S - Z f * T x (z) dS(z) dP(x). The latter condition holds if and only if P-almost surely for all x ∈ X we have T † x ∈ arg inf Tx:Z→Y C x, T † x ♯S - Z f * T † x (z) dS(z) . We substitute f * = ∥•∥ 2 2 -ϕ * and C = C 2,γ . As a result, we derive C 2,γ x, T † x ♯S -f * T † x (z) dS(z) = 1 2 Z ∥x -T † x (z)∥ 2 dS(z) - γ 2 Var(T † x ♯S) - 1 2 Z ∥T † x (z)∥ 2 dS(z) + Z ϕ * T † x (z) dS(z) = 1 2 ∥x∥ 2 -⟨x, Z T † x (z)dS(z)⟩ - γ 2 Var(T † x ♯S) + Z ϕ * T † x (z) dS(z) = 1 2 ∥x∥ 2 -⟨x, Z T † x (z)dS(z)⟩ + γ 2 Z T † x (z)dS(z) 2 + Z ϕ * T † x (z) - γ 2 ∥T † x (z)∥ 2 dS(z) = 1 2 ∥x∥ 2 -⟨x, Z T † x (z)dS(z)⟩ + γ 2 Z T † x (z)dS(z) 2 + Z ψ T † x (z) dS(z) ≥ (38) 1 2 ∥x∥ 2 -⟨x, Z T † x (z)dS(z)⟩ + γ 2 Z T † x (z)dS(z) 2 + ψ Z T † x (z)dS(z) , where we substitute (Kakade et al., 2009) . In transition from (38) to (39), we use the convexity of ψ and the Jensen's inequality. ψ = ϕ * -γ 2 ∥ • ∥ 2 . Recall that ψ is convex since ϕ * is 1 γ -smooth Part 1. To begin with, we prove implication "=⇒" in (11). If (37) holds true, then (39) is the equality. If it is not, one may pick T ′ x (z) def = Z T † x (z ′ )dS(z ′ ) which provides smaller value (39) for (37) than T † x . Let T * be any stochastic OT map and π * be its respective OT plan. We know that T * ∈ arg inf T L(f * , T ), i.e., the optimal value (39) of (37) equals 1 2 ∥x∥ 2 -⟨x, Z T * x (z)dS(z)⟩ + γ 2 Z T * x (z)dS(z) 2 + ψ Z T * x (z)dS(z) = (40) 1 2 ∥x∥ 2 -⟨x, ∇ϕ * (x)⟩ + γ 2 ∥∇ϕ * (x)∥ 2 + ψ ∇ϕ * (x) , where we use the equalities ∇ϕ * (x) = Y ydπ * (y|x) = Z T * x (z)dS(z), see 2. The function m → 1 2 ∥x∥ 2 -⟨x, m⟩ + γ 2 ∥m∥ 2 + ψ m is γ-strongly convex in m, therefore, the minimizer m * = ∇ϕ * (x) is unique. This yields that Z T † x (z)dS(z) = m * = ∇ϕ * (x). We know that (38) equals (39), i.e., Z ψ T † x (z) dS(z) -ψ Z T † x (z)dS(z) = Z ψ T † x (z) dS(z) -ψ ∇ϕ * (x) = 0, and the Jensen's gap vanishes. Now we are going to prove that Supp(T x ♯S) ⊂ U ψ (∇ϕ * (x)). We need to show that for every x ′ ∈ ∂ y (∇ϕ * (x)) the following inequality ψ(y) ≥ ψ(∇ϕ * (x)) + ⟨x ′ , y -∇ϕ * (x)⟩ (43) is the equality T † x ♯S-almost surely for all y ∈ Supp(T † x ♯S). Assume the opposite, i.e., for some x ′ , (43) holds true not T † x ♯S-almost surely. We integrate (43) w.r.t. y ∼ T † x ♯S and get the strict inequality Y ψ(y)d T † x ♯S (y) > ψ ∇ϕ * (x) + ⟨x ′ , Y y d T † x ♯S (y) -∇ϕ * (x)⟩. We use the change of variables for y = T † x (z) and obtain Z ψ T † x (z) dS(z) > ψ ∇ϕ * (x) + ⟨y ′ , Z T † x (z)dS(z) -∇ϕ * (x) =0 ⟩ = ψ ∇ϕ * (x) , which contradicts (42). This finishes the proof of implication "=⇒". Part 2. Now we prove implication "⇐=" in (11). For every T † satisfying the conditions on the right-hand side of (11), the Jensen's gap (42) is zero since ψ is linear in U ψ ∇ϕ * (x) . Therefore, (38) equals (39). Due to Z T † x (z)dS(z) = ∇ϕ * (x), we have that (39) attains the optimal (minimal) values. Consequently, (37) holds. Finally, we note that convexity of arg inf T L(f * , T ) follows from the convexity of sets U ψ (•). Proof of Lemma 3. The plan π ∈ Π(P, Q) is optimal if and only if Y y dπ(y|x) = ∇ϕ * (x). That is, a stochastic map T : X × Z → Y is optimal if and only if it pushes P to Q (represents some plan π ∈ Π(P, Q)), i.e., T ♯(P×S) = Q, and Z T x (z)dS = ∇ϕ * (x). The optimal barycentric projection T * satisfies the second condition by the definition.

Consider implication "

(a) =⇒". Since T * is a stochastic OT map, the first condition T * ♯(P×S) = Q holds true. Recall that by the definition of ϕ * and T * we have T * ♯(P × S) = ∇ϕ * ♯P = ( 1 γ ♯P). Thus, Q = Proj ⪯Q ( 1 γ ♯P). Consider implication " ⇐=". Since Proj ⪯Q ( 1 γ ♯P) = Q, the first condition T * ♯(P×S) = Q holds true. Therefore, T * is a stochastic OT map.

Now we prove implication " (b)

=⇒". Let T * be any stochastic OT map. We compute the second moment of T * x ♯(P × S) = Q below: Y ∥y∥ 2 dQ(y) = X Z ∥T * x (z)∥ 2 dS(z)dP(x) = (44) X Var(T * x ♯S) + Z T * x (z)dS(z) 2 dP(x) = X Var(T * x ♯S)dP(x) + X ∥∇ϕ * (x)∥ 2 dP(x) = (45) X Var(T * x ♯S)dP(x) + Y ∥y∥ 2 dQ(y), where in transition to (45), we use the equality Z T * x (z)dS(z) = ∇ϕ * (x); in transition to (46), we use the change of variables formula for y = ∇ϕ * (x) and ∇ϕ * (x)♯P = Q. Finally, by comparing ( 44) and ( 46), we obtain that Var(T x ♯S) = 0 holds P-almost surely. That is, T * is deterministic (does not depend on z) and (P×S)-almost surely matches the optimal barycentric projection T * . Proof of Proposition 2. We are going to show that the second moment of T α ♯(P×S) is less than that of Q. Consequently, T α ♯(P×S) ̸ = Q, and T α can not be a stochastic OT map. First, for T † we have X Z ∥T † x (z)∥ 2 dS(z)dP(x) = Var T † ♯(P × S) ≤Var(Q) - X Z T † x (z)dS(z)dP(x) 2 ≤ Var(Q) - X ∇ϕ * (x)dP(x) 2 = Var(Q) -∥m Q ∥ 2 = X Z ∥T * x (z)∥ 2 dS(z)dP(x), where in the last equality we use T * ♯(P × S) = Q. Finally, we derive X Z ∥T α x (z)∥ 2 dS(z)dP(x) = X Z ∥αT * x (z) + (1 -α)T † x (z)∥ 2 dS(z)dP(x) < X Z α∥T * x (z)∥ 2 + (1 -α)∥T † x (z)∥ 2 dS(z)dP(x) = (48) α X Z ∥T * x (z)∥ 2 dS(z)dP(x) + (1 -α) X Z ∥T † x (z)∥ 2 dS(z)dP(x) ≤ X Z ∥T * x (z)∥ 2 dS(z)dP(x) = [second moment of Q]. In transition to (48), we use the Jensen's inequality for ∥ • ∥ 2 . The inequality is strict since ∥ • ∥ 2 is strictly convex and T † ̸ = T * (P × S-almost surely). In transition to (49), we use (47). That is, T α ♯(P × S) ̸ = Q as its second moment is smaller.

Remark.

The assumption Proj ⪯Q ( 1 γ ♯P) ̸ = Q in Proposition 2 is needed to guarantee the existence of a function T † ∈ arg inf T L(f * , T ) which differs from the given stochastic OT map T * . Due to our Lemma 3, the optimal barycenteric projection T * ∈ arg inf T L(f * , T ) is not an OT map. Thus, T † = T * is a suitable example. Since T * ♯(P×S) ⪯ Q, we also have Var T * ♯(P× S ≤ Var(Q). Proof of Lemma 4. The lower-semi-continuity of C k,γ (x, µ) = C u,γ (x, µ) in (x, µ) follows from the continuity of k, compactness of X = Y ⊂ R D and (Santambrogio, 2015, Lemma 7.3 ). 4The term 1 2 Y ∥u(x) -u(y)∥ 2 H dµ(y) in ( 12) is linear in µ and, consequently, convex. The second term in (12) equals to -γ 2 Var(u♯µ). The pushforward operator u♯ is linear and Var(•) is concave. Therefore, the second term is convex in µ (γ ≥ 0). As a result, C k,γ (x, µ) is convex in µ. To prove that C k,γ is lower-bounded, we rewrite (12) analogously to (3), i.e., C k,γ (x, µ) = 1 2 ∥u(x) - Y u(y)dµ(y)∥ 2 H + 1 -γ 2 • 1 2 Y×Y ∥u(y) -u(y ′ )∥ 2 H dµ(y)dµ(y ′ ) = 1 2 ∥u(x) - Y u(y)dµ(y)∥ 2 H + 1 -γ 2 • Var(u♯µ). (50) Both terms are non-negative (γ ∈ [0, 1]), i.e., C k,γ (x, µ) is lower bounded by 0. Proof of Lemma 5. To begin with, we prove that if k is characteristic, then C k,γ (x, µ) = C u,γ (x, µ) is strictly convex in µ. The term X ∥u(x) -u(y)∥ 2 H dµ(y) in ( 12) is linear in µ, so we focus on the second (variance) term -γ 2 Var(u♯µ). We prove that Var(u♯µ) is strictly concave. We derive Var(u♯µ) = 1 2 Y×Y ∥u(y) -u(y ′ )∥ 2 H dµ(y)dµ(y ′ ) = 1 2 Y ∥u(y)∥ 2 H dµ(y) - Y×Y ⟨u(y), u(y ′ )⟩ H dµ(y)dµ(y ′ ) + 1 2 Y ∥u(y ′ )∥ 2 H dµ(y ′ ) = Y ∥u(y)∥ 2 H dµ(y) - Y u(y)dµ(y), Y u(y ′ )dµ(y ′ ) H = Y ∥u(y)∥ 2 H dµ(y) - Y u(y)dµ(y) 2 H . The first term in ( 51) is linear in µ, so it sufficies to prove that Y u(y)dµ(y) 2 H = ∥u(µ)∥ 2 H is strictly convex. To do this, we pick any µ 1 ̸ = µ 2 ∈ P(X ), α ∈ (0, 1). Since the kernel k is characteristic, it holds that u(µ 1 ) ̸ = u(µ 2 ). The squared norm function ∥ • ∥ 2 H is strictly convex. As a result, the following strict inequality holds α∥u(µ 1 )∥ 2 H + (1 -α)∥u(µ 2 )∥ 2 H > ∥u αµ 1 + (1 -α)µ 2 ∥ 2 H , which yields strict convexity of ∥u(µ)∥ 2 H . Consequently, C k,γ (x, µ) is strictly convex in µ. In this case, the weak OT functional π → X C k,γ (x, π(•|x))dP(x) is also strictly convex and yields the unique minimizer π * ∈ Π(P, Q) which is the OT plan (Backhoff-Veraguas et al., 2019, 1.3.1) . Proof of Theorem 2. To begin with, we expand the functional L: T * ∈ arg inf T L(f * , T ) ⇐⇒ T * ∈ arg inf T X C x, T x ♯S - Z f * T x (z) dS(z) dP(x). (52) Define µ * x def = T * x ♯S. We are going to prove that µ * x ≡ π * (y|x), where π * ∈ Π(P, Q) is the (unique) optimal plan; this yields that T * is a stochastic OT map. Since the optimization over functions in NOT equals to the optimization over distributions that they generate (Korotin et al., 2023, 4 .1), we have {µ * x } ∈ arg inf {µx} X C x, µ x - Y f * (y)dµ x (y) dP(x), where the inf is taken over collections of distributions µ x ∈ P(Y) indexed by X . Importantly, ( 53) can be split into x ∈ X independent problems, i.e., we have that µ * x ∈ arg inf µ C x, µ x - Y f * (y)dµ x (y) holds true P-almost surely for all x ∈ X . Note that the functional µ → C(x, µ) -Y f * (y)dµ(y) consists of a strictly convex term C(x, µ), which follows from the proof of Lemma 5, and a linear term (integral over µ). Therefore, the functional itself is strictly convex. Since π * (y|x) minimizes this functional, it is the unique solution due to the strict convexity. Therefore, µ * (x) = π * (y|x) holds true P-almost surely for x ∈ X and T * x ♯S = µ * x = π * (y|x), i.e., T * is a stochastic OT map. Proof of Proposition 3. Since π * γ1 is optimal for the γ 1 -weak cost, for all π ∈ Π(P, Q) it holds Cost k,γ1 (π * γ1 ) ≤ Cost k,γ1 (π). In particular, for π = π * γ2 it holds true that Dist 2 u (π * γ1 ) -γ 1 CVar u (π * γ1 ) ≤ Dist 2 u (π * γ2 ) -γ 1 CVar u (π * γ2 ). Analogously, π * γ2 is optimal for the γ 2 -weak cost; the following holds: Dist 2 u (π * γ2 )-γ 2 CVar u (π * γ2 ) ≤ Dist 2 u (π * γ1 )-γ 2 CVar u (π * γ1 ). (56) We sum ( 55) and ( 56) and obtain -γ 1 CVar u (π * γ1 ) -γ 2 CVar u (π * γ2 ) ≤ -γ 1 CVar u (π * γ2 ) -γ 2 CVar u (π * γ1 ), or, equivalently, (γ 2 -γ 1 ) CVar u (π * γ1 ) ≤ (γ 2 -γ 1 ) CVar u (π * γ2 ), which is equivalent to CVar u (π * γ1 ) ≤ CVar u (π * γ2 ) since γ 2 -γ 1 > 0. Now we multiply ( 55) and ( 56) by γ 2 and γ 1 , respectively, and sum the resulting inequalities. We obtain γ 2 Dist 2 u (π * γ1 ) + γ 1 Dist 2 u (π * γ2 ) ≤ γ 2 Dist 2 u (π * γ2 ) + γ 1 Dist 2 u (π * γ1 ), or, equivalently, (γ 2 -γ 1 ) Dist 2 u (π * γ1 ) ≤ (γ 2 -γ 1 ) Dist 2 u (π * γ2 ), which provides Dist u (π * γ1 ) ≤ Dist u (π * γ2 ). Finally, we note that Cost k,γ2 (π * γ2 ) = inf π∈Π(P,Q) Cost k,γ2 (π) ≤ Cost k,γ2 (π * γ1 ) = Cost k,γ1 (π * γ1 ) -(γ 2 -γ 1 ) ≥0 CVar u (π * γ1 ) ≥0 ≥ Cost k,γ1 (π * γ1 ), which concludes the proof. H WEAK QUADRATIC VS. KERNEL COSTS ON REAL DATA The goal of this section is to demonstrate that our proposed kernel cost (12) consistently outperforms the weak quadratic cost (3) in the downstream task of unpaired image translation. We consider the distance-induced kernel k(x, y) = 1 2 ∥x∥ + 1 2 ∥y∥ -1 2 ∥x -y∥. In short, we run NOT (Korotin et al., 2023 , Algorithm 1) multiple times (with various random seeds) with the same hyperparameters (Appendix I) for weak and kernel costs, and then we compare the obtained FID (µ ± σ). Datasets. We consider shoes → handbags and celeba (female) → anime translation. We work only with small 32 × 32 images to speed up the training and be able to run many experiments. EXPERIMENT 1. We consider the γ-weak quadratic cost for each γ ∈ { 3, 4 provide several important insights. First, we see that with the increase of parameter γ, the best FID stably increases. Second, the overall training becomes less stable: max FID becomes extremely large, which indicates severe fluctuations of the model. In particular, both the mean and standard deviation of last FID (as well as max FID) drastically increase. Why does this happen? We think that with the increase of γ sets U ψ (•) (11) become large. These sets determine how much a fake solution may vary (Theorem 1). Thus, this naturally leads to high ambiguity of the solutions and results in unstable and unpredictable behaviour. EXPERIMENT 2. We pick the highest considered value γ = 2 and show that our γ-weak kernel cost performs better than the weak quadratic cost from the previous experiments. We run NOT with the kernel cost 5 times and report the results in the same Tables 3, 4 . The results show that even for high γ the issues with the fluctuation are notably softened. This is seen from the fact that for γ = 2 the gap between the last FID (or max FID) and best much smaller for the kernel cost than for the quadratic cost. In particular, this gap is comparable to the gap for γ = 1 2 -weak quadratic cost for which sets U ψ (•) are presumably small and provide less ambiguity to the solutions. Conclusion. Our empirical evaluation shows that NOT with our proposed kernel costs yields more stable behaviour than NOT with the weak quadratic cost. This agrees with our theory which suggests that one of the reasons for unstable behaviour and severe fluctuations might be the existence of the fake solutions ( 3.1). Our weak kernel cost removes all the fake solutions ( 3.2). 

I ADDITIONAL TRAINING DETAILS

Pre-processing. In all the cases, we rescale RGB channels of images from [0, 1] to [-1, 1]. As in (Korotin et al., 2023) , we beforehand rescale anime face images to 512 × 512, and do 256 × 256 crop with the center located 14 pixels above the image center to get the face. Next, for all the datasets except for the describable textures, we resize images to the required size (64 × 64 or 128 × 128). Specifically for the describable textures dataset (≈5K textures), we augment the samples. We rescale input textures to minimal border size of 300, do the random resized crop (from 128 to 300 pixels) and random horizontal & vertical flips. Then we resize images to the required size (64 × 64 or 128 × 128). Neural networks. We use WGAN-QC discriminator's ResNet architecture (Liu et al., 2019) for potential f . We use UNetfoot_5 (Ronneberger et al., 2015) as the stochastic transport map T (x, z) = T x (z). To condition it on z, we insert conditional instance normalization (CondIN) layers after each UNet's upscaling blockfoot_6 . We use CondIN from AugCycleGANfoot_7 (Almahairi et al., 2018) . In experiments, z is the 128-dimensional standard Gaussian noise. Optimization. To learn stochastic OT maps, we use NOT algorithm (Korotin et al., 2023, Algorithm 1) . We use the Adam optimizer (Kingma & Ba, 2014) with the default betas for both T θ and f ω . The learning rate is lr = 1 • 10 -4 . We use the MultiStepLR scheduler which decreases lr by 2 after [15k, 25k, 40k, 55k, 70k] (iterations of f ω ). The batch size is |X| = 64, |Z x | = 4. The number of inner iterations is k T = 10. In toy experiments, we do 10K total iterations of f ω update. In the image-to-image translation experiments, we observe convergence in ≈ 70k iterations for 128 × 128 datasets, in ≈ 40k iterations for 64 × 64 datasets. In image-to-image translation, we gradually change γ. Starting from γ = 0, we linearly increase it to the desired value (mostly 1 3 ) during 25K first iterations of f ω . Computational complexity. NOT with kernel costs for 128 × 128 images converges in 3-4 days on a 4× Tesla V100 GPUs (16 GB). This is slightly bigger than the respective time for NOT (Korotin et al., 2023, 6) with the quadratic cost due to the reasons discussed in 3.3. We do not include the results of AugCycleGAN on texture→shoes as it did not converge on these datasets (FID ≫ 100). We tried tuning its hyperparameters, but this did not yield improvement. 



kaggle.com/reitanaka/alignedanimefaces https://github.com/iamalexkorotin/NeuralOptimalTransport In this case, the cost is the strong quadratic cost. We use the lower semi-continuity (in µ) of C(x, µ) w.r.t. the weak convergence of distributions in P(Y). In contrast,Backhoff-Veraguas et al. (2019) work with µ ∈ Pp(Y) ⊂ P(Y), i.e., with the distributions which have a finite p-th moment. They prove the existence and duality results for weak OT (2) assuming that C(x, µ) is lower semi-continuous w.r.t. the convergence in the Wasserstein-p sence in Pp(Y). Since we consider compact Y, it holds that Pp(Y) = P(Y) and these notions of convergence coincide(Villani, 2008, Def. 6.8). We choose 25k iterations as the starting point because at this time point FID roughly stabilizes at a small level. This indicates the model has nearly converged and starts fluctuating around the optimum. github.com/milesial/Pytorch-UNet github.com/kgkgzrtk/cUNet-Pytorch github.com/ErfanMN/Augmented_CycleGAN_Pytorch github.com/eriklindernoren/PyTorch-GAN/tree/master/implementations/ cyclegan github.com/aalmah/augmented_cyclegan github.com/NVlabs/MUNIT



For a function u : Y → H (feature map), we denote the respective positive definite symmetric (PDS) kernel byk(y, y ′ ) def = ⟨u(y), u(y ′ )⟩ H . A PDS kernel k : Y × Y → R is called characteristic if the kernel mean embedding P(Y) ∋ µ → u(µ) def = X u(y)dµ(y) ∈ H is a one-to-one mapping.For a function ϕ : R D → R ∪ {∞}, we denote its convex conjugate by ϕ(y) def = sup x∈R D {⟨x, y⟩ -ϕ (x)}.

def = inf µ∈P(Y) {C(x, µ)-Y f (y)dµ(y)} is the weak C-transform.

Figure 3: Implicit representation of a transport plan via function T : X ×Z → Y.

(a) Input and target distributions P and Q. (b) Fitted map Tx(z) for γ = 1 2 . (c) Fitted map Tx(z) for γ = 3 4 . (d) Fitted map Tx(z) for γ = 1.

Figure 5: Stochastic maps T between P and Q fitted by NOT algorithm with costs C 2,γ for various γ.

(a) Iteration 10k. (b) Iteration 10k+100. (c) Iteration 10k+200. (d) Iteration 10k+300.

Figure 6: The evolution of the learned transport map T during training on a toy 2D example (γ = 1). conditional expectation of π * (y|x) shared by all OT plans π * ( 2). However, T ♯(P×S) ̸ = Q and NOT fails to learn an OT plan.

(a) Handbag → shoes, 128 × 128. (b) Shoes → handbags, 128 × 128.

Figure 7: Unpaired one-to-many translation with kernel Neural Optimal Transport (NOT).

) the γ-weak kernel cost. The γ-weak quadratic cost (3) is its particular case for H = R D and u(x) = x. The respective kernel k(x, y) = ⟨u(x), u(y)⟩ = ⟨x, y⟩ is bilinear. Lemma 4 (Weak kernel costs are appropriate). Let k be a continuous PDS kernel and γ ∈ [0, 1]. Then the cost C k,γ (x, µ) is convex, lower semi-continuous and lower bounded in µ. Corollary 2 (Existence and duality for kernel costs). Let k be a continuous PDS kernel and γ ∈ [0, 1]. Then an OT plan π * for cost C k,γ (x, µ) exists and duality formulas (5) and (6) hold true.

(a) Texture → shoes, 128 × 128. (b) Texture → handbags, 128 × 128.

Figure 8: Unpaired one-to-many translation with kernel Neural Optimal Transport (NOT). estimated in O(|Z|) operations (Korotin et al., 2023, Equation23), but NOT algorithm may suffer from fake solutions ( 3.1).

(a) Gaussian N (0, 1) → Mixture of 2 Gaussians, γ = 1. (b) Gaussian N (0, 1) → Mixture of 2 Gaussians, γ = 10. (c) Mixture of 3 Gaussians → Mixture of 2 Gaussians, γ = 1. (d) Mixture of 3 Gaussians → Mixture of 2 Gaussians, γ = 10.

Figure9: Stochastic plans (3rd and 4th columns) between toy 1D distributions (1st and 2nd columns) learned by our Kernel NOT (3rd column) and discrete OT with the weak kernel cost (4th column).14

(a) Gaussian N (0, [ 1 2 ] 2 I2) → Gaussian N (0, I2). (b) Gaussian N (0, [ 1 2 ] 2 I2) → Mixture of 8 Gaussians. (c) Gaussian N (0, 1 2 I2) → Swiss roll. (d) Mixture of 4 Gaussians → Uniform.

Figure 10: Stochastic (one-to-many) maps learned between toy 2D distributions by Kernel NOT.

Figure 11: Stochastic maps T between P and Q fitted by NOT with costs C 2,γ for various γ.

Iteration 10k+100. (c) Iteration 10k+200. (d) Iteration 10k+300.

Figure 12: The evolution of the learned transport map T during training on a toy 2D example (γ = 1).

Figure 13: Texture → shoes (64 × 64) translation with the γ-weak kernel cost for various values γ.

(a) Bilinear k(y, y ′ ) = ⟨y, y ′ ⟩ (Korotin et al., 2023), FID = 18.79 ± 0.08. (b) Distance k(y, y ′ ) = 1 2 ∥y∥+ 1 2 ∥y ′ ∥-1 2 ∥y ′ -y∥, FID = 16.13 ± 0.1. (c) Gaussian k(y, y ′ ) = exp -∥y-y ′ ∥ 2 2D , FID = 20.89 ± 0.09. (d) Laplacian k(y, y ′ ) = exp -∥y-y ′ ∥ 2D , FID = 20 ± 0.09.

Figure 14: Texture → handbags (64 × 64) translation with the different γ-weak kernel costs.

2} we run NOT with 5 different random seeds and train it for 60k iterations of f ω . In this experiment, during training, we evaluate test FID every 1k iteration and for each experiment we report 3 FID values: the best FID value of iterations 25-60k 5 indicating what the model can achieve best, the max FID value of iterations 25-60k indicating what the model achieves worst (because of potential training instabilities) and the last FID value at the end of training (60k) showing what the model actually achieved. The experimental results of Tables

COMPARISON WITH NOT WITH THE QUADRATIC COST (a) Outdoor → church, 128 × 128. (b) Celeba (f) → anime, 128 × 128. (c) Handbags → shoes, 128 × 128.

Figure 15: Unpaired translation with Neural Optimal Transport with various costs (C 2,0 , C 2,γ , C k,γ ).

(a) C2,0 (Korotin et al., 2023, 5.2). (b) C2,γ (Korotin et al., 2023, 5.3). (c) C k,γ [Ours].

Figure 16: Celeba (f) → anime (128 × 128) translation with NOT with costs C 2,0 , C 2,γ , C k,γ .

Figure 17: Outdoor → church (128 × 128) translation with NOT with costs C 2,0 , C 2,γ , C k,γ .

(a) C2,0 (Korotin et al., 2023, 5.2). (b) C2,γ (Korotin et al., 2023, 5.3). (c) C k,γ [Ours].

Figure 18: Handbags → shoes (128 × 128) translation with NOT with costs C 2,0 , C 2,γ , C k,γ .

NOT with C k,γ[Ours]

Figure 20: Outdoor → church (128 × 128) translation with various methods.

(a) Input images x and random translated examples Tx(z). (b) Interpolation in the conditional latent space, z = (1 -α)z1 + αz2.

Figure 22: Outdoor → church, 128 × 128. Additional examples.

Figure 23: Texture → shoes translation, 128 × 128. Additional examples.

Figure 25: Shoes → handbags translation, 128 × 128. Additional examples.

Test FID↓ of NOT with various costs.

The values of CVar( π γ ), Dist( π γ ) and Cost k,γ ( π γ ) of learned plans π γ ≈ π * γ with different values of γ on texture → shoes (64 × 64) translation with the kernel quadratic cost. i.e., for larger γ, the OT plan π * γ on average for each x yields more conditionally diverse samples π * (y|x) but they are less close to x in features w.r.t. ∥ • ∥ 2 H . The OT cost is non-increasing, i.e., Cost k,γ1 (π * γ1

Test FID↓ (µ ± σ) on shoes → handbags, 32 × 32 of C 2,γ and C k,γ for different γ. Red color highlights µ ≥ 30, orange color highlights σ ≥ 5.

Test FID↓ (µ ± σ) on celeba (female) → anime, 32 × 32 of C 2,γ and C k,γ for different γ. Red color highlights µ ≥ 40, orange color highlights σ ≥ 4.

Qualitative examples are shown in Figures 20, 19.

Test FID↓ of the considered image-to-image translation methods.

annex

Reproducibility. We provide the source code for all experiments and release the checkpoints for all models of 5. The details are given in README.MD in the official repository.

ACKNOWLEDGEMENTS.

The work was supported by the Analytical center under the RF Government (subsidy agreement 000000D730321P5Q0002, Grant No. 70-2021 -00145 02.11.2021) .Published as a conference paper at ICLR 2023 

