THE UNBALANCED GROMOV WASSERSTEIN DIS-TANCE: CONIC FORMULATION AND RELAXATION

Abstract

Comparing metric measure spaces (i.e. a metric space endowed with a probability distribution) is at the heart of many machine learning problems. This includes for instance predicting properties of molecules in quantum chemistry or generating graphs with varying connectivity. The most popular distance between such metric measure spaces is the Gromov-Wasserstein (GW) distance, which is the solution of a quadratic assignment problem. This distance has been successfully applied to supervised learning and generative modeling, for applications as diverse as quantum chemistry or natural language processing. The GW distance is however limited to the comparison of metric measure spaces endowed with a probability distribution. This strong limitation is problematic for many applications in ML where there is no a priori natural normalization on the total mass of the data. Furthermore, imposing an exact conservation of mass across spaces is not robust to outliers and often leads to irregular matching. To alleviate these issues, we introduce two Unbalanced Gromov-Wasserstein formulations: a distance and a more tractable upper-bounding relaxation. They both allow the comparison of metric spaces equipped with arbitrary positive measures up to isometries. The first formulation is a positive and definite divergence based on a relaxation of the mass conservation constraint using a novel type of quadratically-homogeneous divergence. This divergence works hand in hand with the entropic regularization approach which is popular to solve large scale optimal transport problems. We show that the underlying non-convex optimization problem can be efficiently tackled using a highly parallelizable and GPU-friendly iterative scheme. The second formulation is a distance between mm-spaces up to isometries based on a conic lifting. Lastly, we provide numerical simulations to highlight the salient features of the unbalanced divergence and its potential applications in ML.

1. INTRODUCTION

Comparing data distributions on different metric spaces is a basic problem in machine learning. This class of problems is for instance at the heart of surfaces (Bronstein et al., 2006) or graph matching (Xu et al., 2019) (equipping the surface or graph with its associated geodesic distance), regression problems in quantum chemistry (Gilmer et al., 2017) (viewing the molecules as distributions of points in R 3 ) and natural language processing (Grave et al., 2019; Alvarez-Melis & Jaakkola, 2018) (where texts in different languages are embedded as points distributions in different vector spaces). Metric measure spaces. The mathematical way to formalize these problems is to model the data as metric measure spaces (mm-spaces). A mm-space is denoted as X = (X, d, µ) where X is a complete separable set endowed with a distance d and a positive Borel measure µ ∈ M + (X). For instance, if X = (x i ) i is a finite set of points, then µ = i m i δ xi (here δ xi is the Dirac mass at x i ) is simply a set of positive weights m i = µ({x i }) ≥ 0 associated to each point x i , which accounts for its mass or importance. For instance, setting some m i to 0 is equivalent to removing the point x i . We refer to Sturm (2012) for a mathematical account on the theory of mm-spaces. In all the applications highlighted above, it makes sense to perform the comparisons up to isometric transformations of the data. Two mm-spaces X = (X, d X , µ) and Y = (Y, d Y , ν) are considered to be equal (denoted X ∼ Y) if they are isometric, meaning that there is a bijection ψ : spt(µ) → spt(ν) (where spt(µ) is the support of µ) such that d X (x, y) = d Y (ψ(x), ψ(y)) and ψ µ = ν. Here ψ is the push-forward operator, so that ψ µ = ν is equivalent to imposing ν(A) = µ(ψ -1 (A)) for any set A ⊂ Y . For discrete spaces where µ = i m i δ xi , then one should have ν = ψ µ = i m i δ ψ(xi) . As highlighted by Mémoli (2011) , considering mm-spaces up to isometry is a powerful way to formalize and analyze a wide variety of problems such as matching, regression and classification of distributions of points belonging to different spaces. The key to unlock all these problems is the computation of a distance between mm-spaces up to isometry. So far, existing distances (reviewed below) assume that µ is a probability distribution, i.e. µ(X) = 1. This constraint is not natural and sometimes problematic for most of the practical applications to machine learning. The goal of this paper is to alleviate this restriction. We define for the first time a class of distances between unbalanced metric measure spaces, these distances being upper-bounded by divergences which can be approximated by an efficient numerical scheme. Csiszár divergences The simplest case is when X = Y and one simply ignores the underlying metric. One can then use Csiszár divergences (or ϕ-divergences), which perform a pointwise comparison (in contrast with optimal transport distances, which perform a displacement comparison). It is defined using an entropy function ϕ : R + → [0, +∞], which is a convex, lower semi-continuous, positive function with ϕ(1) = 0. The Csiszár ϕ-divergence reads D ϕ (µ|ν) def. = X ϕ dµ dν dν + ϕ ∞ X dµ ⊥ , where µ = dµ dν ν + µ ⊥ is the Lebesgue decomposition of µ with respect to ν and ϕ ∞ = lim r→∞ ϕ(r)/r ∈ R ∪ {+∞} is called the recession constant. This divergence D ϕ is convex, positive, 1-homogeneous and weak* lower-semicontinuous, see Liero et al. (2015) for details. Particular instances of ϕ-divergences are Kullback-Leibler (KL) for ϕ(r) = r log(r) -r + 1 (note that ϕ ∞ = ∞) and Total Variation (TV) for ϕ(r) = |r -1|. Balanced and unbalanced optimal transport. If the common embedding space X is equipped with a distance d(x, y), one can use more elaborated methods such as optimal transport (OT) distances, which are computed by solving convex optimization problems. This type of methods has proven useful for ML problems as diverse as domain adaptation (Courty et al., 2014) , supervised learning over histograms (Frogner et al., 2015) and unsupervised learning of generative models (Arjovsky et al., 2017) . In this case, the extension from probability distributions to arbitrary positive measures (µ, ν) ∈ M + (X) 2 is now well understood and corresponds to the theory of unbalanced OT. Following Liero et al. (2015) ; Chizat et al. (2018c) , a family of unbalanced Wasserstein distances is defined by solving UW(µ, ν) q def. = inf π∈M(X×X) λ(d(x, y))dπ(x, y) + D ϕ (π 1 |µ) + D ϕ (π 2 |µ). Here (π 1 , π 2 ) are the two marginals of the joint distribution π, defined by π 1 (A) = π(A × Y ) for A ⊂ X. The mapping λ : R + → R and exponent q ≥ 1 should be chosen wisely to ensure for instance that UW defines a distance (see Section 2.2.1). It is frequent to take ρD ϕ instead of D ϕ (i.e. take ψ = ρϕ) to adjust the strength of the marginals' penalization. Balanced OT is retrieved with the convex indicator ϕ = ι {1} or by taking the limit ρ → +∞, which enforces π 1 = µ and π 2 = ν. When 0 < ρ < +∞, unbalanced OT operates a trade-off between transportation and creation of mass, which is crucial to be robust to outliers in the data and to cope with mass variations in the modes of the distributions. For supervised tasks, the value of ρ should be cross-validated to obtain the best performances. Its use is gaining popularity in applications, such as medical imaging registration (Feydy et al., 2019a) , videos (Lee et al., 2019) , generative learning (Balaji et al., 2020) and gradient flow to train neural networks (Chizat & Bach, 2018; Rotskoff et al., 2019) . Furthermore, existing efficient algorithms for balanced OT extend to this unbalanced problem. In particular Sinkhorn's iterations, introduced in ML for balanced OT by Cuturi (2013) , extend to unbalanced OT (Chizat et al., 2018a; Séjourné et al., 2019) , as detailed in Section 3. The Gromov-Wasserstein distance and its applications. The Gromov-Wasserstein (GW) distance (Mémoli, 2011; Sturm, 2012) generalizes the notion of OT to the setting of mm-spaces up to isometries. It corresponds to replacing the linear cost λ(d)dπ of OT by a quadratic function GW(X , Y) q def. = min π∈M+(X×Y ) λ(|d X (x, x ) -d Y (y, y )|)dπ(x, y)dπ(x , y ) : π 1 = µ π 2 = ν , It is proved in Mémoli (2011) ; Sturm (2012) that GW defines with λ(t) = t q a distance up to isometries on balanced mm-spaces (i.e. the measures are probability distributions). In this paper, we extend this construction to arbitrary positive measures, and provide explicit settings in Section 2.2.1. This distance is applied successfully in natural language processing for unsupervised translation learning (Grave et al., 2019; Alvarez-Melis & Jaakkola, 2018) , in generative learning for objects lying in spaces of different dimensions (Bunne et al., 2019) and to build VAE for graphs (Xu et al., 2020) . It has been adapted for domain adaptation over different spaces (Redko et al., 2020) . It is also a relevant distance to compute barycenters between graphs or shapes (Vayer et al., 2018; Chowdhury & Needham, 2020) . When (X , Y) are Euclidean spaces, this distance compares distributions up to rigid isometry, and is closely related (but not equal) to metrics defined by procrustes analysis (Grave et al., 2019; Alvarez-Melis et al., 2019) . The problem ( 2) is non convex because the quadratic form λ(|d X -d Y |)dπ ⊗ π is not positive in general. It is in fact closely related to quadratic assignment problems (Burkard et al., 1998) , which are used for graph matching problems, and are known to be NP-hard in general. Nevertheless, nonconvex optimization methods have been shown to be successful in practice to use GW distances for ML problems. This includes for instance alternating minimization (Mémoli, 2011; Redko et al., 2020) and entropic regularization (Peyré et al., 2016; Gold & Rangarajan, 1996) . Related works and contributions. The concomitant work of De Ponti & Mondino (2020) extends the L p transportation distance defined in Sturm et al. (2006) to unbalanced mm-spaces and studies its geometric properties. This distortion distance is not equivalent to the GW distance, and is more difficult to estimate numerically because it explicitly imposes a triangle inequality constraint in the optimization problem. The work of Chapel et al. (2020) relaxes the GW distance to the unbalanced setting by hybridizing GW with partial OT (Figalli, 2010) for unsupervised labeling. It ressembles one particular setting of our formulation, but with some important differences, detailed in Section 2. Our construction is also connected to partial matching methods, which find numerous applications in graphics and vision (Cosmo et al., 2016) . In particular, Rodola et al. (2012) introduces a mass conservation relaxation of the GW problem. The two main contributions of this paper are the definition of two formulations relaxing the GW distance. The first one is called the Unbalanced Gromov-Wasserstein (UGW) divergence and can be computed efficiently on GPUs. The second one is called the Conic Gromov-Wasserstein distance (CGW). It is proved to be a distance between mm-spaces endowed with positive measures up to isometries, as stated in Theorem 1 which is the main theoretical result of this paper. We also prove in Theorem 1 that UGW can be used as a surrogate upper-bounding CGW. We present those concepts and their properties in Section 2. We also detail in Section 3 an efficient computational scheme for a particular setting of UGW. This method computes an approximate stationary point of the nonconvex energy. It leverages the strength of entropic regularization and the Sinkhorn algorithm, namely that it is GPU-friendly and defines smooth loss functions amenable to back-propagation for ML applications. Section 4 provides some numerical experiments to highlight the qualitative behavior of this algorithm, which shed some lights on the favorable properties of UGW to cope with outliers and mass variations in the modes of the distributions.

2. UNBALANCED GROMOV-WASSERSTEIN FORMULATIONS

We present in this section our two new formulations and their properties. The first one, called UGW, is exploited in Sections 3 and 4 to derive an efficient algorithm used in numerical experiments. The second one, called CGW, defines a distance between mm-spaces up to isometries. Those results build upon the results of Liero et al. (2015) , and a summary of the construction of UOT is detailed in Appendix A In all what follows, we consider complete separable mm-spaces endowed with a metric and a positive measure.

2.1. THE UNBALANCED GROMOV-WASSERSTEIN DIVERGENCE

This new formulation makes use of quadratic ϕ-divergences, defined as D ⊗ ϕ (ρ|ν) def. = D ϕ (ρ⊗ρ|ν⊗ν), where ρ ⊗ ρ ∈ M + (X 2 ) is the tensor product measure defined by d(ρ ⊗ ρ)(x, y) = dρ(x)dρ(y). Note that D ⊗ ϕ is not a convex function in general. Definition 1 (Unbalanced GW). The Unbalanced Gromov-Wasserstein divergence is defined as UGW(X , Y) = inf π∈M + (X×Y ) L(π) where L(π) def. = X 2 ×Y 2 λ(|d X (x, x ) -d Y (y, y )|)dπ(x, y)dπ(x , y ) + D ⊗ ϕ (π 1 |µ) + D ⊗ ϕ (π 2 |ν). (3) This definition can be understood as an hybridation between (1) and ( 2) but with a twist: one needs to use the quadratic divergence D ⊗ ϕ in place of D ϕ . To the best of our knowledge, it is the first time such quadratic divergences are being used and studied. In the TV case, this is the most important distinction between UGW and partial GW (Chapel et al., 2020) . Note also that the balanced GW distance (2) is recovered as a particular case when using ϕ = ι = or by letting ρ → +∞ for an entropy ψ = ρϕ. Using quadratic divergences results in UGW being 2-homogeneous: for θ ≥ 0, writing (X θ , Y θ ) equiped with (θµ, θν), one has θ -2 UGW(X θ , Y θ ) = UGW(X , Y). When using non tensorized ϕ-divergences, the resulting unbalanced Gromov-Wassertein functional between X θ and Y θ have very different and inconsistent behaviors when θ → 0 and θ → +∞. Indeed, once normalized by θ -2 and θ -1 , one obtains respectively balanced GW and a Hellinger-type distance. Using tensorized divergences ensure that the behavior does not depends on θ. We first prove the existence of optimal plans π solution to (3), which holds for the three key settings of Section 2.2.1, namely for KL, TV, and for compact metric spaces (such as finite pointclouds and graphs). All proofs are deferred in Appendix B. Proposition 1 (Existence of minimizers). We assume that (X, Y ) are compact and that either (i) ϕ superlinear, i.e ϕ ∞ = ∞, or (ii) λ has compact sublevel sets in R + and 2ϕ ∞ + inf λ > 0. Then there exists π ∈ M + (X × Y ) such that UGW(X , Y) = L(π). The following proposition ensures that the functional UGW can be used to compare mm-spaces. Proposition 2 (Definiteness of UGW). Assume that ϕ -1 ({0}) = {1} and λ -1 ({0}) = {0}. Then UGW(X , Y) ≥ 0 and is 0 if and only if X ∼ Y. We end this section with a reformulation of UGW (3) which is important to make the connection with the second formulation of the following section. = dµ dπ1 , g def. = dν dπ2 ) the Lebesgue densities of (µ, ν) w.r.t. (π 1 , π 2 ) such that µ = f π 1 + µ ⊥ and ν = gπ 2 + ν ⊥ , one has L(π) = X 2 ×Y 2 L λ(|d X -d Y |) (f ⊗ f, g ⊗ g)dπdπ + ϕ(0)(|(µ ⊗ µ) ⊥ | + |(ν ⊗ ν) ⊥ |). (4)

2.2. THE CONIC GROMOV-WASSERSTEIN DISTANCE

We introduce a second "conic" formulation of unbalanced GW, which is connected to UGW, and whose construction is inspired by the conic formulation of UOT (see Appendix A for an overview).

2.2.1. BACKGROUND ON CONE SETS AND DISTANCES

The conic formulation lifts a point x ∈ X to a couple (x, r) ∈ X × R + where r encodes some (power of a) mass. Then we seek optimal transport plans defined over C[X] def. = X × R + /(X × {0}), where coordinates (x, r = 0) with no mass are merged into a single point 0 X called the apex of the cone. In the sequel, points of X × R + are noted (x, r), while [x, r] are quotiented points of C[X].

While transport plans depend on variables

([x, r], [y, s]) and ([x , r ], [y , s ]) in C[X] × C[Y ], the transportation cost involved in our conic formulation only makes use of the 2-D cone C[R + ] over R + endowed with the distance |u -v| (note that any other distance on R could be used as well). More specifically, we consider coordinates of the form ([u, a], [v, b]) = ([d X (x, x ), rr ], [d Y (y, y ), ss ]) ∈ C[R + ] × C[R + ]. Thus we now describe conic discrepancies D on C[R + ], which are defined for (p, q) ≥ 0 as D([u, a], [v, b]) q def. = H λ(|u-v|) (a p , b p ) where H c (a p , b p ) def. = inf θ≥0 θL c ( a p θ , b p θ ) is the perspective transform of L c introduced in Lemma 1. The intuition underpinning the definition of this cost is that the perspective transform accounts for the possibility to rescale a transport plan π by a scalar θ but the scaling is performed pointwise instead of globally. In general D is not a distance, but it is always definite as stated by this result proved in Appendix A. Proposition 3. Assume λ -1 ({0}) = {0}, ϕ -1 ({0}) = {1} and ϕ is coercive. Then D is definite on C[R + ], i.e. D([u, a], [v, b]) = 0 if and only if (a = b = 0) or (a = b and u = v). Of particular interest are those ϕ where D is a distance, which necessitates a careful choice of λ, p and q. We now detail three examples where this is the case. Gaussian Hellinger distance (GH). When D ϕ = KL, λ(t) = t 2 and q = p = 2, then one has D([u, a], [v, b]) 2 = a 2 +b 2 -2abe -|u-v|/2 . This cone distance (Burago et al., 2001) is further generalized by De Ponti ( 2019) who shows that D is a distance for power entropies ϕ(s) = s p -p(s-1)-1 v|) . This construction, which might seem peculiar, corresponds to the one used to make unbalanced OT a geodesic distance, as detailed in (Liero et al., 2015; Chizat et al., 2018c) . Partial optimal transport distance (PT). (Chizat et al., 2018c) . p(p-1) if p ≥ 1 (the case p = 1 corresponding to D ϕ = KL). Hellinger-Kantorovich (HK) / Wasserstein-Fisher-Rao distance (WFR). When D ϕ = KL, λ(t) = -log cos 2 (t∧ π 2 ) and q = p = 2, then one has D([u, a], [v, b]) 2 = a 2 +b 2 -2ab cos( π 2 ∧|u- When D ϕ = TV, λ(t) = t q , q ≥ 1 and p = 1, then D([u, a], [v, b]) q = a + b -(a ∧ b)(2 -|u -v| q ) + defines a cone distance

2.2.2. DEFINITIONS AND PROPERTIES

The conic formulation consists in solving a GW problem on the cone, with the addition of two linear constraints. Informally speaking, L c from Lemma 1 becomes D, the term (|(µ ⊗ µ) ⊥ | + |(ν ⊗ ν) ⊥ |) is taken into account by the constraints (5) below, and the variables (f, g) are replaced by (r p , s p ).

It reads CGW(X , Y)

def. = inf α∈Up(µ,ν) H(α) where H(α) def. = D([d X (x, x ), rr ], [d Y (y, y ), ss ]) q dα([x, r], [y, s])dα([x , r ], [y , s ]), U p (µ, ν) def. = α ∈ M + (C[X] × C[Y ]) : R+ r p dα 1 (•, r) = µ, R+ s p dα 2 (•, s) = ν . (5) It is similar to the conic formulation of UW, see Appendix A. Note that similarly to the GW formulation (2) -and in sharp contrast with the conic formulation of UW -here the transport plans are defined on the cone C[X] × C[Y ] but the cost D is a distance on C[R + ]. We present now the main contributions of this paper, proved in Appendix C. We state that CGW defines a distance under conditions that hold for the settings of Section 2.2.1, and that it is upperbounded by UGW. While the distance CGW 1/q cannot be casted as a finite dimensional program even in discrete settings (because it is defined on a lifted space), UGW can be approximated with efficient numerical schemes as detailed in Section 3. The tightness of the bound between UGW and CGW and the computation of CGW are open questions left for future works. Theorem 1. (i) The divergence CGW is symmetric, positive and definite up to isometries. (ii) If D is a distance on C[R + ], then CGW 1/q is a distance on the set of mm-spaces up to isometries. (iii) For any (D ϕ , λ, p, q) with associated cost D on the cone, one has UGW ≥ CGW. Sketch of proof (i) CGW is positive and symmetric. Definiteness holds thanks to Proposition 3. (ii) The triangle inequality is similar to balanced OT, and applies the gluing lemma (Villani, 2003, Lemma 7.6 ). The non-trivial part is showing that the latter lemma holds, because it glues two plans provided they have a common marginal. Since CGW is invariant under radial rescalings (called dilations in Appendix C), it is possible to dilate two plans such that they have a common marginal and remain optimal. (iii) Take an optimal plan π for UGW(X , Y). From this π one can build a plan α such that L(π) ≥ H(α) because L c ≥ H c . Furthermore α ∈ U p , and is thus admissible and suboptimal, which yields UGW(X , Y) = L(π) ≥ H(α) ≥ CGW(X , Y).

3. ALGORITHMS

The computation of the distance CGW is in practice out-of-reach because it requires an optimization over a lifted conic space, which would need to be discretized. We focus in this section on the numerical computation of the upper bound UGW, using an alternate minimization coupled with entropic regularization. The algorithm is presented on arbitrary measures, the special case of discrete measures being a particular case. The discretized formulas and algorithms are detailed in Appendix D, see also Chizat et al. (2018a) ; Peyré et al. (2016) . All implementations are available at https://github.com/anonymous-conference-submission. In order to derive a simple numerical approximation scheme, following Mémoli (2011) , we introduce a lower bound obtained by introducing two transportation plans. To further accelerate the method and enable GPU-friendly iterations, similarly to Gold et al. (1996) ; Solomon et al. (2016) , we consider an entropic regularization. It reads, for any ε ≥ 0, UGW ε (X , Y) def. = inf π L(π) + εKL ⊗ (π|µ ⊗ ν) ≥ inf π,γ F(π, γ) + εKL(π ⊗ γ|(µ ⊗ ν) ⊗2 ), and F(π, γ) def. = X 2 ×Y 2 λ(|d X -d Y |)dπ ⊗ γ + D ϕ (π 1 ⊗ γ 1 |µ ⊗ µ) + D ϕ (π 2 ⊗ γ 2 |ν ⊗ ν), where (γ 1 , γ 2 ) denote the marginals of the plan γ. Note that in contrast to the entropic regularization of GW Peyré et al. (2016) , here we use a tensorized entropy to maintain the overall homogeneity of the energy. A simple method to approximate this lower bound is to perform an alternate minimization on π and γ, which is known to converge for smooth ϕ to a stationary point since the coupling term in the functional is smooth (Tseng, 2001) . Note that if π ⊗ γ is optimal then so is (sπ) ⊗ ( 1 s γ) with s ≥ 0. Thus without loss of generality we optimize under the constraint m(π) = m(γ) by setting s = m(γ)/m(π) . In general, this bound is not expected to be tight, but empirically, alternate minimization often converges to a solution with π = γ (as already observed for instance in Rangarajan et al. (1999) ; Solomon et al. (2016) ), so that the algorithm also finds a local minimizer of the UGW ε problem. In the Balanced-GW case in Euclidean spaces, the optimum is known to satisfy π = γ (Konno, 1976) and alternate descent is equivalent to a mirror descent algorithm (Solomon et al., 2016) . Minimizing the lower bound of (6) with respect to either π or γ is non-trivial for an arbitrary ϕ. We restrict our attention to the Kullback-Leibler case D ϕ = ρKL with ρ > 0, which can be addressed by solving a regularized and convex unbalanced problem as studied in Chizat et al. (2018a) ; Séjourné et al. (2019) . It is explained in the following proposition. Proposition 4. For a fixed γ, the optimal π ∈ arg min π F(π, γ) + εKL(π ⊗ γ|(µ ⊗ ν) ⊗2 ) is the solution of min π c ε γ (x, y)dπ(x, y) + ρm(γ)KL(π 1 |µ) + ρm(γ)KL(π 2 |ν) + εm(γ)KL(π|µ ⊗ ν), where m(γ) def. = γ(X × Y ) is the mass of γ, and where we define the cost associated to γ as c ε γ (x, y) def. = λ(|d X (x, •) -d Y (y, •)|)dγ + ρ log( dγ1 dµ )dγ 1 + ρ log( dγ2 dν )dγ 2 + ε log( dγ dµdν )dγ. Computing the cost c ε γ for spaces X and Y of n points has in general a cost O(n 4 ) in time and memory. However, as explained for instance in Peyré et al. (2016) , for the special case λ(t) = t 2 , this cost is reduced to O(n 3 ) in time and O(n 2 ) in memory. This is the setting we consider in the numerical simulations. This makes the method applicable for scales of the order of 10 4 points. For larger datasets one should use approximation schemes such as hierarchical approaches (Xu et al., 2019) or Nyström compression of the kernel (Altschuler et al., 2018) . The resulting alternate minimization method is detailed in Algorithm 1, see Appendix D for a discretized version. It uses the unbalanced Sinkhorn algorithm of Chizat et al. (2018a) ; Séjourné et al. (2019) as sub-iterations and is initialized using π = µ ⊗ ν/ m(µ)m(ν). This Sinkhorn algorithm operates over a pair of continuous functions (so-called Kantorovitch potentials) f (x) and g(y). For discrete spaces X and Y of size n, these functions are stored in vectors of size n, and that integral Algorithm 1 -UGW(X , Y, ρ, ε) Input: mm-spaces (X , Y), relaxation ρ, regularization ε Output: approximation (π, γ) minimizing 6 1: Initialize π = γ = µ ⊗ ν/ m(µ)m(ν), g = 0. 2: while (π, γ) has not converged do 3: Update π ← γ, then c ← c ε π , ρ ← m(π)ρ, ε ← m(π)ε 4: while (f, g) has not converged do 5: ∀x, f (x) ← -ε ρ ε+ ρ log e (g(y)-c(x,y))/ε dν(y) 6: ∀y, g(y) ← -ε ρ ε+ ρ log e (f (x)-c(x,y))/ε dµ(x) 7: Update γ(x, y) ← exp (f (x) + g(y) -c(x, y))/ε µ(x)ν(y) 8: Rescale γ ← m(π)/m(γ)γ 9: Return (π, γ). involved in the updates becomes a sum. Each iteration of Sinkhorn thus has a cost n 2 , and all the involved operation can be efficiently mapped to parallelizable GPU routines as detailed in Chizat et al. (2018a) ; Séjourné et al. (2019) . Another advantage of using an unbalanced Sinkhorn algorithm is its complexity O(n 2 /ε) to compute an ε-approximation, as stated in Pham et al. (2020) , which should be compared to O(n 2 /ε 2 ) operations for balanced Sinkhorn. Note also that balanced GW is recovered as a special case when setting ρ → +∞, so that ρ/(ε + ρ) → 1 should be used in the iterations. In order to speed up Sinkhorn inner-loops, especially for small values of ε, one can use linear extrapolation (Thibault et al., 2017) or non-linear Anderson acceleration (Scieur et al., 2016) . There is an extra scaling step after computing γ involving the mass m(π). It corresponds to the scaling s of π ⊗γ such that m(π) = m(γ), and we observe that this scaling is key not only to impose this mass equality but also to stabilize the algorithm. Otherwise we observed that m(γ) < 1 < m(π) and underflows whenever m(γ) → 0 and m(π) → ∞.

4. NUMERICAL EXPERIMENTS

This section presents numerical simulations on synthetic examples, to highlight the qualitative behavior of UGW with respect to mass variation and outliers. In all these experiments, µ and ν are probability distributions, which allows us to compare GW with UGW. Robustness to imbalanced classes. In this first example, we take X = Y = R 2 and consider E, C and S to be uniform distributions on an ellipse, a disk and a square. Figure 1 contrasts the transportation plan obtained by GW and UGW for a fixed µ = 0.5E + 0.5C and ν obtained using two different mixtures of E and S. The black segments show the largest entries of the transportation matrix π, for a sub-sampled set of points (to ease visibility), thus effectively displaying the matching induced by the plan. Furthermore, the width of the dots are scaled according to the mass of the marginals π 1 ≈ µ and π 2 ≈ ν, i.e. the smaller the point, the smaller is the amount of transported mass. This figure shows that the exact conservation of mass imposed by GW leads to a poor geometrical matching of the shapes which have different global mass. As this should be expected, UGW recovers coherent matchings. We suspect the alternate minimization algorithm was able to find the global minimum in these cases. Influence of ε and debiasing. This figure (and the following ones) does not show the influence of ε. This parameter is set of a low value ε = 10 -2 on a domain [0, 1] 2 so as to approximate the optimal plan of the unregularized UGW problem . The impact of ε is similar to those of classical OT, namely that it introduces an extra diffusion bias. in cyan). Decreasing the value of ρ (thus allowing for more mass creation/destruction in place of transportation) is able to reduce and even remove the influence of the outliers, as expected. Furthermore, using small values of ρ tends to favor "local structures", which is a behavior quite different from UW (1). Indeed, for UW, ρ → 0 sets to zero all the mass of π outside of the diagonal (points are not transported), while for UGW, it is rather pairs of points with dissimilar pairwise distances which cannot be transported together.

Robustness to outlier

Graph matching and comparison with Partial-GW. We now consider two graphs (X, Y ) equipped with their respective geodesic distances. These graphs correspond to points embedded in R 2 , and the length of the edges corresponds to their Euclidean length. These two synthetic graphs are close to be isometric, but differ by addition or modification of small sub-structures. The colors c(x) are defined on the "source" graph X and are mapped by an optimal plan π on y ∈ Y to a color 1 π1(y) X c(x)dπ(x, y). This allows to visualize the matching induced by GW and UGW for a varying ρ, as displayed in Figure 3 . The graphs for GW should be taken as reference since there is no mass creation. The POT library (Flamary & Courty, 2017 ) is used to compute GW. For large values of ρ, UGW behaves similarly to GW, thus producing irregular matchings which do not preserve the overall geometry of the shapes. In sharp contrast, for smaller values of ρ (e.g. ρ = 10 -1 ), some fine scale structures (such as the target's small circle) are discarded, and UGW is able to produce a meaningful partial matching of the graphs. For intermediate values (ρ = 10 0 ), we observe that the two branches and the blue cluster of the source are correctly matched to the target, while for GW the blue points are scattered because of the marginal constraint. Figure 4 shows a comparison with Partial-GW (Chapel et al., 2020) , computed using the POT library. It is close to UGW with a TV ⊗ penalty, since partial OT is equivalent to the use of a TV relaxation of the marginal. UGW with a KL ⊗ penalty is first computed for a given ρ, then the total mass m of the optimal plan is computed, and is used as a parameter for PGW which imposes this total mass as a constraint. Figure 3 and 4 display the transportation strategy associated to both methods. KL-UGW operates smooth transitions between transportation and creation of mass, while PGW either performs pure transportation or pure destruction/creation of mass. This can be observed in Figure 4 where nodes of the graphs are removed and not taken into account by the matching. Note also that since PGW is equivalent to solving GW on sub-graphs, the color distribution of GW and PGW are the same. 

5. CONCLUSION AND PERSPECTIVES

This paper defines two Unbalanced Gromov-Wasserstein formulations. We prove that they are both positive and definite. We provide a scalable, GPU-friendly algorithm to compute one of them, and show that the other is a distance between mm-spaces up to isometry. These divergences and distances allow for the first time to blend in a seamless way the transportation geometry of GW with creation and destruction of mass. This hybridization is the key to unlock both theoretical and practical issues. This work opens new questions for futures works. On the theoretical side, the geodesic structures induced by unbalanced GW distances and divergences is an important subject of study. On the practical side, removing the bias introduced by the use of entropic regularization is important for applications to ML. Note that such a debiasing was successfully applied for Balanced-GW in Bunne et al. (2019) and is shown to lead to a valid divergence for balanced OT in Feydy et al. (2019b) and UW in Séjourné et al. (2019) . The design of efficient numerical solvers for the the conic formulation is also an interesting avenue for future works. A BACKGROUND ON UNBALANCED OPTIMAL TRANSPORT Following Liero et al. (2015) , this section reviews and generalizes the homogeneous and conic formulations of unbalanced optimal transport. These three formulations are equal in the convex setting of UOT. Our relaxed divergence UGW and conic distance CGW defined in Section 2 build upon those constructions but are not anymore equal due to the non-convexity of GW problems.

A.1 HOMOGENEOUS FORMULATION

To ease the description of the homogeneous formulation, we develop and refactor the Csiszàr divergence terms of (1) in a form analog to Lemma 1. It reads UW(µ, ν) q = inf π∈M(X 2 ) L λ(d(x,y)) (f (x), g(y))dπ(x, y) + ψ ∞ (|µ ⊥ | + |ν ⊥ |), where L c (r, s) def. = c + rϕ(1/r) + sϕ(1/s), |µ ⊥ | def. = µ ⊥ (X) and (f def. = dµ dπ1 , g def. = dν dπ2 ) are the densities of the Lebesgue decomposition of (µ, ν) with respect to (π 1 , π 2 ) and Then the authors of Liero et al. (2015) define the homogeneous formulations HUW as µ = f π 1 + µ ⊥ and ν = gπ 2 + ν ⊥ . ( HUW(µ, ν) q def. = inf π∈M(X 2 ) H λ(d(x,y)) (f (x), g(y))dπ(x, y) + ψ ∞ (|µ ⊥ | + |ν ⊥ |), where the 1-homogeneous function H c is the perspective transform of L c H c (r, s) def. = inf θ≥0 θ c + ψ( r θ ) + ψ( s θ ) = inf θ≥0 θL c ( r θ , s θ ). By definition one has L c ≥ H c , though after optimization one has UW = HUW.

A.2 CONE SETS, CONE DISTANCES AND EXPLICIT SETTINGS

The conic formulation detailed in Section A.3 is obtained by performing the optimal transport on the cone set C[X] def. = X × R + /(X × {0}), where the extra coordinate accounts for the mass of the particle. Coordinates of the form (x, 0) are merged into a single point called the apex of the cone, noted 0 X . In the sequel, points of X × R + are noted (x, r) and those of C[X] are noted [x, r] to emphasize the quotient operation at the apex. For a pair (p, q) ∈ R + , we define for any [x, r], [y, s] ∈ C[X] 2 D C[X] ([x, r], [y, s]) q def. = H λ(d(x,y)) (r p , s p ). (11) In general D C[X] is not a distance, but it is always definite as proved by the following result. Proposition 5. Assume that d is definite, λ -1 ({0}) = {0} and ϕ -1 ({0}) = {1}. Assume also that for any (r, s), there always exists θ * such that H c (r, s) = θ * L c ( r θ * , s θ * ). Then D C[X] is definite on C[X], i.e. D C[X] ([x, r], [y, s]) = 0 if and only if (r = s = 0) or (r = s and x = y). Proof. Assume D C[X] ([x, r], [y, s]) = 0, and write θ * such that D C[X] ([x, r], [y, s]) q = θ * L c ( r p θ * , s p θ * ) = θ * λ(d(x, y)) + r p ϕ( θ * r p ) + sϕ( θ * s p ) , where the last line is given by the definition of reverse entropy. There are two cases. If θ * > 0, since all terms are positive, there are all equal to 0. By definiteness of d it yields x = y and because ϕ -1 ({0}) = {1} we have r p = s p = θ * and r = s. If θ * = 0 then D C[X] ([x, r], [y, s]) q = ϕ(0)(r p +s p ). The assumption ϕ -1 ({0}) = {1} implies ϕ(0) > 0, thus necessarily r = s = 0. The function H c can be computed in closed form for a certain number of common entropies ϕ, and we refer to Liero et al. (2015, Section 5 ) for an overview. Of particular interest are those ϕ where D C[X] is a distance, which necessitates a careful choice of λ, p and q. We now detail three particular settings where this is the case. In each setting we provide (D ϕ , λ, p, q) and its associated cone distance D C[X] . Gaussian Hellinger distance It corresponds to D ϕ = KL, λ(t) = t 2 and q = p = 2, D C[X] ([x, r], [y, s]) 2 = r 2 + s 2 -2rse -d(x,y)/2 , in which case it is proved in Liero et al. (2015)  that D C[X] is a cone distance. Hellinger-Kantorovich / Wasserstein-Fisher-Rao distance It reads D ϕ = KL, λ(t) = -log cos 2 (t ∧ π 2 ) and q = p = 2, D C[X] ([x, r], [y, s]) 2 = r 2 + s 2 -2rs cos( π 2 ∧ d(x, y)), in which case it is proved in Burago et al. (2001) that D C[X] is a cone distance. The weight λ(t) = -log cos 2 (t ∧ π 2 ), which might seem more peculiar, is in fact the penalty that makes unbalanced OT a length space induced by the Gaussian-Hellinger distance (if the ground metric d is itself geodesic), as proved in Liero et al. (2016) ; Chizat et al. (2018b) . This weight introduces a cut-off, because λ(d(x, y)) = +∞ if d(x, y) > π/2. There is no transport between points too far from each other. The choice of π/2 is arbitrary, and can be modified by scaling λ → λ(•/s) for some cutoff s. Partial optimal transport It corresponds to D ϕ = TV, λ(t) = t q and q ≥ 1 and p = 1, D C[X] ([x, r], [y, s]) q = r + s -(r ∧ s)(2 -d(x, y) q ) + , in which case it is proved in Chizat et al. (2018c) that D C[X] is a cone distance. The case D ϕ = TV is equivalent to partial unbalanced OT, which produces discontinuities (because of the non-smoothness of the divergence) between regions of the supports which are being transported and regions where mass is being destroyed/created. Note that Liero et al. (2015) do not mention that this D C[X] defines a distance, so this result is new to the best of our knowledge, although it can be proved without a conic lifting that partial OT defines a distance as explained in Chizat et al. (2018c) .

A.3 CONIC FORMULATION OF UW

The last formulation reinterprets UW as an OT problem on the cone, with the addition of two linear constraints. Informally speaking, H c becomes D C[X] , the term (|µ ⊥ | + |ν ⊥ |) is taken into account by the constraints (13) below, and the variables (f, g) are replaced by (r p , s p ). It reads CUW(µ, ν) q def. = inf α∈Up(µ,ν) D C[X] ([x, r], [y, s])) q dα([x, r], [y, s]), where the constraint set U p (µ, ν) is defined as U p (µ, ν) def. = α ∈ M + (C[X] 2 ) : R+ r p dα 1 (•, r) = µ, R+ s p dα 2 (•, s) = ν . Thus CUW consists in minimizing the Wasserstein distance W D C[X] (α 1 , α 2 ) on the cone (C[X], D C[X] ). The additional constraints on (α 1 , α 2 ) mean that the lift of the mass on the cone must be consistent with the total mass of (µ, ν). When D C[X] is a distance, CUW inherits the metric properties of W D C[X] . Our theoretical results rely on an analog construction for GW. The following proposition states the equality of the three formulations and recapitulates its main properties. The proofs are detailed in Liero et al. (2015) . Proposition 6 (From Liero et al. (2015) ). One has UW = HUW = CUW, which are symmetric, positive and definite. Furthermore, if (X, d X ) and (C[X], D C[X] ) are metric spaces with X separable, then M + (X) endowed with CUW is a metric space. Proof. The equality UW = HUW is given by Liero et al. (2015, Theorem 5.8) , while the equality HUW = CUW holds thanks to Liero et al. (2015, Theorem 6.7 and Remark 7.5) , where the latter theorem can be straightforwardly generalized to any cone distance built as in Section 2.2.1. Since D C[X] is symmetric, positive and definite (see Proposition 3), then so is CUW. Furthermore, if D C[X] satisfies the triangle inequality, separability of X allows to apply the gluing lemma (Liero et al., 2015, Corollary 7.14 ) which generalizes to any exponent p defining U p (µ, ν) and any cone distance D C[X] .

B UGW FORMULATION AND DEFINITENESS

We present in this section the proofs of the properties of our divergence UGW. We refer to Section 2 for the definition of the UGW formulation and its related concepts. For conciseness we write Γ(x, x , y, y ) = |d X (x, x ) -d Y (y, y )|. We first start with the existence of minimizers stated in Proposition 1. It illustrates in some sense that our divergence is well-defined. Proposition 7 (Existence of minimizers). Assume (X , Y) to be compact mm-spaces and that we either have 1. ϕ superlinear, i.e ϕ ∞ = ∞ 2. λ has compact sublevel sets in R + and 2ϕ ∞ + inf λ > 0 Then there exists π ∈ M + (X × Y ) such that UGW(X , Y) = L(π). Proof. We adapt here from Liero et al. (2015, Theorem 3.3 ). The functional is lower semicontinuous as a sum of l.s.c terms. Thus it suffices to have relative compactness of the set of minimizers. Under either one of the assumptions, coercivity of the functional holds thanks to Jensen's inequality L(π) ≥ m(π) 2 inf λ(Γ) + m(µ) 2 ϕ( m(π) 2 m(µ) 2 ) + m(ν) 2 ϕ( m(π) 2 m(ν) 2 ) ≥ m(π) 2 inf λ(Γ) + m(µ) 2 m(π) 2 ϕ( m(π) 2 m(µ) 2 ) + m(ν) 2 m(π) 2 ϕ( m(π) 2 m(ν) 2 ) . As m(π) → +∞ the right hand side converges to 2ϕ ∞ + inf λ > 0, which under either one of the assumptions yields L(π) → +∞, hence the coercivity. Thus we can assume there exists some M such that m(π) < M . Since the spaces are assumed to be compact, the Banach-Alaoglu theorem holds and gives relative compactness in M + (X × Y ). Take any sequence of plans π n that approaches UGW(X , Y) = inf L(π). Compactness gives that a subsequence π n k weak* converges to some π * . Because L is l.s.c, we have L(π * ) ≤ inf L(π), thus L(π * ) = inf L(π). The existence of such limit reaching the infimum gives the existence of a minimizer. Note that this formulation is nonegative and symmetric because the functional L is also nonegative and symmetric in its inputs (X , Y). This formulation allows straightforwardly to prove the definiteness of UGW. Proposition 8 (Definiteness of UGW). Assume that ϕ -1 ({0}) = {1} and λ -1 ({0}) = {0}. The following assertions are equivalent: 1. UGW(X , Y) = 0 2. ∃π ∈ M + (X × Y ) whose marginals are (µ, ν) such that d X (x, x ) = d Y (y, y ) for π ⊗ π- a.e. (x, x , y, y ) ∈ (X × Y ) 2 . 3. There exists a mm-space (Z, d Z , η) with full support and Borel maps ψ X : Z → X and The importance of dilations is given by the following lemma. Lemma 3 (Invariance to dilation). The problem CGW is invariant to dilations, i.e. for any α ∈ U p (µ, ν), we have Dil v (α) ∈ U p (µ, ν) and H(α) = H(Dil v (α)). ψ Y : Z → Y . such that (ψ X ) η = µ, (ψ Y ) η = ν and d Z = (ψ X ) d X = (ψ Y ) d Y Proof. First we prove the stability of U p (µ, ν) under dilations. Take α ∈ U p (µ, ν). For any test function ξ defined on X we have ξ(x)r p dDil v (α) = ξ(x)( r v ) p .v p d(α) = ξ(x)r p dα = ξ(x)dµ(x). Similarly we get P (Y ) (s q Dil v (α)) = ν, thus Dil v (α) ∈ U p (µ, ν). It remains to prove the invariance of the functional. Recall that D q is p-homogeneous. It yields H(Dil v (α)) = D([d X (x, x ), rr ], [d Y (y, y ), ss ])) q dDil v (α)dDil v (α) = D([d X (x, x ), r v • r v ], [d Y (y, y ), s v • s v ])) q v p • v p dαdα = 1 v 2p D([d X (x, x ), rr ], [d Y (y, y ), ss ])) q v 2p dαdα = D([d X (x, x ), rr ], [d Y (y, y ), ss ])) q dαdα = H(α) Both the functional and the constraint set are invariant, thus the whole CGW problem is invariant to dilations. The above lemma allows to normalize the plan such that one of its marginal is fixed to some value. Fixing a marginal allows to generalize the gluing lemma which is a key ingredient of the triangle inequality in optimal transport. Lemma 4 (Normalization lemma). Assume there exists α ∈ U p (µ, ν) such that CGW(X , Y) = H(α). Then there exists α such that α ∈ U p (µ, ν) and CGW(X , Y) = H(α) and whose marginal on C[Y ] is ν C[Y ] = P (C[Y ]) α = δ 0 Y + p (ν ⊗ δ 1 ), where p is the canonical injection from Y × R + to C[Y ]. Proof. The proof is exactly the same as Liero et al. (2015, Lemma 7.10) and is included for completeness. Take an optimal plan α. Because the functional and the constraints are homogeneous in (r, s), the plan α = α + δ 0 X ⊗ δ 0 Y verifies α ∈ U p (µ, ν) and H(α) = H(α). Indeed, because of this homogeneity the contribution δ 0 X ⊗ δ 0 Y has (r, s) = (0, 0) which has thus no impact. Considering α instead of α allows to assume without loss of generality that the transport plan charges the apex, i.e. setting S = {[x, r], [y, s] ∈ C[X] × C[Y ], [y, s] = 0 Y }, one has ω Y def. = α(S) ≥ 1. Then we can define the following scaling v([x, r], [y, s]) = s if s > 0 ω -1/q Y otherwise. ( ) We prove now that Dil v (α) has the desired marginal on C(Y ) by considering test functions ξ([y, s]). We separate the integral into two parts with the set S, and write α = α| S + α| S c their restrictions to S and S c respectively. It reads ξ([y, s])dDil v ( α) = ξ([y, s/v])v p dα = ξ([y, s/v])v p d α| S + ξ([y, s/v])v p d α| S c = ξ(0 Y )ω -1 Y d α| S + ξ([y, s/s])s p d α| S c = ξ(0 Y ) • ω Y • ω -1 Y + ξ([y, 1])s p dα = ξ(0 Y ) + ξ(p(y, s))d(ν(y) ⊗ δ 1 (s)) = ξ([y, s])d(δ 0 Y + p (ν ⊗ δ 1 )), which is the formula of the desired marginal on C[Y ]. Since α ∈ U p (µ, ν), its dilation is also in U p (µ, ν), and H(α) = H(α) = H(Dil v ( α)).

C.1.1 PROOF OF THEOREM 1

Non-negativity and symmetry hold since H is a sum of non-negative symmetric terms. To prove Definiteness, assume CGW(X , Y) = 0, and write α an optimal plan. We have α ⊗ α-a.e. that d X (x, x ) = d Y (y, y ) and rr = ss because D is definite (see Proposition 3). Thanks to the completeness of (X , Y) and a result from Sturm (2012, Lemma 1.10), such property implies the existence of a Borel isometric bijection with Borel inverse between the supports of the measures ψ : Supp(µ) → Supp(ν), where Supp denotes the support. The bijection ψ verifies d X (x, x ) = d Y (ψ(x), ψ(x )). To prove X ∼ Y it remains to prove ψ µ = ν. Due to the density of continuous functions of the form ξ(x)ξ(x ), the constraints of U p (µ, ν) are equivalent to R+ (rr ) p dα 1 (•, r)dα 1 (•, r ) = µ ⊗ µ, R+ (ss ) p dα 2 (•, s)dα 2 (•, s ) = ν ⊗ ν. Take a continuous test function ξ defined on Supp(ν) 2 . Writing y = ψ(x) and y = ψ(x ), one has ξ(y, y )dνdν = ξ(y, y )(ss ) p dαdα = ξ(ψ(x), ψ(x ))(ss ) p dαdα = ξ(ψ(x), ψ(x ))(rr ) p dαdα = ξ(ψ(x), ψ(x ))dµdµ = ξ(x, x )dψ µdψ µ. Since ψ is a bijection, there is a bijection between continuous functions ξ of Supp(ν) 2 and functions ξ of Supp(µ) 2 . Thus we obtain ν = ψ µ and we have X ∼ Y. It remains to prove the triangle inequality. Assume now that D satisfies it. Given three mm-spaces (X , Y, Z) respectively equipped with measures (µ, ν, η), consider α, β which are optimal plans for CGW(X , Y) and CGW(Y, Z). Using Lemma 4 to both α and β, we can consider measures (ᾱ, β) which are also optimal and have a common marginal ν on C[Y ]. Thanks to this common marginal and the separability of (X, Y, Z), the standard gluing lemma (Villani, 2003, Lemma 7.6 ) applies and yields a glued plan γ (γ, ᾱ) have the same marginal on C[X] and same for (γ, β) on C[Z], hence this property. Write d X = d X (x, x ) for sake of conciseness (and similarly for Y, Z). The calculation reads CGW(X , Z) ∈ M + (C[X] × C[Y ] × C[Z]) whose respective marginals on C[X] × C[Y ] and C[Y ] × C[Z] are (ᾱ, β). Furthermore, the marginal γ of γ on C[X] × C[Z] is in U p (µ, η). Indeed, 1 q (16) ≤ D([d X , rr ], [d Z , tt ]) q dγ([x, r], [z, t])dγ([x , r ], [z , t ]) 1 q (17) ≤ D([d X , rr ], [d Z , tt ]) q dγ([x, r], [y, s], [z, t])dγ([x , r ], [y , s ], [z , t ]) 1 q (18) ≤ (D([d X , rr ], [d Y , ss ]) + D([d Y , ss ], [d Z , tt ])) q dγdγ 1 q (19) ≤ D([d X , rr ], [d Y , ss ]) q dγdγ 1 q + D([d Y , ss ], [d Z , tt ]) q dγdγ 1 q (20) ≤ D([d X , rr ], [d Y , ss ]) q dᾱ([x, r], [y, s])dᾱ([x , r ], [y , s ]) 1 q + D([d Y , ss ], [d Z , tt ]) q d β([y, s], [z, t])d β([y , s ], [z , t ]) 1 q (21) ≤ CGW(X , Y) 1 q + CGW(Y, Z) 1 q . (22) Since γ ∈ U p (µ, η), it is thus suboptimal, which yields Equation ( 17). Because γ is the marginal of γ we get Equation (18). Equations ( 19) and ( 20) are respectively obtained by the triangle and Minkowski inequalities, which hold because D which is a distance. Equation ( 21) is the marginalization of γ, and Equation ( 22) is given by the optimality of (ᾱ, β), which ends the proof of the triangle inequality.

C.1.2 PROOF OF THE INEQUALITY BETWEEN UGW AND CGW

The proof consists in considering an optimal plan π for UGW, building a lift α of this plan into the cone such that L(π) ≥ H(α), and prove that α is admissible for the program CGW, thus suboptimal. Using Equation (8), we have µ ⊗ µ = (f ⊗ f )π 1 ⊗ π 1 + (µ ⊗ µ) ⊥ , (µ ⊗ µ) ⊥ = µ ⊥ ⊗ (f π 1 ) + (f π 1 ) ⊗ µ ⊥ + µ ⊥ ⊗ µ ⊥ , ν ⊗ ν = (g ⊗ g)π 2 ⊗ π 2 + (ν ⊗ ν) ⊥ , (ν ⊗ ν) ⊥ = ν ⊥ ⊗ (gπ 2 ) + (gπ 2 ) ⊗ ν ⊥ + ν ⊥ ⊗ ν ⊥ . (23) Recall that the canonic injection p reads p(x, r) = [x, r]. Based on the above Lebesgue decomposition, we define the conic plan α = (p(x, f (x) 1 p ), p(y, g(y) 1 p )) π(x, y) + δ 0 X ⊗ p [ν ⊥ ⊗ δ 1 ] + p [µ ⊥ ⊗ δ 1 ] ⊗ δ 0 Y . We have that α ∈ U p (µ, ν). Indeed for the first marginal (and similarly for the second) we have for any test function ξ(x) ξ(x)(r) p dα = ξ(x)f (x)dπ 1 (x) + 0 + ξ(x)(1) p dµ ⊥ (x) = ξ(x)d(f (x)π 1 + µ ⊥ ) = ξ(x)dµ(x). We define θ * = θ * c (r, s) the parameter which verifies H c (r, s) = θ * L c (r/θ * , s/θ * ). We restrict α ⊗ α to the set S = {θ * λ(Γ) ((rr ) p , (ss ) p ) > 0}. By construction, θ * c (r, s) is 1-homogeneous in (r, s). Thus on S we necessarily have r, r , s, s > 0. It yields  α ⊗ α| S = (p(x, f (x) 1 p ), p(y, g(y) 1 p ), p(x , f (x ) 1 p ), p(y , g(y ) 1 p )) (π ⊗ π).

D.2 DISCRETE SETTING AND FORMULAS

In order to implement those algorithms, one consider discrete mm-spaces X = (x i ) n i=1 and Y = (y j ) m j=1 , endowed with discrete measures µ = i µ i δ xi and ν = j ν j δ yj , where µ i , ν j ≥ 0. The distance matrices are D X i,i def. = d X (x i , x i ) and D X j,j def. = d X (y j , y j ). Transport plan are thus also discrete π = i,j π i,j δ (xi,yj ) . The functional L now reads in this discrete setting (d X (x, x ) -d Y (y, y )) 2 dπ(x, y)dπ(x , y ) = i,j,k, (D X i,j -D Y k, ) 2 π i,k π j, , and KL(π 1 ⊗ π 1 |µ ⊗ µ) = i,j log π 1,i π 1,j µ i µ j π 1,i π 1,ji,j π 1,i π 1,j + i,j µ i µ j = 2m(π) i log π 1,i µ i π 1,i -m(π) 2 + m(µ) 2 , where we define the marginals π 1,k def. = j π k,j , π 2, def. = i π i, and m(π) = i,j π i,j . When one runs the stabilized implementation of Sinkhorn's iterations with a ground cost C i,j = C(x i , y j ) between the points, it is necessary to use a Log-Sum-Exp reduction which reads f i ← - ερ ε + ρ LSE j (g j -C i,j )/ε + log(µ j ) where LSE j is a reduction performed on the index j. It reads LSE j (C i,j ) def. = log j exp(C i,j -max k C i,k ) + max k C i,k , where the logarithm and exponential are pointwise operations. Algorithm 2 -UGW(X , Y, ρ, ε) in discrete form Input: mm-spaces X = (D X i,j , (µ i ) i ) and Y = (D Y i,j , (ν j ) j ), relaxation ρ, regularization ε Output: approximation (π, γ) minimizing 6 1: Initialize matrix π i,j = γ i,j = µ i ν j / ( i µ i )( j ν j ), vector g (s=0) j = 0. f ← -ε ρ ε+ ρ log j exp (g j -c i,j )/ε + log ν j 8: g ← -ε ρ ε+ ρ log i exp (f i -c i,j )/ε + log µ i 9: Update γ i,j ← exp (f i + g j -c i,j )/ε µ i ν j 10: Rescale γ ← m(π)/m(γ)γ 11: Return (π, γ). We also provide an algorithm that computes the cost c ε π defined in Proposition (10). We focus on the case D ϕ = ρKL and λ(t) = t 2 which is computable with complexity O(n 3 ) as shown in Peyré et al. (2016) . Indeed, note that one has 



Its proof is deferred to Appendix B. It splits UGW into two parts: the term ϕ(0)(|(µ ⊗ µ) ⊥ | + |(ν ⊗ ν) ⊥ |) accounts for the pure creation/destruction of mass and a new transport cost L c accounts for the remaining part (partial/pure transport and partial creation/destruction of mass). Lemma 1. Defining L c (a, b) def. = c + aϕ(1/a) + bϕ(1/b), and writing (f def.

Figure 1: GW vs. UGW transportation plan, using ν = 0.3E +0.7S on the left, and ν = 0.7E +0.3S on the right.

Figure 3: Comparison of UGW and GW for graph matching.

) Such form is helpful to explicit the terms of pure mass creation/destruction (|µ ⊥ | + |ν ⊥ |) and reinterpret the integral under π as a transport term with a new cost L λ(d) .

There exists a Borel measurable bijection between the measures' supports ψ : spt(µ) → spt(ν) with Borel measurable inverse such that ψ µ = ν and d Y = ψ d X . where w = v([x, r], [y, s]). It reads for any test function ξ ξ([x, r], [y, s])dDil v (α) = ξ([x, r/w], [y, s/w])w p dα.

For any measures (µ, ν) ∈ M + (X ), one has KL(µ ⊗ ν|α ⊗ β) = m(ν)KL(µ|α) + m(µ)KL(ν|β)   + (m(µ) -m(α))(m(ν) -m(β)).(26)In particular,KL(µ ⊗ µ|ν ⊗ ν) = 2m(µ)KL(µ|ν) + (m(µ) -m(ν)) 2 . (27)Proof. Assuming KL(µ ⊗ ν|α ⊗ β) to be finite, one has µ = f α and ν = gβ. It readsKL(µ ⊗ ν|α ⊗ β) = log(f ⊗ g)dµdν -m(µ)m(ν) + m(α)m(β) = m(ν) log(f )dµ + m(µ) log(g)dν -m(µ)m(ν) + m(α)m(β) = m(ν) KL(µ|α) + m(µ) -m(α) + m(µ) KL(ν|β) + m(ν) -m(β) -m(µ)m(ν) + m(α)m(β) = m(ν)KL(µ|α) + m(µ)KL(ν|β) + m(µ)m(ν) -m(ν)m(α) -m(µ)m(β) + m(α)m(β) = m(ν)KL(µ|α) + m(µ)KL(ν|β) + (m(µ) -m(α))(m(ν) -m(β)).We now prove Proposition 4 which applies the above result.Proposition 10. For a fixed γ, the optimal π ∈ arg min π F(π, γ) + εKL(π ⊗ γ|(µ ⊗ ν) ⊗2 ) is the solution of min π c ε γ (x, y)dπ(x, y) + ρm(γ)KL(π 1 |µ) + ρm(γ)KL(π 2 |ν) + εm(γ)KL(π|µ ⊗ ν), where m(γ) def.= γ(X × Y ) is the total mass of γ, and where we define the cost and weight associated to γ asc ε γ (x, y) def. = λ(Γ(x, •, y, •))dγ + ρ log( dγ 1 dµ )dγ 1 + ρ log( dγ 2 dν )dγ 2 + ε log( dγ dµdν)dγ.Proof. First note that F(γ, π) = F(π, γ) so that minimizing with the first or the second argument gives the same solution. The rest follows from the relationKL(π 1 ⊗ γ 1 |µ ⊗ µ) = m(γ)KL(π 1 |µ) + m(π)KL(γ 1 |µ) + (m(γ) -m(µ))(m(π) -m(µ)),and also from KL(π 1 |µ) = log( dγ1 dµ )dγ 1 -(m(γ) -m(µ)). Similar formulas hold for (π 2 , γ 2 ) and (π, γ).

π) ← i,j π i,j , ρ ← m(π)ρ, ε ← m(π)ε 5: Define c ← ComputeCost(X , Y, π, ρ, ε) 6:while (f, g) has not converged do 7:

(d X (x, x ) -d Y (y, y )) 2 dπ(x , y ) = d X (x, x ) 2 dπ 1 (x ) + d Y (y, y ) 2 dπ 2 (y ) -2 d X (x, x )d Y (y, y )dπ(x , y ).

ComputeCost(X , Y, π, ρ, ε)  in discrete form Input: mm-spaces X = (D X i,j , (µ i ) i ) and Y = (D Y k, , (ν j ) j ), transport matrix (π j,k ) j,k , relaxation ρ, regularization ε Output: cost c ε π defined in Proposition 10 1: Compute π 1,j ← k π j,k and π 2,k ← j π j,k π 1 = π1 and π 2 = π 1 2: Compute A i ← j (D X i,j ) 2 π 1,j A = (D X ) •2 π 1 3: Compute B ← k (D Y k, ) 2 π 2,k B = (D Y ) •2 π 2 4: Compute C i, ← j D X i,j k D Y k, π j,k C = D X πD Y 5: Compute E ← ρ j log π1,j µj π 1,j + ρ k log π 2,k ν k π 2,k + ε j,k log π jk µj ν k π j,k 6: Return c ε π,i, ← A i + B -2C i, + E

annex

Proof. Recall that (2) ⇔ (3) ⇔ (4) from Sturm (2012, Lemma 1.10) . thus it remains to prove (1) ⇔ (2).If there is such coupling plan π between (µ, ν) then one has π ⊗ π-a.e. that Γ = 0, and all ϕdivergences are zero as well, yielding a distance of zero a.e. Assume now that UGW(X , Y) = 0, and write π an optimal plan. All terms of L are positive, thus under our assumptions we have Γ = 0, π 1 ⊗ π 1 = µ ⊗ µ and π 2 ⊗ π 2 = ν ⊗ ν. Thus we get that π has marginals (µ, ν) and that dWe end with a result on the reformulation of UGW which is the first step to connnect it with the conic formulation CGW. Lemma 2. Defining L c (r, s) def.= c + rϕ(1/r) + sϕ(1/s), and writing (fProof. Using Equation ( 23), one has

C CONIC FORMULATION AND METRIC PROPERTIES

We present in this section the proofs of the properties mentioned in Section 2. We refer to Section 2 and Appendix A for the definition of the conic formulation and its related concepts.In this section we frequently use the notion of marginal for neasures. For any sets E, F , we write) a coupling plan, and define its marginals by π 1 = P (X) π and π 2 = P (Y ) π. The definition of the marginals can also be seen by the use of test functions. In the case of π 1 it reads for any test function ξ ξ(x)dπ 1 (x) = ξ(x)dπ(x, y).

C.1 PRELIMINARY RESULTS

We present in this section concepts and properties which are necessary for the proof of Theorem 1. We introduce a dilation operator whose role is to rescale the radial coordinate of a measure with a given scaling. Definition 2 (dilations). Consider v([x, r], [y, s]) a Borel measurable scaling function depending onConcerning the orthogonal part of the decomposition, note that whenever θ * = 0, due to the definition of H the cone distance readsIt geometrically means that the shortest path between [x, r] and [y, s] must pass via the apex, which corresponds to a pure mass creation/destruction regime.Furthermore we have thatIndeed, thanks to Equation ( 24) we have for the first marginal thatNote that the last equality holds because each term of α ⊗ α involving a measure δ 0 X cancels out when integrated against (rr ) p .Eventually the computation gives (thanks to Lemma 1)Thus we have UGW(X , Y) = L(π) ≥ H(α) ≥ CGW(X , Y).

D ALGORITHMIC DETAILS AND FORMULAS D.1 DECOMPOSITION OF KL QUADRATIC DIVERGENCE

We present in this section an additional property on the quadratic-KL divergence which allows to reduce the computational burden to evaluate it by involving the computation of a standard KL divergence.

